853 research outputs found

    Ab Antiquo: Neural Proto-language Reconstruction

    Full text link
    Historical linguists have identified regularities in the process of historic sound change. The comparative method utilizes those regularities to reconstruct proto-words based on observed forms in daughter languages. Can this process be efficiently automated? We address the task of proto-word reconstruction, in which the model is exposed to cognates in contemporary daughter languages, and has to predict the proto word in the ancestor language. We provide a novel dataset for this task, encompassing over 8,000 comparative entries, and show that neural sequence models outperform conventional methods applied to this task so far. Error analysis reveals variability in the ability of neural model to capture different phonological changes, correlating with the complexity of the changes. Analysis of learned embeddings reveals the models learn phonologically meaningful generalizations, corresponding to well-attested phonological shifts documented by historical linguistics.Comment: Accepted as a long paper in NAACL2

    Linear mappings: semantic transfer from transformer models for cognate detection and coreference resolution

    Get PDF
    Includes bibliographical references.2022 Fall.Embeddings or vector representations of language and their properties are useful for understanding how Natural Language Processing technology works. The usefulness of embeddings, however, depends on how contextualized or information-rich such embeddings are. In this work, I apply a novel affine (linear) mapping technique first established in the field of computer vision to embeddings generated from large Transformer-based language models. In particular, I study its use in two challenging linguistic tasks: cross-lingual cognate detection and cross-document coreference resolution. Cognate detection for two Low-Resource Languages (LRL), Assamese and Bengali, is framed as a binary classification problem using semantic (embedding-based), articulatory, and phonetic features. Linear maps for this task are extrinsically evaluated on the extent of transfer of semantic information between monolingual as well as multi-lingual models including those specialized for low-resourced Indian languages. For cross-document coreference resolution, whole-document contextual representations are generated for event and entity mentions from cross- document language models like CDLM and other BERT-variants and then linearly mapped to form coreferring clusters based on their cosine similarities. I evaluate my results on gold output based on established coreference metrics like BCUB and MUC. My findings reveal that linearly transforming vectors from one model's embedding space to another carries certain semantic information with high fidelity thereby revealing the existence of a canonical embedding space and its geometric properties for language models. Interestingly, even for a much more challenging task like coreference resolution, linear maps are able to transfer semantic information between "lighter" models or less contextual models and "larger" models with near-equivalent performance or even improved results in some cases

    The more similar the better? : factors in learning cognates, false cognates and non-cognate words

    Get PDF
    In this study we explored factors that determine the knowledge of L2 words with orthographic neighbours in L1 (cognates and false cognates). We asked 150 Polish learners of English to translate 105 English non-cognate words, cognates, and false-cognates into Polish, and to assess the confidence of each translation. Confidence ratings allows us to employ a novel analytic procedure which disentangles knowing cognates and false cognates from strategic guessing. Mixed-effects logistic regression models revealed that cognates were known better, whereas false cognates were known worse, relative to non-cognate controls. The advantage of knowing cognates, but not false cognates, was modulated by the degree of similarity to their L1 equivalents. The knowledge of cognates and false cognates was not affected by the frequency of their formal equivalent in L1. Based on these findings we conclude how cross-linguistic formal similarity affects L2 word learnability, proposing a mechanism by which cognates and false cognates are acquired

    Sentiment analysis for Hinglish code-mixed tweets by means of cross-lingual word embeddings

    Get PDF

    Development and evaluation of phonological models for cognate identification

    Get PDF
    The paper presents a methodology for the development and task-based evaluation of phonological models, which improve the accuracy of cognate terminology identification, but may potentially be used for other applications, such as transliteration or improving character-based NMT. Terminology translation remains a bottleneck for MT, especially for under-resourced languages and domains, and automated identification of cognate terms addresses this problem. The proposed phonological models explicitly represent distinctive phonological features for each character, such as acoustic types (e.g., vowel/ consonant, voiced/ unvoiced/ sonant), place and manner of articulation (closed/open, front/back vowel; plosive, fricative, or labial, dental, glottal consonant). The advantage of such representations is that they explicate information about characters’ internal structure rather than treat them as elementary atomic units of comparison, placing graphemes into a feature space that provides additional information about their articulatory (pronunciation-based) or acoustic (sound-based) distances and similarity. The article presents experimental results of using the proposed phonological models for extracting cognate terminology with the phonologically aware Levenshtein edit distance, which for Top-1 cognate ranking metric outperforms the baseline character-based Levenshtein by 16.5%. Project resources are released on: https://github.com/bogdanbabych/cognates-phonolog

    The Influence of Cross-Linguistic Similarity and Language Background on Writing to Dictation

    Get PDF
    The current research was completed and thanks to financial aid provided by the doctoral research grant FPU16/01748 to AI and grants from the Ministerio de Ciencia, Innovacion y Universidades-Fondos Feder to TB (PGC2018-093786-B-I00) and DP (PCIN-2015-165-C02-01 and PSI2017-89324-C2-1-P) and from the Feder Andalucia to TB (A-CTS-111-UGR18 and P20.00107).This study used a word dictation task to examine the influence of a variety of factors on word writing production: cognate status (cognate vs. non-cognate words), orthographic (OS) and phonological similarity (PS) within the set of cognate words, and language learning background [late bilinguals (LBs) with academic literacy and formal instruction in English and Spanish, and heritage speakers (HSs) with academic literacy and formal instruction only in English]. Both accuracy and reaction times for the first key pressed by participants (indicating lexical access), and the time required to type the rest of the word after the first keypress (indicating sublexical processing) was assessed. The results revealed an effect of PS on the dictation task particularly for the first keypress. That is, cognates with high PS were processed faster than cognates with low PS. In contrast to reading studies in which PS only revealed a significant effect when the OS between languages was high (O + P+ vs. O + P−), in the dictation to writing task, the phonology had a more general effect across all conditions, regardless of the level of OS. On the other hand, OS tended to be more influential for typing the rest of the word. This pattern is interpreted as indicating the importance of phonology (and PS in cognates) for initial lexical retrieval when the input is aural. In addition, the role of OS and PS during co-activation was different between groups probably due to the participants’ linguistic learning environment. Concretely, HSs were found to show relatively lower OS effects, which is attributed to the greater emphasis on spoken language in their Spanish language learning experiences, compared to the formal education received by the LBs. Thus, the study demonstrates that PS can influence lexical processing of cognates, as long as the task demands specifically require phonological processing, and that variations in language learning experiences also modulate lexical processing in bilinguals.Ministerio de Ciencia, Innovacion y Universidades-Fondos Feder FPU16/01748Feder Andalucia PGC2018-093786-B-I00 PCIN-2015-165-C02-01 PSI2017-89324-C2-1-PA-CTS-111-UGR18 P20.0010

    Cognate Discovery and Alignment in Computational Etymology

    Get PDF
    This master thesis discusses two main tasks of computational etymology. First, finding cognates in multilingual text. Second, finding underlying correspondence rules by aligning cognates. For the first part, I briefly described two categories of methods in identifying cognates: symbol based and phonetic based. For the second part, I described the Etymon project, which I had been working in. The Etymon project uses a probabilistic method and Minimum Description Length principle to align cognate sets. The objective of this project is to build a model which can automatically find as much information in the cognates as possible without linguistic knowledge as well as find genetic relationship between those languages. I also discussed the experiment that I did to explore the uncertainty in the data source

    Reading Polish with Czech Eyes: Distance and Surprisal in Quantitative, Qualitative, and Error Analyses of Intelligibility

    Get PDF
    In CHAPTER I, I first introduce the thesis in the context of the project workflow in section 1. I then summarise the methods and findings from the project publications about the languages in focus. There I also introduce the relevant concepts and terminology viewed in the literature as possible predictors of intercomprehension and processing difficulty. CHAPTER II presents a quantitative (section 4) and a qualitative (section 5) analysis of the results of the cooperative translation experiments. The focus of this thesis – the language pair PL-CS – is explained and the hypotheses are introduced in section 6. The experiment website is introduced in section 7 with an overview over participants, the different experiments conducted and in which section they are discussed. In CHAPTER IV, free translation experiments are discussed in which two different sets of individual word stimuli were presented to Czech readers: (i) Cognates that are transformable with regular PL-CS correspondences (section 12) and (ii) the 100 most frequent PL nouns (section 13). CHAPTER V presents the findings of experiments in which PL NPs in two different linearisation conditions were presented to Czech readers (section 14.1-14.6). A short digression is made when I turn to experiments with PL internationalisms which were presented to German readers (14.7). CHAPTER VI discusses the methods and results of cloze translation experiments with highly predictable target words in sentential context (section 15) and random context with sentences from the cooperative translation experiments (section 16). A final synthesis of the findings, together with an outlook, is provided in CHAPTER VII.In KAPITEL I stelle ich zunächst die These im Kontext des Projektablaufs in Abschnitt 1 vor. Anschließend fasse ich die Methoden und Erkenntnisse aus den Projektpublikationen zu den untersuchten Sprachen zusammen. Dort stelle ich auch die relevanten Konzepte und die Terminologie vor, die in der Literatur als mögliche Prädiktoren für Interkomprehension und Verarbeitungsschwierigkeiten angesehen werden. KAPITEL II enthält eine quantitative (Abschnitt 4) und eine qualitative (Abschnitt 5) Analyse der Ergebnisse der kooperativen Übersetzungsexperimente. Der Fokus dieser Arbeit - das Sprachenpaar PL-CS - wird erläutert und die Hypothesen werden in Abschnitt 6 vorgestellt. Die Experiment-Website wird in Abschnitt 7 mit einer Übersicht über die Teilnehmer, die verschiedenen durchgeführten Experimente und die Abschnitte, in denen sie besprochen werden, vorgestellt. In KAPITEL IV werden Experimente zur freien Übersetzung besprochen, bei denen tschechischen Lesern zwei verschiedene Sätze einzelner Wortstimuli präsentiert wurden: (i) Kognaten, die mit regulären PL-CS-Korrespondenzen umgewandelt werden können (Abschnitt 12) und (ii) die 100 häufigsten PL-Substantive (Abschnitt 13). KAPITEL V stellt die Ergebnisse von Experimenten vor, in denen tschechischen Lesern PL-NP in zwei verschiedenen Linearisierungszuständen präsentiert wurden (Abschnitt 14.1-14.6). Einen kurzen Exkurs mache ich, wenn ich mich den Experimenten mit PL-Internationalismen zuwende, die deutschen Lesern präsentiert wurden (14.7). KAPITEL VI erörtert die Methoden und Ergebnisse von Lückentexten mit hochgradig vorhersehbaren Zielwörtern im Satzkontext (Abschnitt 15) und Zufallskontext mit Sätzen aus den kooperativen Übersetzungsexperimenten (Abschnitt 16). Eine abschließende Synthese der Ergebnisse und ein Ausblick finden sich in KAPITEL VII
    corecore