37 research outputs found

    Automated identification of borrowings in multilingual wordlists

    Get PDF
    Although lexical borrowing is an important aspect of language evolution, there have been few attempts to automate the identification of borrowings in lexical datasets. Moreover, none of the solutions which have been proposed so far identify borrowings across multiple languages. This study proposes a new method for the task and tests it on a newly compiled large comparative dataset of 48 South-East Asian languages from Southern China. The method yields very promising results, while it is conceptually straightforward and easy to apply. This makes the approach a perfect candidate for computer-assisted exploratory studies on lexical borrowing in contact areas

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages

    A theoretical approach to automatic loanword detection

    Get PDF
    For several years, computational methods found their way into humanities. Especially in the field of computational linguistics several analysis andmethods are studied. It is not surprising that computational analysis arouse interest in the field of historical linguistics. Due to such methods, language evolution can be studied from another point of view. Biological and linguistic evolution show certain parallels. Especially the parallels between phylogenetics and linguistics arouse the interest of combining both fields. Phylogenetics provide a great number of mathematical and computational methods for computing di erent tasks. Based on the parallels, the methods can be adapted into historical linguistics. In historical linguistics, the process of borrowing is a well-known evolutionary process where words are borrowed from one language and adapted into another. Borrowing has its corresponding parallel within phylogenetics, namely horizontal gene transfer. Horizontal gene transfer is the process of transferring genes from one organism to another. The similarity between borrowing and horizontal gene transfer is the transfer of genes or words whereas the organisms or languages are not related. Phylogenetics provides several computational methods and analysis to detect horizontal gene transfer. The methods might be adapted into linguistics to detect borrowing. This paper introduces the background of borrowing and phylogenetics as well as the combination of both fields. The new tree-based approach should indicate if provided methods of phylogenetics can be adapted into linguistics for the detection of borrowing.Vor einigen Jahren haben automatische Methoden und Computeranalysen ihren Weg in die Geisteswissenschaften gefunden. Vor allem die Computerlinguistik untersucht und entwickelt neue Methoden. Es ist daher nicht überraschend, dass das Interesse an unterschiedlichen Computeranalysen im Bereich der historischen Linguistik an Interesse gewonnen hat. Neue Ansätze haben die Sicht auf die Untersuchungsmethoden innerhalb der Sprachevolution verändert. Biologische Evolution und Sprachevolution weisen verschiedene Gemeinsamkeiten auf. Die Ähnlichkeiten zwischen Phylogenetik und Linguistik haben zu einer Kombination dieser Bereiche geführt. Die Phylogenetik stellt eine große Anzahl von mathematischen und auch implementierten Methoden zur Verfügung, um unterschiedliche Prozesse zu analysieren. Einige dieser Methoden können auf Grund der Gemeinsamkeiten dieser Bereiche in die historische Linguistik übernommen werden. In der historischen Linguistik ist die Entlehnung ein bekannter evolutionärer Prozess, bei welchem Wörter der einen Sprache in eine andere entlehnt werden. Der Prozess der Entlehnung weist große Ähnlichkeiten mit dem aus der Phylogenetik bekannten Prozess des Horizontalem Gentransfers auf. Horizontaler Gentransfer beschreibt die Übertragung von Genen von einem Organismus in einen anderen. Die Gemeinsamkeit von Entlehnung und Horizontalem Gentransfer ist die Übertragung von Genen oder Wörtern, wobei der Organismus oder die Sprache nicht verwandt sein müssen. Die Phylogenetik stellt mehrere mathematische Methoden und Analysen zur Verfügung, um Horizontalen Gentransfer zu erkennen. Diese könnten in die Linguistik übernommen werden. In dieser Arbeit werden die Hintergründe von Entlehnung und die Grundlagen der Phylogenetik erklärt. Des Weiteren wird die Kombination der beiden Bereiche erläutert. Der neue baumbasierte Ansatz soll zeigen, ob die Methoden aus der Phylogenetik in die Linguistik aufgenommen werden können und ob diese Entlehnungen erkennen können

    Computer-Assisted Language Comparison in Practice. Tutorials on Computational Approaches to the History and Diversity of Languages. Volume II

    Get PDF
    This document summarizes all contributions to the blog "Computer-Assisted Language Comparison in Practice" from 2019, online also available under https://calc.hypotheses.org

    Algorithmic advancements in Computational Historical Linguistics

    Get PDF
    Computergestützte Methoden in der historischen Linguistik haben in den letzten Jahren einen großen Aufschwung erlebt. Die wachsende Verfügbarkeit maschinenlesbarer Daten förderten diese Entwicklung ebenso wie die zunehmende Leistungsfähigkeit von Computern. Die in dieser Forschung verwendeten Berechnungsmethoden stammen aus verschiedenen wissenschaftlichen Disziplinen, wobei Methoden aus der Bioinformatik sicherlich die Initialzündung gaben. Diese Arbeit, die sich von Fortschritten in angrenzenden Gebieten inspirieren lässt, zielt darauf ab, die bestehenden Berechnungsmethoden in verschiedenen Bereichen der computergestützten historischen Linguistik zu verbessern. Mit Hilfe von Fortschritten aus der Forschung aus dem maschinellen Lernen und der Computerlinguistik wird hier eine neue Trainingsmethode für Algorithmen zur Kognatenerkennung vorgestellt. Diese Methode erreicht an vielen Stellen die besten Ergebnisse im Bereich der Kognatenerkennung. Außerdem kann das neue Trainingsschema die Rechenzeit erheblich verbessern. Ausgehend von diesen Ergebnissen wird eine neue Kombination von Methoden der Bioinformatik und der historischen Linguistik entwickelt. Durch die Definition eines expliziten Modells der Lautevolution wird der Begriff der evolutionären Zeit in die Kognatenerkennung mit einbezogen. Die sich daraus ergebenden posterioren Verteilungen werden verwendet, um das Modell anhand einer standardmäßigen Kognatenerkennung zu evaluieren. Eine weitere klassische Problemstellung in der pyhlogenetischen Forschung ist die Inferenz eines Baumes. Aktuelle Methoden, die den ``quasi-industriestandard'' bilden, verwenden den klassischen Metropolis-Hastings-Algorithmus. Allerdings ist bekannt, dass dieser Algorithmus für hochdimensionale und korrelierte Daten vergleichsweise ineffizient ist. Um dieses Problem zu beheben, wird im letzten Kapitel ein Algorithmus vorgestellt, der die Hamilton'sche Dynamik verwendet.The use of computational methods in historical linguistics has seen a large boost in recent years. An increasing availability of machine readable data and the growing power of computers fostered this development. While the computational methods which are used in this research stem from different scientific disciplines, a lot of tools from computational biology have found their way into this research. Drawing inspiration from advancements in related fields, this thesis aims at improving existing computational methods in different disciplines of computational historical linguistics. Using advancements from machine learning and natural language processing research, I present an updated training regime for cognate detection algorithms. Besides achieving state of the art performance in a cognate clustering task, the updated training scheme considerably improved computation time. Following up on these results, I develop a novel combination of tools from bioinformatics and historical linguistics is developed. By defining an explicit model of sound evolution, I include the notion of evolutionary time into a cognate detection task. The resulting posterior distributions are used to evaluate the model on a standard cognate detection task. A standard problem in phylogenetic research is the inference of a tree. Current quasi "industry-standard" methods use the classical Metropolis-Hastings algorithm. However, this algorithm is known to be rather inefficient for high dimensional and correlated data. To solve this problem, I present an algorithm which uses Hamiltonian dynamics in the last chapter

    Linguistic Diversity: Empirical Perspectives

    Get PDF
    When comparing the more than 7000 human language varieties spoken today, one encounters a huge diversity in all domains of language, ranging from phonology via morphology up to syntax and pragmatics. In the seminar, we explored how language diversity can be studied empirically. In order to do so, we looked at linguistic approaches to the study of linguistic diversity from multiple perspectives, including classical approaches in historical and areal linguistics and linguistic typology, as well as recent, predominantly quantitative approaches in the field of diversity linguistics. In terms of topics, we focused on the major domains of language, such as phonology, morphology, and structure ("grammar" in a broad sense)

    Approches Neuronales pour la Reconstruction de Mots Historiques

    Get PDF
    In historical linguistics, cognates are words that descend in direct line from a common ancestor, called their proto-form, andtherefore are representative of their respective languages evolutions through time, as well as of the relations between theselanguages synchronically. As they reflect the phonetic history of the languages they belong to, they allow linguists to betterdetermine all manners of synchronic and diachronic linguistic relations (etymology, phylogeny, sound correspondences).Cognates of related languages tend to be linked through systematic phonetic correspondence patterns, which neuralnetworks could well learn to model, being especially good at learning latent patterns. In this dissertation, we seek tomethodically study the applicability of machine translation inspired neural networks to historical word prediction, relyingon the surface similarity of both tasks. We first create an artificial dataset inspired by the phonetic and phonotactic rules ofRomance languages, which allow us to vary task complexity and data size in a controlled environment, therefore identifyingif and under which conditions neural networks were applicable. We then extend our work to real datasets (after havingupdated an etymological database to gather a correct amount of data), study the transferability of our conclusions toreal data, then the applicability of a number of data augmentation techniques to the task, to try to mitigate low-resourcesituations. We finally investigat in more detail our best models, multilingual neural networks. We first confirm that, onthe surface, they seem to capture language relatedness information and phonetic similarity, confirming prior work. Wethen discover, by probing them, that the information they store is actually more complex: our multilingual models actuallyencode a phonetic language model, and learn enough latent historical information to allow decoders to reconstruct the(unseen) proto-form of the studied languages as well or better than bilingual models trained specifically on the task. Thislatent information is likely the explanation for the success of multilingual methods in the previous worksEn linguistique historique, les cognats sont des mots qui descendent en ligne directe d'un ancêtre commun, leur proto-forme, et qui sont ainsi représentatifs de l'évolution de leurs langues respectives à travers le temps. Comme ils portent eneux l'histoire phonétique des langues auxquelles ils appartiennent, ils permettent aux linguistes de mieux déterminer toutessortes de relations linguistiques synchroniques et diachroniques (étymologie, phylogénie, correspondances phonétiques).Les cognats de langues apparentées sont liés par des correspondances phonétiques systématiques. Les réseaux deneurones, particulièrement adaptés à l'apprentissage de motifs latents, semblent donc bien un bon outil pour modéliserces correspondances. Dans cette thèse, nous cherchons donc à étudier méthodiquement l'applicabilité de réseaux deneurones spécifiques (inspirés de la traduction automatique) à la `prédiction de mots historiques', en nous appuyantsur les similitudes entre ces deux tâches. Nous créons tout d'abord un jeu de données artificiel à partir des règlesphonétiques et phonotactiques des langues romanes, que nous utilisons pour étudier l'utilisation de nos réseaux ensituation controlée, et identifions ainsi sous quelles conditions les réseaux de neurones sont applicables à notre tâched'intérêt. Nous étendons ensuite notre travail à des données réelles (après avoir mis à jour une base étymologiquespour obtenir d'avantage de données), étudions si nos conclusions précédentes leur sont applicables, puis s'il est possibled'utiliser des techniques d'augmentation des données pour pallier aux manque de ressources de certaines situations.Enfin, nous analysons plus en détail nos meilleurs modèles, les réseaux neuronaux multilingues. Nous confirmons àpartir de leurs résultats bruts qu'ils semblent capturer des informations de parenté linguistique et de similarité phonétique,ce qui confirme des travaux antérieurs. Nous découvrons ensuite en les sondant (probing) que les informations qu'ilsstockent sont en fait plus complexes : nos modèles multilingues encodent en fait un modèle phonétique de la langue, etapprennent suffisamment d'informations diachroniques latentes pour permettre à des décodeurs de reconstruire la proto-forme (non vue) des langues étudiées aussi bien, voire mieux, que des modèles bilingues entraînés spécifiquement surcette tâche. Ces informations latentes expliquent probablement le succès des méthodes multilingues dans les travauxprécédents

    Computer-Assisted Language Comparison in Practice. Volume 3

    Get PDF
    The weblog Computer-Assisted Language Comparison in Practice, published on the Hypotheses platform for scientific blogging, offers tutorials and discussion notes on computer-assisted approaches to the history and diversity of languages. A substantial part of its content is contributed as part of the ERC Starting Grant “Computer-Assisted Language Comparison” (CALC, 715618), funded by the European Research Council. But on the long run, we want to make this blog a platform for everybody willing to share ideas on small or big problems involving data preparation and analysis in computer-assisted or computer-based approaches to language comparison. This document summarizes all contributions from 2020. If you want to cite them, please follow the instructions at the end of each contribution. I express my gratitude to all contributors, who helped to make this an interesting collection of tutorials, algorithms, and initial theories related to the fields of computer-assisted language comparison

    The Negative Existential Cycle

    Get PDF
    In 1991, William Croft suggested that negative existentials (typically lexical expressions that mean ‘not exist, not have’) are one possible source for negation markers and gave his hypothesis the name Negative Existential Cycle (NEC). It is a variationist model based on cross-linguistic data. For a good twenty years following its formulation, it was cited at face-value without ever having been tested by (historical)-comparative data. Over the last decade, Ljuba Veselinova has worked on testing the model in a comparative perspective, and this edited volume further expands on her work. The collection presented here features detailed studies of several language families such as Bantu, Chadic and Indo-European. A number of articles focus on the micro-variation and attested historical developments within smaller groups and clusters such as Arabic, Mandarin and Cantonese, and Nanaic. Finally, variation and historical developments in specific languages are discussed for Ancient Hebrew, Ancient Egyptian, Moksha-Mordvin (Uralic), Bashkir (Turkic), Kalmyk (Mongolic), three Pama-Nyungan languages, O’dam (Southern Uto-Aztecan) and Tacana (Takanan, Amazonian Bolivia). The book is concluded by two chapters devoted to modeling cyclical processes in language change from different theoretical perspectives. Key notions discussed throughout the book include affirmative and negative existential constructions, the expansion of the latter into verbal negation, and subsequently from more specific to more general markers of negation. Nominalizations as well as the uses of negative existentials as standalone negative answers figure among the most frequent pathways whereby negative existentials evolve as general negation markers. The operation of the Negative Existential Cycle appears partly genealogically conditioned, as the cycle is found to iterate regularly within some families but never starts in others, as is the case in Bantu. In addition, other special negation markers such as nominal negators are found to undergo similar processes, i.e. they expand into the verbal domain and thereby develop into more general negation markers. The book provides rich information on a specific path of the evolution of negation, on cyclical processes in language change, and it show-cases the historical-comparative method in a modern setting
    corecore