45 research outputs found

    Automatic Loanword Identification Using Tree Reconciliation

    Get PDF
    Die Verwendung von computerbasierten Methoden in der Historischen Linguistik stieg in den letzten Jahren stetig an. Phylogenetische Methoden, welche zur Bestimmung der Evolutionsgeschichte und Verwandtschaftsgraden zwischen Organismen entwickelt wurden, erhielten Einzug in die Historische Linguistik. Die VerfĂŒgbarkeit von maschinenlesbaren Daten förderten deren Anpassung und Weiterentwicklung. WĂ€hrend einige Algorithmen zur Rekonstruktion der sprachlichen Evolutionsgeschichte ĂŒbernommen wurden, wurde den Methoden fĂŒr horizontalen Transfer kaum Beachtung geschenkt. Angelehnt an die Parallele zwischen horizontalem Gentransfer und Entlehnung, werden in dieser Arbeit phylogenetische Methoden zur Erkennung von horizontalem Gentransfer fĂŒr die Identifikation von Lehnwörtern verwendet. Die Algorithmen fĂŒr horizontalen Gentransfer basieren auf dem Vergleich zweier phylogenetischer BĂ€ume. In der Linguistik bildet der Sprachbaum die Sprachgeschichte ab, wĂ€hrend ein Konzeptbaum die Evolutionsgeschichte einzelner Wörter reprĂ€sentiert. Die Rekonstruktion eines Sprachbaumes ist wissenschaftlich fundiert, wohingegen die Rekonstruktion von KonzeptbĂ€umen bisher wenig erforscht wurde. Eine erhebliche Innovation dieser Arbeit ist die EinfĂŒhrung verschiedener Methoden zur Rekonstruktion von stabilen KonzeptbĂ€umen. Da die Algorithmen zur Erkennung von horizontalem Transfer auf einem Baumvergleich basieren, deuten die Unterschiede zwischen einem Sprachbaum und einem Konzeptbaum auf Lehnwörter innerhalb der Daten hin. Daher wird sowohl die Methodik, als auch ein geeigneter Algorithmus in einem linguistischen Kontext eingefĂŒhrt. Die Ergebnisse der Lehnworterkennung werden mithilfe eines neu entwickelten Goldstandards evaluiert und mit drei weiteren Algorithmen aus der Historischen Computerlinguistik verglichen. Ziel der Arbeit ist zu erlĂ€utern, inwieweit Algorithmen basierend auf dem Vergleich zweier BĂ€ume fĂŒr die automatische Lehnworterkennung verwendet und in welchem Umfang Lehnwörter erfolgreich innerhalb der Daten bestimmt werden können. Die Identifikation von Lehnwörtern trĂ€gt zu einem tieferen VerstĂ€ndnis von Sprachkontakt und den unterschiedlichen Arten von Lehnwörtern bei. Daher ist die Adaption von phylogenetischen Methoden nicht nur lohnenswert fĂŒr die Bestimmungen von Entlehnungen, sondern dient auch als Basis fĂŒr weitere, detailliertere Analysen auf den Gebieten der automatischen Lehnworterkennung und Kontaktlinguistik.The use of computational methods in historical linguistics increased during the last years. Phylogenetic methods, which explore the evolutionary history and relationships among organisms, found their way into historical linguistics. The availability of machine-readable data accelerated their adaptation and development. While some methods addressing the evolution of languages are integrated into linguistics, scarcely any attention has been paid to methods analyzing horizontal transmission. Inspired by the parallel between horizontal gene transfer and borrowing, this thesis aims at adapting horizontal transfer methods into computational historical linguistics to identify borrowing scenarios along with the transferred loanwords. Computational methods modeling horizontal transfer are based on the framework of tree reconciliation. The methods attempt to detect horizontal transfer by fitting the evolutionary history of words to the evolution of their corresponding languages, both represented in phylogenetic trees. The discordance between the two evolutionary scenarios indicates the influence of loanwords due to language contact. The tree reconciliation framework is introduced in a linguistic setting along with an appropriate algorithm, which is applied to linguistic trees to detect loanwords. While the reconstruction of language trees is scientifically substantiated, little research has so far be done on the reconstruction of concept trees, representing the words’ histories. One major innovation of this thesis is the introduction of various methods to reconstruct reliable concept trees and determine their stability in order to achieve reasonable results in terms of loanword detection. The results of the tree reconciliation are evaluated against a newly developed gold standard and compared to three methods established for the task of language contact detection in computational historical linguistics. The main aim of this thesis is to clarify the purpose of tree reconciliation methods in linguistics. The following analyses should give insights to which degree the direct transfer of phylogenetic methods into the field of linguistics is fruitful and can be used to discover borrowings along with the transferred loanwords. The identification of loanwords is a first step into the direction of a deeper understanding of contact scenarios and possible types of loanwords present in linguistic data. The adaptation of phylogenetic methods is not only worthwhile to shed light on detailed horizontal transmissions, but serves as basis for further, more detailed analyses in the field of contact linguistics

    Sequence comparison in computational historical linguistics

    Get PDF
    With increasing amounts of digitally available data from all over the world, manual annotation of cognates in multi-lingual word lists becomes more and more time-consuming in historical linguistics. Using available software packages to pre-process the data prior to manual analysis can drastically speed-up the process of cognate detection. Furthermore, it allows us to get a quick overview on data which have not yet been intensively studied by experts. LingPy is a Python library which provides a large arsenal of routines for sequence comparison in historical linguistics. With LingPy, linguists can not only automatically search for cognates in lexical data, but they can also align the automatically identified words, and output them in various forms, which aim at facilitating manual inspection. In this tutorial, we will briefly introduce the basic concepts behind the algorithms employed by LingPy and then illustrate in concrete workflows how automatic sequence comparison can be applied to multi-lingual word lists. The goal is to provide the readers with all information they need to (1) carry out cognate detection and alignment analyses in LingPy, (2) select the appropriate algorithms for the appropriate task, (3) evaluate how well automatic cognate detection algorithms perform compared to experts, and (4) export their data into various formats useful for additional analyses or data sharing. While basic knowledge of the Python language is useful for all analyses, our tutorial is structured in such a way that scholars with basic knowledge of computing can follow through all steps as well

    Sequence comparison in computational historical linguistics

    Get PDF
    With increasing amounts of digitally available data from all over the world, manual annotation of cognates in multi-lingual word lists becomes more and more time-consuming in historical linguistics. Using available software packages to pre-process the data prior to manual analysis can drastically speed-up the process of cognate detection. Furthermore, it allows us to get a quick overview on data which have not yet been intensively studied by experts. LingPy is a Python library which provides a large arsenal of routines for sequence comparison in historical linguistics. With LingPy, linguists can not only automatically search for cognates in lexical data, but they can also align the automatically identified words, and output them in various forms, which aim at facilitating manual inspection. In this tutorial, we will briefly introduce the basic concepts behind the algorithms employed by LingPy and then illustrate in concrete workflows how automatic sequence comparison can be applied to multi-lingual word lists. The goal is to provide the readers with all information they need to (1) carry out cognate detection and alignment analyses in LingPy, (2) select the appropriate algorithms for the appropriate task, (3) evaluate how well automatic cognate detection algorithms perform compared to experts, and (4) export their data into various formats useful for additional analyses or data sharing. While basic knowledge of the Python language is useful for all analyses, our tutorial is structured in such a way that scholars with basic knowledge of computing can follow through all steps as well.This research was supported by the European Research Council Starting Grant ‘Computer-Assisted Language Comparison’ (Grant CALC 715618, J.M.L., T.T.) and the Australian Research Council’s Centre of Excellence for the Dynamics of Language (Australian National University, Grant CE140100041, S.J.G.). As part of the GlottoBank project (http://glottobank.org), this work was further supported by the Department of Linguistic and Cultural Evolution of the Max Planck Institute for the Science of Human History (Jena) and the Royal Society of New Zealand (Marsden Fund, Grant 13-UOA-121)

    Automated methods for the investigation of language contact, with a focus on lexical borrowing

    Get PDF
    While language contact has so far been predominantly studied on the basis of detailed case studies, the emergence of methods for phylogenetic reconstruction and automated word comparison – as a result of the recent quantitative turn in historical linguistics – has also resulted in new proposals to study language contact situations by means of automated approaches. This study provides a concise introduction to the most important approaches which have been proposed in the past, presenting methods that use (A) phylogenetic networks to detect reticulation events during language history, (B) sequence comparison methods in order to identify borrowings in multilingual datasets, and (C) arguments for the borrowability of shared traits to decide if traits have been borrowed or inherited. While the overview focuses on approaches dealing with lexical borrowing, questions of general contact inference will also be discussed where applicable

    Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution

    Get PDF
    Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend. An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches. Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still. Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms. While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing. Yet, this does not hinder studies of language evolution using automated tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks. Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems

    Computer-Assisted Language Comparison in Practice. Tutorials on Computational Approaches to the History and Diversity of Languages. Volume II

    Get PDF
    This document summarizes all contributions to the blog "Computer-Assisted Language Comparison in Practice" from 2019, online also available under https://calc.hypotheses.org

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

    Structural approaches to protein sequence analysis

    Get PDF
    Various protein sequence analysis techniques are described, aimed at improving the prediction of protein structure by means of pattern matching. To investigate the possibility that improvements in amino acid comparison matrices could result in improvements in the sensitivity and accuracy of protein sequence alignments, a method for rapidly calculating amino acid mutation data matrices from large sequence data sets is presented. The method is then applied to the membrane-spanning segments of integral membrane proteins in order to investigate the nature of amino acid mutability in a lipid environment. Whilst purely sequence analytic techniques work well for cases where some residual sequence similarity remains between a newly characterized protein and a protein of known 3-D structure, in the harder cases, there is little or no sequence similarity with which to recognize proteins with similar folding patterns. In the light of these limitations, a new approach to protein fold recognition is described, which uses a statistically derived pairwise potential to evaluate the compatibility between a test sequence and a library of structural templates, derived from solved crystal structures. The method, which is called optimal sequence threading, proves to be highly successful, and is able to detect the common TIM barrel fold between a number of enzyme sequences, which has not been achieved by any previous sequence analysis technique. Finally, a new method for the prediction of the secondary structure and topology of membrane proteins is described. The method employs a set of statistical tables compiled from well-characterized membrane protein data, and a novel dynamic programming algorithm to recognize membrane topology models by expectation maximization. The statistical tables show definite biases towards certain amino acid species on the inside, middle and outside of a cellular membrane
    corecore