70 research outputs found

    Alignment Analysis of Sequential Segmentation of Lexicons to Improve Automatic Cognate Detection

    Full text link
    Ranking functions in information retrieval are often used in search engines to recommend the relevant answers to the query. This paper makes use of this notion of information retrieval and applies onto the problem domain of cognate detection. The main contributions of this paper are: (1) positional segmentation, which incorporates the sequential notion; (2) graphical error modelling, which deduces the transformations. The current research work focuses on classification problem; which is distinguishing whether a pair of words are cognates. This paper focuses on a harder problem, whether we could predict a possible cognate from the given input. Our study shows that when language modelling smoothing methods are applied as the retrieval functions and used in conjunction with positional segmentation and error modelling gives better results than competing baselines, in both classification and prediction of cognates. Source code is at: https://github.com/pranav-ust/cognatesComment: Published at ACL-SRW 201

    Automatic Loanword Identification Using Tree Reconciliation

    Get PDF
    Die Verwendung von computerbasierten Methoden in der Historischen Linguistik stieg in den letzten Jahren stetig an. Phylogenetische Methoden, welche zur Bestimmung der Evolutionsgeschichte und Verwandtschaftsgraden zwischen Organismen entwickelt wurden, erhielten Einzug in die Historische Linguistik. Die VerfĂŒgbarkeit von maschinenlesbaren Daten förderten deren Anpassung und Weiterentwicklung. WĂ€hrend einige Algorithmen zur Rekonstruktion der sprachlichen Evolutionsgeschichte ĂŒbernommen wurden, wurde den Methoden fĂŒr horizontalen Transfer kaum Beachtung geschenkt. Angelehnt an die Parallele zwischen horizontalem Gentransfer und Entlehnung, werden in dieser Arbeit phylogenetische Methoden zur Erkennung von horizontalem Gentransfer fĂŒr die Identifikation von Lehnwörtern verwendet. Die Algorithmen fĂŒr horizontalen Gentransfer basieren auf dem Vergleich zweier phylogenetischer BĂ€ume. In der Linguistik bildet der Sprachbaum die Sprachgeschichte ab, wĂ€hrend ein Konzeptbaum die Evolutionsgeschichte einzelner Wörter reprĂ€sentiert. Die Rekonstruktion eines Sprachbaumes ist wissenschaftlich fundiert, wohingegen die Rekonstruktion von KonzeptbĂ€umen bisher wenig erforscht wurde. Eine erhebliche Innovation dieser Arbeit ist die EinfĂŒhrung verschiedener Methoden zur Rekonstruktion von stabilen KonzeptbĂ€umen. Da die Algorithmen zur Erkennung von horizontalem Transfer auf einem Baumvergleich basieren, deuten die Unterschiede zwischen einem Sprachbaum und einem Konzeptbaum auf Lehnwörter innerhalb der Daten hin. Daher wird sowohl die Methodik, als auch ein geeigneter Algorithmus in einem linguistischen Kontext eingefĂŒhrt. Die Ergebnisse der Lehnworterkennung werden mithilfe eines neu entwickelten Goldstandards evaluiert und mit drei weiteren Algorithmen aus der Historischen Computerlinguistik verglichen. Ziel der Arbeit ist zu erlĂ€utern, inwieweit Algorithmen basierend auf dem Vergleich zweier BĂ€ume fĂŒr die automatische Lehnworterkennung verwendet und in welchem Umfang Lehnwörter erfolgreich innerhalb der Daten bestimmt werden können. Die Identifikation von Lehnwörtern trĂ€gt zu einem tieferen VerstĂ€ndnis von Sprachkontakt und den unterschiedlichen Arten von Lehnwörtern bei. Daher ist die Adaption von phylogenetischen Methoden nicht nur lohnenswert fĂŒr die Bestimmungen von Entlehnungen, sondern dient auch als Basis fĂŒr weitere, detailliertere Analysen auf den Gebieten der automatischen Lehnworterkennung und Kontaktlinguistik.The use of computational methods in historical linguistics increased during the last years. Phylogenetic methods, which explore the evolutionary history and relationships among organisms, found their way into historical linguistics. The availability of machine-readable data accelerated their adaptation and development. While some methods addressing the evolution of languages are integrated into linguistics, scarcely any attention has been paid to methods analyzing horizontal transmission. Inspired by the parallel between horizontal gene transfer and borrowing, this thesis aims at adapting horizontal transfer methods into computational historical linguistics to identify borrowing scenarios along with the transferred loanwords. Computational methods modeling horizontal transfer are based on the framework of tree reconciliation. The methods attempt to detect horizontal transfer by fitting the evolutionary history of words to the evolution of their corresponding languages, both represented in phylogenetic trees. The discordance between the two evolutionary scenarios indicates the influence of loanwords due to language contact. The tree reconciliation framework is introduced in a linguistic setting along with an appropriate algorithm, which is applied to linguistic trees to detect loanwords. While the reconstruction of language trees is scientifically substantiated, little research has so far be done on the reconstruction of concept trees, representing the words’ histories. One major innovation of this thesis is the introduction of various methods to reconstruct reliable concept trees and determine their stability in order to achieve reasonable results in terms of loanword detection. The results of the tree reconciliation are evaluated against a newly developed gold standard and compared to three methods established for the task of language contact detection in computational historical linguistics. The main aim of this thesis is to clarify the purpose of tree reconciliation methods in linguistics. The following analyses should give insights to which degree the direct transfer of phylogenetic methods into the field of linguistics is fruitful and can be used to discover borrowings along with the transferred loanwords. The identification of loanwords is a first step into the direction of a deeper understanding of contact scenarios and possible types of loanwords present in linguistic data. The adaptation of phylogenetic methods is not only worthwhile to shed light on detailed horizontal transmissions, but serves as basis for further, more detailed analyses in the field of contact linguistics

    Sequence comparison in computational historical linguistics

    Get PDF
    With increasing amounts of digitally available data from all over the world, manual annotation of cognates in multi-lingual word lists becomes more and more time-consuming in historical linguistics. Using available software packages to pre-process the data prior to manual analysis can drastically speed-up the process of cognate detection. Furthermore, it allows us to get a quick overview on data which have not yet been intensively studied by experts. LingPy is a Python library which provides a large arsenal of routines for sequence comparison in historical linguistics. With LingPy, linguists can not only automatically search for cognates in lexical data, but they can also align the automatically identified words, and output them in various forms, which aim at facilitating manual inspection. In this tutorial, we will briefly introduce the basic concepts behind the algorithms employed by LingPy and then illustrate in concrete workflows how automatic sequence comparison can be applied to multi-lingual word lists. The goal is to provide the readers with all information they need to (1) carry out cognate detection and alignment analyses in LingPy, (2) select the appropriate algorithms for the appropriate task, (3) evaluate how well automatic cognate detection algorithms perform compared to experts, and (4) export their data into various formats useful for additional analyses or data sharing. While basic knowledge of the Python language is useful for all analyses, our tutorial is structured in such a way that scholars with basic knowledge of computing can follow through all steps as well

    Sequence comparison in computational historical linguistics

    Get PDF
    With increasing amounts of digitally available data from all over the world, manual annotation of cognates in multi-lingual word lists becomes more and more time-consuming in historical linguistics. Using available software packages to pre-process the data prior to manual analysis can drastically speed-up the process of cognate detection. Furthermore, it allows us to get a quick overview on data which have not yet been intensively studied by experts. LingPy is a Python library which provides a large arsenal of routines for sequence comparison in historical linguistics. With LingPy, linguists can not only automatically search for cognates in lexical data, but they can also align the automatically identified words, and output them in various forms, which aim at facilitating manual inspection. In this tutorial, we will briefly introduce the basic concepts behind the algorithms employed by LingPy and then illustrate in concrete workflows how automatic sequence comparison can be applied to multi-lingual word lists. The goal is to provide the readers with all information they need to (1) carry out cognate detection and alignment analyses in LingPy, (2) select the appropriate algorithms for the appropriate task, (3) evaluate how well automatic cognate detection algorithms perform compared to experts, and (4) export their data into various formats useful for additional analyses or data sharing. While basic knowledge of the Python language is useful for all analyses, our tutorial is structured in such a way that scholars with basic knowledge of computing can follow through all steps as well.This research was supported by the European Research Council Starting Grant ‘Computer-Assisted Language Comparison’ (Grant CALC 715618, J.M.L., T.T.) and the Australian Research Council’s Centre of Excellence for the Dynamics of Language (Australian National University, Grant CE140100041, S.J.G.). As part of the GlottoBank project (http://glottobank.org), this work was further supported by the Department of Linguistic and Cultural Evolution of the Max Planck Institute for the Science of Human History (Jena) and the Royal Society of New Zealand (Marsden Fund, Grant 13-UOA-121)

    Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution

    Get PDF
    Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend. An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches. Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still. Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms. While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing. Yet, this does not hinder studies of language evolution using automated tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks. Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems
    • 

    corecore