1,041 research outputs found

    Temporal Phylogenetic Networks and Logic Programming

    Full text link
    The concept of a temporal phylogenetic network is a mathematical model of evolution of a family of natural languages. It takes into account the fact that languages can trade their characteristics with each other when linguistic communities are in contact, and also that a contact is only possible when the languages are spoken at the same time. We show how computational methods of answer set programming and constraint logic programming can be used to generate plausible conjectures about contacts between prehistoric linguistic communities, and illustrate our approach by applying it to the evolutionary history of Indo-European languages. To appear in Theory and Practice of Logic Programming (TPLP)

    Networking Phylogeny for Indo-European and Austronesian Languages

    Get PDF
    Harnessing cognitive abilities of many individuals, a language evolves upon their mutual interactions establishing a persistent social environment to which language is closely attuned. Human history is encoded in the rich sets of linguistic data by means of symmetry patterns that are not always feasibly represented by trees. Here we use the methods developed in the study of complex networks to decipher accurately symmetry records on the language phylogeny of the Indo-European and the Austronesian language families, considering, in both cases, the samples of fifty different languages. In particular, we support the Anatolian theory of Indo-European origin and the ‘express train’ model of Austronesian expansion from South-East Asia, with an essential role for the Batanes islands located between the Philippines and Taiwan

    The evolutionary approach to history:sociocultural phylogenetics

    Get PDF

    PHYLO-ASP: Phylogenetic Systematics with Answer Set Programming

    Full text link
    This note summarizes the use of Answer Set Programming to solve various computational problems to infer phylogenetic trees and phylogenetic networks, and discusses its applicability and effectiveness on some real taxa

    Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution

    Get PDF
    Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend. An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches. Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still. Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms. While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing. Yet, this does not hinder studies of language evolution using automated tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks. Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages. A flow-based separation criterion and domain-specific directionality detection criteria are developed to make existing causal inference algorithms more robust against imperfect cognacy data, giving rise to two new algorithms. The Phylogenetic Lexical Flow Inference (PLFI) algorithm requires lexical features of proto-languages to be reconstructed in advance, but yields fully general phylogenetic networks, whereas the more complex Contact Lexical Flow Inference (CLFI) algorithm treats proto-languages as hidden common causes, and only returns hypotheses of historical contact situations between attested languages. The algorithms are evaluated both against a large lexical database of Northern Eurasia spanning many language families, and against simulated data generated by a new model of language contact that builds on the opening and closing of directional contact channels as primary evolutionary events. The algorithms are found to infer the existence of contacts very reliably, whereas the inference of directionality remains difficult. This currently limits the new algorithms to a role as exploratory tools for quickly detecting salient patterns in large lexical datasets, but it should soon be possible for the framework to be enhanced e.g. by confidence values for each directionality decision

    Algorithmic advancements in Computational Historical Linguistics

    Get PDF
    Computergestützte Methoden in der historischen Linguistik haben in den letzten Jahren einen großen Aufschwung erlebt. Die wachsende Verfügbarkeit maschinenlesbarer Daten förderten diese Entwicklung ebenso wie die zunehmende Leistungsfähigkeit von Computern. Die in dieser Forschung verwendeten Berechnungsmethoden stammen aus verschiedenen wissenschaftlichen Disziplinen, wobei Methoden aus der Bioinformatik sicherlich die Initialzündung gaben. Diese Arbeit, die sich von Fortschritten in angrenzenden Gebieten inspirieren lässt, zielt darauf ab, die bestehenden Berechnungsmethoden in verschiedenen Bereichen der computergestützten historischen Linguistik zu verbessern. Mit Hilfe von Fortschritten aus der Forschung aus dem maschinellen Lernen und der Computerlinguistik wird hier eine neue Trainingsmethode für Algorithmen zur Kognatenerkennung vorgestellt. Diese Methode erreicht an vielen Stellen die besten Ergebnisse im Bereich der Kognatenerkennung. Außerdem kann das neue Trainingsschema die Rechenzeit erheblich verbessern. Ausgehend von diesen Ergebnissen wird eine neue Kombination von Methoden der Bioinformatik und der historischen Linguistik entwickelt. Durch die Definition eines expliziten Modells der Lautevolution wird der Begriff der evolutionären Zeit in die Kognatenerkennung mit einbezogen. Die sich daraus ergebenden posterioren Verteilungen werden verwendet, um das Modell anhand einer standardmäßigen Kognatenerkennung zu evaluieren. Eine weitere klassische Problemstellung in der pyhlogenetischen Forschung ist die Inferenz eines Baumes. Aktuelle Methoden, die den ``quasi-industriestandard'' bilden, verwenden den klassischen Metropolis-Hastings-Algorithmus. Allerdings ist bekannt, dass dieser Algorithmus für hochdimensionale und korrelierte Daten vergleichsweise ineffizient ist. Um dieses Problem zu beheben, wird im letzten Kapitel ein Algorithmus vorgestellt, der die Hamilton'sche Dynamik verwendet.The use of computational methods in historical linguistics has seen a large boost in recent years. An increasing availability of machine readable data and the growing power of computers fostered this development. While the computational methods which are used in this research stem from different scientific disciplines, a lot of tools from computational biology have found their way into this research. Drawing inspiration from advancements in related fields, this thesis aims at improving existing computational methods in different disciplines of computational historical linguistics. Using advancements from machine learning and natural language processing research, I present an updated training regime for cognate detection algorithms. Besides achieving state of the art performance in a cognate clustering task, the updated training scheme considerably improved computation time. Following up on these results, I develop a novel combination of tools from bioinformatics and historical linguistics is developed. By defining an explicit model of sound evolution, I include the notion of evolutionary time into a cognate detection task. The resulting posterior distributions are used to evaluate the model on a standard cognate detection task. A standard problem in phylogenetic research is the inference of a tree. Current quasi "industry-standard" methods use the classical Metropolis-Hastings algorithm. However, this algorithm is known to be rather inefficient for high dimensional and correlated data. To solve this problem, I present an algorithm which uses Hamiltonian dynamics in the last chapter

    Information-theoretic causal inference of lexical flow

    Get PDF
    This volume seeks to infer large phylogenetic networks from phonetically encoded lexical data and contribute in this way to the historical study of language varieties. The technical step that enables progress in this case is the use of causal inference algorithms. Sample sets of words from language varieties are preprocessed into automatically inferred cognate sets, and then modeled as information-theoretic variables based on an intuitive measure of cognate overlap. Causal inference is then applied to these variables in order to determine the existence and direction of influence among the varieties. The directed arcs in the resulting graph structures can be interpreted as reflecting the existence and directionality of lexical flow, a unified model which subsumes inheritance and borrowing as the two main ways of transmission that shape the basic lexicon of languages
    corecore