500 research outputs found

    Studying Evolutionary Change: Transdisciplinary Advances in Understanding and Measuring Evolution

    Get PDF
    Evolutionary processes can be found in almost any historical, i.e. evolving, system that erroneously copies from the past. Well studied examples do not only originate in evolutionary biology but also in historical linguistics. Yet an approach that would bind together studies of such evolving systems is still elusive. This thesis is an attempt to narrowing down this gap to some extend. An evolving system can be described using characters that identify their changing features. While the problem of a proper choice of characters is beyond the scope of this thesis and remains in the hands of experts we concern ourselves with some theoretical as well data driven approaches. Having a well chosen set of characters describing a system of different entities such as homologous genes, i.e. genes of same origin in different species, we can build a phylogenetic tree. Consider the special case of gene clusters containing paralogous genes, i.e. genes of same origin within a species usually located closely, such as the well known HOX cluster. These are formed by step- wise duplication of its members, often involving unequal crossing over forming hybrid genes. Gene conversion and possibly other mechanisms of concerted evolution further obfuscate phylogenetic relationships. Hence, it is very difficult or even impossible to disentangle the detailed history of gene duplications in gene clusters. Expanding gene clusters that use unequal crossing over as proposed by Walter Gehring leads to distinctive patterns of genetic distances. We show that this special class of distances helps in extracting phylogenetic information from the data still. Disregarding genome rearrangements, we find that the shortest Hamiltonian path then coincides with the ordering of paralogous genes in a cluster. This observation can be used to detect ancient genomic rearrangements of gene clus- ters and to distinguish gene clusters whose evolution was dominated by unequal crossing over within genes from those that expanded through other mechanisms. While the evolution of DNA or protein sequences is well studied and can be formally described, we find that this does not hold for other systems such as language evolution. This is due to a lack of detectable mechanisms that drive the evolutionary processes in other fields. Hence, it is hard to quantify distances between entities, e.g. languages, and therefore the characters describing them. Starting out with distortions of distances, we first see that poor choices of the distance measure can lead to incorrect phylogenies. Given that phylogenetic inference requires additive metrics we can infer the correct phylogeny from a distance matrix D if there is a monotonic, subadditive function ζ such that ζ^−1(D) is additive. We compute the metric-preserving transformation ζ as the solution of an optimization problem. This result shows that the problem of phylogeny reconstruction is well defined even if a detailed mechanistic model of the evolutionary process is missing. Yet, this does not hinder studies of language evolution using automated tools. As the amount of available and large digital corpora increased so did the possibilities to study them automatically. The obvious parallels between historical linguistics and phylogenetics lead to many studies adapting bioinformatics tools to fit linguistics means. Here, we use jAlign to calculate bigram alignments, i.e. an alignment algorithm that operates with regard to adjacency of letters. Its performance is tested in different cognate recognition tasks. Using pairwise alignments one major obstacle is the systematic errors they make such as underestimation of gaps and their misplacement. Applying multiple sequence alignments instead of a pairwise algorithm implicitly includes more evolutionary information and thus can overcome the problem of correct gap placement. They can be seen as a generalization of the string-to-string edit problem to more than two strings. With the steady increase in computational power, exact, dynamic programming solutions have become feasible in practice also for 3- and 4-way alignments. For the pairwise (2-way) case, there is a clear distinction between local and global alignments. As more sequences are consid- ered, this distinction, which can in fact be made independently for both ends of each sequence, gives rise to a rich set of partially local alignment problems. So far these have remained largely unexplored. Thus, a general formal frame- work that gives raise to a classification of partially local alignment problems is introduced. It leads to a generic scheme that guides the principled design of exact dynamic programming solutions for particular partially local alignment problems

    APPLICATION OF INFORMATION SYSTEMS AND TOOLS IN BIOINFORMATICS

    Get PDF
    The pace at which scientific data is produced and disseminated has never been as high as it is currently. Modern sequencing technologies make it possible to obtain the genome of a specific organism in a few days, and the genome of a bacterial organism in less than a day, and therefore researchers from the field of life science are faced with a huge amount of data that needs to be analyzed. In this connection, various fields of science are converging with each other, giving rise to new disciplines. So, bioinformatics is one of these fields, it is a scientific discipline that has been actively developing over the past decades and uses IT tools and methods to solve problems related to the study of biological processes. In particular, a crucial role in the field of bioinformatics is played by the development of new algorithms, tools and the creation of new databases, as well as the integration of extremely large amounts of data. The rapid development of bioinformatics has made it possible to conduct modern biological research. Bioinformatics can help a biologist to extract valuable information from biological data by using tools to process them. Despite the fact that bioinformatics is a relatively new discipline, various web and computer tools already exist, most of which are freely available. This is a review article that provides an exhaustive overview of some of the tools for biological analysis available to a biologist, as well as describes the key role of information systems in this interdisciplinary field

    Algorithmic advancements in Computational Historical Linguistics

    Get PDF
    Computergestützte Methoden in der historischen Linguistik haben in den letzten Jahren einen großen Aufschwung erlebt. Die wachsende Verfügbarkeit maschinenlesbarer Daten förderten diese Entwicklung ebenso wie die zunehmende Leistungsfähigkeit von Computern. Die in dieser Forschung verwendeten Berechnungsmethoden stammen aus verschiedenen wissenschaftlichen Disziplinen, wobei Methoden aus der Bioinformatik sicherlich die Initialzündung gaben. Diese Arbeit, die sich von Fortschritten in angrenzenden Gebieten inspirieren lässt, zielt darauf ab, die bestehenden Berechnungsmethoden in verschiedenen Bereichen der computergestützten historischen Linguistik zu verbessern. Mit Hilfe von Fortschritten aus der Forschung aus dem maschinellen Lernen und der Computerlinguistik wird hier eine neue Trainingsmethode für Algorithmen zur Kognatenerkennung vorgestellt. Diese Methode erreicht an vielen Stellen die besten Ergebnisse im Bereich der Kognatenerkennung. Außerdem kann das neue Trainingsschema die Rechenzeit erheblich verbessern. Ausgehend von diesen Ergebnissen wird eine neue Kombination von Methoden der Bioinformatik und der historischen Linguistik entwickelt. Durch die Definition eines expliziten Modells der Lautevolution wird der Begriff der evolutionären Zeit in die Kognatenerkennung mit einbezogen. Die sich daraus ergebenden posterioren Verteilungen werden verwendet, um das Modell anhand einer standardmäßigen Kognatenerkennung zu evaluieren. Eine weitere klassische Problemstellung in der pyhlogenetischen Forschung ist die Inferenz eines Baumes. Aktuelle Methoden, die den ``quasi-industriestandard'' bilden, verwenden den klassischen Metropolis-Hastings-Algorithmus. Allerdings ist bekannt, dass dieser Algorithmus für hochdimensionale und korrelierte Daten vergleichsweise ineffizient ist. Um dieses Problem zu beheben, wird im letzten Kapitel ein Algorithmus vorgestellt, der die Hamilton'sche Dynamik verwendet.The use of computational methods in historical linguistics has seen a large boost in recent years. An increasing availability of machine readable data and the growing power of computers fostered this development. While the computational methods which are used in this research stem from different scientific disciplines, a lot of tools from computational biology have found their way into this research. Drawing inspiration from advancements in related fields, this thesis aims at improving existing computational methods in different disciplines of computational historical linguistics. Using advancements from machine learning and natural language processing research, I present an updated training regime for cognate detection algorithms. Besides achieving state of the art performance in a cognate clustering task, the updated training scheme considerably improved computation time. Following up on these results, I develop a novel combination of tools from bioinformatics and historical linguistics is developed. By defining an explicit model of sound evolution, I include the notion of evolutionary time into a cognate detection task. The resulting posterior distributions are used to evaluate the model on a standard cognate detection task. A standard problem in phylogenetic research is the inference of a tree. Current quasi "industry-standard" methods use the classical Metropolis-Hastings algorithm. However, this algorithm is known to be rather inefficient for high dimensional and correlated data. To solve this problem, I present an algorithm which uses Hamiltonian dynamics in the last chapter

    Mathematical modeling of evolutionary changes of oligonucleotide frequency patterns of bacterial genomes for genome-scale phylogenetic inferences

    Get PDF
    Modern phylogenetic studies from the advancement of next generation sequencing can benefit from an analysis of complete genome sequences of various microorganisms. Evolutionary inferences based on genome scale analysis were believed to be more accurate than gene-based ones. However, the computational complexity of current phylogenomic procedures and lack of reliable annotation and alignment free evolutionary models keep microbiologists from wider use of these opportunities. For example, the super-matrix approach of phylogenomics requires identification of clusters of orthologous genes in compared genomes followed by alignment of numerous sequences to proceed with reconciliation of multiple trees inferred by traditional phylogenetic tools. In fact, the approach potentially multiplies the problems of gene annotation and sequence alignment, not mentioning the computational difficulties and laboriousness of the methods. For this research, we identified that the alignment and annotation-free method based on comparison of oligonucleotide usage patterns (OUP) calculated for genome-scale DNA sequences allowed fast inferring of phylogenetic trees. These were also congruent with the corresponding whole genome supermatrix trees in terms of tree topology and branch lengths. Validation and benchmarking tests for OUP phylogenomics were done based on comparisons to current literature and artificially created sequences with known phylogeny. It was demonstrated that the OUP diversification between taxa was driven by global adjustments of codon usage to fit fluctuating tRNA concentrations that were well aligned to the species evolution. A web-based program to perform OUP-based phylogenomics was released on http://swphylo.bi.up.ac.za/. Applicability of the tool was proven for different taxa from species to family levels. Distinguishing between closely related taxonomic units may be enforced by providing the program with alignments of marker protein sequences, e.g. gyrA.Thesis (PhD)--University of Pretoria, 2018.BiochemistryPhDUnrestricte

    Aligning Sequences by Minimum Description Length

    Get PDF
    <p/> <p>This paper presents a new information theoretic framework for aligning sequences in bioinformatics. A transmitter compresses a set of sequences by constructing a regular expression that describes the regions of similarity in the sequences. To retrieve the original set of sequences, a receiver generates all strings that match the expression. An alignment algorithm uses minimum description length to encode and explore alternative expressions; the expression with the shortest encoding provides the best overall alignment. When two substrings contain letters that are similar according to a substitution matrix, a code length function based on conditional probabilities defined by the matrix will encode the substrings with fewer bits. In one experiment, alignments produced with this new method were found to be comparable to alignments from <inline-formula><graphic file="1687-4153-2007-72936-i1.gif"/></inline-formula>. A second experiment measured the accuracy of the new method on pairwise alignments of sequences from the BAliBASE alignment benchmark.</p

    Variation in form and meaning across the Japonic language family: With a focus on the Ryukyuan languages

    Get PDF

    The origins and spread of domestic horses from the Western Eurasian steppes

    Get PDF
    Domestication of horses fundamentally transformed long-range mobility and warfare1. However, modern domesticated breeds do not descend from the earliest domestic horse lineage associated with archaeological evidence of bridling, milking and corralling2,3,4 at Botai, Central Asia around 3500 BC3. Other longstanding candidate regions for horse domestication, such as Iberia5 and Anatolia6, have also recently been challenged. Thus, the genetic, geographic and temporal origins of modern domestic horses have remained unknown. Here we pinpoint the Western Eurasian steppes, especially the lower Volga-Don region, as the homeland of modern domestic horses. Furthermore, we map the population changes accompanying domestication from 273 ancient horse genomes. This reveals that modern domestic horses ultimately replaced almost all other local populations as they expanded rapidly across Eurasia from about 2000 BC, synchronously with equestrian material culture, including Sintashta spoke-wheeled chariots. We find that equestrianism involved strong selection for critical locomotor and behavioural adaptations at the GSDMC and ZFPM1 genes. Our results reject the commonly held association7 between horseback riding and the massive expansion of Yamnaya steppe pastoralists into Europe around 3000 BC8,9 driving the spread of Indo-European languages10. This contrasts with the scenario in Asia where Indo-Iranian languages, chariots and horses spread together, following the early second millennium BC Sintashta culture11,12
    corecore