11,513 research outputs found
Computational phylogenetics and the classification of South American languages
In recent years, South Americanist linguists have embraced computational phylogenetic methods to resolve the numerous outstanding questions about the genealogi- cal relationships among the languages of the continent. We provide a critical review of the methods and language classification results that have accumulated thus far, emphasizing the superiority of character-based methods over distance-based ones and the importance of develop- ing adequate comparative datasets for producing well- resolved classifications
On the accuracy of language trees
Historical linguistics aims at inferring the most likely language
phylogenetic tree starting from information concerning the evolutionary
relatedness of languages. The available information are typically lists of
homologous (lexical, phonological, syntactic) features or characters for many
different languages.
From this perspective the reconstruction of language trees is an example of
inverse problems: starting from present, incomplete and often noisy,
information, one aims at inferring the most likely past evolutionary history. A
fundamental issue in inverse problems is the evaluation of the inference made.
A standard way of dealing with this question is to generate data with
artificial models in order to have full access to the evolutionary process one
is going to infer. This procedure presents an intrinsic limitation: when
dealing with real data sets, one typically does not know which model of
evolution is the most suitable for them. A possible way out is to compare
algorithmic inference with expert classifications. This is the point of view we
take here by conducting a thorough survey of the accuracy of reconstruction
methods as compared with the Ethnologue expert classifications. We focus in
particular on state-of-the-art distance-based methods for phylogeny
reconstruction using worldwide linguistic databases.
In order to assess the accuracy of the inferred trees we introduce and
characterize two generalizations of standard definitions of distances between
trees. Based on these scores we quantify the relative performances of the
distance-based algorithms considered. Further we quantify how the completeness
and the coverage of the available databases affect the accuracy of the
reconstruction. Finally we draw some conclusions about where the accuracy of
the reconstructions in historical linguistics stands and about the leading
directions to improve it.Comment: 36 pages, 14 figure
EM for phylogenetic topology reconstruction on non-homogeneous data
Background: The reconstruction of the phylogenetic tree topology of four taxa
is, still nowadays, one of the main challenges in phylogenetics. Its
difficulties lie in considering not too restrictive evolutionary models, and
correctly dealing with the long-branch attraction problem. The correct
reconstruction of 4-taxon trees is crucial for making quartet-based methods
work and being able to recover large phylogenies.
Results: In this paper we consider an expectation-maximization method for
maximizing the likelihood of (time nonhomogeneous) evolutionary Markov models
on trees. We study its success on reconstructing 4-taxon topologies and its
performance as input method in quartet-based phylogenetic reconstruction
methods such as QFIT and QuartetSuite. Our results show that the method
proposed here outperforms neighbor-joining and the usual (time-homogeneous
continuous-time) maximum likelihood methods on 4-leaved trees with
among-lineage instantaneous rate heterogeneity, and perform similarly to usual
continuous-time maximum-likelihood when data satisfies the assumptions of both
methods.
Conclusions: The method presented in this paper is well suited for
reconstructing the topology of any number of taxa via quartet-based methods and
is highly accurate, specially regarding largely divergent trees and time
nonhomogeneous data.Comment: 1 main file: 6 Figures and 2 Tables. 1 Additional file with 2 Figures
and 2 Tables. To appear in "BCM Evolutionary Biology
Reliability analysis of reconstructing phylogenies under long branch attraction conditions
Master's Project (M.S.) University of Alaska Fairbanks, 2018.In this simulation study we examined the reliability of three phylogenetic reconstruction techniques in a long branch attraction (LBA) situation: Maximum Parsimony (M P), Neighbor Joining (NJ), and Maximum Likelihood. Data were simulated under five DNA substitution models-JC, K2P, F81, HKY, and G T R-from four different taxa. Two branch length parameters of four taxon trees ranging from 0.05 to 0.75 with an increment of 0.02 were used to simulate DNA data under each model. For each model we simulated DNA sequences with 100, 250, 500 and 1000 sites with 100 replicates. When we have enough data the maximum likelihood technique is the most reliable of the three methods examined in this study for reconstructing phylogenies under LBA conditions. We also find that MP is the most sensitive to LBA conditions and that Neighbor Joining performs well under LBA conditions compared to MP
Phylogenetic mixtures: Concentration of measure in the large-tree limit
The reconstruction of phylogenies from DNA or protein sequences is a major
task of computational evolutionary biology. Common phenomena, notably
variations in mutation rates across genomes and incongruences between gene
lineage histories, often make it necessary to model molecular data as
originating from a mixture of phylogenies. Such mixed models play an
increasingly important role in practice. Using concentration of measure
techniques, we show that mixtures of large trees are typically identifiable. We
also derive sequence-length requirements for high-probability reconstruction.Comment: Published in at http://dx.doi.org/10.1214/11-AAP837 the Annals of
Applied Probability (http://www.imstat.org/aap/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A Bayesian approach for inferring the dynamics of partially observed endemic infectious diseases from space-time-genetic data
We describe a statistical framework for reconstructing the sequence of transmission events between observed cases of an endemic infectious disease using genetic, temporal and spatial information. Previous approaches to reconstructing transmission trees have assumed all infections in the study area originated from a single introduction and that a large fraction of cases were observed. There are as yet no approaches appropriate for endemic situations in which a disease is already well established in a host population and in which there may be multiple origins of infection, or that can enumerate unobserved infections missing from the sample. Our proposed framework addresses these shortcomings, enabling reconstruction of partially observed transmission trees and estimating the number of cases missing from the sample. Analyses of simulated datasets show the method to be accurate in identifying direct transmissions, while introductions and transmissions via one or more unsampled intermediate cases could be identified at high to moderate levels of case detection. When applied to partial genome sequences of rabies virus sampled from an endemic region of South Africa, our method reveals several distinct transmission cycles with little contact between them, and direct transmission over long distances suggesting significant anthropogenic influence in the movement of infected dogs
- âŠ