26 research outputs found
Polyhedral geometry of Phylogenetic Rogue Taxa
It is well known among phylogeneticists that adding an extra taxon (e.g.
species) to a data set can alter the structure of the optimal phylogenetic tree
in surprising ways. However, little is known about this "rogue taxon" effect.
In this paper we characterize the behavior of balanced minimum evolution (BME)
phylogenetics on data sets of this type using tools from polyhedral geometry.
First we show that for any distance matrix there exist distances to a "rogue
taxon" such that the BME-optimal tree for the data set with the new taxon does
not contain any nontrivial splits (bipartitions) of the optimal tree for the
original data. Second, we prove a theorem which restricts the topology of
BME-optimal trees for data sets of this type, thus showing that a rogue taxon
cannot have an arbitrary effect on the optimal tree. Third, we construct
polyhedral cones computationally which give complete answers for BME rogue
taxon behavior when our original data fits a tree on four, five, and six taxa.
We use these cones to derive sufficient conditions for rogue taxon behavior for
four taxa, and to understand the frequency of the rogue taxon effect via
simulation.Comment: In this version, we add quartet distances and fix Table 4
Cultural Phylogenetics of the Tupi Language Family in Lowland South America
Background: Recent advances in automated assessment of basic vocabulary lists allow the construction of linguistic phylogenies useful for tracing dynamics of human population expansions, reconstructing ancestral cultures, and modeling transition rates of cultural traits over time. Methods: Here we investigate the Tupi expansion, a widely-dispersed language family in lowland South America, with a distance-based phylogeny based on 40-word vocabulary lists from 48 languages. We coded 11 cultural traits across the diverse Tupi family including traditional warfare patterns, post-marital residence, corporate structure, community size, paternity beliefs, sibling terminology, presence of canoes, tattooing, shamanism, men’s houses, and lip plugs. Results/Discussion: The linguistic phylogeny supports a Tupi homeland in west-central Brazil with subsequent major expansions across much of lowland South America. Consistently, ancestral reconstructions of cultural traits over the linguistic phylogeny suggest that social complexity has tended to decline through time, most notably in the independent emergence of several nomadic hunter-gatherer societies. Estimated rates of cultural change across the Tupi expansion are on the order of only a few changes per 10,000 years, in accord with previous cultural phylogenetic results in other languag
Including RNA secondary structures improves accuracy and robustness in reconstruction of phylogenetic trees
<p>Abstract</p> <p>Background</p> <p>In several studies, secondary structures of ribosomal genes have been used to improve the quality of phylogenetic reconstructions. An extensive evaluation of the benefits of secondary structure, however, is lacking.</p> <p>Results</p> <p>This is the first study to counter this deficiency. We inspected the accuracy and robustness of phylogenetics with individual secondary structures by simulation experiments for artificial tree topologies with up to 18 taxa and for divergency levels in the range of typical phylogenetic studies. We chose the internal transcribed spacer 2 of the ribosomal cistron as an exemplary marker region. Simulation integrated the coevolution process of sequences with secondary structures. Additionally, the phylogenetic power of marker size duplication was investigated and compared with sequence and sequence-structure reconstruction methods. The results clearly show that accuracy and robustness of Neighbor Joining trees are largely improved by structural information in contrast to sequence only data, whereas a doubled marker size only accounts for robustness.</p> <p>Conclusions</p> <p>Individual secondary structures of ribosomal RNA sequences provide a valuable gain of information content that is useful for phylogenetics. Thus, the usage of ITS2 sequence together with secondary structure for taxonomic inferences is recommended. Other reconstruction methods as maximum likelihood, bayesian inference or maximum parsimony may equally profit from secondary structure inclusion.</p> <p>Reviewers</p> <p>This article was reviewed by Shamil Sunyaev, Andrea Tanzer (nominated by Frank Eisenhaber) and Eugene V. Koonin.</p> <p>Open peer review</p> <p>Reviewed by Shamil Sunyaev, Andrea Tanzer (nominated by Frank Eisenhaber) and Eugene V. Koonin. For the full reviews, please go to the Reviewers' comments section.</p
Pattern-based phylogenetic distance estimation and tree reconstruction
We have developed an alignment-free method that calculates phylogenetic
distances using a maximum likelihood approach for a model of sequence change on
patterns that are discovered in unaligned sequences. To evaluate the
phylogenetic accuracy of our method, and to conduct a comprehensive comparison
of existing alignment-free methods (freely available as Python package decaf+py
at http://www.bioinformatics.org.au), we have created a dataset of reference
trees covering a wide range of phylogenetic distances. Amino acid sequences
were evolved along the trees and input to the tested methods; from their
calculated distances we infered trees whose topologies we compared to the
reference trees.
We find our pattern-based method statistically superior to all other tested
alignment-free methods on this dataset. We also demonstrate the general
advantage of alignment-free methods over an approach based on automated
alignments when sequences violate the assumption of collinearity. Similarly, we
compare methods on empirical data from an existing alignment benchmark set that
we used to derive reference distances and trees. Our pattern-based approach
yields distances that show a linear relationship to reference distances over a
substantially longer range than other alignment-free methods. The pattern-based
approach outperforms alignment-free methods and its phylogenetic accuracy is
statistically indistinguishable from alignment-based distances.Comment: 21 pages, 3 figures, 2 table
Evolutionary distances in the twilight zone -- a rational kernel approach
Phylogenetic tree reconstruction is traditionally based on multiple sequence
alignments (MSAs) and heavily depends on the validity of this information
bottleneck. With increasing sequence divergence, the quality of MSAs decays
quickly. Alignment-free methods, on the other hand, are based on abstract
string comparisons and avoid potential alignment problems. However, in general
they are not biologically motivated and ignore our knowledge about the
evolution of sequences. Thus, it is still a major open question how to define
an evolutionary distance metric between divergent sequences that makes use of
indel information and known substitution models without the need for a multiple
alignment. Here we propose a new evolutionary distance metric to close this
gap. It uses finite-state transducers to create a biologically motivated
similarity score which models substitutions and indels, and does not depend on
a multiple sequence alignment. The sequence similarity score is defined in
analogy to pairwise alignments and additionally has the positive semi-definite
property. We describe its derivation and show in simulation studies and
real-world examples that it is more accurate in reconstructing phylogenies than
competing methods. The result is a new and accurate way of determining
evolutionary distances in and beyond the twilight zone of sequence alignments
that is suitable for large datasets.Comment: to appear in PLoS ON
Inferring phylogenies with incomplete data sets: a 5-gene, 567-taxon analysis of angiosperms
<p>Abstract</p> <p>Background</p> <p>Phylogenetic analyses of angiosperm relationships have used only a small percentage of available sequence data, but phylogenetic data matrices often can be augmented with existing data, especially if one allows missing characters. We explore the effects on phylogenetic analyses of adding 378 <it>matK </it>sequences and 240 26S rDNA sequences to the complete 3-gene, 567-taxon angiosperm phylogenetic matrix of Soltis et al.</p> <p>Results</p> <p>We performed maximum likelihood bootstrap analyses of the complete, 3-gene 567-taxon data matrix and the incomplete, 5-gene 567-taxon data matrix. Although the 5-gene matrix has more missing data (27.5%) than the 3-gene data matrix (2.9%), the 5-gene analysis resulted in higher levels of bootstrap support. Within the 567-taxon tree, the increase in support is most evident for relationships among the 170 taxa for which both <it>matK </it>and 26S rDNA sequences were added, and there is little gain in support for relationships among the 119 taxa having neither <it>matK </it>nor 26S rDNA sequences. The 5-gene analysis also places the enigmatic <it>Hydrostachys </it>in Lamiales (BS = 97%) rather than in Cornales (BS = 100% in 3-gene analysis). The placement of <it>Hydrostachys </it>in Lamiales is unprecedented in molecular analyses, but it is consistent with embryological and morphological data.</p> <p>Conclusion</p> <p>Adding available, and often incomplete, sets of sequences to existing data sets can be a fast and inexpensive way to increase support for phylogenetic relationships and produce novel and credible new phylogenetic hypotheses.</p
Phylogenetic Divergence Time, Algorithms for Improved Accuracy and Performance
The inference of species divergence time is a key step in the study of phylogenetics. Methods have been available for the last ten years to perform the inference, but, there are two significant problems with these methods. First, the performance of the methods does not yet scale well to studies with hundreds of taxa and thousands of DNA base pairs. A study of 349 taxa was estimated to require over 9 months of processing time. Second, the accuracy of the inference process is subject to bias and variance in the specification of model parameters that is not completely understood. These parameters include both the topology of the phylogenetic tree and, more importantly for our purposes, the set of fossils used to calibrate the tree.
In this work, we present new algorithms and methods to improve the performance of the divergence time process. We demonstrate a new algorithm for the computation of phylogenetic likelihood and experimentally illustrate a 90% improvement in likelihood computation time on the aforementioned dataset of 349 taxa with over 60,000 DNA base pairs. Additionally we show a new algorithm for the computation of the Bayesian prior on node ages that is experimentally shown to reduce the time for this computation on the 349 taxa dataset by 99%.
Using our high performance methods, we present a novel new method for assessing the level of support for the ages inferred. This method utilizes a statistical jackknifing technique on the set of fossil calibrations producing a support value similar to the bootstrap used in phylogenetic inference.
Finally, we present efficient methods for divergence time inference on sets of trees based on our development of subtree sharing models. We show a 60% improvement in processing times on a dataset of 567 taxa with over 10,000 DNA base pairs