6,605 research outputs found
Principal components analysis in the space of phylogenetic trees
Phylogenetic analysis of DNA or other data commonly gives rise to a
collection or sample of inferred evolutionary trees. Principal Components
Analysis (PCA) cannot be applied directly to collections of trees since the
space of evolutionary trees on a fixed set of taxa is not a vector space. This
paper describes a novel geometrical approach to PCA in tree-space that
constructs the first principal path in an analogous way to standard linear
Euclidean PCA. Given a data set of phylogenetic trees, a geodesic principal
path is sought that maximizes the variance of the data under a form of
projection onto the path. Due to the high dimensionality of tree-space and the
nonlinear nature of this problem, the computational complexity is potentially
very high, so approximate optimization algorithms are used to search for the
optimal path. Principal paths identified in this way reveal and quantify the
main sources of variation in the original collection of trees in terms of both
topology and branch lengths. The approach is illustrated by application to
simulated sets of trees and to a set of gene trees from metazoan (animal)
species.Comment: Published in at http://dx.doi.org/10.1214/11-AOS915 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Ghost-tree: creating hybrid-gene phylogenetic trees for diversity analyses.
BackgroundFungi play critical roles in many ecosystems, cause serious diseases in plants and animals, and pose significant threats to human health and structural integrity problems in built environments. While most fungal diversity remains unknown, the development of PCR primers for the internal transcribed spacer (ITS) combined with next-generation sequencing has substantially improved our ability to profile fungal microbial diversity. Although the high sequence variability in the ITS region facilitates more accurate species identification, it also makes multiple sequence alignment and phylogenetic analysis unreliable across evolutionarily distant fungi because the sequences are hard to align accurately. To address this issue, we created ghost-tree, a bioinformatics tool that integrates sequence data from two genetic markers into a single phylogenetic tree that can be used for diversity analyses. Our approach starts with a "foundation" phylogeny based on one genetic marker whose sequences can be aligned across organisms spanning divergent taxonomic groups (e.g., fungal families). Then, "extension" phylogenies are built for more closely related organisms (e.g., fungal species or strains) using a second more rapidly evolving genetic marker. These smaller phylogenies are then grafted onto the foundation tree by mapping taxonomic names such that each corresponding foundation-tree tip would branch into its new "extension tree" child.ResultsWe applied ghost-tree to graft fungal extension phylogenies derived from ITS sequences onto a foundation phylogeny derived from fungal 18S sequences. Our analysis of simulated and real fungal ITS data sets found that phylogenetic distances between fungal communities computed using ghost-tree phylogenies explained significantly more variance than non-phylogenetic distances. The phylogenetic metrics also improved our ability to distinguish small differences (effect sizes) between microbial communities, though results were similar to non-phylogenetic methods for larger effect sizes.ConclusionsThe Silva/UNITE-based ghost tree presented here can be easily integrated into existing fungal analysis pipelines to enhance the resolution of fungal community differences and improve understanding of these communities in built environments. The ghost-tree software package can also be used to develop phylogenetic trees for other marker gene sets that afford different taxonomic resolution, or for bridging genome trees with amplicon trees.Availabilityghost-tree is pip-installable. All source code, documentation, and test code are available under the BSD license at https://github.com/JTFouquier/ghost-tree
Recommended from our members
Inference of single-cell phylogenies from lineage tracing data using Cassiopeia.
The pairing of CRISPR/Cas9-based gene editing with massively parallel single-cell readouts now enables large-scale lineage tracing. However, the rapid growth in complexity of data from these assays has outpaced our ability to accurately infer phylogenetic relationships. First, we introduce Cassiopeia-a suite of scalable maximum parsimony approaches for tree reconstruction. Second, we provide a simulation framework for evaluating algorithms and exploring lineage tracer design principles. Finally, we generate the most complex experimental lineage tracing dataset to date, 34,557 human cells continuously traced over 15 generations, and use it for benchmarking phylogenetic inference approaches. We show that Cassiopeia outperforms traditional methods by several metrics and under a wide variety of parameter regimes, and provide insight into the principles for the design of improved Cas9-enabled recorders. Together, these should broadly enable large-scale mammalian lineage tracing efforts. Cassiopeia and its benchmarking resources are publicly available at www.github.com/YosefLab/Cassiopeia
Detecting adaptive evolution in phylogenetic comparative analysis using the Ornstein-Uhlenbeck model
Phylogenetic comparative analysis is an approach to inferring evolutionary
process from a combination of phylogenetic and phenotypic data. The last few
years have seen increasingly sophisticated models employed in the evaluation of
more and more detailed evolutionary hypotheses, including adaptive hypotheses
with multiple selective optima and hypotheses with rate variation within and
across lineages. The statistical performance of these sophisticated models has
received relatively little systematic attention, however. We conducted an
extensive simulation study to quantify the statistical properties of a class of
models toward the simpler end of the spectrum that model phenotypic evolution
using Ornstein-Uhlenbeck processes. We focused on identifying where, how, and
why these methods break down so that users can apply them with greater
understanding of their strengths and weaknesses. Our analysis identifies three
key determinants of performance: a discriminability ratio, a signal-to-noise
ratio, and the number of taxa sampled. Interestingly, we find that
model-selection power can be high even in regions that were previously thought
to be difficult, such as when tree size is small. On the other hand, we find
that model parameters are in many circumstances difficult to estimate
accurately, indicating a relative paucity of information in the data relative
to these parameters. Nevertheless, we note that accurate model selection is
often possible when parameters are only weakly identified. Our results have
implications for more sophisticated methods inasmuch as the latter are
generalizations of the case we study.Comment: 38 pages, in press at Systematic Biolog
- …