5,473 research outputs found

    Reconciling taxonomy and phylogenetic inference: formalism and algorithms for describing discord and inferring taxonomic roots

    Get PDF
    Although taxonomy is often used informally to evaluate the results of phylogenetic inference and find the root of phylogenetic trees, algorithmic methods to do so are lacking. In this paper we formalize these procedures and develop algorithms to solve the relevant problems. In particular, we introduce a new algorithm that solves a "subcoloring" problem for expressing the difference between the taxonomy and phylogeny at a given rank. This algorithm improves upon the current best algorithm in terms of asymptotic complexity for the parameter regime of interest; we also describe a branch-and-bound algorithm that saves orders of magnitude in computation on real data sets. We also develop a formalism and an algorithm for rooting phylogenetic trees according to a taxonomy. All of these algorithms are implemented in freely-available software.Comment: Version submitted to Algorithms for Molecular Biology. A number of fixes from previous versio

    Minimizing the average distance to a closest leaf in a phylogenetic tree

    Full text link
    When performing an analysis on a collection of molecular sequences, it can be convenient to reduce the number of sequences under consideration while maintaining some characteristic of a larger collection of sequences. For example, one may wish to select a subset of high-quality sequences that represent the diversity of a larger collection of sequences. One may also wish to specialize a large database of characterized "reference sequences" to a smaller subset that is as close as possible on average to a collection of "query sequences" of interest. Such a representative subset can be useful whenever one wishes to find a set of reference sequences that is appropriate to use for comparative analysis of environmentally-derived sequences, such as for selecting "reference tree" sequences for phylogenetic placement of metagenomic reads. In this paper we formalize these problems in terms of the minimization of the Average Distance to the Closest Leaf (ADCL) and investigate algorithms to perform the relevant minimization. We show that the greedy algorithm is not effective, show that a variant of the Partitioning Among Medoids (PAM) heuristic gets stuck in local minima, and develop an exact dynamic programming approach. Using this exact program we note that the performance of PAM appears to be good for simulated trees, and is faster than the exact algorithm for small trees. On the other hand, the exact program gives solutions for all numbers of leaves less than or equal to the given desired number of leaves, while PAM only gives a solution for the pre-specified number of leaves. Via application to real data, we show that the ADCL criterion chooses chimeric sequences less often than random subsets, while the maximization of phylogenetic diversity chooses them more often than random. These algorithms have been implemented in publicly available software.Comment: Please contact us with any comments or questions

    Molecular Infectious Disease Epidemiology: Survival Analysis and Algorithms Linking Phylogenies to Transmission Trees

    Full text link
    Recent work has attempted to use whole-genome sequence data from pathogens to reconstruct the transmission trees linking infectors and infectees in outbreaks. However, transmission trees from one outbreak do not generalize to future outbreaks. Reconstruction of transmission trees is most useful to public health if it leads to generalizable scientific insights about disease transmission. In a survival analysis framework, estimation of transmission parameters is based on sums or averages over the possible transmission trees. A phylogeny can increase the precision of these estimates by providing partial information about who infected whom. The leaves of the phylogeny represent sampled pathogens, which have known hosts. The interior nodes represent common ancestors of sampled pathogens, which have unknown hosts. Starting from assumptions about disease biology and epidemiologic study design, we prove that there is a one-to-one correspondence between the possible assignments of interior node hosts and the transmission trees simultaneously consistent with the phylogeny and the epidemiologic data on person, place, and time. We develop algorithms to enumerate these transmission trees and show these can be used to calculate likelihoods that incorporate both epidemiologic data and a phylogeny. A simulation study confirms that this leads to more efficient estimates of hazard ratios for infectiousness and baseline hazards of infectious contact, and we use these methods to analyze data from a foot-and-mouth disease virus outbreak in the United Kingdom in 2001. These results demonstrate the importance of data on individuals who escape infection, which is often overlooked. The combination of survival analysis and algorithms linking phylogenies to transmission trees is a rigorous but flexible statistical foundation for molecular infectious disease epidemiology.Comment: 28 pages, 11 figures, 3 table

    Visualizing Co-Phylogenetic Reconciliations

    Get PDF
    We introduce a hybrid metaphor for the visualization of the reconciliations of co-phylogenetic trees, that are mappings among the nodes of two trees. The typical application is the visualization of the co-evolution of hosts and parasites in biology. Our strategy combines a space-filling and a node-link approach. Differently from traditional methods, it guarantees an unambiguous and `downward' representation whenever the reconciliation is time-consistent (i.e., meaningful). We address the problem of the minimization of the number of crossings in the representation, by giving a characterization of planar instances and by establishing the complexity of the problem. Finally, we propose heuristics for computing representations with few crossings.Comment: This paper appears in the Proceedings of the 25th International Symposium on Graph Drawing and Network Visualization (GD 2017

    A chloroplast phylogeny of Arisaema (Araceae) illustrates Tertiary floristic links between Asia, North America, and East Africa

    Get PDF
    The evolution of Arisaema is reconstructed, based on combined sequences (2048 aligned bases) from the chloroplast trnL intron, trnL-trnF spacer, and rpl20-rps12 spacer obtained for species from all 11 sections, including sectional type species and geographically disjunct East African and North American/Mexican species. Analyses were rooted with a representative sample of the closest outgroups, Pinellia and Typhonium, to rigorously test the monophyly of Arisaema. Sections in Arisaema are mostly based on leaf, stem, and inflorescence characters and, with one exception, are not rejected by the molecular data; however, statistical support for sectional relationships in the genus remains poor. Section Tortuosa, which includes eastern North American A. dracontium and Mexican A. macrospathum, is demonstrably polyphyletic. The third New World species, A. triphyllum, also occurs in eastern North America and groups with a different Asian clade than do A. dracontium/A. macrospathum. The genus thus appears to have entered North America twice. Fossil infructescences similar to those of A. triphyllum are known from approximately 18 million-year-old deposits inWashington State and can serve to calibrate a molecular clock. Constraining the age of A. triphyllum to 18 million years (my) and applying either a semiparametric or an ultrametric clock model to the combined data yields an age of approximately 31–49 my for the divergence of A. dracontium/A. macrospathum from their Asian relatives and of 19–32 my for the divergence between African A. schimperianum and a Tibetan/Nepalese relative. The genus thus provides an example of the Oligocene/Miocene floristic links between East Africa, Arabia, the Himalayan region, China, and North America. The phylogeny also suggests secondary loss of the environmental sex determination strategy that characterizes all arisaemas except for two subspecies of A. flavum, which have consistently bisexual spathes. These subspecies are tetraploid and capable of selfing, while a third subspecies of A. flavum is diploid and retains the sex-changing strategy. In the molecular trees, the sex-changing subspecies is sister to the two non-sex-changing ones, and the entire species is not basal in the genus

    Learning Latent Tree Graphical Models

    Get PDF
    We study the problem of learning a latent tree graphical model where samples are available only from a subset of variables. We propose two consistent and computationally efficient algorithms for learning minimal latent trees, that is, trees without any redundant hidden nodes. Unlike many existing methods, the observed nodes (or variables) are not constrained to be leaf nodes. Our first algorithm, recursive grouping, builds the latent tree recursively by identifying sibling groups using so-called information distances. One of the main contributions of this work is our second algorithm, which we refer to as CLGrouping. CLGrouping starts with a pre-processing procedure in which a tree over the observed variables is constructed. This global step groups the observed nodes that are likely to be close to each other in the true latent tree, thereby guiding subsequent recursive grouping (or equivalent procedures) on much smaller subsets of variables. This results in more accurate and efficient learning of latent trees. We also present regularized versions of our algorithms that learn latent tree approximations of arbitrary distributions. We compare the proposed algorithms to other methods by performing extensive numerical experiments on various latent tree graphical models such as hidden Markov models and star graphs. In addition, we demonstrate the applicability of our methods on real-world datasets by modeling the dependency structure of monthly stock returns in the S&P index and of the words in the 20 newsgroups dataset
    corecore