15 research outputs found

    Generalized Buneman pruning for inferring the most parsimonious multi-state phylogeny

    Full text link
    Accurate reconstruction of phylogenies remains a key challenge in evolutionary biology. Most biologically plausible formulations of the problem are formally NP-hard, with no known efficient solution. The standard in practice are fast heuristic methods that are empirically known to work very well in general, but can yield results arbitrarily far from optimal. Practical exact methods, which yield exponential worst-case running times but generally much better times in practice, provide an important alternative. We report progress in this direction by introducing a provably optimal method for the weighted multi-state maximum parsimony phylogeny problem. The method is based on generalizing the notion of the Buneman graph, a construction key to efficient exact methods for binary sequences, so as to apply to sequences with arbitrary finite numbers of states with arbitrary state transition weights. We implement an integer linear programming (ILP) method for the multi-state problem using this generalized Buneman graph and demonstrate that the resulting method is able to solve data sets that are intractable by prior exact methods in run times comparable with popular heuristics. Our work provides the first method for provably optimal maximum parsimony phylogeny inference that is practical for multi-state data sets of more than a few characters.Comment: 15 page

    Computing the blocks of a quasi-median graph

    Get PDF
    Quasi-median graphs are a tool commonly used by evolutionary biologists to visualise the evolution of molecular sequences. As with any graph, a quasi-median graph can contain cut vertices, that is, vertices whose removal disconnect the graph. These vertices induce a decomposition of the graph into blocks, that is, maximal subgraphs which do not contain any cut vertices. Here we show that the special structure of quasi-median graphs can be used to compute their blocks without having to compute the whole graph. In particular we present an algorithm that, for a collection of nn aligned sequences of length mm, can compute the blocks of the associated quasi-median graph together with the information required to correctly connect these blocks together in run time O(n2m2)\mathcal O(n^2m^2), independent of the size of the sequence alphabet. Our primary motivation for presenting this algorithm is the fact that the quasi-median graph associated to a sequence alignment must contain all most parsimonious trees for the alignment, and therefore precomputing the blocks of the graph has the potential to help speed up any method for computing such trees.Comment: 17 pages, 2 figure

    Representing and extending ensembles of parsimonious evolutionary histories with a directed acyclic graph

    Full text link
    In many situations, it would be useful to know not just the best phylogenetic tree for a given data set, but the collection of high-quality trees. This goal is typically addressed using Bayesian techniques, however, current Bayesian methods do not scale to large data sets. Furthermore, for large data sets with relatively low signal one cannot even store every good tree individually, especially when the trees are required to be bifurcating. In this paper, we develop a novel object called the "history subpartition directed acyclic graph" (or "history sDAG" for short) that compactly represents an ensemble of trees with labels (e.g. ancestral sequences) mapped onto the internal nodes. The history sDAG can be built efficiently and can also be efficiently trimmed to only represent maximally parsimonious trees. We show that the history sDAG allows us to find many additional equally parsimonious trees, extending combinatorially beyond the ensemble used to construct it. We argue that this object could be useful as the "skeleton" of a more complete uncertainty quantification.Comment: To appear in JM

    Fast Hash-Based Algorithms for Analyzing Large Collections of Evolutionary Trees

    Get PDF
    Phylogenetic analysis can produce easily tens of thousands of equally plausible evolutionary trees. Consensus trees and topological distance matrices are often used to summarize the evolutionary relationships among the trees of interest. However, current approaches are not designed to analyze very large tree collections. In this dissertation, we present two fast algorithms— HashCS and HashRF —for analyzing large collections of evolutionary trees based on a novel hash table data structure, which provides a convenient and fast approach to store and access the bipartition information collected from the tree collections. Our HashCS algorithm is a fast () technique for constructing consensus trees, where is the number of taxa and is the number of trees. By reprocessing the bipartition information in our hash table, HashCS constructs strict and majority consensus trees. In addition to a consensus algorithm, we design a fast topological distance algorithm called HashRF to compute the × Robinson-Foulds distance matrix, which requires (^ 2) running time. A RF distance matrix provides plenty of data-mining opportunities to help researchers understand the evolutionary relationships contained in their collection of trees. We also introduce a series of extensions based on HashRF to provide researchers with more convenient set of tools for analyzing their trees. We provide extensive experimentation regarding the practical performance of our hash-based algorithms across a diverse collection of biological and artificial trees. Our results show that both algorithms easily outperform existing consensus and RF matrix implementations. For example, on our biological trees, HashCS and HashRF are 1.8 and 100 times faster than PAUP*, respectively. We show two real-world applications of our fast hashing algorithms: (i) comparing phylogenetic heuristic implementations, and (ii) clustering and visualizing trees. In our first application, we design novel methods to compare the PaupRat and Rec-I-DCM3, two popular phylogenetic heuristics that use the Maximum Parsimony criterion, and show that RF distances are more effective than parsimony scores at identifying heterogeneity within a collection of trees. In our second application, we empirically show how to determine the distinct clusters of trees within large tree collections. We use two different techniques to identify distinct tree groups. Both techniques show that partitioning the trees into distinct groups and summarizing each group separately is a better representation of the data. Additional benefits of our approach are better consensus trees as well as insightful information regarding the convergence behavior of phylogenetic heuristics. Our fast hash-based algorithms provide scientists with a very powerful tools for analyzing the relationships within their large phylogenetic tree collections in new and exciting ways. Our work has many opportunities for future work including detecting convergence and designing better heuristics. Furthermore, our hash tables have lots of potential future extensions. For example, we can also use our novel hashing structure to design algorithms for computing other distance metrics such as Nearest Neighbor Interchange (NNI), Subtree Pruning and Regrafting (SPR), and Tree Bisection and Reconnection (TBR) distances

    Phylogeny and phylogeography of the Chacma Baboon (Papio ursinus): the role of landscape in shaping contemporary genetic structure in the southern African baboon

    Get PDF
    Includes abstract.Includes bibliographical references (leaves 146-175).This thesis contributes to our understanding of the role of climate and landscape change in structuring diversity within chacma baboons (Papio ursinus). The data set comprises molecular sequences from two mitochondrial DNA markers: the Brown region and the hypervariable D-loop. DNA was extracted from faecal samples of 261 free living chacma baboons across southern Africa. Phylogenetic and phylogeographic techniques, including coalescent modeling, were used to examine past and present population dynamics of chacma baboon populations. Bayesian tree constructions provide a timeline of diversification for the sample. Although the ecological drivers of ongoing differentiation remain unclear, it was shown that population contractions and expansions have also played a significant role in driving regional genetic structure within the species

    Statistical approaches to viral phylodynamics

    Get PDF
    The recent years have witnessed a rapid increase in the quantity and quality of genomic data collected from human and animal pathogens, viruses in particular. When coupled with mathematical and statistical models, these data allow us to combine evolutionary theory and epidemiology to understand pathogen dynamics. While these developments led to important epidemiological questions being tackled, it also exposed the need for improved analytical methods. In this thesis I employ modern statistical techniques to address two pressing issues in phylodynamics: (i) computational tools for Bayesian phylogenetics and (ii) data integration. I detail the development and testing of new transition kernels for Markov Chain Monte Carlo (MCMC) for time-calibrated phylogenetics in Chapter 2 and show that an adaptive kernel leads to improved MCMC performance in terms of mixing for a range of data sets, in particular for a challenging Ebola virus phylogeny with 1610 taxa/sequences. As a trade-off, I also found that the new adaptive kernels have longer warm up times in general, suggesting room for improvement. Chapter 3 shows how to apply state-of-the-art techniques to visualise and analyse phylogenetic space and MCMC for time-calibrated phylogenies, which are crucial to the viral phylodynamics analysis pipeline. I describe a pipeline for a typical phylodynamic analysis which includes convergence diagnostics for continuous parameters and in phylogenetic space, extending existing methods to deal with large time-calibrated phylogenies. In addition I investigate different representations of phylogenetic space through multi-dimensional scaling (MDS) or univariate distributions of distances to a focal tree and show that even for the simplest toy examples phylogenetic space remains complex and in particular not all metrics lead to desirable or useful representations. On the data integration front, Chapters 4 and 5 detail the use data from the 2013-2016 Ebola virus disease (EVD) epidemic in West Africa to show how one can combine phylogenetic and epidemiological data to tackle epidemiological questions. I explore the determinants of the Ebola epidemic in Chapter 4 through a generalised linear model framework coupled with Bayesian stochastic search variable selection (BSSVS) to assess the relative importance climatic and socio-economic variables on EVD number of cases. In Chapter 5 I tackle the question of whether a particular glycoprotein mutation could lead to increased human mortality from EVD. I show that a principled analysis of the available data that accounts for several sources of uncertainty as well as shared ancestry between samples does not allow us to ascertain the presence of such effect of a viral mutation on mortality. Chapter 6 attempts to bring the findings of the thesis together and discuss how the field of phylodynamics, in special its methodological aspect, might move forward

    Generalized buneman pruning for inferring the most parsimonious multi-state phylogeny.

    No full text
    Accurate reconstruction of phylogenies remains a key challenge in evolutionary biology. Most biologically plausible formulations of the problem are formally NP-hard, with no known efficient solution. The standard in practice are fast heuristic methods that are empirically known to work very well in general, but can yield results arbitrarily far from optimal. Practical exact methods, which yield exponential worst-case running times but generally much better times in practice, provide an important alternative. We report progress in this direction by introducing a provably optimal method for the weighted multi-state maximum parsimony phylogeny problem. The method is based on generalizing the notion of the Buneman graph, a construction key to efficient exact methods for binary sequences, so as to apply to sequences with arbitrary finite numbers of states with arbitrary state transition weights. We implement an integer linear programming (ILP) method for the multi-state problem using this generalized Buneman graph and demonstrate that the resulting method is able to solve data sets that are intractable by prior exact methods in run times comparable with popular heuristics. We further show on a collection of less difficult problem instances that the ILP method leads to large reductions in average-case run times relative to leading heuristics on moderately hard problems. Our work provides the first method for provably optimal maximum parsimony phylogeny inference that is practical for multi-state data sets of more than a few characters.</p

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Evolutionary Genomics

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
    corecore