2,151 research outputs found

    On the complexity of computing MP distance between binary phylogenetic trees

    Full text link
    Within the field of phylogenetics there is great interest in distance measures to quantify the dissimilarity of two trees. Recently, a new distance measure has been proposed: the Maximum Parsimony (MP) distance. This is based on the difference of the parsimony scores of a single character on both trees under consideration, and the goal is to find the character which maximizes this difference. Here we show that computation of MP distance on two \emph{binary} phylogenetic trees is NP-hard. This is a highly nontrivial extension of an earlier NP-hardness proof for two multifurcating phylogenetic trees, and it is particularly relevant given the prominence of binary trees in the phylogenetics literature. As a corollary to the main hardness result we show that computation of MP distance is also hard on binary trees if the number of states available is bounded. In fact, via a different reduction we show that it is hard even if only two states are available. Finally, as a first response to this hardness we give a simple Integer Linear Program (ILP) formulation which is capable of computing the MP distance exactly for small trees (and for larger trees when only a small number of character states are available) and which is used to computationally verify several auxiliary results required by the hardness proofs.Comment: 37 pages, 8 figure

    On the Maximum Parsimony distance between phylogenetic trees

    Full text link
    Within the field of phylogenetics there is great interest in distance measures to quantify the dissimilarity of two trees. Here, based on an idea of Bruen and Bryant, we propose and analyze a new distance measure: the Maximum Parsimony (MP) distance. This is based on the difference of the parsimony scores of a single character on both trees under consideration, and the goal is to find the character which maximizes this difference. In this article we show that this new distance is a metric and provides a lower bound to the well-known Subtree Prune and Regraft (SPR) distance. We also show that to compute the MP distance it is sufficient to consider only characters that are convex on one of the trees, and prove several additional structural properties of the distance. On the complexity side, we prove that calculating the MP distance is in general NP-hard, and identify an interesting island of tractability in which the distance can be calculated in polynomial time.Comment: 30 pages, 6 figure

    Phylogenetic incongruence through the lens of Monadic Second Order logic

    Full text link
    Within the field of phylogenetics there is growing interest in measures for summarising the dissimilarity, or 'incongruence', of two or more phylogenetic trees. Many of these measures are NP-hard to compute and this has stimulated a considerable volume of research into fixed parameter tractable algorithms. In this article we use Monadic Second Order logic (MSOL) to give alternative, compact proofs of fixed parameter tractability for several well-known incongruency measures. In doing so we wish to demonstrate the considerable potential of MSOL - machinery still largely unknown outside the algorithmic graph theory community - within phylogenetics. A crucial component of this work is the observation that many of these measures, when bounded, imply the existence of an 'agreement forest' of bounded size, which in turn implies that an auxiliary graph structure, the display graph, has bounded treewidth. It is this bound on treewidth that makes the machinery of MSOL available for proving fixed parameter tractability. We give a variety of different MSOL formulations. Some are based on explicitly encoding agreement forests, while some only use them implicitly to generate the treewidth bound. Our formulations introduce a number of "phylogenetics MSOL primitives" which will hopefully be of use to other researchers

    Impacts of terraces on phylogenetic inference

    Full text link
    Terraces are potentially large sets of trees with precisely the same likelihood or parsimony score, which can be induced by missing sequences in partitioned multi-locus phylogenetic data matrices. The set of trees on a terrace can be characterized by enumeration algorithms or consensus methods that exploit the pattern of partial taxon coverage in the data, independent of the sequence data themselves. Terraces add ambiguity and complexity to phylogenetic inference particularly in settings where inference is already challenging: data sets with many taxa and relatively few loci. In this paper we present five new findings about terraces and their impacts on phylogenetic inference. First we clarify assumptions about model parameters that are necessary for the existence of terraces. Second, we explore the dependence of terrace size on partitioning scheme and indicate how to find the partitioning scheme associated with the largest terrace containing a given tree. Third, we highlight the impact of terraces on bootstrap estimates of confidence limits in clades, and characterize the surprising result that the bootstrap proportion for a clade can be entirely determined by the frequency of bipartitions on a terrace, with some bipartitions receiving high support even when incorrect. Fourth, we dissect some effects of prior distributions of edge lengths on the computed posterior probabilities of clades on terraces, to understand an example in which long edges "attract" each other in Bayesian inference. Fifth, we show that even if data are not partitioned, patterns of missing data studied in the terrace problem can lead to instances of apparent statistical inconsistency when even a small element of heterotachy is introduced to the model generating the sequence data. Finally, we discuss strategies for remediation of some of these problems.Comment: 50 pages, 9 figure

    Species Tree Estimation Using ASTRAL: Practical Considerations

    Full text link
    ASTRAL is a method for reconstructing species trees after inferring a set of gene trees and is increasingly used in phylogenomic analyses. It is statistically consistent under the multi-species coalescent model, is scalable, and has shown high accuracy in simulated and empirical studies. This chapter discusses practical considerations in using ASTRAL, starting with a review of published results and pointing to the strengths and weaknesses of species tree estimation using ASTRAL. It then continues to detail the best ways to prepare input gene trees, interpret ASTRAL outputs, and perform follow-up analyses

    Supertree Construction: Opportunities and Challenges

    Full text link
    Supertree construction is the process by which a set of phylogenetic trees, each on a subset of the overall set X of species, is combined into a tree on the full set S. The traditional use of supertree methods is the assembly of a large species tree from previously computed smaller species trees; however, supertree methods are also used to address large-scale tree estimation using divide-and-conquer (i.e., a dataset is divided into overlapping subsets, trees are constructed on the subsets, and then combined using the supertree method). Because most supertree methods are heuristics for NP-hard optimization problems, the use of supertree estimation on large datasets is challenging, both in terms of scalability and accuracy. In this paper, we describe the current state of the art in supertree construction and the use of supertree methods in divide-and-conquer strategies. Finally, we identify directions where future research could lead to improved supertree methods.Comment: 28 pages, will be part of Festschrift volume for Bernard Mor

    Most Compact Parsimonious Trees

    Full text link
    Construction of phylogenetic trees has traditionally focused on binary trees where all species appear on leaves, a problem for which numerous efficient solutions have been developed. Certain application domains though, such as viral evolution and transmission, paleontology, linguistics, and phylogenetic stemmatics, often require phylogeny inference that involves placing input species on ancestral tree nodes (live phylogeny), and polytomies. These requirements, despite their prevalence, lead to computationally harder algorithmic solutions and have been sparsely examined in the literature to date. In this article we prove some unique properties of most parsimonious live phylogenetic trees with polytomies, and describe novel algorithms to find the such trees without resorting to exhaustive enumeration of all possible tree topologies.Comment: 7 pages, 4 figures, 1 table, submitted for peer revie

    Gap-weighted subsequences for automatic cognate identification and phylogenetic inference

    Full text link
    In this paper, we describe the problem of cognate identification and its relation to phylogenetic inference. We introduce subsequence based features for discriminating cognates from non-cognates. We show that subsequence based features perform better than the state-of-the-art string similarity measures for the purpose of cognate identification. We use the cognate judgments for the purpose of phylogenetic inference and observe that these classifiers infer a tree which is close to the gold standard tree. The contribution of this paper is the use of subsequence features for cognate identification and to employ the cognate judgments for phylogenetic inference

    Finding the most parsimonious or likely tree in a network with respect to an alignment

    Full text link
    Phylogenetic networks are often constructed by merging multiple conflicting phylogenetic signals into a directed acyclic graph. It is interesting to explore whether a network constructed in this way induces biologically-relevant phylogenetic signals that were not present in the input. Here we show that, given a multiple alignment A for a set of taxa X and a rooted phylogenetic network N whose leaves are labelled by X, it is NP-hard to locate the most parsimonious phylogenetic tree displayed by N (with respect to A) even when the level of N - the maximum number of reticulation nodes within a biconnected component - is 1 and A contains only 2 distinct states. (If, additionally, gaps are allowed the problem becomes APX-hard.) We also show that under the same conditions, and assuming a simple binary symmetric model of character evolution, finding the most likely tree displayed by the network is NP-hard. These negative results contrast with earlier work on parsimony in which it is shown that if A consists of a single column the problem is fixed parameter tractable in the level. We conclude with a discussion of why, despite the NP-hardness, both the parsimony and likelihood problem can likely be well-solved in practice

    Gromov meets Phylogenetics - new Animals for the Zoo of Biocomputable Metrics on Tree Space

    Full text link
    We present a new class of metrics for unrooted phylogenetic XX-trees derived from the Gromov-Hausdorff distance for (compact) metric spaces. These metrics can be efficiently computed by linear or quadratic programming. They are robust under NNI-operations, too. The local behavior of the metrics shows that they are different from any formerly introduced metrics. The performance of the metrics is briefly analised on random weighted and unweighted trees as well as random caterpillars
    • …
    corecore