78 research outputs found

    Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance

    Get PDF
    We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, specific biological process, such as gene duplication and loss, deep coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the RF distance between two mul-trees, but it is easy to calculate the generalized RF distance between a mul-tree and a singly-labeled tree. Motivated by this observation, we formulate the RF supertree problem for mul-trees (MulRF), which takes a collection of mul-trees and constructs a species tree that minimizes the total RF distance from the input mul-trees. We present a fast heuristic algorithm for the MulRF supertree problem. Simulation experiments demonstrate that the MulRF method produces more accurate species trees than gene tree parsimony methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets containing hundreds of trees with up to a hundred taxa.Comment: 16 pages, 11 figure

    Edge Ratchet and Simulated Annealing to Improve RF Score of the Supertree of Life

    Get PDF
    Constructing the Supertree of Life can provide crucially valuable knowledge to address many critical contemporary challenges such as fighting diseases, improving global agriculture, and protecting ecosystems to name a few. However, building such a tree is among the most complicated and challenging scientific problems. In the case of biological data, the true species tree is not available. Hence, the accuracy of the supertree is usually evaluated based on its similarity to the given source input trees. In this work, we aim at improving the accuracy of the supertree in terms of its cumulative Robinson Foulds (RF) distance to the source trees. This problem is NP-hard. Therefore, we have to resort to heuristic algorithms. We have two main contributions in this work. First, we propose a new technique, Edge Ratchet, which is used in a hill-climbing based algorithm to deal with local optimum problem. Second, we develop a Simulated Annealing algorithm to minimize total RF distance of the supertree to the source trees. Our results demonstrate that these two algorithms are able to improve the accuracy of the best existing supertree algorithms with regard to RF distance

    Optimal Completion and Comparison of Incomplete Phylogenetic Trees Under Robinson-Foulds Distance

    Get PDF

    Advancing Divide-And-Conquer Phylogeny Estimation Using Robinson-Foulds Supertrees

    Get PDF
    One of the Grand Challenges in Science is the construction of the Tree of Life, an evolutionary tree containing several million species, spanning all life on earth. However, the construction of the Tree of Life is enormously computationally challenging, as all the current most accurate methods are either heuristics for NP-hard optimization problems or Bayesian MCMC methods that sample from tree space. One of the most promising approaches for improving scalability and accuracy for phylogeny estimation uses divide-and-conquer: a set of species is divided into overlapping subsets, trees are constructed on the subsets, and then merged together using a "supertree method". Here, we present Exact-RFS-2, the first polynomial-time algorithm to find an optimal supertree of two trees, using the Robinson-Foulds Supertree (RFS) criterion (a major approach in supertree estimation that is related to maximum likelihood supertrees), and we prove that finding the RFS of three input trees is NP-hard. We also present GreedyRFS (a greedy heuristic that operates by repeatedly using Exact-RFS-2 on pairs of trees, until all the trees are merged into a single supertree). We evaluate Exact-RFS-2 and GreedyRFS, and show that they have better accuracy than the current leading heuristic for RFS

    Robinson-Foulds Supertrees

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Supertree methods synthesize collections of small phylogenetic trees with incomplete taxon overlap into comprehensive trees, or supertrees, that include all taxa found in the input trees. Supertree methods based on the well established Robinson-Foulds (RF) distance have the potential to build supertrees that retain much information from the input trees. Specifically, the RF supertree problem seeks a binary supertree that minimizes the sum of the RF distances from the supertree to the input trees. Thus, an RF supertree is a supertree that is consistent with the largest number of clusters (or clades) from the input trees.</p> <p>Results</p> <p>We introduce efficient, local search based, hill-climbing heuristics for the intrinsically hard RF supertree problem on rooted trees. These heuristics use novel non-trivial algorithms for the SPR and TBR local search problems which improve on the time complexity of the best known (naïve) solutions by a factor of Θ(<it>n</it>) and Θ(<it>n</it><sup>2</sup>) respectively (where <it>n </it>is the number of taxa, or leaves, in the supertree). We use an implementation of our new algorithms to examine the performance of the RF supertree method and compare it to matrix representation with parsimony (MRP) and the triplet supertree method using four supertree data sets. Not only did our RF heuristic provide fast estimates of RF supertrees in all data sets, but the RF supertrees also retained more of the information from the input trees (based on the RF distance) than the other supertree methods.</p> <p>Conclusions</p> <p>Our heuristics for the RF supertree problem, based on our new local search algorithms, make it possible for the first time to estimate large supertrees by directly optimizing the RF distance from rooted input trees to the supertrees. This provides a new and fast method to build accurate supertrees. RF supertrees may also be useful for estimating majority-rule(-) supertrees, which are a generalization of majority-rule consensus trees.</p

    Edge Ratchet and Simulated Annealing to Improve RF Score of the Supertree of Life

    Get PDF
    Constructing the Supertree of Life can provide crucially valuable knowledge to address many critical contemporary challenges such as fighting diseases, improving global agriculture, and protecting ecosystems to name a few. However, building such a tree is among the most complicated and challenging scientific problems. In the case of biological data, the true species tree is not available. Hence, the accuracy of the supertree is usually evaluated based on its similarity to the given source input trees. In this work, we aim at improving the accuracy of the supertree in terms of its cumulative Robinson Foulds (RF) distance to the source trees. This problem is NP-hard. Therefore, we have to resort to heuristic algorithms. We have two main contributions in this work. First, we propose a new technique, Edge Ratchet, which is used in a hill-climbing based algorithm to deal with local optimum problem. Second, we develop a Simulated Annealing algorithm to minimize total RF distance of the supertree to the source trees. Our results demonstrate that these two algorithms are able to improve the accuracy of the best existing supertree algorithms with regard to RF distance

    Fast and accurate supertrees: towards large scale phylogenies

    Get PDF
    Phylogenetics is the study of evolutionary relationships between biological entities; phylogenetic trees (phylogenies) are a visualization of these evolutionary relationships. Accurate approaches to reconstruct hylogenies from sequence data usually result in NPhard optimization problems, hence local search heuristics have to be applied in practice. These methods are highly accurate and fast enough as long as the input data is not too large. Divide-and-conquer techniques are a promising approach to boost scalability and accuracy of those local search heuristics on very large datasets. A divide-and-conquer method breaks down a large phylogenetic problem into smaller sub-problems that are computationally easier to solve. The sub-problems (overlapping trees) are then combined using a supertree method. Supertree methods merge a set of overlapping phylogenetic trees into a supertree containing all taxa of the input trees. The challenge in supertree reconstruction is the way of dealing with conflicting information in the input trees. Many different algorithms for different objective functions have been suggested to resolve these conflicts. In particular, there are methods that encode the source trees in a matrix and the supertree is constructed applying a local search heuristic to optimize the respective objective function. The most widely used supertree methods use such local search heuristics. However, to really improve the scalability of accurate tree reconstruction by divide-and-conquer approaches, accurate polynomial time methods are needed for the supertree reconstruction step. In this work, we present approaches for accurate polynomial time supertree reconstruction in particular Bad Clade Deletion (BCD), a novel heuristic supertree algorithm with polynomial running time. BCD uses minimum cuts to greedily delete a locally minimal number of columns from a matrix representation to make it compatible. Different from local search heuristics, it guarantees to return the directed perfect phylogeny for the input matrix, corresponding to the parent tree of the input trees if one exists. BCD can take support values of the source trees into account without an increase in complexity. We show how reliable clades can be used to restrict the search space for BCD and how those clades can be collected from the input data using the Greedy Strict Consensus Merger. Finally, we introduce a beam search extension for the BCD algorithm that keeps alive a constant number of partial solutions in each top-down iteration phase. The guaranteed worst-case running time of BCD with beam search extension is still polynomial. We present an exact and a randomized subroutine to generate suboptimal partial solutions. In our thorough evaluation on several simulated and biological datasets against a representative set of supertree methods we found that BCD is more accurate than the most accurate supertree methods when using support values and search space restriction on simulated data. Simultaneously BCD is faster than any other evaluated method. The beam search approach improved the accuracy of BCD on all evaluated datasets at the cost of speed. We found that BCD supertrees can boost maximum likelihood tree reconstruction when used as starting tree. Further, BCD could handle large scale datasets where local search heuristics did not converge in reasonable time. Due to its combination of speed, accuracy, and the ability to reconstruct the parent tree if one exists, BCD is a promising approach to enable outstanding scalability of divide-and-conquer approaches.Die Phylogenetik studiert die evolutionĂ€ren Beziehungen zwischen biologischen EntitĂ€ten. Phylogenetische BĂ€ume sind eine Visualisierung dieser Beziehungen. Akkurate AnsĂ€tze zur Rekonstruktion von Phylogenien aus Sequenzdaten fĂŒhren in der Regel zu NP-schweren Optimierungsproblemen, sodass in der Praxis lokale Suchheuristiken angewendet werden mĂŒssen. Diese Methoden liefern akkurate BĂ€ume und sind schnell genug, solange die Eingabedaten nicht zu groß werden. Teile-und-herrsche-Verfahren sind ein vielversprechender Ansatz, um Skalierbarkeit und Genauigkeit dieser lokalen Suchheuristiken auf sehr großen DatensĂ€tzen zu verbessern. Beim Teile-und-herrsche-Ansatz zerlegt man ein großes phylogenetisches Problem in kleinere Teilprobleme, die einfacher und schneller zu lösen sind. Die Teilprobleme, in diesem Fall ĂŒberlappende TeilbĂ€ume, mĂŒssen dann zu einem gesamtheitlichen Baum kombiniert werden. Superbaummethoden verschmelzen solche ĂŒberlappenden phylogenetischen BĂ€ume zu einem Superbaum, der alle Taxa der EingangsbĂ€ume enthĂ€lt. Die Herausforderung bei der Superbaumrekonstruktion besteht darin, mit widersprĂŒchlichen EingabebĂ€umen umzugehen. Es wurden viele verschiedene Algorithmen mit unterschiedlichen Zielfunktionen entwickelt, um solche WidersprĂŒche möglichst sinnvoll aufzulösen. Verfahren, die auf der Kodierung der EingabebĂ€ume als MatrixreprĂ€sentation basieren, sind am weitesten verbreitet. Die zum Auflösen der Konflikte verwendeten Zielfunktionen fĂŒhren in der Regel zu NP-schweren Optimierungsproblemen, sodass in der Praxis auch hier lokale Suchheuristiken zum Einsatz kommen. Da diese AnsĂ€tze nicht wesentlich besser mit der GrĂ¶ĂŸe der Eingabedaten skalieren als die direkte Rekonstruktion aus Sequenzdaten, werden fĂŒr die Superbaumrekonstruktion in Teile-undherrsche-AnsĂ€tzen akkurate Polynomialzeitmethoden benötigt. Diese Arbeit beschĂ€ftigt sich mit der akkuraten Rekonstruktion von SuperbĂ€umen in Polynomialzeit. Wir prĂ€sentieren Bad Clade Deletion (BCD), eine neue Polynomialzeitheuristik zur Superbaumrekonstruktion. BCD verwendet minimale Schnitte in Graphen, um eine minimale Anzahl von Spalten aus der MatrixreprĂ€sentation zu löschen, sodass diese konfliktfrei wird. Im Gegensatz zu lokalen Suchheuristiken garantiert BCD die Rekonstruktion einer perfekten Phylogenie, sofern eine solche fĂŒr die Eingabematrix existiert. BCD ermöglicht es, GĂŒtekriterien der EingabebĂ€ume zu berĂŒcksichtigen, ohne dass sich dadurch die KomplexitĂ€t erhöht. Weiterhin zeigen wir, wie zuverlĂ€ssige Kladen verwendet werden können, um den Suchraum fĂŒr BCD einzuschrĂ€nken und wie man diese mit Hilfe des Greedy Strict Consensus Mergers aus den Eingabedaten gewinnen kann. Schließlich stellen wir eine Strahlensuche fĂŒr BCD vor. Diese erlaubt es eine bestimmte Anzahl suboptimaler Teillösungen (anstatt nur der optimalen) zu berĂŒcksichtigen, um so das Gesamtergebnis zu verbessern. Die Worst-Case-Laufzeit der Strahlensuche ist immer noch polynomiell. Zur Berechnung suboptimaler Teillösungen stellen wir einen exakten und einen randomisierten Algorithmus vor. In einer ausfĂŒhrlichen Evaluation auf mehreren simulierten und biologischen DatensĂ€tzen vergleichen wir BCD mit einer reprĂ€sentativen Auswahl an Superbaummethoden. Wir haben herausgefunden, dass BCD bei Verwendung von GĂŒtekriterien und SuchraumbeschrĂ€nkung auf simulierten Daten genauer ist als die akkuratesten evaluierten Superbaummethoden. Gleichzeitig ist BCD deutlich schneller als alle evaluierten Methoden. Die Strahlensuche verbessert die QualitĂ€t der BCD-BĂ€ume auf allen DatensĂ€tzen, allerdings auf Kosten der Laufzeit. Weiterhin fanden wir heraus, dass ein BCD-Superbaum, der als Startbaum verwendet wird, die QualitĂ€t einer Maximum-Likelihood-Baumrekonstruktion verbessern kann. Außerdem kann BCD DatensĂ€tze verarbeiten, die so groß sind, dass lokale Suchheuristiken auf diesen nicht mehr in angemessener Zeit konvergieren. Aufgrund der Kombination aus Geschwindigkeit, Genauigkeit und der FĂ€higkeit, den Elternbaum zu rekonstruieren, sofern ein solcher existiert, ist BCD ein vielversprechender Ansatz um die Skalierbarkeit von Teile-und-herrsche-Methoden entscheidend zu verbessern
    • 

    corecore