41 research outputs found

    Optimizing Phylogenetic Supertrees Using Answer Set Programming

    Full text link
    The supertree construction problem is about combining several phylogenetic trees with possibly conflicting information into a single tree that has all the leaves of the source trees as its leaves and the relationships between the leaves are as consistent with the source trees as possible. This leads to an optimization problem that is computationally challenging and typically heuristic methods, such as matrix representation with parsimony (MRP), are used. In this paper we consider the use of answer set programming to solve the supertree construction problem in terms of two alternative encodings. The first is based on an existing encoding of trees using substructures known as quartets, while the other novel encoding captures the relationships present in trees through direct projections. We use these encodings to compute a genus-level supertree for the family of cats (Felidae). Furthermore, we compare our results to recent supertrees obtained by the MRP method.Comment: To appear in Theory and Practice of Logic Programming (TPLP), Proceedings of ICLP 201

    Polynomial supertree methods in phylogenomics: algorithms, simulations and software

    Get PDF
    One of the objectives in modern biology, especially phylogenetics, is to build larger clades of the Tree of Life. Large-scale phylogenetic analysis involves several serious challenges. The aim of this thesis is to contribute to some of the open problems in this context. In computational phylogenetics, supertree methods provide a way to reconstruct larger clades of the Tree of Life. We present a novel polynomial time approach for the computation of supertrees called FlipCut supertree. Our method combines the computation of minimum cuts from graph-based methods with a matrix representation method, namely Minimum Flip Supertrees. Here, the input trees are encoded in a 0/1/?-matrix. We present a heuristic to search for a minimum set of 0/1-flips such that the resulting matrix admits a directed perfect phylogeny. In contrast to other polynomial time approaches, our results can be interpreted in the sense that we try to minimize a global objective function, namely the number of flips in the input matrix. We extend our approach by using edge weights to weight the columns of the 0/1/?-matrix. In order to compare our new FlipCut supertree method with other recent polynomial supertree methods and matrix representation methods, we present a large scale simulation study using two different data sets. Our findings illustrate the trade-off between accuracy and running time in supertree construction, as well as the pros and cons of different supertree approaches. Furthermore, we present EPoS, a modular software framework for phylogenetic analysis and visualization. It fills the gap between command line-based algorithmic packages and visual tools without sufficient support for computational methods. By combining a powerful graphical user interface with a plugin system that allows simple integration of new algorithms, visualizations and data structures, we created a framework that is easy to use, to extend and that covers all important steps of a phylogenetic analysis

    Fixed-parameter algorithms for some combinatorial problems in bioinformatics

    Get PDF
    Fixed-parameterized algorithmics has been developed in 1990s as an approach to solve NP-hard problem optimally in a guaranteed running time. It offers a new opportunity to solve NP-hard problems exactly even on large problem instances. In this thesis, we apply fixed-parameter algorithms to cope with three NP-hard problems in bioinformatics: Flip Consensus Tree Problem is a combinatorial problem arising in computational phylogenetics. Using the formulation of the Flip Consensus Tree Problem as a graph-modification problem, we present a set of data reduction rules and two fixed-parameter algorithms with respect to the number of modifications. Additionally, we discuss several heuristic improvements to accelerate the running time of our algorithms in practice. We also report computational results on phylogenetic data. Weighted Cluster Editing Problem is a graph-modification problem, that arises in computational biology when clustering objects with respect to a given similarity or distance measure. We present one of our fixed-parameter algorithms with respect to the minimum modification cost and describe the idea of our fastest algorithm for this problem and its unweighted counterpart. Bond Order Assignment Problem asks for a bond order assignment of a molecule graph that minimizes a penalty function. We prove several complexity results on this problem and give two exact fixed-parameter algorithms for the problem. Our algorithms base on the dynamic programming approach on a tree decomposition of the molecule graph. Our algorithms are fixed-parameter with respect to the treewidth of the molecule graph and the maximum atom valence. We implemented one of our algorithms with several heuristic improvements and evaluate our algorithm on a set of real molecule graphs. It turns out that our algorithm is very fast on this dataset and even outperforms a heuristic algorithm that is usually used in practice

    Reweaving the tapestry: a supertree of birds

    Get PDF
    Supertrees are a useful method of constructing large-scale phylogenies by assembling numerous smaller phylogenies that have some, but not necessarily all, taxa in common. Birds are an obvious candidate for supertree construction as they are the most abundant land vertebrates on the planet and no comprehensive phylogeny of both extinct and extant species currently exists. In order to construct supertrees, primary analysis of characters is required. One such study, presented here, describes two new partial specimens belonging to the Primobucconidae from the Green River Formation of Wyoming (USA), which were assigned to the species Primobucco mcgrewi. Although incomplete, these specimens had preserved anatomical features not seen in other material. An attempt to further constrain their phylogenetic position was inconclusive, showing only that the Primobucconidae belong in a clade containing the extant Coraciiformes and related taxa. Over 700 such studies were used to construct a species-level supertree of Aves containing over 5000 taxa. The resulting tree shows the relationships between the main avian groups, with only a few novel clades, some of which can be explained by a lack of information regarding those taxa. The tree was constructed using a strict protocol which ensures robust, accurate and efficient data collection and processing; extending previous work by other authors. Before creating the species-level supertree the protocol was tested on the order Galliformes in order to determine the most efficient method of removing non-independent data. It was found that combining non-independent source trees via a “mini-supertree” analysis produced results more consistent with the input source data and, in addition, significantly reduced computational load. Another method for constructing large-scale trees is via a supermatrix, which is constructed from primary data collated into a single, large matrix. A molecular-only tree was constructed using both supertree and supermatrix methods, from the same data, again of the order Galliformes. Both methods performed equally as well in producing trees that fit the source data. The two methods could be considered complementary rather than conflicting as the supertree took a long time to construct but was very quick to calculate, but the supermatrix took longer to calculate, but was quicker to construct. Dependent upon the data at hand and the other factors involved, the choice of which method to use appears, from this small study, to be of little consequence. Finally an updated species-level supertree of the Dinosauria was also constructed and used to look at diversification rates in order to elucidate the “Cretaceous explosion of terrestrial life”. Results from this study show that this apparent burst in diversity at the end of the Cretaceous is a sampling artefact and in fact, dinosaurs show most of their major diversification shifts in the first third of their history

    Fast and accurate supertrees: towards large scale phylogenies

    Get PDF
    Phylogenetics is the study of evolutionary relationships between biological entities; phylogenetic trees (phylogenies) are a visualization of these evolutionary relationships. Accurate approaches to reconstruct hylogenies from sequence data usually result in NPhard optimization problems, hence local search heuristics have to be applied in practice. These methods are highly accurate and fast enough as long as the input data is not too large. Divide-and-conquer techniques are a promising approach to boost scalability and accuracy of those local search heuristics on very large datasets. A divide-and-conquer method breaks down a large phylogenetic problem into smaller sub-problems that are computationally easier to solve. The sub-problems (overlapping trees) are then combined using a supertree method. Supertree methods merge a set of overlapping phylogenetic trees into a supertree containing all taxa of the input trees. The challenge in supertree reconstruction is the way of dealing with conflicting information in the input trees. Many different algorithms for different objective functions have been suggested to resolve these conflicts. In particular, there are methods that encode the source trees in a matrix and the supertree is constructed applying a local search heuristic to optimize the respective objective function. The most widely used supertree methods use such local search heuristics. However, to really improve the scalability of accurate tree reconstruction by divide-and-conquer approaches, accurate polynomial time methods are needed for the supertree reconstruction step. In this work, we present approaches for accurate polynomial time supertree reconstruction in particular Bad Clade Deletion (BCD), a novel heuristic supertree algorithm with polynomial running time. BCD uses minimum cuts to greedily delete a locally minimal number of columns from a matrix representation to make it compatible. Different from local search heuristics, it guarantees to return the directed perfect phylogeny for the input matrix, corresponding to the parent tree of the input trees if one exists. BCD can take support values of the source trees into account without an increase in complexity. We show how reliable clades can be used to restrict the search space for BCD and how those clades can be collected from the input data using the Greedy Strict Consensus Merger. Finally, we introduce a beam search extension for the BCD algorithm that keeps alive a constant number of partial solutions in each top-down iteration phase. The guaranteed worst-case running time of BCD with beam search extension is still polynomial. We present an exact and a randomized subroutine to generate suboptimal partial solutions. In our thorough evaluation on several simulated and biological datasets against a representative set of supertree methods we found that BCD is more accurate than the most accurate supertree methods when using support values and search space restriction on simulated data. Simultaneously BCD is faster than any other evaluated method. The beam search approach improved the accuracy of BCD on all evaluated datasets at the cost of speed. We found that BCD supertrees can boost maximum likelihood tree reconstruction when used as starting tree. Further, BCD could handle large scale datasets where local search heuristics did not converge in reasonable time. Due to its combination of speed, accuracy, and the ability to reconstruct the parent tree if one exists, BCD is a promising approach to enable outstanding scalability of divide-and-conquer approaches.Die Phylogenetik studiert die evolutionĂ€ren Beziehungen zwischen biologischen EntitĂ€ten. Phylogenetische BĂ€ume sind eine Visualisierung dieser Beziehungen. Akkurate AnsĂ€tze zur Rekonstruktion von Phylogenien aus Sequenzdaten fĂŒhren in der Regel zu NP-schweren Optimierungsproblemen, sodass in der Praxis lokale Suchheuristiken angewendet werden mĂŒssen. Diese Methoden liefern akkurate BĂ€ume und sind schnell genug, solange die Eingabedaten nicht zu groß werden. Teile-und-herrsche-Verfahren sind ein vielversprechender Ansatz, um Skalierbarkeit und Genauigkeit dieser lokalen Suchheuristiken auf sehr großen DatensĂ€tzen zu verbessern. Beim Teile-und-herrsche-Ansatz zerlegt man ein großes phylogenetisches Problem in kleinere Teilprobleme, die einfacher und schneller zu lösen sind. Die Teilprobleme, in diesem Fall ĂŒberlappende TeilbĂ€ume, mĂŒssen dann zu einem gesamtheitlichen Baum kombiniert werden. Superbaummethoden verschmelzen solche ĂŒberlappenden phylogenetischen BĂ€ume zu einem Superbaum, der alle Taxa der EingangsbĂ€ume enthĂ€lt. Die Herausforderung bei der Superbaumrekonstruktion besteht darin, mit widersprĂŒchlichen EingabebĂ€umen umzugehen. Es wurden viele verschiedene Algorithmen mit unterschiedlichen Zielfunktionen entwickelt, um solche WidersprĂŒche möglichst sinnvoll aufzulösen. Verfahren, die auf der Kodierung der EingabebĂ€ume als MatrixreprĂ€sentation basieren, sind am weitesten verbreitet. Die zum Auflösen der Konflikte verwendeten Zielfunktionen fĂŒhren in der Regel zu NP-schweren Optimierungsproblemen, sodass in der Praxis auch hier lokale Suchheuristiken zum Einsatz kommen. Da diese AnsĂ€tze nicht wesentlich besser mit der GrĂ¶ĂŸe der Eingabedaten skalieren als die direkte Rekonstruktion aus Sequenzdaten, werden fĂŒr die Superbaumrekonstruktion in Teile-undherrsche-AnsĂ€tzen akkurate Polynomialzeitmethoden benötigt. Diese Arbeit beschĂ€ftigt sich mit der akkuraten Rekonstruktion von SuperbĂ€umen in Polynomialzeit. Wir prĂ€sentieren Bad Clade Deletion (BCD), eine neue Polynomialzeitheuristik zur Superbaumrekonstruktion. BCD verwendet minimale Schnitte in Graphen, um eine minimale Anzahl von Spalten aus der MatrixreprĂ€sentation zu löschen, sodass diese konfliktfrei wird. Im Gegensatz zu lokalen Suchheuristiken garantiert BCD die Rekonstruktion einer perfekten Phylogenie, sofern eine solche fĂŒr die Eingabematrix existiert. BCD ermöglicht es, GĂŒtekriterien der EingabebĂ€ume zu berĂŒcksichtigen, ohne dass sich dadurch die KomplexitĂ€t erhöht. Weiterhin zeigen wir, wie zuverlĂ€ssige Kladen verwendet werden können, um den Suchraum fĂŒr BCD einzuschrĂ€nken und wie man diese mit Hilfe des Greedy Strict Consensus Mergers aus den Eingabedaten gewinnen kann. Schließlich stellen wir eine Strahlensuche fĂŒr BCD vor. Diese erlaubt es eine bestimmte Anzahl suboptimaler Teillösungen (anstatt nur der optimalen) zu berĂŒcksichtigen, um so das Gesamtergebnis zu verbessern. Die Worst-Case-Laufzeit der Strahlensuche ist immer noch polynomiell. Zur Berechnung suboptimaler Teillösungen stellen wir einen exakten und einen randomisierten Algorithmus vor. In einer ausfĂŒhrlichen Evaluation auf mehreren simulierten und biologischen DatensĂ€tzen vergleichen wir BCD mit einer reprĂ€sentativen Auswahl an Superbaummethoden. Wir haben herausgefunden, dass BCD bei Verwendung von GĂŒtekriterien und SuchraumbeschrĂ€nkung auf simulierten Daten genauer ist als die akkuratesten evaluierten Superbaummethoden. Gleichzeitig ist BCD deutlich schneller als alle evaluierten Methoden. Die Strahlensuche verbessert die QualitĂ€t der BCD-BĂ€ume auf allen DatensĂ€tzen, allerdings auf Kosten der Laufzeit. Weiterhin fanden wir heraus, dass ein BCD-Superbaum, der als Startbaum verwendet wird, die QualitĂ€t einer Maximum-Likelihood-Baumrekonstruktion verbessern kann. Außerdem kann BCD DatensĂ€tze verarbeiten, die so groß sind, dass lokale Suchheuristiken auf diesen nicht mehr in angemessener Zeit konvergieren. Aufgrund der Kombination aus Geschwindigkeit, Genauigkeit und der FĂ€higkeit, den Elternbaum zu rekonstruieren, sofern ein solcher existiert, ist BCD ein vielversprechender Ansatz um die Skalierbarkeit von Teile-und-herrsche-Methoden entscheidend zu verbessern

    Synthesizing species trees from gene trees using the parameterized and graph-theoretic approaches

    Get PDF
    Gene trees describe how parts of the species have evolved over time, and it is assumed that gene trees have evolved along the branches of the species tree. However, some of gene trees are often discordant with the corresponding species tree due to the complicated evolution history of genes. To overcome this obstacle, median problems have emerged as a major tool for synthesizing species trees by reconciling discordance in a given collection of gene trees. Given a collection of gene trees and a cost function, the median problem seeks a tree, called median tree, that minimizes the overall cost to the gene trees. Median tree problems are typically NP-hard, and there is an increased interest in making such median tree problems available for large-scale species tree construction. In this thesis work, we first show that the gene duplication median tree problem satisfied the weaker version of the Pareto property and propose a parameterized algorithm to solve the gene duplication median tree problem. Second, we design two efficient methods to handle the issues of applying the parameterized algorithm to unrooted gene trees which are sampled from the different species. Third, we introduce the graph-theoretic formulation of the Robinson-Foulds median tree problem and a new tree edit operation. Fourth, we propose a new metric between two phylogenetic trees and examine the statistical properties of the metric. Finally, we propose a new clustering criteria in a bipartite network and propose a new NP-hard problem and its ILP formulation

    Developing and applying supertree methods in Phylogenomics and Macroevolution

    Get PDF
    Supertrees can be used to combine partially overalapping trees and generate more inclusive phylogenies. It has been proposed that Maximum Likelihood (ML) supertrees method (SM) could be developed using an exponential probability distribution to model errors in the input trees (given a proposed supertree). When the tree-­‐to-­‐tree distances used in the ML computation are symmetric differences, the ML SM has been shown to be equivalent to a Majority-­‐Rule consensus SM, and hence, exactly as the latter, it has the desirable property of being a median tree (with reference to the set of input trees). The ability to estimate the likelihood of supertrees, allows implementing Bayesian (MCMC) approaches, which have the advantage to allow the support for the clades in a supertree to be properly estimated. I present here the L.U.St software package; it contains the first implementation of a ML SM and allows for the first time statistical tests on supertrees. I also characterized the first implementation of the Bayesian (MCMC) SM. Both the ML and the Bayesian (MCMC) SMs have been tested for and found to be immune to biases. The Bayesian (MCMC) SM is applied to the reanalyses of a variety of datasets (i.e. the datasets for the Metazoa and the Carnivora), and I have also recovered the first Bayesian supertree-­‐based phylogeny of the Eubacteria and the Archaebacteria. These new SMs are discussed, with reference to other, well-­‐ known SMs like Matrix Representation with Parsimony. Both the ML and Bayesian SM offer multiple attractive advantages over current alternatives
    corecore