18,264 research outputs found

    Reconstructing (super)trees from data sets with missing distances: Not all is lost

    Get PDF
    The wealth of phylogenetic information accumulated over many decades of biological research, coupled with recent technological advances in molecular sequence generation, present significant opportunities for researchers to investigate relationships across and within the kingdoms of life. However, to make best use of this data wealth, several problems must first be overcome. One key problem is finding effective strategies to deal with missing data. Here, we introduce Lasso, a novel heuristic approach for reconstructing rooted phylogenetic trees from distance matrices with missing values, for datasets where a molecular clock may be assumed. Contrary to other phylogenetic methods on partial datasets, Lasso possesses desirable properties such as its reconstructed trees being both unique and edge-weighted. These properties are achieved by Lasso restricting its leaf set to a large subset of all possible taxa, which in many practical situations is the entire taxa set. Furthermore, the Lasso approach is distance-based, rendering it very fast to run and suitable for datasets of all sizes, including large datasets such as those generated by modern Next Generation Sequencing technologies. To better understand the performance of Lasso, we assessed it by means of artificial and real biological datasets, showing its effectiveness in the presence of missing data. Furthermore, by formulating the supermatrix problem as a particular case of the missing data problem, we assessed Lasso's ability to reconstruct supertrees. We demonstrate that, although not specifically designed for such a purpose, Lasso performs better than or comparably with five leading supertree algorithms on a challenging biological data set. Finally, we make freely available a software implementation of Lasso so that researchers may, for the first time, perform both rooted tree and supertree reconstruction with branch lengths on their own partial datasets

    Efficient FPT algorithms for (strict) compatibility of unrooted phylogenetic trees

    Full text link
    In phylogenetics, a central problem is to infer the evolutionary relationships between a set of species XX; these relationships are often depicted via a phylogenetic tree -- a tree having its leaves univocally labeled by elements of XX and without degree-2 nodes -- called the "species tree". One common approach for reconstructing a species tree consists in first constructing several phylogenetic trees from primary data (e.g. DNA sequences originating from some species in XX), and then constructing a single phylogenetic tree maximizing the "concordance" with the input trees. The so-obtained tree is our estimation of the species tree and, when the input trees are defined on overlapping -- but not identical -- sets of labels, is called "supertree". In this paper, we focus on two problems that are central when combining phylogenetic trees into a supertree: the compatibility and the strict compatibility problems for unrooted phylogenetic trees. These problems are strongly related, respectively, to the notions of "containing as a minor" and "containing as a topological minor" in the graph community. Both problems are known to be fixed-parameter tractable in the number of input trees kk, by using their expressibility in Monadic Second Order Logic and a reduction to graphs of bounded treewidth. Motivated by the fact that the dependency on kk of these algorithms is prohibitively large, we give the first explicit dynamic programming algorithms for solving these problems, both running in time 2O(k2)â‹…n2^{O(k^2)} \cdot n, where nn is the total size of the input.Comment: 18 pages, 1 figur

    A Practical Algorithm for Reconstructing Level-1 Phylogenetic Networks

    Full text link
    Recently much attention has been devoted to the construction of phylogenetic networks which generalize phylogenetic trees in order to accommodate complex evolutionary processes. Here we present an efficient, practical algorithm for reconstructing level-1 phylogenetic networks - a type of network slightly more general than a phylogenetic tree - from triplets. Our algorithm has been made publicly available as the program LEV1ATHAN. It combines ideas from several known theoretical algorithms for phylogenetic tree and network reconstruction with two novel subroutines. Namely, an exponential-time exact and a greedy algorithm both of which are of independent theoretical interest. Most importantly, LEV1ATHAN runs in polynomial time and always constructs a level-1 network. If the data is consistent with a phylogenetic tree, then the algorithm constructs such a tree. Moreover, if the input triplet set is dense and, in addition, is fully consistent with some level-1 network, it will find such a network. The potential of LEV1ATHAN is explored by means of an extensive simulation study and a biological data set. One of our conclusions is that LEV1ATHAN is able to construct networks consistent with a high percentage of input triplets, even when these input triplets are affected by a low to moderate level of noise

    Identifying Mislabeled Training Data

    Full text link
    This paper presents a new approach to identifying and eliminating mislabeled training instances for supervised learning. The goal of this approach is to improve classification accuracies produced by learning algorithms by improving the quality of the training data. Our approach uses a set of learning algorithms to create classifiers that serve as noise filters for the training data. We evaluate single algorithm, majority vote and consensus filters on five datasets that are prone to labeling errors. Our experiments illustrate that filtering significantly improves classification accuracy for noise levels up to 30 percent. An analytical and empirical evaluation of the precision of our approach shows that consensus filters are conservative at throwing away good data at the expense of retaining bad data and that majority filters are better at detecting bad data at the expense of throwing away good data. This suggests that for situations in which there is a paucity of data, consensus filters are preferable, whereas majority vote filters are preferable for situations with an abundance of data

    Near-Optimal Algorithm for Constructing Greedy Consensus Tree

    Get PDF
    • …
    corecore