110 research outputs found

    A practical approximation algorithm for solving massive instances of hybridization number for binary and nonbinary trees

    Get PDF
    Reticulate events play an important role in determining evolutionary relationships. The problem of computing the minimum number of such events to explain discordance between two phylogenetic trees is a hard computational problem. Even for binary trees, exact solvers struggle to solve instances with reticulation number larger than 40-50. Here we present CycleKiller and NonbinaryCycleKiller, the first methods to produce solutions verifiably close to optimality for instances with hundreds or even thousands of reticulations. Using simulations, we demonstrate that these algorithms run quickly for large and difficult instances, producing solutions that are very close to optimality. As a spin-off from our simulations we also present TerminusEst, which is the fastest exact method currently available that can handle nonbinary trees: this is used to measure the accuracy of the NonbinaryCycleKiller algorithm. All three methods are based on extensions of previous theoretical work and are publicly available. We also apply our methods to real data

    A Duality Based 2-Approximation Algorithm for Maximum Agreement Forest

    Get PDF
    We give a 2-approximation algorithm for the Maximum Agreement Forest problem on two rooted binary trees. This NP-hard problem has been studied extensively in the past two decades, since it can be used to compute the rooted Subtree Prune-and-Regraft (rSPR) distance between two phylogenetic trees. Our algorithm is combinatorial and its running time is quadratic in the input size. To prove the approximation guarantee, we construct a feasible dual solution for a novel linear programming formulation. In addition, we show this linear program is stronger than previously known formulations, and we give a compact formulation, showing that it can be solved in polynomial tim

    Better Practical Algorithms for rSPR Distance and Hybridization Number

    Get PDF
    The problem of computing the rSPR distance of two phylogenetic trees (denoted by RDC) is NP-hard and so is the problem of computing the hybridization number of two phylogenetic trees (denoted by HNC). Since they are important problems in phylogenetics, they have been studied extensively in the literature. Indeed, quite a number of exact or approximation algorithms have been designed and implemented for them. In this paper, we design and implement one exact algorithm for HNC and several approximation algorithms for RDC and HNC. Our experimental results show that the resulting exact program is much faster (namely, more than 80 times faster for the easiest dataset used in the experiments) than the previous best and its superiority in speed becomes even more significant for more difficult instances. Moreover, the resulting approximation programs output much better results than the previous bests; indeed, the outputs are always nearly optimal and often optimal. Of particular interest is the usage of the Monte Carlo tree search (MCTS) method in the design of our approximation algorithms. Our experimental results show that with MCTS, we can often solve HNC exactly within short time

    SPRIT: Identifying horizontal gene transfer in rooted phylogenetic trees

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Phylogenetic trees based on sequences from a set of taxa can be incongruent due to horizontal gene transfer (HGT). By identifying the HGT events, we can reconcile the gene trees and derive a taxon tree that adequately represents the species' evolutionary history. One HGT can be represented by a rooted Subtree Prune and Regraft (<smcaps>R</smcaps>SPR) operation and the number of <smcaps>R</smcaps>SPRs separating two trees corresponds to the minimum number of HGT events. Identifying the minimum number of <smcaps>R</smcaps>SPRs separating two trees is NP-hard, but the problem can be reduced to fixed parameter tractable. A number of heuristic and two exact approaches to identifying the minimum number of <smcaps>R</smcaps>SPRs have been proposed. This is the first implementation delivering an exact solution as well as the intermediate trees connecting the input trees.</p> <p>Results</p> <p>We present the SPR Identification Tool (SPRIT), a novel algorithm that solves the fixed parameter tractable minimum <smcaps>R</smcaps>SPR problem and its GPL licensed Java implementation. The algorithm can be used in two ways, exhaustive search that guarantees the minimum <smcaps>R</smcaps>SPR distance and a heuristic approach that guarantees finding a solution, but not necessarily the minimum one. We benchmarked SPRIT against other software in two different settings, small to medium sized trees i.e. five to one hundred taxa and large trees i.e. thousands of taxa. In the small to medium tree size setting with random artificial incongruence, SPRIT's heuristic mode outperforms the other software by always delivering a solution with a low overestimation of the <smcaps>R</smcaps>SPR distance. In the large tree setting SPRIT compares well to the alternatives when benchmarked on finding a minimum solution within a reasonable time. SPRIT presents both the minimum <smcaps>R</smcaps>SPR distance and the intermediate trees.</p> <p>Conclusions</p> <p>When used in exhaustive search mode, SPRIT identifies the minimum number of <smcaps>R</smcaps>SPRs needed to reconcile two incongruent rooted trees. SPRIT also performs quick approximations of the minimum <smcaps>R</smcaps>SPR distance, which are comparable to, and often better than, purely heuristic solutions. Put together, SPRIT is an excellent tool for identification of HGT events and pinpointing which taxa have been involved in HGT.</p

    Approximating subtree distances between Phylogenies

    Get PDF
    We give a 5-approximation algorithm to the rooted Subtree-Prune-and-Regraft (rSPR) distance between two phylogenies, which was recently shown to be NP-complete by Bordewich and Semple [5]. This paper presents the first approximation result for this important tree distance. The algorithm follows a standard format for tree distances such as Rodrigues et al. [24] and Hein et al. [13]. The novel ideas are in the analysis. In the analysis, the cost of the algorithm uses a \cascading" scheme that accounts for possible wrong moves. This accounting is missing from previous analysis of tree distance approximation algorithms. Further, we show how all algorithms of this type can be implemented in linear time and give experimental results

    Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance

    Get PDF
    We present a new method for inferring species trees from multi-copy gene trees. Our method is based on a generalization of the Robinson-Foulds (RF) distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple leaves can have the same label. Unlike most previous phylogenetic methods using gene trees, this method does not assume that gene tree incongruence is caused by a single, specific biological process, such as gene duplication and loss, deep coalescence, or lateral gene transfer. We prove that it is NP-hard to compute the RF distance between two mul-trees, but it is easy to calculate the generalized RF distance between a mul-tree and a singly-labeled tree. Motivated by this observation, we formulate the RF supertree problem for mul-trees (MulRF), which takes a collection of mul-trees and constructs a species tree that minimizes the total RF distance from the input mul-trees. We present a fast heuristic algorithm for the MulRF supertree problem. Simulation experiments demonstrate that the MulRF method produces more accurate species trees than gene tree parsimony methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Furthermore, the MulRF heuristic runs quickly on data sets containing hundreds of trees with up to a hundred taxa.Comment: 16 pages, 11 figure

    Efficiency of Algorithms in Phylogenetics

    Get PDF
    Phylogenetics is the study of evolutionary relationships between species. Phylogenetic trees have long been the standard object used in evolutionary biology to illustrate how a given set of species are related. There are some groups (including certain plant and fish species) for which the ancestral history contains reticulation events, caused by processes that include hybridization, lateral gene transfer, and recombination. For such groups of species, it is appropriate to represent their ancestral history by phylogenetic networks: rooted acyclic digraphs, where arcs represent lines of genetic inheritance and vertices of in-degree at least two represent reticulation events. This thesis is concerned with the efficiency, accuracy, and tractability of mathematical models for phylogenetic network methods. Three important and related measures for summarizing the dissimilarity in phylogenetic trees are the minimum number of hybridization events required to fit two phylogenetic trees onto a single phylogenetic network (the hybridization number), the (rooted) subtree prune and regraft distance (the rSPR distance) and the tree bisection and reconnection distance (the TBR distance) between two phylogenetic trees. The respective problems of computing these measures are known to be NP-hard, but also fixed-parameter tractable in their respective natural parameters. This means that, while they are hard to compute in general, for cases in which a parameter (here the hybridization number and rSPR/TBR distance, respectively) is small, the problem can be solved efficiently even for large input trees. Here, we present new analyses showing that the use of the “cluster reduction” rule – already defined for the hybridization number and the rSPR distance and introduced here for the TBR distance – can transform any O(f(p) · n)-time algorithm for any of these problems into an O(f(k) · n)-time one, where n is the number of leaves of the phylogenetic trees, p is the natural parameter and k is a much stronger (that is, smaller) parameter: the minimum level of a phylogenetic network displaying both trees. These results appear in [9]. Traditional “distance based methods” reconstruct a phylogenetic tree from a matrix of pairwise distances between taxa. A phylogenetic network is a generalization of a phylogenetic tree that can describe evolutionary events such as reticulation and hybridization that are not tree-like. Although evolution has been known to be more accurately modelled by a network than a tree for some time, only recently have efforts been made to directly reconstruct a phylogenetic network from sequence data, as opposed to reconstructing several trees first and then trying to combine them into a single coherent network. In this work, we present a generalisation of the UPGMA algorithm for ultrametric tree reconstruction which can accurately reconstruct ultrametric tree-child networks from the set of distinct distances between each pair of taxa. This result will also appear in [15]. Moreover, we analyse the safety radius of the NETWORKUPGMA algorithm and show that it has safety radius 1/2. This means that if we can obtain accurate estimates of the set of distances between each pair of taxa in an ultrametric tree-child network, then NETWORKUPGMA correctly reconstructs the true network
    corecore