110 research outputs found
A practical approximation algorithm for solving massive instances of hybridization number for binary and nonbinary trees
Reticulate events play an important role in determining evolutionary
relationships. The problem of computing the minimum number of such events to
explain discordance between two phylogenetic trees is a hard computational
problem. Even for binary trees, exact solvers struggle to solve instances with
reticulation number larger than 40-50. Here we present CycleKiller and
NonbinaryCycleKiller, the first methods to produce solutions verifiably close
to optimality for instances with hundreds or even thousands of reticulations.
Using simulations, we demonstrate that these algorithms run quickly for large
and difficult instances, producing solutions that are very close to optimality.
As a spin-off from our simulations we also present TerminusEst, which is the
fastest exact method currently available that can handle nonbinary trees: this
is used to measure the accuracy of the NonbinaryCycleKiller algorithm. All
three methods are based on extensions of previous theoretical work and are
publicly available. We also apply our methods to real data
A Duality Based 2-Approximation Algorithm for Maximum Agreement Forest
We give a 2-approximation algorithm for the Maximum Agreement Forest problem
on two rooted binary trees. This NP-hard problem has been studied extensively
in the past two decades, since it can be used to compute the rooted Subtree
Prune-and-Regraft (rSPR) distance between two phylogenetic trees. Our algorithm
is combinatorial and its running time is quadratic in the input size. To prove
the approximation guarantee, we construct a feasible dual solution for a novel
linear programming formulation. In addition, we show this linear program is
stronger than previously known formulations, and we give a compact formulation,
showing that it can be solved in polynomial tim
Better Practical Algorithms for rSPR Distance and Hybridization Number
The problem of computing the rSPR distance of two phylogenetic trees (denoted by RDC) is NP-hard and so is the problem of computing the hybridization number of two phylogenetic trees (denoted by HNC). Since they are important problems in phylogenetics, they have been studied extensively in the literature. Indeed, quite a number of exact or approximation algorithms have been designed and implemented for them. In this paper, we design and implement one exact algorithm for HNC and several approximation algorithms for RDC and HNC. Our experimental results show that the resulting exact program is much faster (namely, more than 80 times faster for the easiest dataset used in the experiments) than the previous best and its superiority in speed becomes even more significant for more difficult instances. Moreover, the resulting approximation programs output much better results than the previous bests; indeed, the outputs are always nearly optimal and often optimal. Of particular interest is the usage of the Monte Carlo tree search (MCTS) method in the design of our approximation algorithms. Our experimental results show that with MCTS, we can often solve HNC exactly within short time
SPRIT: Identifying horizontal gene transfer in rooted phylogenetic trees
<p>Abstract</p> <p>Background</p> <p>Phylogenetic trees based on sequences from a set of taxa can be incongruent due to horizontal gene transfer (HGT). By identifying the HGT events, we can reconcile the gene trees and derive a taxon tree that adequately represents the species' evolutionary history. One HGT can be represented by a rooted Subtree Prune and Regraft (<smcaps>R</smcaps>SPR) operation and the number of <smcaps>R</smcaps>SPRs separating two trees corresponds to the minimum number of HGT events. Identifying the minimum number of <smcaps>R</smcaps>SPRs separating two trees is NP-hard, but the problem can be reduced to fixed parameter tractable. A number of heuristic and two exact approaches to identifying the minimum number of <smcaps>R</smcaps>SPRs have been proposed. This is the first implementation delivering an exact solution as well as the intermediate trees connecting the input trees.</p> <p>Results</p> <p>We present the SPR Identification Tool (SPRIT), a novel algorithm that solves the fixed parameter tractable minimum <smcaps>R</smcaps>SPR problem and its GPL licensed Java implementation. The algorithm can be used in two ways, exhaustive search that guarantees the minimum <smcaps>R</smcaps>SPR distance and a heuristic approach that guarantees finding a solution, but not necessarily the minimum one. We benchmarked SPRIT against other software in two different settings, small to medium sized trees i.e. five to one hundred taxa and large trees i.e. thousands of taxa. In the small to medium tree size setting with random artificial incongruence, SPRIT's heuristic mode outperforms the other software by always delivering a solution with a low overestimation of the <smcaps>R</smcaps>SPR distance. In the large tree setting SPRIT compares well to the alternatives when benchmarked on finding a minimum solution within a reasonable time. SPRIT presents both the minimum <smcaps>R</smcaps>SPR distance and the intermediate trees.</p> <p>Conclusions</p> <p>When used in exhaustive search mode, SPRIT identifies the minimum number of <smcaps>R</smcaps>SPRs needed to reconcile two incongruent rooted trees. SPRIT also performs quick approximations of the minimum <smcaps>R</smcaps>SPR distance, which are comparable to, and often better than, purely heuristic solutions. Put together, SPRIT is an excellent tool for identification of HGT events and pinpointing which taxa have been involved in HGT.</p
Approximating subtree distances between Phylogenies
We give a 5-approximation algorithm to the rooted Subtree-Prune-and-Regraft (rSPR) distance between two phylogenies, which was recently shown to be NP-complete by Bordewich and Semple [5]. This paper presents the first approximation result for this important tree distance. The algorithm follows a standard format for tree distances such as Rodrigues et al. [24] and Hein et al. [13]. The novel ideas are in the analysis. In the analysis, the cost of the algorithm uses a \cascading" scheme that accounts for possible wrong moves. This accounting is missing from previous analysis of tree distance approximation algorithms. Further, we show how all algorithms of this type can be implemented in linear time and give experimental results
Inferring Species Trees from Incongruent Multi-Copy Gene Trees Using the Robinson-Foulds Distance
We present a new method for inferring species trees from multi-copy gene
trees. Our method is based on a generalization of the Robinson-Foulds (RF)
distance to multi-labeled trees (mul-trees), i.e., gene trees in which multiple
leaves can have the same label. Unlike most previous phylogenetic methods using
gene trees, this method does not assume that gene tree incongruence is caused
by a single, specific biological process, such as gene duplication and loss,
deep coalescence, or lateral gene transfer. We prove that it is NP-hard to
compute the RF distance between two mul-trees, but it is easy to calculate the
generalized RF distance between a mul-tree and a singly-labeled tree. Motivated
by this observation, we formulate the RF supertree problem for mul-trees
(MulRF), which takes a collection of mul-trees and constructs a species tree
that minimizes the total RF distance from the input mul-trees. We present a
fast heuristic algorithm for the MulRF supertree problem. Simulation
experiments demonstrate that the MulRF method produces more accurate species
trees than gene tree parsimony methods when incongruence is caused by gene tree
error, duplications and losses, and/or lateral gene transfer. Furthermore, the
MulRF heuristic runs quickly on data sets containing hundreds of trees with up
to a hundred taxa.Comment: 16 pages, 11 figure
Efficiency of Algorithms in Phylogenetics
Phylogenetics is the study of evolutionary relationships between species. Phylogenetic
trees have long been the standard object used in evolutionary biology to illustrate how a
given set of species are related. There are some groups (including certain plant and fish
species) for which the ancestral history contains reticulation events, caused by processes that
include hybridization, lateral gene transfer, and recombination. For such groups of species, it
is appropriate to represent their ancestral history by phylogenetic networks: rooted acyclic
digraphs, where arcs represent lines of genetic inheritance and vertices of in-degree at least
two represent reticulation events. This thesis is concerned with the efficiency, accuracy, and
tractability of mathematical models for phylogenetic network methods.
Three important and related measures for summarizing the dissimilarity in phylogenetic
trees are the minimum number of hybridization events required to fit two phylogenetic trees
onto a single phylogenetic network (the hybridization number), the (rooted) subtree prune
and regraft distance (the rSPR distance) and the tree bisection and reconnection distance (the
TBR distance) between two phylogenetic trees. The respective problems of computing these
measures are known to be NP-hard, but also fixed-parameter tractable in their respective
natural parameters. This means that, while they are hard to compute in general, for cases
in which a parameter (here the hybridization number and rSPR/TBR distance, respectively)
is small, the problem can be solved efficiently even for large input trees. Here, we present
new analyses showing that the use of the “cluster reduction” rule – already defined for the
hybridization number and the rSPR distance and introduced here for the TBR distance – can
transform any O(f(p) · n)-time algorithm for any of these problems into an O(f(k) · n)-time
one, where n is the number of leaves of the phylogenetic trees, p is the natural parameter
and k is a much stronger (that is, smaller) parameter: the minimum level of a phylogenetic
network displaying both trees. These results appear in [9].
Traditional “distance based methods” reconstruct a phylogenetic tree from a matrix of pairwise
distances between taxa. A phylogenetic network is a generalization of a phylogenetic
tree that can describe evolutionary events such as reticulation and hybridization that are not
tree-like. Although evolution has been known to be more accurately modelled by a network
than a tree for some time, only recently have efforts been made to directly reconstruct a
phylogenetic network from sequence data, as opposed to reconstructing several trees first and then trying to combine them into a single coherent network. In this work, we present
a generalisation of the UPGMA algorithm for ultrametric tree reconstruction which can
accurately reconstruct ultrametric tree-child networks from the set of distinct distances
between each pair of taxa. This result will also appear in [15]. Moreover, we analyse the
safety radius of the NETWORKUPGMA algorithm and show that it has safety radius 1/2.
This means that if we can obtain accurate estimates of the set of distances between each pair
of taxa in an ultrametric tree-child network, then NETWORKUPGMA correctly reconstructs
the true network
- …