24 research outputs found
On the computational complexity of the rooted subtree prune and regraft distance
The graph-theoretic operation of rooted subtree prune and regraft is increasingly being used as a tool for understanding and modelling reticulation events in evolutionary biology. In this paper, we show that computing the rooted subtree prune and regraft distance between two rooted binary phylogenetic trees on the same label set is NP-hard. This resolves a longstanding open problem. Furthermore, we show that this distance is xed parameter tractable when parameterised by the distance between the two trees
A tight kernel for computing the tree bisection and reconnection distance between two phylogenetic trees
In 2001 Allen and Steel showed that, if subtree and chain reduction rules
have been applied to two unrooted phylogenetic trees, the reduced trees will
have at most 28k taxa where k is the TBR (Tree Bisection and Reconnection)
distance between the two trees. Here we reanalyse Allen and Steel's
kernelization algorithm and prove that the reduced instances will in fact have
at most 15k-9 taxa. Moreover we show, by describing a family of instances which
have exactly 15k-9 taxa after reduction, that this new bound is tight. These
instances also have no common clusters, showing that a third
commonly-encountered reduction rule, the cluster reduction, cannot further
reduce the size of the kernel in the worst case. To achieve these results we
introduce and use "unrooted generators" which are analogues of rooted
structures that have appeared earlier in the phylogenetic networks literature.
Using similar argumentation we show that, for the minimum hybridization problem
on two rooted trees, 9k-2 is a tight bound (when subtree and chain reduction
rules have been applied) and 9k-4 is a tight bound (when, additionally, the
cluster reduction has been applied) on the number of taxa, where k is the
hybridization number of the two trees.Comment: One figure added, two small typos fixed. This version to appear in
SIDMA (SIAM Journal on Discrete Mathematics
Rearrangement operations on unrooted phylogenetic networks
Rearrangement operations transform a phylogenetic tree into another one and hence induce a metric on the space of phylogenetic trees. Popular operations for unrooted phylogenetic trees are NNI (nearest neighbour interchange), SPR (subtree prune and regraft), and TBR (tree bisection and reconnection). Recently, these operations have been extended to unrooted phylogenetic networks, which are generalisations of phylogenetic trees that can model reticulated evolutionary relationships. Here, we study global and local properties of spaces of phylogenetic networks under these three operations. In particular, we prove connectedness and asymptotic bounds on the diameters of spaces of different classes of phylogenetic networks, including tree-based and level-k networks. We also examine the behaviour of shortest TBR-sequence between two phylogenetic networks in a class, and whether the TBR-distance changes if intermediate networks from other classes are allowed: for example, the space of phylogenetic trees is an isometric subgraph of the space of phylogenetic networks under TBR. Lastly, we show that computing the TBR-distance and the PR-distance of two phylogenetic networks is NP-hard
Computing Maximum Agreement Forests without Cluster Partitioning is Folly
Computing a maximum (acyclic) agreement forest (M(A)AF) of a pair of phylogenetic trees is known to be fixed-parameter tractable; the two main techniques are kernelization and depth-bounded search. In theory, kernelization-based algorithms for this problem are not competitive, but they perform remarkably well in practice. We shed light on why this is the case. Our results show that, probably unsurprisingly, the kernel is often much smaller in practice than the theoretical worst case, but not small enough to fully explain the good performance of these algorithms. The key to performance is cluster partitioning, a technique used in almost all fast M(A)AF algorithms. In theory, cluster partitioning does not help: some instances are highly clusterable, others not at all. However, our experiments show that cluster partitioning leads to substantial performance improvements for kernelization-based M(A)AF algorithms. In contrast, kernelizing the individual clusters before solving them using exponential search yields only very modest performance improvements or even hurts performance; for the vast majority of inputs, kernelization leads to no reduction in the maximal cluster size at all. The choice of the algorithm applied to solve individual clusters also significantly impacts performance, even though our limited experiment to evaluate this produced no clear winner; depth-bounded search, exponential search interleaved with kernelization, and an ILP-based algorithm all achieved competitive performance
Efficiency of Algorithms in Phylogenetics
Phylogenetics is the study of evolutionary relationships between species. Phylogenetic
trees have long been the standard object used in evolutionary biology to illustrate how a
given set of species are related. There are some groups (including certain plant and fish
species) for which the ancestral history contains reticulation events, caused by processes that
include hybridization, lateral gene transfer, and recombination. For such groups of species, it
is appropriate to represent their ancestral history by phylogenetic networks: rooted acyclic
digraphs, where arcs represent lines of genetic inheritance and vertices of in-degree at least
two represent reticulation events. This thesis is concerned with the efficiency, accuracy, and
tractability of mathematical models for phylogenetic network methods.
Three important and related measures for summarizing the dissimilarity in phylogenetic
trees are the minimum number of hybridization events required to fit two phylogenetic trees
onto a single phylogenetic network (the hybridization number), the (rooted) subtree prune
and regraft distance (the rSPR distance) and the tree bisection and reconnection distance (the
TBR distance) between two phylogenetic trees. The respective problems of computing these
measures are known to be NP-hard, but also fixed-parameter tractable in their respective
natural parameters. This means that, while they are hard to compute in general, for cases
in which a parameter (here the hybridization number and rSPR/TBR distance, respectively)
is small, the problem can be solved efficiently even for large input trees. Here, we present
new analyses showing that the use of the âcluster reductionâ rule â already defined for the
hybridization number and the rSPR distance and introduced here for the TBR distance â can
transform any O(f(p) · n)-time algorithm for any of these problems into an O(f(k) · n)-time
one, where n is the number of leaves of the phylogenetic trees, p is the natural parameter
and k is a much stronger (that is, smaller) parameter: the minimum level of a phylogenetic
network displaying both trees. These results appear in [9].
Traditional âdistance based methodsâ reconstruct a phylogenetic tree from a matrix of pairwise
distances between taxa. A phylogenetic network is a generalization of a phylogenetic
tree that can describe evolutionary events such as reticulation and hybridization that are not
tree-like. Although evolution has been known to be more accurately modelled by a network
than a tree for some time, only recently have efforts been made to directly reconstruct a
phylogenetic network from sequence data, as opposed to reconstructing several trees first and then trying to combine them into a single coherent network. In this work, we present
a generalisation of the UPGMA algorithm for ultrametric tree reconstruction which can
accurately reconstruct ultrametric tree-child networks from the set of distinct distances
between each pair of taxa. This result will also appear in [15]. Moreover, we analyse the
safety radius of the NETWORKUPGMA algorithm and show that it has safety radius 1/2.
This means that if we can obtain accurate estimates of the set of distances between each pair
of taxa in an ultrametric tree-child network, then NETWORKUPGMA correctly reconstructs
the true network
Treewidth distance on phylogenetic trees
In this article we study the treewidth of the display graph, an auxiliary graph structure obtained from the fusion of phylogenetic (i.e., evolutionary) trees at their leaves. Earlier work has shown that the treewidth of the display graph is bounded if the trees are in some formal sense topologically similar. Here we further expand upon this relationship. We analyse a number of reduction rules, commonly used in the phylogenetics literature to obtain fixed parameter tractable algorithms. In some cases (the subtree reduction) the reduction rules behave similarly with respect to treewidth, while others (the cluster reduction) behave very differently, and the behaviour of the chain reduction is particularly intriguing because of its link with graph separators and forbidden minors. We also show that the gap between treewidth and Tree Bisection and Reconnect (TBR) distance can be infinitely large, and that unlike, for example, planar graphs the treewidth of the display graph can be as much as linear in its number of vertices. A number of other auxiliary results are given. We conclude with a discussion and list a number of open problems
Relaxed Agreement Forests
There are multiple factors which can cause the phylogenetic inference process
to produce two or more conflicting hypotheses of the evolutionary history of a
set X of biological entities. That is: phylogenetic trees with the same set of
leaf labels X but with distinct topologies. This leads naturally to the goal of
quantifying the difference between two such trees T_1 and T_2. Here we
introduce the problem of computing a 'maximum relaxed agreement forest' (MRAF)
and use this as a proxy for the dissimilarity of T_1 and T_2, which in this
article we assume to be unrooted binary phylogenetic trees. MRAF asks for a
partition of the leaf labels X into a minimum number of blocks S_1, S_2, ...
S_k such that for each i, the subtrees induced in T_1 and T_2 by S_i are
isomorphic up to suppression of degree-2 nodes and taking the labels X into
account. Unlike the earlier introduced maximum agreement forest (MAF) model,
the subtrees induced by the S_i are allowed to overlap. We prove that it is
NP-hard to compute MRAF, by reducing from the problem of partitioning a
permutation into a minimum number of monotonic subsequences (PIMS).
Furthermore, we show that MRAF has a polynomial time O(log n)-approximation
algorithm where n=|X| and permits exact algorithms with single-exponential
running time. When at least one of the two input trees has a caterpillar
topology, we prove that testing whether a MRAF has size at most k can be
answered in polynomial time when k is fixed. We also note that on two
caterpillars the approximability of MRAF is related to that of PIMS. Finally,
we establish a number of bounds on MRAF, compare its behaviour to MAF both in
theory and in an experimental setting and discuss a number of open problems.Comment: 14 pages plus appendi