24 research outputs found
SPRIT: Identifying horizontal gene transfer in rooted phylogenetic trees
<p>Abstract</p> <p>Background</p> <p>Phylogenetic trees based on sequences from a set of taxa can be incongruent due to horizontal gene transfer (HGT). By identifying the HGT events, we can reconcile the gene trees and derive a taxon tree that adequately represents the species' evolutionary history. One HGT can be represented by a rooted Subtree Prune and Regraft (<smcaps>R</smcaps>SPR) operation and the number of <smcaps>R</smcaps>SPRs separating two trees corresponds to the minimum number of HGT events. Identifying the minimum number of <smcaps>R</smcaps>SPRs separating two trees is NP-hard, but the problem can be reduced to fixed parameter tractable. A number of heuristic and two exact approaches to identifying the minimum number of <smcaps>R</smcaps>SPRs have been proposed. This is the first implementation delivering an exact solution as well as the intermediate trees connecting the input trees.</p> <p>Results</p> <p>We present the SPR Identification Tool (SPRIT), a novel algorithm that solves the fixed parameter tractable minimum <smcaps>R</smcaps>SPR problem and its GPL licensed Java implementation. The algorithm can be used in two ways, exhaustive search that guarantees the minimum <smcaps>R</smcaps>SPR distance and a heuristic approach that guarantees finding a solution, but not necessarily the minimum one. We benchmarked SPRIT against other software in two different settings, small to medium sized trees i.e. five to one hundred taxa and large trees i.e. thousands of taxa. In the small to medium tree size setting with random artificial incongruence, SPRIT's heuristic mode outperforms the other software by always delivering a solution with a low overestimation of the <smcaps>R</smcaps>SPR distance. In the large tree setting SPRIT compares well to the alternatives when benchmarked on finding a minimum solution within a reasonable time. SPRIT presents both the minimum <smcaps>R</smcaps>SPR distance and the intermediate trees.</p> <p>Conclusions</p> <p>When used in exhaustive search mode, SPRIT identifies the minimum number of <smcaps>R</smcaps>SPRs needed to reconcile two incongruent rooted trees. SPRIT also performs quick approximations of the minimum <smcaps>R</smcaps>SPR distance, which are comparable to, and often better than, purely heuristic solutions. Put together, SPRIT is an excellent tool for identification of HGT events and pinpointing which taxa have been involved in HGT.</p
VaiPhy: a Variational Inference Based Algorithm for Phylogeny
Phylogenetics is a classical methodology in computational biology that today
has become highly relevant for medical investigation of single-cell data, e.g.,
in the context of cancer development. The exponential size of the tree space
is, unfortunately, a substantial obstacle for Bayesian phylogenetic inference
using Markov chain Monte Carlo based methods since these rely on local
operations. And although more recent variational inference (VI) based methods
offer speed improvements, they rely on expensive auto-differentiation
operations for learning the variational parameters. We propose VaiPhy, a
remarkably fast VI based algorithm for approximate posterior inference in an
augmented tree space. VaiPhy produces marginal log-likelihood estimates on par
with the state-of-the-art methods on real data and is considerably faster since
it does not require auto-differentiation. Instead, VaiPhy combines coordinate
ascent update equations with two novel sampling schemes: (i) SLANTIS, a
proposal distribution for tree topologies in the augmented tree space, and (ii)
the JC sampler, to the best of our knowledge, the first-ever scheme for
sampling branch lengths directly from the popular Jukes-Cantor model. We
compare VaiPhy in terms of density estimation and runtime. Additionally, we
evaluate the reproducibility of the baselines. We provide our code on GitHub:
\url{https://github.com/Lagergren-Lab/VaiPhy}.Comment: NeurIPS-22 conference pape
A practical approximation algorithm for solving massive instances of hybridization number for binary and nonbinary trees
Reticulate events play an important role in determining evolutionary
relationships. The problem of computing the minimum number of such events to
explain discordance between two phylogenetic trees is a hard computational
problem. Even for binary trees, exact solvers struggle to solve instances with
reticulation number larger than 40-50. Here we present CycleKiller and
NonbinaryCycleKiller, the first methods to produce solutions verifiably close
to optimality for instances with hundreds or even thousands of reticulations.
Using simulations, we demonstrate that these algorithms run quickly for large
and difficult instances, producing solutions that are very close to optimality.
As a spin-off from our simulations we also present TerminusEst, which is the
fastest exact method currently available that can handle nonbinary trees: this
is used to measure the accuracy of the NonbinaryCycleKiller algorithm. All
three methods are based on extensions of previous theoretical work and are
publicly available. We also apply our methods to real data
On unrooted and root-uncertain variants of several well-known phylogenetic network problems
The hybridization number problem requires us to embed a set of binary rooted
phylogenetic trees into a binary rooted phylogenetic network such that the
number of nodes with indegree two is minimized. However, from a biological
point of view accurately inferring the root location in a phylogenetic tree is
notoriously difficult and poor root placement can artificially inflate the
hybridization number. To this end we study a number of relaxed variants of this
problem. We start by showing that the fundamental problem of determining
whether an \emph{unrooted} phylogenetic network displays (i.e. embeds) an
\emph{unrooted} phylogenetic tree, is NP-hard. On the positive side we show
that this problem is FPT in reticulation number. In the rooted case the
corresponding FPT result is trivial, but here we require more subtle
argumentation. Next we show that the hybridization number problem for unrooted
networks (when given two unrooted trees) is equivalent to the problem of
computing the Tree Bisection and Reconnect (TBR) distance of the two unrooted
trees. In the third part of the paper we consider the "root uncertain" variant
of hybridization number. Here we are free to choose the root location in each
of a set of unrooted input trees such that the hybridization number of the
resulting rooted trees is minimized. On the negative side we show that this
problem is APX-hard. On the positive side, we show that the problem is FPT in
the hybridization number, via kernelization, for any number of input trees.Comment: 28 pages, 8 Figure
A tight kernel for computing the tree bisection and reconnection distance between two phylogenetic trees
In 2001 Allen and Steel showed that, if subtree and chain reduction rules
have been applied to two unrooted phylogenetic trees, the reduced trees will
have at most 28k taxa where k is the TBR (Tree Bisection and Reconnection)
distance between the two trees. Here we reanalyse Allen and Steel's
kernelization algorithm and prove that the reduced instances will in fact have
at most 15k-9 taxa. Moreover we show, by describing a family of instances which
have exactly 15k-9 taxa after reduction, that this new bound is tight. These
instances also have no common clusters, showing that a third
commonly-encountered reduction rule, the cluster reduction, cannot further
reduce the size of the kernel in the worst case. To achieve these results we
introduce and use "unrooted generators" which are analogues of rooted
structures that have appeared earlier in the phylogenetic networks literature.
Using similar argumentation we show that, for the minimum hybridization problem
on two rooted trees, 9k-2 is a tight bound (when subtree and chain reduction
rules have been applied) and 9k-4 is a tight bound (when, additionally, the
cluster reduction has been applied) on the number of taxa, where k is the
hybridization number of the two trees.Comment: One figure added, two small typos fixed. This version to appear in
SIDMA (SIAM Journal on Discrete Mathematics
Cuts and Partitions in Graphs/Trees with Applications
Both the maximum agreement forest problem and the multicut on trees problem are NP-hard, thus cannot be solved efficiently if P /=NP. The maximum agreement forest problem was motivated in the study of evolution trees in bioinformatics, in which we are given two leaf-labeled trees and are asked to find a maximum forest that is a subgraph of both trees. The multicuton trees problem has applications in networks, in which we are given a forest and a set of pairs of termianls and are asked to find a cut that separates all pairs of terminals.
We develop combinatorial and algorithmic techniques that lead to improved parameterized algorithms, approximation algorithms, and kernelization algorithms for these problems. For the maximum agreement forest problem, we proceed from the bottommost level of trees and extend solutions to whole trees. With this technique, we show that the maxi- mum agreement forest problem is fixed-parameterized tractable in general trees, resolving an open problem in this area. We also provide the first constant ratio approximation algorithm for the problem in general trees. For the multicut on trees problem, we take a new look at the problem through the eyes of vertex cover problem. This connection allows us to develop an kernelization algorithm for the problem, which gives an upper bound of O(k3) on the kernel size, significantly improving the previous best upper bound O(k6). We further exploit this connection to give a parameterized algorithm for the problem that runs in time O∗ (1.62k), thus improving the previous best algorithm of running time O∗ (2k). In the protein complex prediction problem, which comes directly from the study of bioinformatics, we are given a protein-protein interaction network, and are asked to find dense regions in this graph. We formulate this problem as a graph clustering problem and develop an algorithm to refine the results for identifying protein complexes. We test our algorithm on yeast protein- protein interaction networks, and we show that our algorithm is able to identify complexes more accurately than other existing algorithms
Phylogenetic incongruence through the lens of Monadic Second Order logic
International audienceWithin the field of phylogenetics there is growing interest in measures for summarising the dissimilarity, or incongruence, of two or more phylogenetic trees. Many of these measures are NP-hard to compute and this has stimulated a considerable volume of research into fixed parameter tractable algorithms. In this article we use Monadic Second Order logic to give alternative, compact proofs of fixed parameter tractability for several well-known incongruence measures. In doing so we wish to demonstrate the considerable potential of MSOL - machinery still largely unknown outside the algorithmic graph theory community - within phylogenetics. A crucial component of this work is the observation that many measures, when bounded, imply the existence of an agreement forest of bounded size, which in turn implies that an auxiliary graph structure, the display graph, has bounded treewidth. It is this bound on treewidth that makes the machinery of MSOL available for proving fixed parameter tractability