25 research outputs found

    A practical approximation algorithm for solving massive instances of hybridization number for binary and nonbinary trees

    Get PDF
    Reticulate events play an important role in determining evolutionary relationships. The problem of computing the minimum number of such events to explain discordance between two phylogenetic trees is a hard computational problem. Even for binary trees, exact solvers struggle to solve instances with reticulation number larger than 40-50. Here we present CycleKiller and NonbinaryCycleKiller, the first methods to produce solutions verifiably close to optimality for instances with hundreds or even thousands of reticulations. Using simulations, we demonstrate that these algorithms run quickly for large and difficult instances, producing solutions that are very close to optimality. As a spin-off from our simulations we also present TerminusEst, which is the fastest exact method currently available that can handle nonbinary trees: this is used to measure the accuracy of the NonbinaryCycleKiller algorithm. All three methods are based on extensions of previous theoretical work and are publicly available. We also apply our methods to real data

    On unrooted and root-uncertain variants of several well-known phylogenetic network problems

    Get PDF
    The hybridization number problem requires us to embed a set of binary rooted phylogenetic trees into a binary rooted phylogenetic network such that the number of nodes with indegree two is minimized. However, from a biological point of view accurately inferring the root location in a phylogenetic tree is notoriously difficult and poor root placement can artificially inflate the hybridization number. To this end we study a number of relaxed variants of this problem. We start by showing that the fundamental problem of determining whether an \emph{unrooted} phylogenetic network displays (i.e. embeds) an \emph{unrooted} phylogenetic tree, is NP-hard. On the positive side we show that this problem is FPT in reticulation number. In the rooted case the corresponding FPT result is trivial, but here we require more subtle argumentation. Next we show that the hybridization number problem for unrooted networks (when given two unrooted trees) is equivalent to the problem of computing the Tree Bisection and Reconnect (TBR) distance of the two unrooted trees. In the third part of the paper we consider the "root uncertain" variant of hybridization number. Here we are free to choose the root location in each of a set of unrooted input trees such that the hybridization number of the resulting rooted trees is minimized. On the negative side we show that this problem is APX-hard. On the positive side, we show that the problem is FPT in the hybridization number, via kernelization, for any number of input trees.Comment: 28 pages, 8 Figure

    Better Practical Algorithms for rSPR Distance and Hybridization Number

    Get PDF
    The problem of computing the rSPR distance of two phylogenetic trees (denoted by RDC) is NP-hard and so is the problem of computing the hybridization number of two phylogenetic trees (denoted by HNC). Since they are important problems in phylogenetics, they have been studied extensively in the literature. Indeed, quite a number of exact or approximation algorithms have been designed and implemented for them. In this paper, we design and implement one exact algorithm for HNC and several approximation algorithms for RDC and HNC. Our experimental results show that the resulting exact program is much faster (namely, more than 80 times faster for the easiest dataset used in the experiments) than the previous best and its superiority in speed becomes even more significant for more difficult instances. Moreover, the resulting approximation programs output much better results than the previous bests; indeed, the outputs are always nearly optimal and often optimal. Of particular interest is the usage of the Monte Carlo tree search (MCTS) method in the design of our approximation algorithms. Our experimental results show that with MCTS, we can often solve HNC exactly within short time

    Computing hybridization networks using agreement forests

    Get PDF
    Rooted phylogenetic trees are widely used in biology to represent the evolutionary history of certain species. Usually, such a tree is a simple binary tree only containing internal nodes of in-degree one and out-degree two representing specific speciation events. In applied phylogenetics, however, trees can contain nodes of out-degree larger than two because, often, in order to resolve some orderings of speciation events, there is only insufficient information available and the common way to model this uncertainty is to use nonbinary nodes (i.e., nodes of out-degree of at least three), also denoted as polytomies. Moreover, in addition to such speciation events, there exist certain biological events that cannot be modeled by a tree and, thus, require the more general concept of rooted phylogenetic networks or, more specifically, of hybridization networks. Examples for such reticulate events are horizontal gene transfer, hybridization, and recombination. Nevertheless, in order to construct hybridization networks, the less general concept of a phylogenetic tree can still be used as building block. More precisely, often, in a first step, phylogenetic trees for a set of species, each based on a distinctive orthologous gene, are constructed. In a second step, specific sets containing common subtrees of those trees, known as maximum acyclic agreement forests, are calculated, which are then glued together to a single hybridization network. In such a network, hybridization nodes (i.e., nodes of in-degree larger than or equal to two) can exist representing potential reticulate events of the underlying evolutionary history. As such events are considered as rare phenomena, from a biological point of view, especially those networks representing a minimum number of reticulate events, which is denoted as hybridization number, are of high interest. Consequently, in a mathematical aspect, the problem of calculating hybridization networks can be briefly described as follows. Given a set T of rooted phylogenetic trees sharing the same set of taxa, compute a hybridization network N displaying T with minimum hybridization number. In this context, we say that such a network N displays a phylogenetic tree T, if we can obtain T from N by removing as well as contracting some of its nodes and edges. Unfortunately, this is a computational hard problem (i.e., it is NP-hard), even for the simplest case given just two binary input trees. In this thesis, we present several methods tackling this NP-hard problem. Our first approach describes how to compute a representative set of minimum hybridization networks for two binary input trees. For that purpose, our approach implements the first non-naive algorithm - called allMAAFs - calculating all maximum acyclic agreement forests for two rooted binary phylogenetic trees on the same set of taxa. In a subsequent step, in order to maximize the efficiency of the algorithm allMAAFs, we have developed additionally several modifications each reducing the number of computational steps and, thus, significantly improving its practical runtime. Our second approach is an extension of our first approach making the underlying algorithm accessible to more than two binary input trees. For this purpose, our approach implements the algorithm allHNetworks being the first algorithm calculating all relevant hybridization networks displaying a set of rooted binary phylogenetic trees on the same set of taxa, which is a preferable feature when studying hybridization events. Lastly, we have developed a generalization of our second approach that can now deal with multiple nonbinary input trees. For that purpose, our approach implements the first non-naive algorithm - called allMulMAAFs - calculating a relevant set of nonbinary maximum acyclic agreement forests for two rooted (nonbinary) phylogenetic trees on the same set of taxa. Each of the algorithms above is integrated into our user friendly Java-based software package Hybroscale, which is freely available and platform independent, so that it runs on all major operating systems. Our program provides a graphical user interface for visualizing trees and networks. Moreover, it facilitates the interpretation of computed hybridization networks by adding specific features to its graphical representation and, thus, supports biologists in investigating reticulate evolution. In addition, we have implemented a method using a user friendly SQL-style modeling language for filtering the usually large amount of reported networks

    Computing hybridization networks using agreement forests

    Get PDF
    Rooted phylogenetic trees are widely used in biology to represent the evolutionary history of certain species. Usually, such a tree is a simple binary tree only containing internal nodes of in-degree one and out-degree two representing specific speciation events. In applied phylogenetics, however, trees can contain nodes of out-degree larger than two because, often, in order to resolve some orderings of speciation events, there is only insufficient information available and the common way to model this uncertainty is to use nonbinary nodes (i.e., nodes of out-degree of at least three), also denoted as polytomies. Moreover, in addition to such speciation events, there exist certain biological events that cannot be modeled by a tree and, thus, require the more general concept of rooted phylogenetic networks or, more specifically, of hybridization networks. Examples for such reticulate events are horizontal gene transfer, hybridization, and recombination. Nevertheless, in order to construct hybridization networks, the less general concept of a phylogenetic tree can still be used as building block. More precisely, often, in a first step, phylogenetic trees for a set of species, each based on a distinctive orthologous gene, are constructed. In a second step, specific sets containing common subtrees of those trees, known as maximum acyclic agreement forests, are calculated, which are then glued together to a single hybridization network. In such a network, hybridization nodes (i.e., nodes of in-degree larger than or equal to two) can exist representing potential reticulate events of the underlying evolutionary history. As such events are considered as rare phenomena, from a biological point of view, especially those networks representing a minimum number of reticulate events, which is denoted as hybridization number, are of high interest. Consequently, in a mathematical aspect, the problem of calculating hybridization networks can be briefly described as follows. Given a set T of rooted phylogenetic trees sharing the same set of taxa, compute a hybridization network N displaying T with minimum hybridization number. In this context, we say that such a network N displays a phylogenetic tree T, if we can obtain T from N by removing as well as contracting some of its nodes and edges. Unfortunately, this is a computational hard problem (i.e., it is NP-hard), even for the simplest case given just two binary input trees. In this thesis, we present several methods tackling this NP-hard problem. Our first approach describes how to compute a representative set of minimum hybridization networks for two binary input trees. For that purpose, our approach implements the first non-naive algorithm - called allMAAFs - calculating all maximum acyclic agreement forests for two rooted binary phylogenetic trees on the same set of taxa. In a subsequent step, in order to maximize the efficiency of the algorithm allMAAFs, we have developed additionally several modifications each reducing the number of computational steps and, thus, significantly improving its practical runtime. Our second approach is an extension of our first approach making the underlying algorithm accessible to more than two binary input trees. For this purpose, our approach implements the algorithm allHNetworks being the first algorithm calculating all relevant hybridization networks displaying a set of rooted binary phylogenetic trees on the same set of taxa, which is a preferable feature when studying hybridization events. Lastly, we have developed a generalization of our second approach that can now deal with multiple nonbinary input trees. For that purpose, our approach implements the first non-naive algorithm - called allMulMAAFs - calculating a relevant set of nonbinary maximum acyclic agreement forests for two rooted (nonbinary) phylogenetic trees on the same set of taxa. Each of the algorithms above is integrated into our user friendly Java-based software package Hybroscale, which is freely available and platform independent, so that it runs on all major operating systems. Our program provides a graphical user interface for visualizing trees and networks. Moreover, it facilitates the interpretation of computed hybridization networks by adding specific features to its graphical representation and, thus, supports biologists in investigating reticulate evolution. In addition, we have implemented a method using a user friendly SQL-style modeling language for filtering the usually large amount of reported networks

    On Unrooted and Root-Uncertain Variants of Several Well-Known Phylogenetic Network Problems

    Get PDF
    International audienceThe hybridization number problem requires us to embed a set of binary rooted phylogenetic trees into a binary rooted phylogenetic network such that the number of nodes with indegree two is minimized. However, from a biological point of view accurately inferring the root location in a phylogenetic tree is notoriously difficult and poor root placement can artificially inflate the hybridization number. To thisend we study a number of relaxed variants of this problem. We start by showing that the fundamental problem of determining whether an unrooted phylogenetic network displays (i.e. embeds) an unrooted phylogenetic tree, is NP-hard. On the positive side we show that this problem is FPT in reticulation number. In the rooted case the corresponding FPT result is trivial, but here we require more subtle argumentation. Next we show that the hybridization number problem for unrooted networks (when given two unrooted trees) is equivalent to the problem of computing the tree bisection and reconnect distance of the two unrooted trees. In the third part of the paper we consider the “root uncertain” variant of hybridization number. Here we are free to choose the root location in each of a set of unrooted input trees such that the hybridization number of the resulting rooted trees is minimized. On the negative side we show that this problem is APX-hard. On the positive side, we show that the problem is FPT in the hybridization number, via kernelization, for any number of input trees

    Cophylogenetic analysis of dated trees

    Get PDF
    Parasites and the associations they form with their hosts is an important area of research due to the associated health risks which parasites pose to the human population. The associations parasites form with their hosts are responsible for a number of the worst emerging diseases impacting global health today, including Ebola, HIV, and malaria. Macro-scale coevolutionary research aims to analyse these associations to provide further insights into these deadly diseases. This approach, first considered by Fahrenholz in 1913, has been applied to hundreds of coevolutionary systems and remains the most robust means to infer the underlying relationships which form between coevolving species. While reconciling the coevolutionary relationships between a pair of evolutionary systems is NP-Hard, it has been shown that if dating information exists there is a polynomial solution. These solutions however are computationally expensive, and are quickly becoming infeasible due to the rapid growth of phylogenetic data. If the rate of growth continues in line with the last three decades, the current means for analysing dated systems will become computationally infeasible. Within this thesis a collection of algorithms are introduced which aim to address this problem. This includes the introduction of the most efficient solution for analysing dated coevolutionary systems optimally, along with two linear time heuristics which may be applied where traditional algorithms are no longer feasible, while still offering a high degree of accuracy 91%. Finally, this work integrates these incremental results into a single model which is able to handle widespread parasitism, the case where parasites infect multiple hosts. This proposed model reconciles two competing theories of widespread parasitism, while also providing an accuracy improvement of 21%, one of the largest single improvements provided in this field to date. As such, the set of algorithms introduced within this thesis offers another step toward a unified coevolutionary analysis framework, consistent with Fahrenholz original coevolutionary analysis model
    corecore