    On unrooted and root-uncertain variants of several well-known phylogenetic network problems

    The hybridization number problem requires us to embed a set of binary rooted phylogenetic trees into a binary rooted phylogenetic network such that the number of nodes with indegree two is minimized. However, from a biological point of view accurately inferring the root location in a phylogenetic tree is notoriously difficult and poor root placement can artificially inflate the hybridization number. To this end we study a number of relaxed variants of this problem. We start by showing that the fundamental problem of determining whether an \emph{unrooted} phylogenetic network displays (i.e. embeds) an \emph{unrooted} phylogenetic tree, is NP-hard. On the positive side we show that this problem is FPT in reticulation number. In the rooted case the corresponding FPT result is trivial, but here we require more subtle argumentation. Next we show that the hybridization number problem for unrooted networks (when given two unrooted trees) is equivalent to the problem of computing the Tree Bisection and Reconnect (TBR) distance of the two unrooted trees. In the third part of the paper we consider the "root uncertain" variant of hybridization number. Here we are free to choose the root location in each of a set of unrooted input trees such that the hybridization number of the resulting rooted trees is minimized. On the negative side we show that this problem is APX-hard. On the positive side, we show that the problem is FPT in the hybridization number, via kernelization, for any number of input trees.Comment: 28 pages, 8 Figure

    International audienceThe hybridization number problem requires us to embed a set of binary rooted phylogenetic trees into a binary rooted phylogenetic network such that the number of nodes with indegree two is minimized. However, from a biological point of view accurately inferring the root location in a phylogenetic tree is notoriously difficult and poor root placement can artificially inflate the hybridization number. To thisend we study a number of relaxed variants of this problem. We start by showing that the fundamental problem of determining whether an unrooted phylogenetic network displays (i.e. embeds) an unrooted phylogenetic tree, is NP-hard. On the positive side we show that this problem is FPT in reticulation number. In the rooted case the corresponding FPT result is trivial, but here we require more subtle argumentation. Next we show that the hybridization number problem for unrooted networks (when given two unrooted trees) is equivalent to the problem of computing the tree bisection and reconnect distance of the two unrooted trees. In the third part of the paper we consider the “root uncertain” variant of hybridization number. Here we are free to choose the root location in each of a set of unrooted input trees such that the hybridization number of the resulting rooted trees is minimized. On the negative side we show that this problem is APX-hard. On the positive side, we show that the problem is FPT in the hybridization number, via kernelization, for any number of input trees

    Embedding Phylogenetic Trees in Networks of Low Treewidth

    Given a rooted, binary phylogenetic network and a rooted, binary phylogenetic tree, can the tree be embedded into the network? This problem, called Tree Containment, arises when validating networks constructed by phylogenetic inference methods. We present the first algorithm for (rooted) Tree Containment using the treewidth t of the input network N as parameter, showing that the problem can be solved in 2O(t2) |N| time and space.Optimizatio

    A tight kernel for computing the tree bisection and reconnection distance between two phylogenetic trees

    In 2001 Allen and Steel showed that, if subtree and chain reduction rules have been applied to two unrooted phylogenetic trees, the reduced trees will have at most 28k taxa where k is the TBR (Tree Bisection and Reconnection) distance between the two trees. Here we reanalyse Allen and Steel's kernelization algorithm and prove that the reduced instances will in fact have at most 15k-9 taxa. Moreover we show, by describing a family of instances which have exactly 15k-9 taxa after reduction, that this new bound is tight. These instances also have no common clusters, showing that a third commonly-encountered reduction rule, the cluster reduction, cannot further reduce the size of the kernel in the worst case. To achieve these results we introduce and use "unrooted generators" which are analogues of rooted structures that have appeared earlier in the phylogenetic networks literature. Using similar argumentation we show that, for the minimum hybridization problem on two rooted trees, 9k-2 is a tight bound (when subtree and chain reduction rules have been applied) and 9k-4 is a tight bound (when, additionally, the cluster reduction has been applied) on the number of taxa, where k is the hybridization number of the two trees.Comment: One figure added, two small typos fixed. This version to appear in SIDMA (SIAM Journal on Discrete Mathematics

    Treewidth of display graphs: bounds, brambles and applications

    Phylogenetic trees and networks are leaf-labelled graphs used to model evolution. Display graphs are created by identifying common leaf labels in two or more phylogenetic trees or networks. The treewidth of such graphs is bounded as a function of many common dissimilarity measures between phylogenetic trees and this has been leveraged in fixed parameter tractability results. Here we further elucidate the properties of display graphs and their interaction with treewidth. We show that it is NP-hard to recognize display graphs, but that display graphs of bounded treewidth can be recognized in linear time. Next we show that if a phylogenetic network displays (i.e. topologically embeds) a phylogenetic tree, the treewidth of their display graph is bounded by a function of the treewidth of the original network (and also by various other parameters). In fact, using a bramble argument we show that this treewidth bound is sharp up to an additive term of 1. We leverage this bound to give an FPT algorithm, parameterized by treewidth, for determining whether a network displays a tree, which is an intensively-studied problem in the field. We conclude with a discussion on the future use of display graphs and treewidth in phylogenetics

    The Emergence and Early Evolution of Biological Carbon-Fixation

    The fixation of into living matter sustains all life on Earth, and embeds the biosphere within geochemistry. The six known chemical pathways used by extant organisms for this function are recognized to have overlaps, but their evolution is incompletely understood. Here we reconstruct the complete early evolutionary history of biological carbon-fixation, relating all modern pathways to a single ancestral form. We find that innovations in carbon-fixation were the foundation for most major early divergences in the tree of life. These findings are based on a novel method that fully integrates metabolic and phylogenetic constraints. Comparing gene-profiles across the metabolic cores of deep-branching organisms and requiring that they are capable of synthesizing all their biomass components leads to the surprising conclusion that the most common form for deep-branching autotrophic carbon-fixation combines two disconnected sub-networks, each supplying carbon to distinct biomass components. One of these is a linear folate-based pathway of reduction previously only recognized as a fixation route in the complete Wood-Ljungdahl pathway, but which more generally may exclude the final step of synthesizing acetyl-CoA. Using metabolic constraints we then reconstruct a “phylometabolic” tree with a high degree of parsimony that traces the evolution of complete carbon-fixation pathways, and has a clear structure down to the root. This tree requires few instances of lateral gene transfer or convergence, and instead suggests a simple evolutionary dynamic in which all divergences have primary environmental causes. Energy optimization and oxygen toxicity are the two strongest forces of selection. The root of this tree combines the reductive citric acid cycle and the Wood-Ljungdahl pathway into a single connected network. This linked network lacks the selective optimization of modern fixation pathways but its redundancy leads to a more robust topology, making it more plausible than any modern pathway as a primitive universal ancestral form

    Efficiency of Algorithms in Phylogenetics

    Phylogenetics is the study of evolutionary relationships between species. Phylogenetic trees have long been the standard object used in evolutionary biology to illustrate how a given set of species are related. There are some groups (including certain plant and fish species) for which the ancestral history contains reticulation events, caused by processes that include hybridization, lateral gene transfer, and recombination. For such groups of species, it is appropriate to represent their ancestral history by phylogenetic networks: rooted acyclic digraphs, where arcs represent lines of genetic inheritance and vertices of in-degree at least two represent reticulation events. This thesis is concerned with the efficiency, accuracy, and tractability of mathematical models for phylogenetic network methods. Three important and related measures for summarizing the dissimilarity in phylogenetic trees are the minimum number of hybridization events required to fit two phylogenetic trees onto a single phylogenetic network (the hybridization number), the (rooted) subtree prune and regraft distance (the rSPR distance) and the tree bisection and reconnection distance (the TBR distance) between two phylogenetic trees. The respective problems of computing these measures are known to be NP-hard, but also fixed-parameter tractable in their respective natural parameters. This means that, while they are hard to compute in general, for cases in which a parameter (here the hybridization number and rSPR/TBR distance, respectively) is small, the problem can be solved efficiently even for large input trees. Here, we present new analyses showing that the use of the “cluster reduction” rule – already defined for the hybridization number and the rSPR distance and introduced here for the TBR distance – can transform any O(f(p) · n)-time algorithm for any of these problems into an O(f(k) · n)-time one, where n is the number of leaves of the phylogenetic trees, p is the natural parameter and k is a much stronger (that is, smaller) parameter: the minimum level of a phylogenetic network displaying both trees. These results appear in [9]. Traditional “distance based methods” reconstruct a phylogenetic tree from a matrix of pairwise distances between taxa. A phylogenetic network is a generalization of a phylogenetic tree that can describe evolutionary events such as reticulation and hybridization that are not tree-like. Although evolution has been known to be more accurately modelled by a network than a tree for some time, only recently have efforts been made to directly reconstruct a phylogenetic network from sequence data, as opposed to reconstructing several trees first and then trying to combine them into a single coherent network. In this work, we present a generalisation of the UPGMA algorithm for ultrametric tree reconstruction which can accurately reconstruct ultrametric tree-child networks from the set of distinct distances between each pair of taxa. This result will also appear in [15]. Moreover, we analyse the safety radius of the NETWORKUPGMA algorithm and show that it has safety radius 1/2. This means that if we can obtain accurate estimates of the set of distances between each pair of taxa in an ultrametric tree-child network, then NETWORKUPGMA correctly reconstructs the true network

    Rearrangement operations on unrooted phylogenetic networks

    Rearrangement operations transform a phylogenetic tree into another one and hence induce a metric on the space of phylogenetic trees. Popular operations for unrooted phylogenetic trees are NNI (nearest neighbour interchange), SPR (subtree prune and regraft), and TBR (tree bisection and reconnection). Recently, these operations have been extended to unrooted phylogenetic networks, which are generalisations of phylogenetic trees that can model reticulated evolutionary relationships. Here, we study global and local properties of spaces of phylogenetic networks under these three operations. In particular, we prove connectedness and asymptotic bounds on the diameters of spaces of different classes of phylogenetic networks, including tree-based and level-k networks. We also examine the behaviour of shortest TBR-sequence between two phylogenetic networks in a class, and whether the TBR-distance changes if intermediate networks from other classes are allowed: for example, the space of phylogenetic trees is an isometric subgraph of the space of phylogenetic networks under TBR. Lastly, we show that computing the TBR-distance and the PR-distance of two phylogenetic networks is NP-hard