    Near-optimal labeling schemes for nearest common ancestors

    We consider NCA labeling schemes: given a rooted tree TT, label the nodes of TT with binary strings such that, given the labels of any two nodes, one can determine, by looking only at the labels, the label of their nearest common ancestor. For trees with nn nodes we present upper and lower bounds establishing that labels of size (2±ϵ)logn(2\pm \epsilon)\log n, ϵ<1\epsilon<1 are both sufficient and necessary. (All logarithms in this paper are in base 2.) Alstrup, Bille, and Rauhe (SIDMA'05) showed that ancestor and NCA labeling schemes have labels of size logn+Ω(loglogn)\log n +\Omega(\log \log n). Our lower bound increases this to logn+Ω(logn)\log n + \Omega(\log n) for NCA labeling schemes. Since Fraigniaud and Korman (STOC'10) established that labels in ancestor labeling schemes have size logn+Θ(loglogn)\log n +\Theta(\log \log n), our new lower bound separates ancestor and NCA labeling schemes. Our upper bound improves the 10logn10 \log n upper bound by Alstrup, Gavoille, Kaplan and Rauhe (TOCS'04), and our theoretical result even outperforms some recent experimental studies by Fischer (ESA'09) where variants of the same NCA labeling scheme are shown to all have labels of size approximately 8logn8 \log n

    Cophenetic metrics for phylogenetic trees, after Sokal and Rohlf

    Phylogenetic tree comparison metrics are an important tool in the study of evolution, and hence the definition of such metrics is an interesting problem in phylogenetics. In a paper in Taxon fifty years ago, Sokal and Rohlf proposed to measure quantitatively the difference between a pair of phylogenetic trees by first encoding them by means of their half-matrices of cophenetic values, and then comparing these matrices. This idea has been used several times since then to define dissimilarity measures between phylogenetic trees but, to our knowledge, no proper metric on weighted phylogenetic trees with nested taxa based on this idea has been formally defined and studied yet. Actually, the cophenetic values of pairs of different taxa alone are not enough to single out phylogenetic trees with weighted arcs or nested taxa. In this paper we define a family of cophenetic metrics that compare phylogenetic trees on a same set of taxa by encoding them by means of their vectors of cophenetic values of pairs of taxa and depths of single taxa, and then computing the LpL^p norm of the difference of the corresponding vectors. Then, we study, either analytically or numerically, some of their basic properties: neighbors, diameter, distribution, and their rank correlation with each other and with other metrics.Comment: The "authors' cut" of a paper published in BMC Bioinformatics 14:3 (2013). 46 page

    The generalized Robinson-Foulds distance for phylogenetic trees

    The Robinson-Foulds (RF) distance, one of the most widely used metrics for comparing phylogenetic trees, has the advantage of being intuitive, with a natural interpretation in terms of common splits, and it can be computed in linear time, but it has a very low resolution, and it may become trivial for phylogenetic trees with overlapping taxa, that is, phylogenetic trees that share some but not all of their leaf labels. In this article, we study the properties of the Generalized Robinson-Foulds (GRF) distance, a recently proposed metric for comparing any structures that can be described by multisets of multisets of labels, when applied to rooted phylogenetic trees with overlapping taxa, which are described by sets of clusters, that is, by sets of sets of labels. We show that the GRF distance has a very high resolution, it can also be computed in linear time, and it is not (uniformly) equivalent to the RF distance.This research was partially supported by the Spanish Ministry of Science, Innovation and Universitiesand the European Regional Development Fund through project PGC2018-096956-B-C43 (FEDER/MICINN/AEI), and by the Agency for Management of University and Research Grants (AGAUR) throughgrant 2017-SGR-786 (ALBCOM).Peer ReviewedPostprint (published version

    Algorithms for constructing more accurate and inclusive phylogenetic trees

    Despite the unprecedented outpouring of molecular sequence data in phylogenetics, the current understanding of the tree of life is still incomplete. The widespread applications of phylogenies, ranging from drug design to biodiversity conservation, repeatedly remind us of the need for more accurate and inclusive phylogenies. My thesis addresses some of the underlying challenges, by presenting theoretical and empirical results, as well as algorithms for a range of phylogenetic optimization problems. In the first part of this thesis, I develop a heuristic method for the NP-hard unrooted Robinson-Foulds (RF) supertree problem, and show that it yields more accurate supertrees than those obtained from Matrix Representation with Parsimony (MRP) and rooted RF heuristic. In the second, I present an RF distance measure based approach (MulRF) for inferring a species tree from the input multi-copy gene trees, through a generalization of RF distance to multi-labeled trees. Through simulation, I show that this approach, which is independent of gene tree discordance mechanisms, produces more accurate species trees than existing methods when incongruence is caused by gene tree error, duplications and losses, and/or lateral gene transfer. Next, I perform a simulation study to evaluate the performance of Gene Tree Parsimony (GTP) under duplication and duplication and loss cost models and compare it to MulRF method. The objective is to study the effects of various types of sampling (e.g., gene tree and sequence sampling), gene tree error, and duplication and loss rates on the accuracy of the phylogenetic estimates by GTP and MulRF. Next, I present efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence. In the end, I present NP-completeness proofs for two problems whose complexity was previously unknown

    Algorithm design techniques for parameterized graph modification problems

    Diese Arbeit beschaeftigt sich mit dem Entwurf parametrisierter Algorithmen fuer Graphmodifikationsprobleme wie Feedback Vertex Set, Multicut in Trees, Cluster Editing und Closest 3-Leaf Powers. Anbei wird die Anwendbarkeit von vier Technicken zur Entwicklung parametrisierter Algorithmen, naemlich, Datenreduktion, Suchbaum, Iterative Kompression und Dynamische Programmierung, fuer solche Graphmodifikationsprobleme untersucht

    From trees to networks and back

    The evolutionary history of a set of species is commonly represented by a phylogenetic tree. Often, however, the data contain conflicting signals, which can be better represented by a more general structure, namely a phylogenetic network. Such networks allow the display of several alternative evolutionary scenarios simultaneously but this can come at the price of complex visual representations. Using so-called circular split networks reduces this complexity, because this type of network can always be visualized in the plane without any crossing edges. These circular split networks form the core of this thesis. We construct them, use them as a search space for minimum evolution trees and explore their properties. More specifically, we present a new method, called SuperQ, to construct a circular split network summarising a collection of phylogenetic trees that have overlapping leaf sets. Then, we explore the set of phylogenetic trees associated with a �fixed circular split network, in particular using it as a search space for optimal trees. This set represents just a tiny fraction of the space of all phylogenetic trees, but we still �find trees within it that compare quite favourably with those obtained by a leading heuristic, which uses tree edit operations for searching the whole tree space. In the last part, we advance our understanding of the set of phylogenetic trees associated with a circular split network. Specifically, we investigate the size of the so-called circular tree neighbourhood for the three tree edit operations, tree bisection and reconnection (tbr), subtree prune and regraft (spr) and nearest neighbour interchange (nni)

    A list of parameterized problems in bioinformatics

    In this report we present a list of problems that originated in bionformatics. Our aim is to collect information on such problems that have been analyzed from the point of view of Parameterized Complexity. For every problem we give its definition and biological motivation together with known complexity results.Postprint (published version

    LIPIcs, Volume 248, ISAAC 2022, Complete Volume

    On Reconfiguration Problems: Structure and Tractability

    Given an n-vertex graph G and two vertices s and t in G, determining whether there exists a path and computing the length of the shortest path between s and t are two of the most fundamental graph problems. In the classical battle of P versus NP or ``easy'' versus ``hard'', both of these problems are on the easy side. That is, they can be solved in poly(n) time, where poly is any polynomial function. But what if our input consisted of a 2^n-vertex graph? Of course, we can no longer assume G to be part of the input, as reading the input alone requires more than poly(n) time. Instead, we are given an oracle encoded using poly(n) bits and that can, in constant or poly(n) time, answer queries of the form ``is u a vertex in G'' or ``is there an edge between u and v?''. Given such an oracle and two vertices of the 2^n-vertex graph, can we still determine if there is a path or compute the length of the shortest path between s and t in poly(n) time? A slightly different, but equally insightful, formulation of the question above is as follows. Given a set S of n objects, consider the graph R(S) which contains one vertex for each set in the power set of S, 2^S, and two vertices are adjacent in R(S) whenever the size of their symmetric difference is equal to one. Clearly, this graph contains 2^n vertices and can be easily encoded in poly(n) bits using the oracle described above. It is not hard to see that there exists a path between any two vertices of R(S). Moreover, computing the length of a shortest path can be accomplished in constant time; it is equal to the size of the symmetric difference of the two underlying sets. If the vertex set of R(S) were instead restricted to a subset of 2^S, both of our problems can become NP-complete or even PSPACE-complete. Therefore, another interesting question is whether we can determine what types of ``restriction'' on the vertex set of R(S) induce such variations in the complexity of the two problems. These two seemingly artificial questions are in fact quite natural and appear in many practical and theoretical problems. In particular, these are exactly the types of questions asked under the reconfiguration framework, the main subject of this thesis. Under the reconfiguration framework, instead of finding a feasible solution to some instance I of a search problem Q, we are interested in structural and algorithmic questions related to the solution space of Q. Naturally, given some adjacency relation A defined over feasible solutions of Q, size of the symmetric difference being one such relation, the solution space can be represented using a graph R_Q(I). R_Q(I) contains one vertex for each feasible solution of Q on instance I and two vertices share an edge whenever their corresponding solutions are adjacent under A. An edge in R_Q(I) corresponds to a reconfiguration step, a walk in R_Q(I) is a sequence of such steps, a reconfiguration sequence, and R_Q(I) is a reconfiguration graph. Studying problems related to reconfiguration graphs has received considerable attention in recent literature, the most popular problem being to determine whether there exists a reconfiguration sequence between two given feasible solutions; for most NP-complete problems, this problem has been shown to be PSPACE-complete. The purpose of our work is to embark on a systematic investigation of the tractability and structural properties of such problems under both classical and parameterized complexity assumptions. Parameterized complexity is another framework which has become an essential tool for researchers in computational complexity during the last two decades or so and one of its main goals is to provide a better explanation of why some hard problems (in a classical sense) can be in fact much easier than others. Hence, we are interested in what separates the tractable instances from the intractable ones and the fixed-parameter tractable instances from the fixed-parameter intractable ones. It is clear from the generic definition of reconfiguration problems that several factors affect their complexity status. Our work aims at providing a finer classification of the complexity of reconfiguration problems with respect to some of these factors, including the definition of the adjacency relation A, structural properties of the input instance I, structural properties of the reconfiguration graph, and the length of a reconfiguration sequence. As most of these factors can be numerically quantified, we believe that the investigation of reconfiguration problems under both parameterized and classical complexity assumptions will help us further understand the boundaries between tractability and intractability. We consider reconfiguration problems related to Satisfiability, Coloring, Dominating Set, Vertex Cover, Independent Set, Feedback Vertex Set, and Odd Cycle Transversal, and provide lower bounds, polynomial-time algorithms, and fixed-parameter tractable algorithms. In doing so, we answer some of the questions left open in recent work and push the known boundaries between tractable and intractable even closer. As a byproduct of our initiating work on parameterized reconfiguration problems, we present a generic adaptation of parameterized complexity techniques which we believe can be used as a starting point for studying almost any such problem