17 research outputs found

    Optimizing Phylogenetic Supertrees Using Answer Set Programming

    Full text link
    The supertree construction problem is about combining several phylogenetic trees with possibly conflicting information into a single tree that has all the leaves of the source trees as its leaves and the relationships between the leaves are as consistent with the source trees as possible. This leads to an optimization problem that is computationally challenging and typically heuristic methods, such as matrix representation with parsimony (MRP), are used. In this paper we consider the use of answer set programming to solve the supertree construction problem in terms of two alternative encodings. The first is based on an existing encoding of trees using substructures known as quartets, while the other novel encoding captures the relationships present in trees through direct projections. We use these encodings to compute a genus-level supertree for the family of cats (Felidae). Furthermore, we compare our results to recent supertrees obtained by the MRP method.Comment: To appear in Theory and Practice of Logic Programming (TPLP), Proceedings of ICLP 201

    Explaining Evolution via Constrained Persistent Perfect Phylogeny

    Get PDF
    BACKGROUND: The perfect phylogeny is an often used model in phylogenetics since it provides an efficient basic procedure for representing the evolution of genomic binary characters in several frameworks, such as for example in haplotype inference. The model, which is conceptually the simplest, is based on the infinite sites assumption, that is no character can mutate more than once in the whole tree. A main open problem regarding the model is finding generalizations that retain the computational tractability of the original model but are more flexible in modeling biological data when the infinite site assumption is violated because of e.g. back mutations. A special case of back mutations that has been considered in the study of the evolution of protein domains (where a domain is acquired and then lost) is persistency, that is the fact that a character is allowed to return back to the ancestral state. In this model characters can be gained and lost at most once. In this paper we consider the computational problem of explaining binary data by the Persistent Perfect Phylogeny model (referred as PPP) and for this purpose we investigate the problem of reconstructing an evolution where some constraints are imposed on the paths of the tree. RESULTS: We define a natural generalization of the PPP problem obtained by requiring that for some pairs (character, species), neither the species nor any of its ancestors can have the character. In other words, some characters cannot be persistent for some species. This new problem is called Constrained PPP (CPPP). Based on a graph formulation of the CPPP problem, we are able to provide a polynomial time solution for the CPPP problem for matrices whose conflict graph has no edges. Using this result, we develop a parameterized algorithm for solving the CPPP problem where the parameter is the number of characters. CONCLUSIONS: A preliminary experimental analysis shows that the constrained persistent perfect phylogeny model allows to explain efficiently data that do not conform with the classical perfect phylogeny model

    Algorithms for weighted multidimensional search and perfect phylogeny

    Get PDF
    This dissertation is a collection of papers from two independent areas: convex optimization problems in R[superscript]d and the construction of evolutionary trees;The paper on convex optimization problems in R[superscript]d gives improved algorithms for solving the Lagrangian duals of problems that have both of the following properties. First, in absence of the bad constraints, the problems can be solved in strongly polynomial time by combinatorial algorithms. Second, the number of bad constraints is fixed. As part of our solution to these problems, we extend Cole\u27s circuit simulation approach and develop a weighted version of Megiddo\u27s multidimensional search technique;The papers on evolutionary tree construction deal with the perfect phylogeny problem, where species are specified by a set of characters and each character can occur in a species in one of a fixed number of states. This problem is known to be NP-complete. The dissertation contains the following results on the perfect phylogeny problem: (1) A linear time algorithm when all the characters have two states. (2) A polynomial time algorithm when the number of character states is fixed. (3) A polynomial time algorithm when the number of characters is fixed

    A new method for identifying site-specific evolutionary rates and its applications.

    Get PDF
    In this thesis, I discuss each stage in the development of a new method for identifying site specific evolutionary rates, from conception of the idea, through the implementation to its application to data. TIGER, or tree independent generation of evolutionary rates, is based largely around the works of LeQuesne (1989), Wilkinson (1998) and Pisani (2004) and the premise that sites in a multi-state character matrix could be scored based on the level of agreement it displays with the other sites. In these earlier studies, however, agreement was measured in binary manner: sites were either compatible with each other or they are not. TIGER allows various degrees of agreement to occur between two sites, allowing it to pick up more subtle signals in the data. After implementing the method into a software program, it could be applied to data. Using a combination of simulated and empirical datasets, TIGER was shown to produce desirable results. In particular, removal of sites identified by TIGER was shown to improve phylogenetic reconstruction of deeply diverging lineages and of taxa displaying compositional attraction. Additionally, TIGER was applied to a gene content matrix in order to identify HGT signals and integrated into the analysis of a current phylogenetic problem, the origin of the mitochondria. Although it is widely accepted that eukaryotes have a chimeric genome, the specific “parent” of the mitochondria is, as of yet, unclear. Previous studies have failed to reach agreement regarding this issue for a number of reasons. Exploration of the signals using TIGER and heterogeneous modelling reveal that multiple signals and compositional heterogeneity are among the biggest problems with datasets containing both mitochondrial and a-proteobacterial sequences

    Constructing Camin-Sokal phylogenies via answer set programming

    Get PDF
    The problem of constructing a most parsimonious phylogenetic tree from species data, the maximum parsimony problem, is central to phylogenetics and has diverse applications elsewhere. Most natural variations of the problem, including the cladistic Camin-Sokal (CCS) version studied here, are NP-complete. The usual approach to solving these problems is branch-and-bound (BNB); packages using BNB often find approximate solutions quickly, but can establish optimality only for small instances. We present a new approach to solving the CCS problem based on Answer Set Programming (ASP), a declarative approach based on stable mod el semantics of logic programming. ASP proves useful in tackling hard, combinatorial search problems. Along with our base model, we describe several variations which significantly affect performance. We compare our best versions with a commonly used BNB-based approach (PHYLIP\u27s PENNY package), and conclude that ASP offers a viable approach to solving phylogeny problems, especially when optimality is relevant

    Constructing Camin-Sokal Phylogenies via Answer Set Programming

    No full text
    Abstract. Constructing parsimonious phylogenetic trees from species data is a central problem in phylogenetics, and has diverse applications, even outside biology. Many variations of the problem, including the cladistic Camin-Sokal (CCS) version, are NP-complete. We present Answer Set Programming (ASP) models for the binary CCS problem, as well as a simpler perfect phylogeny version, along with experimental results of applying the models to biological data. Our contribution is three-fold. First, we solve phylogeny problems which have not previously been tackled by ASP. Second, we report on variants of our CCS model which significantly affect run time, including the interesting case of making the program “slightly tighter”. This version exhibits some of the best performance, in contrast with a tight version of the model which exhibited poor performance. Third, we are able to find proven-optimal solutions for larger instances of the CCS problem than the widely used branch-and-bound-based PHYLIP package. Keywords: phylogeny; maximum parsimony; Camin-Sokal; answer set programming

    A quantitative approach to the study of syntactic evolution

    Get PDF
    The dissertation covers the experimentation of quantitative algorithmic procedures for the study of language evolution. In particular, the inquiry is based on the application of quantitative methods originally designed within molecular biology and population genetics to a parametric comparative dataset: The goal is to infer hypotheses regarding genealogical relationships between a specific set of languages, accounting for the role of areal convergence in linguistic variation, and to evaluate them in light of the traditional accounts provided by historical linguistics. The first focus is on the comparison between language evolution and biological evolution. The idea is that some important features of language development may also be identified drawing a parallel with the biological domain. On the whole, this analysis seems to show that language evolution and biological evolution are considerably different in some respects, but that the dissimilarities do not prevent the application of quantitative reconstruction procedures. Then most recent generative views on syntactic change are taken into consideration, showing that they are perfectly compatible with the evolutionary account outlined. To this end, basic notions regarding the cognitive-biolinguistic and the formal aspects of generative grammar are illustrated and, once the parametric perspective on synchronic language variation is clarified, the discussion is dedicated to the extension of the parametric approach to the explanation of diachronic phenomena, including genealogical development and contact. The successive step is the presentation of diverse methods of comparison adopted in historical linguistics and population genetics and, in particular, of the “Parametric comparison method”: The parallel between the latter and the procedures of investigation used in molecular biology paves the way to the introduction of the relevant quantitative techniques of phylogenetic reconstruction. After having outlined the overview of the principal datasets used so far to perform quantitative investigations on the history of languages, the parametric dataset is presented and overview of “traditional” and quantitative-based proposals regarding the genealogical classification of the languages included in the investigation is provided. The last section of the work covers the illustration of the quantitative analyses carried out. The preliminary character-based and distance-based review of the dataset is followed by the discussion on the choice of the phylogenetic methods adopted. Then the first outfit of phylogenies reconstructions on the full dataset is offered and commented on in detail. The successive focus is on possible strategies to account for homoplasy (i.e. chance and borrowing): An empirically-based selection of parameters and suggestions regarding the way in which parameters might be weighted according to their genealogical relevance are proposed. Finally, some tentative analyses concerning the possibility of detecting and accounting for borrowing in phylogenetic trees, the reconstruction of ancestral states and the mapping of syntactic distances onto the diachronic and the diatopic dimensions of variation are introduced. On the whole, the quantitative analyses appear to provide good indications of diverse facts: That phylogenetic techniques are to a large extent effectively applicable to the study of syntactic evolution, that the parametric comparison may successfully help shedding light on both short- and long-range genealogical relationships, and that traces of proper genealogical relatedness are likely to be preserved (and to be recoverable despite homoplasy) at the level of “macro-comparison”, like that instantiated in the parametric data
    corecore