105 research outputs found

    The inference of gene trees with species trees

    Get PDF
    Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can co-exist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. In this article we review the various models that have been used to describe the relationship between gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational Evolutionary Biology" conference, Montpellier, 201

    The Most Parsimonious Reconciliation Problem in the Presence of Incomplete Lineage Sorting and Hybridization Is NP-Hard

    Get PDF
    The maximum parsimony phylogenetic reconciliation problem seeks to explain incongruity between a gene phylogeny and a species phylogeny with respect to a set of evolutionary events. While the reconciliation problem is well-studied for species and gene trees subject to events such as duplication, transfer, loss, and deep coalescence, recent work has examined species phylogenies that incorporate hybridization and are thus represented by networks rather than trees. In this paper, we show that the problem of computing a maximum parsimony reconciliation for a gene tree and species network is NP-hard even when only considering deep coalescence. This result suggests that future work on maximum parsimony reconciliation for species networks should explore approximation algorithms and heuristics

    Efficient error correction algorithms for gene tree reconciliation based on duplication, duplication and loss, and deep coalescence

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Gene tree - species tree reconciliation problems infer the patterns and processes of gene evolution within a species tree. Gene tree parsimony approaches seek the evolutionary scenario that implies the fewest gene duplications, duplications and losses, or deep coalescence (incomplete lineage sorting) events needed to reconcile a gene tree and a species tree. While a gene tree parsimony approach can be informative about genome evolution and phylogenetics, error in gene trees can profoundly bias the results.</p> <p>Results</p> <p>We introduce efficient algorithms that rapidly search local Subtree Prune and Regraft (SPR) or Tree Bisection and Reconnection (TBR) neighborhoods of a given gene tree to identify a topology that implies the fewest duplications, duplication and losses, or deep coalescence events. These algorithms improve on the current solutions by a factor of <it>n </it>for searching SPR neighborhoods and <it>n</it><sup>2 </sup>for searching TBR neighborhoods, where <it>n </it>is the number of taxa in the given gene tree. They provide a fast error correction protocol for ameliorating the effects of gene tree error by allowing small rearrangements in the topology to improve the reconciliation cost. We also demonstrate a simple protocol to use the gene rearrangement algorithm to improve gene tree parsimony phylogenetic analyses.</p> <p>Conclusions</p> <p>The new gene tree rearrangement algorithms provide a fast method to address gene tree error. They do not make assumptions about the underlying processes of genome evolution, and they are amenable to analyses of large-scale genomic data sets. These algorithms are also easily incorporated into gene tree parsimony phylogenetic analyses, potentially producing more credible estimates of reconciliation cost.</p

    Phylogeny reconciliation under gene tree parsimony

    Get PDF
    The growing genomic and phylogenetic data sets represent a unique opportunity to analytically and computationally study the relationship among diversifying species. Unfortunately, such data often result in contradictory gene phylogenies due to common yet unobserved evolutionary events, e.g., gene duplication or deep coalescence. Gene tree parsimony (GTP) methods address such issue by reconciling gene phylogenies into one consistent species evolutionary history as well as identifying the underlying events. In this study, we solve not only the GTP problem but also propose a new method to select gene trees in order to assist biologists in gaining insight from phylogenetic analysis. First, we introduce exact solutions for the intrinsically complex GTP problem. Exact solutions for NP-hard problems, like GTP, have a long and extensive history of improvements for classic problems such as traveling salesman and knapsack. Our solutions presented here are designed via integer linear programming (ILP) and dynamic programming (DP), which are techniques widely used in solving problems of similar complexity. We also demonstrate the effectiveness of our solutions through simulation analysis and empirical datasets. To ensure input data coherence for GTP analysis, as a method to strengthen species represented in a gene tree, we introduce the quasi-biclique (QBC) approach to analyze and condense input datasets. In order to take advantage of emerging techniques that further describe the sequence-host and gene-taxon relations, quasi-bicliques are optimized via weighted edge connectivities and distribution of missing information. Our study showed these QBC mining problems are NP-hard. We describe an ILP formulation that is capable of finding optimal QBCs in an effort to support GTP analysis. We also investigate the applicability of QBC to other applications such as mining genetic interaction networks to encouraging results

    Parsimonious Inference of Hybridization in the Presence of Incomplete Lineage Sorting

    Get PDF
    Hybridization plays an important evolutionary role in several groups of organisms. A phylogenetic approach to detect hybridization entails sequencing multiple loci across the genomes of a group of species of interest, reconstructing their gene trees, and taking their differences as indicators of hybridization. However, methods that follow this approach mostly ignore population effects, such as incomplete lineage sorting (ILS). Given that hybridization occurs between closely related organisms, ILS may very well be at play and, hence, must be accounted for in the analysis framework. To address this issue, we present a parsimony criterion for reconciling gene trees within the branches of a phylogenetic network, and a local search heuristic for inferring phylogenetic networks from collections of gene-tree topologies under this criterion. This framework enables phylogenetic analyses while accounting for both hybridization and ILS. Further, we propose two techniques for incorporating information about uncertainty in gene-tree estimates. Our simulation studies demonstrate the good performance of our framework in terms of identifying the location of hybridization events, as well as estimating the proportions of genes that underwent hybridization. Also, our framework shows good performance in terms of efficiency on handling large data sets in our experiments. Further, in analyzing a yeast data set, we demonstrate issues that arise when analyzing real data sets. While a probabilistic approach was recently introduced for this problem, and while parsimonious reconciliations have accuracy issues under certain settings, our parsimony framework provides a much more computationally efficient technique for this type of analysis. Our framework now allows for genome-wide scans for hybridization, while also accounting for ILS

    Conflict Resolution Algorithms for Deep Coalescence Phylogenetic Networks

    Get PDF
    We address the problem of inferring an optimal tree displayed by a network, given a gene tree G and a tree-child network N, under the deep coalescence cost. We propose an O(|G||N|)-time dynamic programming algorithm (DP) to compute a lower bound of the optimal displayed tree cost, where |G| and |N| are the sizes of G and N, respectively. This algorithm has the ability to state whether the cost is exact or is a lower bound. In addition, our algorithm provides a set of reticulation edges that correspond to the obtained cost. If the cost is exact, the set induces an optimal displayed tree that yields the cost. If the cost is a lower bound, the set contains pairs of conflicting edges, that is, edges sharing a reticulation node. Next, we show a conflict resolution algorithm that requires 2^{r+1}-1 invocations of DP in the worst case, where r is a number of reticulations. We propose a similar O(2^k|G||N|)-time algorithm for level-k networks and a branch and bound solution to compute lower and upper bounds of optimal costs. We also show how our algorithms can be extended to a broader class of phylogenetic networks. Despite their exponential complexity in the worst case, our solutions perform significantly well on empirical and simulated datasets, thanks to the strategy of resolving internal dissimilarities between gene trees and networks. In particular, experiments on simulated data indicate that the runtime of our solution is ?(2^{0.543 k}|G||N|) on average. Therefore, our solution is an efficient alternative to enumeration strategies commonly proposed in the literature and enables analyses of complex networks with dozens of reticulations
    • …
    corecore