10 research outputs found

    Structural properties of the reconciliation space and their applications in enumerating nearly-optimal reconciliations between a gene tree and a species tree

    Get PDF
    Introduction: A gene tree for a gene family is often discordant with the containing species tree because of its complex evolutionary course during which gene duplication, gene loss and incomplete lineage sorting events might occur. Hence, it is of great challenge to infer the containing species tree from a set of gene trees. One common approach to this inference problem is through gene tree and species tree reconciliation. Results: In this paper, we generalize the traditional least common ancestor (LCA) reconciliation to define a reconciliation between a gene tree and species tree under the tree homomorphism framework. We then study the structural properties of the space of all reconciliations between a gene tree and a species tree in terms of the gene duplication, gene loss or deep coalescence costs. As application, we show that the LCA reconciliation is the unique one that has the minimum deep coalescence cost, provide a novel characterization of the reconciliations with the optimal duplication cost, and present efficient algorithms for enumerating (nearly-)optimal reconciliations with respect to each cost. Conclusions: This work provides a new graph-theoretic framework for studying gene tree and species tree reconciliations

    Geometric medians in reconciliation spaces

    Get PDF
    In evolutionary biology, it is common to study how various entities evolve together, for example, how parasites coevolve with their host, or genes with their species. Coevolution is commonly modelled by considering certain maps or reconciliations from one evolutionary tree PP to another HH, all of which induce the same map ϕ\phi between the leaf-sets of PP and HH (corresponding to present-day associations). Recently, there has been much interest in studying spaces of reconciliations, which arise by defining some metric dd on the set Rec(P,H,ϕ)Rec(P,H,\phi) of all possible reconciliations between PP and HH. In this paper, we study the following question: How do we compute a geometric median for a given subset Ψ\Psi of Rec(P,H,ϕ)Rec(P,H,\phi) relative to dd, i.e. an element ψmedRec(P,H,ϕ)\psi_{med} \in Rec(P,H,\phi) such that ψΨd(ψmed,ψ)ψΨd(ψ,ψ) \sum_{\psi' \in \Psi} d(\psi_{med},\psi') \le \sum_{\psi' \in \Psi} d(\psi,\psi') holds for all ψRec(P,H,ϕ)\psi \in Rec(P,H,\phi)? For a model where so-called host-switches or transfers are not allowed, and for a commonly used metric dd called the edit-distance, we show that although the cardinality of Rec(P,H,ϕ)Rec(P,H,\phi) can be super-exponential, it is still possible to compute a geometric median for a set Ψ\Psi in Rec(P,H,ϕ)Rec(P,H,\phi) in polynomial time. We expect that this result could be useful for computing a summary or consensus for a set of reconciliations (e.g. for a set of suboptimal reconciliations).Comment: 12 pages, 1 figur

    Exploring and visualising spaces of tree reconciliations

    Get PDF
    Tree reconciliation is the mathematical tool that is used to investigate the coevolution of organisms, such as hosts and parasites. A common approach to tree reconciliation involves specifying a model that assigns costs to certain events, such as cospeciation, and then tries to find a mapping between two specified phylogenetic trees which minimises the total cost of the implied events. For such models, it has been shown that there may be a huge number of optimal solutions, or at least solutions that are close to optimal. It is therefore of interest to be able to systematically compare and visualise whole collections of reconciliations between a specified pair of trees. In this paper, we consider various metrics on the set of all possible reconciliations between a pair of trees, some that have been defined before but also new metrics that we shall propose. We show that the diameter for the resulting spaces of reconciliations can in some cases be determined theoretically, information that we use to normalise and compare properties of the metrics. We also implement the metrics and compare their behaviour on several host parasite datasets, including the shapes of their distributions. In addition, we show that in combination with multidimensional scaling, the metrics can be useful for visualising large collections of reconciliations, much in the same way as phylogenetic tree metrics can be used to explore collections of phylogenetic trees. Implementations of the metrics can be downloaded from: https://team.inria.fr/erable/en/team-members/blerina-sinaimeri/reconciliation-distances

    Consistency Properties of Species Tree Inference by Minimizing Deep Coalescences

    Full text link
    Methods for inferring species trees from sets of gene trees need to account for the possibility of discordance among the gene trees. Assuming that discordance is caused by incomplete lineage sorting, species tree estimates can be obtained by finding those species trees that minimize the number of -deep- coalescence events required for a given collection of gene trees. Efficient algorithms now exist for applying the minimizing-deep-coalescence (MDC) criterion, and simulation experiments have demonstrated its promising performance. However, it has also been noted from simulation results that the MDC criterion is not always guaranteed to infer the correct species tree estimate. In this article, we investigate the consistency of the MDC criterion. Using the multipscies coalescent model, we show that there are indeed anomaly zones for the MDC criterion for asymmetric four-taxon species tree topologies, and for all species tree topologies with five or more taxa.Peer Reviewedhttp://deepblue.lib.umich.edu/bitstream/2027.42/90434/1/cmb-2E2010-2E0102.pd

    Efficient Algorithms for the Gene Tree-Species Tree Reconciliation Problem

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH

    The Orthology Road: Theory and Methods in Orthology Analysis

    Get PDF
    The evolution of biological species depends on changes in genes. Among these changes are the gradual accumulation of DNA mutations, insertions and deletions, duplication of genes, movements of genes within and between chromosomes, gene losses and gene transfer. As two populations of the same species evolve independently, they will eventually become reproductively isolated and become two distinct species. The evolutionary history of a set of related species through the repeated occurrence of this speciation process can be represented as a tree-like structure, called a phylogenetic tree or a species tree. Since duplicated genes in a single species also independently accumulate point mutations, insertions and deletions, they drift apart in composition in the same way as genes in two related species. The divergence of all the genes descended from a single gene in an ancestral species can also be represented as a tree, a gene tree that takes into account both speciation and duplication events. In order to reconstruct the evolutionary history from the study of extant species, we use sets of similar genes, with relatively high degree of DNA similarity and usually with some functional resemblance, that appear to have been derived from a common ancestor. The degree of similarity among different instances of the “same gene” in different species can be used to explore their evolutionary history via the reconstruction of gene family histories, namely gene trees. Orthology refers specifically to the relationship between two genes that arose by a speciation event, recent or remote, rather than duplication. Comparing orthologous genes is essential to the correct reconstruction of species trees, so that detecting and identifying orthologous genes is an important problem, and a longstanding challenge, in comparative and evolutionary genomics as well as phylogenetics. A variety of orthology detection methods have been devised in recent years. Although many of these methods are dependent on generating gene and/or species trees, it has been shown that orthology can be estimated at acceptable levels of accuracy without having to infer gene trees and/or reconciling gene trees with species trees. Therefore, there is good reason to look at the connection of trees and orthology from a different angle: How much information about the gene tree, the species tree, and their reconciliation is already contained in the orthology relation among genes? Intriguingly, a solution to the first part of this question has already been given by Boecker and Dress [Boecker and Dress, 1998] in a different context. In particular, they completely characterized certain maps which they called symbolic ultrametrics. Semple and Steel [Semple and Steel, 2003] then presented an algorithm that can be used to reconstruct a phylogenetic tree from any given symbolic ultrametric. In this thesis we investigate a new characterization of orthology relations, based on symbolic ultramterics for recovering the gene tree. According to Fitch’s definition [Fitch, 2000], two genes are (co-)orthologous if their last common ancestor in the gene tree represents a speciation event. On the other hand, when their last common ancestor is a duplication event, the genes are paralogs. The orthology relation on a set of genes is therefore determined by the gene tree and an “event labeling” that identifies each interior vertex of that tree as either a duplication or a speciation event. In the context of analyzing orthology data, the problem of reconciling event-labeled gene trees with a species tree appears as a variant of the reconciliation problem where genes trees have no labels in their internal vertices. When reconciling a gene tree with a species tree, it can be assumed that the species tree is correct or, in the case of a unknown species tree, it can be inferred. Therefore it is crucial to know for a given gene tree whether there even exists a species tree. In this thesis we characterize event-labelled gene trees for which a species tree exists and species trees to which event-labelled gene trees can be mapped. Reconciliation methods are not always the best options for detecting orthology. A fundamental problem is that, aside from multicellular eukaryotes, evolution does not seem to have conformed to the descent-with-modification model that gives rise to tree-like phylogenies. Examples include many cases of prokaryotes and viruses whose evolution involved horizontal gene transfer. To treat the problem of distinguishing orthology and paralogy within a more general framework, graph-based methods have been proposed to detect and differentiate among evolutionary relationships of genes in those organisms. In this work we introduce a measure of orthology that can be used to test graph-based methods and reconciliation methods that detect orthology. Using these results a new algorithm BOTTOM-UP to determine whether a map from the set of vertices of a tree to a set of events is a symbolic ultrametric or not is devised. Additioanlly, a simulation environment designed to generate large gene families with complex duplication histories on which reconstruction algorithms can be tested and software tools can be benchmarked is presented

    Phylogenetics in the Genomic Era

    Get PDF
    Molecular phylogenetics was born in the middle of the 20th century, when the advent of protein and DNA sequencing offered a novel way to study the evolutionary relationships between living organisms. The first 50 years of the discipline can be seen as a long quest for resolving power. The goal – reconstructing the tree of life – seemed to be unreachable, the methods were heavily debated, and the data limiting. Maybe for these reasons, even the relevance of the whole approach was repeatedly questioned, as part of the so-called molecules versus morphology debate. Controversies often crystalized around long-standing conundrums, such as the origin of land plants, the diversification of placental mammals, or the prokaryote/eukaryote divide. Some of these questions were resolved as gene and species samples increased in size. Over the years, molecular phylogenetics has gradually evolved from a brilliant, revolutionary idea to a mature research field centred on the problem of reliably building trees. This logical progression was abruptly interrupted in the late 2000s. High-throughput sequencing arose and the field suddenly moved into something entirely different. Access to genome-scale data profoundly reshaped the methodological challenges, while opening an amazing range of new application perspectives. Phylogenetics left the realm of systematics to occupy a central place in one of the most exciting research fields of this century – genomics. This is what this book is about: how we do trees, and what we do with trees, in the current phylogenomic era. One obvious, practical consequence of the transition to genome-scale data is that the most widely used tree-building methods, which are based on probabilistic models of sequence evolution, require intensive algorithmic optimization to be applicable to current datasets. This problem is considered in Part 1 of the book, which includes a general introduction to Markov models (Chapter 1.1) and a detailed description of how to optimally design and implement Maximum Likelihood (Chapter 1.2) and Bayesian (Chapter 1.4) phylogenetic inference methods. The importance of the computational aspects of modern phylogenomics is such that efficient software development is a major activity of numerous research groups in the field. We acknowledge this and have included seven "How to" chapters presenting recent updates of major phylogenomic tools – RAxML (Chapter 1.3), PhyloBayes (Chapter 1.5), MACSE (Chapter 2.3), Bgee (Chapter 4.3), RevBayes (Chapter 5.2), Beagle (Chapter 5.4), and BPP (Chapter 5.6). Genome-scale data sets are so large that statistical power, which had been the main limiting factor of phylogenetic inference during previous decades, is no longer a major issue. Massive data sets instead tend to amplify the signal they deliver – be it biological or artefactual – so that bias and inconsistency, instead of sampling variance, are the main problems with phylogenetic inference in the genomic era. Part 2 covers the issues of data quality and model adequacy in phylogenomics. Chapter 2.1 provides an overview of current practice and makes recommendations on how to avoid the more common biases. Two chapters review the challenges and limitations of two key steps of phylogenomic analysis pipelines, sequence alignment (Chapter 2.2) and orthology prediction (Chapter 2.4), which largely determine the reliability of downstream inferences. The performance of tree building methods is also the subject of Chapter 2.5, in which a new approach is introduced to assess the quality of gene trees based on their ability to correctly predict ancestral gene order. Analyses of multiple genes typically recover multiple, distinct trees. Maybe the biggest conceptual advance induced by the phylogenetic to phylogenomic transition is the suggestion that one should not simply aim to reconstruct “the” species tree, but rather to be prepared to make sense of forests of gene trees. Chapter 3.1 reviews the numerous reasons why gene trees can differ from each other and from the species tree, and what the implications are for phylogenetic inference. Chapter 3.2 focuses on gene trees/species trees reconciliation methods that account for gene duplication/loss and horizontal gene transfer among lineages. Incomplete lineage sorting is another major source of phylogenetic incongruence among loci, which recently gained attention and is covered by Chapter 3.3. Chapter 3.4 concludes this part by taking a user’s perspective and examining the pros and cons of concatenation versus separate analysis of gene sequence alignments. Modern genomics is comparative and phylogenetic methods are key to a wide range of questions and analyses relevant to the study of molecular evolution. This is covered by Part 4. We argue that genome annotation, either structural or functional, can only be properly achieved in a phylogenetic context. Chapters 4.1 and 4.2 review the power of these approaches and their connections with the study of gene function. Molecular substitution rates play a key role in our understanding of the prevalence of nearly neutral versus adaptive molecular evolution, and the influence of species traits on genome dynamics (Chapter 4.4). The analysis of substitution rates, and particularly the detection of positive selection, requires sophisticated methods and models of coding sequence evolution (Chapter 4.5). Phylogenomics also offers a unique opportunity to explore evolutionary convergence at a molecular level, thus addressing the long-standing question of predictability versus contingency in evolution (Chapter 4.6). The development of phylogenomics, as reviewed in Parts 1 through 4, has resulted in a powerful conceptual and methodological corpus, which is often reused for addressing problems of interest to biologists from other fields. Part 5 illustrates this application potential via three selected examples. Chapter 5.1 addresses the link between phylogenomics and palaeontology; i.e., how to optimally combine molecular and fossil data for estimating divergence times. Chapter 5.3 emphasizes the importance of the phylogenomic approach in virology and its potential to trace the origin and spread of infectious diseases in space and time. Finally, Chapter 5.5 recalls why phylogenomic methods and the multi-species coalescent model are key in addressing the problem of species delimitation – one of the major goals of taxonomy. It is hard to predict where phylogenomics as a discipline will stand in even 10 years. Maybe a novel technological revolution will bring it to yet another level? We strongly believe, however, that tree thinking will remain pivotal in the treatment and interpretation of the deluge of genomic data to come. Perhaps a prefiguration of the future of our field is provided by the daily monitoring of the current Covid-19 outbreak via the phylogenetic analysis of coronavirus genomic data in quasi real time – a topic of major societal importance, contemporary to the publication of this book, in which phylogenomics is instrumental in helping to fight disease

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Evolutionary Genomics

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward

    Table of Contents and Masthead (v. 17, no. 1)

    Full text link
    corecore