2,246 research outputs found

    The inference of gene trees with species trees

    Get PDF
    Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can co-exist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice-versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. In this article we review the various models that have been used to describe the relationship between gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a better basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution.Comment: Review article in relation to the "Mathematical and Computational Evolutionary Biology" conference, Montpellier, 201

    Alignment uncertainty, regressive alignment and large scale deployment

    Get PDF
    A multiple sequence alignment (MSA) provides a description of the relationship between biological sequences where columns represent a shared ancestry through an implied set of evolutionary events. The majority of research in the field has focused on improving the accuracy of alignments within the progressive alignment framework and has allowed for powerful inferences including phylogenetic reconstruction, homology modelling and disease prediction. Notwithstanding this, when applied to modern genomics datasets - often comprising tens of thousands of sequences - new challenges arise in the construction of accurate MSA. These issues can be generalised to form three basic problems. Foremost, as the number of sequences increases, progressive alignment methodologies exhibit a dramatic decrease in alignment accuracy. Additionally, for any given dataset many possible MSA solutions exist, a problem which is exacerbated with an increasing number of sequences due to alignment uncertainty. Finally, technical difficulties hamper the deployment of such genomic analysis workflows - especially in a reproducible manner - often presenting a high barrier for even skilled practitioners. This work aims to address this trifecta of problems through a web server for fast homology extension based MSA, two new methods for improved phylogenetic bootstrap supports incorporating alignment uncertainty, a novel alignment procedure that improves large scale alignments termed regressive MSA and finally a workflow framework that enables the deployment of large scale reproducible analyses across clusters and clouds titled Nextflow. Together, this work can be seen to provide both conceptual and technical advances which deliver substantial improvements to existing MSA methods and the resulting inferences.Un alineament de seqüència múltiple (MSA) proporciona una descripció de la relació entre seqüències biològiques on les columnes representen una ascendència compartida a través d'un conjunt implicat d'esdeveniments evolutius. La majoria de la investigació en el camp s'ha centrat a millorar la precisió dels alineaments dins del marc d'alineació progressiva i ha permès inferències poderoses, incloent-hi la reconstrucció filogenètica, el modelatge d'homologia i la predicció de malalties. Malgrat això, quan s'aplica als conjunts de dades de genòmica moderns, que sovint comprenen desenes de milers de seqüències, sorgeixen nous reptes en la construcció d'un MSA precís. Aquests problemes es poden generalitzar per formar tres problemes bàsics. En primer lloc, a mesura que augmenta el nombre de seqüències, les metodologies d'alineació progressiva presenten una disminució espectacular de la precisió de l'alineació. A més, per a un conjunt de dades, existeixen molts MSA com a possibles solucions un problema que s'agreuja amb un nombre creixent de seqüències a causa de la incertesa d'alineació. Finalment, les dificultats tècniques obstaculitzen el desplegament d'aquests fluxos de treball d'anàlisi genòmica, especialment de manera reproduïble, sovint presenten una gran barrera per als professionals fins i tot qualificats. Aquest treball té com a objectiu abordar aquesta trifecta de problemes a través d'un servidor web per a l'extensió ràpida d'homologia basada en MSA, dos nous mètodes per a la millora de l'arrencada filogenètica permeten incorporar incertesa d'alineació, un nou procediment d'alineació que millora els alineaments a gran escala anomenat MSA regressivu i, finalment, un marc de flux de treball permet el desplegament d'anàlisis reproduïbles a gran escala a través de clústers i computació al núvol anomenat Nextflow. En conjunt, es pot veure que aquest treball proporciona tant avanços conceptuals com tècniques que proporcionen millores substancials als mètodes MSA existents i les conseqüències resultants

    A practical guide to design and assess a phylogenomic study

    Full text link
    Over the last decade, molecular systematics has undergone a change of paradigm as high-throughput sequencing now makes it possible to reconstruct evolutionary relationships using genome-scale datasets. The advent of 'big data' molecular phylogenetics provided a battery of new tools for biologists but simultaneously brought new methodological challenges. The increase in analytical complexity comes at the price of highly specific training in computational biology and molecular phy- logenetics, resulting very often in a polarized accumulation of knowledge (technical on one side and biological on the other). Interpreting the robustness of genome-scale phylogenetic studies is not straightforward, particularly as new methodological developments have consistently shown that the general belief of 'more genes, more robustness' often does not apply, and because there is a range of systematic errors that plague phylogenomic investigations. This is particularly problematic because phylogenomic studies are highly heterogeneous in their methodology, and best practices are often not clearly defined. The main aim of this article is to present what I consider as the ten most important points to take into consideration when plan- ning a well-thought-out phylogenomic study and while evaluating the quality of published papers. The goal is to provide a practical step-by-step guide that can be easily followed by nonexperts and phylogenomic novices in order to assess the tech- nical robustness of phylogenomic studies or improve the experimental design of a project

    The inference of gene trees with species trees.

    Get PDF
    This article reviews the various models that have been used to describe the relationships between gene trees and species trees. Molecular phylogeny has focused mainly on improving models for the reconstruction of gene trees based on sequence alignments. Yet, most phylogeneticists seek to reveal the history of species. Although the histories of genes and species are tightly linked, they are seldom identical, because genes duplicate, are lost or horizontally transferred, and because alleles can coexist in populations for periods that may span several speciation events. Building models describing the relationship between gene and species trees can thus improve the reconstruction of gene trees when a species tree is known, and vice versa. Several approaches have been proposed to solve the problem in one direction or the other, but in general neither gene trees nor species trees are known. Only a few studies have attempted to jointly infer gene trees and species trees. These models account for gene duplication and loss, transfer or incomplete lineage sorting. Some of them consider several types of events together, but none exists currently that considers the full repertoire of processes that generate gene trees along the species tree. Simulations as well as empirical studies on genomic data show that combining gene tree-species tree models with models of sequence evolution improves gene tree reconstruction. In turn, these better gene trees provide a more reliable basis for studying genome evolution or reconstructing ancestral chromosomes and ancestral gene sequences. We predict that gene tree-species tree methods that can deal with genomic data sets will be instrumental to advancing our understanding of genomic evolution

    The Apicomplexan Whole-Genome Phylogeny: An Analysis of Incongruence among Gene Trees

    Get PDF
    The protistan phylum Apicomplexa contains many important pathogens and is the subject of intense genome sequencing efforts. Based upon the genome sequences from seven apicomplexan species and a ciliate outgroup, we identified 268 single-copy genes suitable for phylogenetic inference. Both concatenation and consensus approaches inferred the same species tree topology. This topology is consistent with most prior conceptions of apicomplexan evolution based upon ultrastructural and developmental characters, that is, the piroplasm genera Theileria and Babesia form the sister group to the Plasmodium species, the coccidian genera Eimeria and Toxoplasma are monophyletic and are the sister group to the Plasmodium species and piroplasm genera, and Cryptosporidium forms the sister group to the above mentioned with the ciliate Tetrahymena as the outgroup. The level of incongruence among gene trees appears to be high at first glance; only 19% of the genes support the species tree, and a total of 48 different gene-tree topologies are observed. Detailed investigations suggest that the low signal-to-noise ratio in many genes may be the main source of incongruence. The probability of being consistent with the species tree increases as a function of the minimum bootstrap support observed at tree nodes for a given gene tree. Moreover, gene sequences that generate high bootstrap support are robust to the changes in alignment parameters or phylogenetic method used. However, caution should be taken in that some genes can infer a “wrong” tree with strong support because of paralogy, model violations, or other causes. The importance of examining multiple, unlinked genes that possess a strong phylogenetic signal cannot be overstated

    Systematic errors in phylogenomics with a focus on the major metazoan clade Deuterostomia

    Get PDF
    Modern-day phylogenomics studies employ large data sets of many genes to resolve evolutionary relationships among many species. A typical phylogenomic workflow consists of certain steps: taxon sampling, orthology inference, marker selection and tree search. All of these steps contain some subjective decisions made by the researcher, posing risks for introducing systematic errors in the final results. In this thesis, I investigate the source and the impact of systematic errors in multiple steps of the phylogenomic workflow, focusing on the major clade Metazoa. First, I create simulated sets of orthologs under different settings for evolutionary rate and rate heterogeneity among sites and use OrthoFinder to infer their (known) orthology relationships. Orthology inference is sensitive to high evolutionary rates and low rate heterogeneity among sites. I show that errors in orthology inference are carried over to downstream analysis such as gene presence/absence phylogenies, gene gains/losses inference and phylostratigraphy. I also introduce a novel computational pipeline which allows us to identify the presence of a hidden break in the 28S ribosomal RNA of a given species. Mapping RNA-seq reads onto the 28S rRNA sequence reveals non-existent coverage of mapped reads near the middle of the 28S rRNA sequence of species that possess the hidden break. I apply this pipeline in hundreds of metazoan and other eukaryotic species and find that the hidden break is a rarely lost protostome feature, with surprising events of convergent evolution outside Metazoa. I finally focus on the major metazoan clade of Deuterostomia; while it has been widely accepted as a monophyletic group for over a century, recent phylogenomic studies addressing known systematic errors have recovered low support for monophyletic Deuterostomia. I examine five recently published metazoan phylogenomic data sets to show that monophyletic Deuterostomia is much less well supported than monophyletic Protostomia. I also create 40 new data sets, with and without fast-evolving taxa, and use them to correlate strong support for monophyletic Deuterostomia with problematic conditions in a phylogenomic analysis

    Marginal likelihoods in phylogenetics: a review of methods and applications

    Full text link
    By providing a framework of accounting for the shared ancestry inherent to all life, phylogenetics is becoming the statistical foundation of biology. The importance of model choice continues to grow as phylogenetic models continue to increase in complexity to better capture micro and macroevolutionary processes. In a Bayesian framework, the marginal likelihood is how data update our prior beliefs about models, which gives us an intuitive measure of comparing model fit that is grounded in probability theory. Given the rapid increase in the number and complexity of phylogenetic models, methods for approximating marginal likelihoods are increasingly important. Here we try to provide an intuitive description of marginal likelihoods and why they are important in Bayesian model testing. We also categorize and review methods for estimating marginal likelihoods of phylogenetic models, highlighting several recent methods that provide well-behaved estimates. Furthermore, we review some empirical studies that demonstrate how marginal likelihoods can be used to learn about models of evolution from biological data. We discuss promising alternatives that can complement marginal likelihoods for Bayesian model choice, including posterior-predictive methods. Using simulations, we find one alternative method based on approximate-Bayesian computation (ABC) to be biased. We conclude by discussing the challenges of Bayesian model choice and future directions that promise to improve the approximation of marginal likelihoods and Bayesian phylogenetics as a whole.Comment: 33 pages, 3 figure

    Crinoid phylogeny: a preliminary analysis (Echinodermata: Crinoidea)

    Get PDF
    We describe the first molecular and morphological analysis of extant crinoid high-level inter-relationships. Nuclear and mitochondrial gene sequences and a cladistically coded matrix of 30 morphological characters are presented, and analysed by phylogenetic methods. The molecular data were compiled from concatenated nuclear-encoded 18S rDNA, internal transcribed spacer 1, 5.8S rDNA, and internal transcribed spacer 2, together with part of mitochondrial 16S rDNA, and comprised 3,593 sites, of which 313 were parsimony-informative. The molecular and morphological analyses include data from the bourgueticrinid Bathycrinus; the antedonid comatulids Dorometra and Florometra; the cyrtocrinids Cyathidium, Gymnocrinus, and Holopus; the isocrinids Endoxocrinus, and two species of Metacrinus; as well as from Guillecrinus and Caledonicrinus, whose ordinal relationships are uncertain, together with morphological data from Proisocrinus. Because the molecular data include indel-rich regions, special attention was given to alignment procedure, and it was found that relatively low, gene-specific, gap penalties gave alignments from which congruent phylogenetic information was obtained from both well-aligned, indel-poor and potentially misaligned, indel-rich regions. The different sequence data partitions also gave essentially congruent results. The overall direction of evolution in the gene trees remains uncertain: an asteroid outgroup places the root on the branch adjacent to the slowly evolving isocrinids (consistent with palaeontological order of first appearances), but maximum likelihood analysis with a molecular clock places it elsewhere. Despite lineage-specific rate differences, the clock model was not excluded by a likelihood ratio test. Morphological analyses were unrooted. All analyses identified three clades, two of them generally well-supported. One well-supported clade (BCG) unites Bathycrinus and Guillecrinus with the representative (chimaeric) comatulid in a derived position, suggesting that comatulids originated from a sessile, stalked ancestor. In this connection it is noted that because the comatulid centrodorsal ossicle originates ontogenetically from the column, it is not strictly correct to describe comatulids as unstalked crinoids. A second, uniformly well-supported clade contains members of the Isocrinida, while the third clade contains Gymnocrinus, a well-established member of the Cyrtocrinida, together with the problematic taxon Caledonicrinus, currently classified as a bourgueticrinid. Another cyrtocrinid, Holopus, joins this clade with only weak molecular, but strong morphological support. In one morphological analysis Proisocrinus is weakly attached to the isocrinid clade. Only an unusual, divergent 18S rDNA sequence was obtained from the morphologically strange cyrtocrinid Cyathidium. Although not analysed in detail, features of this sequence suggested that it may be a PCR artefact, so that the apparently basal position of this taxon requires confirmation. If not an artefact, Cyathidium either diverged from the crinoid stem much earlier than has been recognised hitherto (i.e., it may be a Palaeozoic relic), or it has an atypically high rate of molecular evolution

    Genome evolution in Prochlorococcus and marine Synechococcus

    Get PDF
    corecore