590 research outputs found

    RĂ©surrection du passĂ© Ă  l’aide de modĂšles hĂ©tĂ©rogĂšnes d’évolution des sĂ©quences protĂ©iques

    Get PDF
    The molecular reconstruction and resurrection of ancestral proteins is the major issue tackled in this thesis manuscript. While fossil molecular data are almost nonexistent, phylogenetic methods allow to estimate what were the most likely ancestral protein sequences along a phylogenetic tree describing the relationships between extant sequences. With these ancestral sequences, several biological hypotheses can be tested, from the evolution of protein function to the inference of ancient environments in which the ancestors were adatapted. These probabilistic estimations of ancestral sequences depend on substitution models giving the different probabilities of substitution between all pairs of amino acids. Classicaly, substitution models assume in a simplistic way that the evolutionary process remains homogeneous (constant) among sites of the multiple sequence alignment or between lineages. During the last decade, several methodological improvements were realised, with the description of substitution models allowing to account for the heterogeneity of the process among sites and in time. During my thesis, I developed new heterogeneous substitution models in Maximum Likelihood that were proved to better fit the data than any other homogeneous or heterogeneous models. I also demonstrated their better performance regarding the accuracy of ancestral sequence reconstruction. With the use of these models to reconstruct or resurrect ancestral proteins, my coworkers and I showed the adapation to temperature is a major determinant of evolutionary rates in Archaea. Furthermore, we also deciphed the nature of the phylogenetic signal informing substitution models to infer a non-parsimonious scenario for the adaptation to temperature during early Life on Earth, with a non-hyperthermophilic last universal common ancestor living at lower temperatures than its two descendants. Finally, we showed that the use of heterogeneous models allow to improve the functionality of resurrected proteins, opening the way to a better understanding of evolutionary mechanisms acting on biological sequencesLa reconstruction et la rĂ©surrection molĂ©culaire de protĂ©ines ancestrales est au coeur de cette thĂšse. Alors que les donnĂ©es molĂ©culaires fossiles sont quasi inexistantes, il est possible d'estimer quelles Ă©taient les sĂ©quences ancestrales les plus probables le long d'un arbre phylogĂ©nĂ©tique dĂ©crivant les relations de parentĂ©s entre sĂ©quences actuelles. Avoir accĂšs Ă  ces sĂ©quences ancestrales permet alors de tester de nombreuses hypothĂšses biologiques, de la fonction des protĂ©ines ancestrales Ă  l'adaptation des organismes Ă  leur environnement. Cependant, ces infĂ©rences probabilistes de sĂ©quences ancestrales sont dĂ©pendantes de modĂšles de substitution fournissant les probabilitĂ©s de changements entre acides aminĂ©s. Ces derniĂšres annĂ©es ont vu le dĂ©veloppement de nouveaux modĂšles de substitutions d'acides aminĂ©s, permettant de mieux prendre en compte les phĂ©nomĂšnes biologiques agissant sur l'Ă©volution des sĂ©quences protĂ©iques. Classiquement, les modĂšles supposent que le processus Ă©volutif est Ă  la fois le mĂȘme pour tous les sites d'un alignement protĂ©ique et qu'il est restĂ© constant au cours du temps lors de l'Ă©volution des lignĂ©es. On parle alors de modĂšle homogĂšne en temps et en sites. Les modĂšles rĂ©cents, dits hĂ©tĂ©rogĂšnes, ont alors permis de lever ces contraintes en permettant aux sites et/ou aux lignĂ©es d'Ă©voluer selon diffĂ©rents processus. Durant cette thĂšse, de nouveaux modĂšles hĂ©tĂ©rogĂšnes en temps et sites ont Ă©tĂ© dĂ©veloppĂ©s en Maximum de Vraisemblance. Il a notamment Ă©tĂ© montrĂ© qu'ils permettent d'amĂ©liorer considĂ©rablement l'ajustement aux donnĂ©es et donc de mieux prendre en compte les phĂ©nomĂšnes rĂ©gissant l'Ă©volution des sĂ©quences protĂ©iques afin d'estimer de meilleurs sĂ©quences ancestrales. A l'aide de ces modĂšles et de reconstruction ou rĂ©surrection de protĂ©ines ancestrales en laboratoire, il a Ă©tĂ© montrĂ© que l'adaptation Ă  la tempĂ©rature est un dĂ©terminant majeur de la variation des taux Ă©volutifs entre lignĂ©es d'ArchĂ©es. De mĂȘme, en appliquant ces modĂšles hĂ©tĂ©rogĂšnes le long de l'arbre universel du vivant, il a Ă©tĂ© possible de mieux comprendre la nature du signal Ă©volutif informant de maniĂšre non-parcimonieuse un ancĂȘtre universel vivant Ă  plus basse tempĂ©rature que ses deux descendants, Ă  savoir les ancĂȘtres bactĂ©riens et archĂ©ens. Enfin, il a Ă©tĂ© montrĂ© que l'utilisation de tels modĂšles pouvait permettre d'amĂ©liorer la fonctionnalitĂ© des protĂ©ines ancestrales ressuscitĂ©es en laboratoire, ouvrant la voie Ă  une meilleure comprĂ©hension des mĂ©canismes Ă©volutifs agissant sur les sĂ©quences biologique

    Comparative mitogenomics of the Decapoda reveals evolutionary heterogeneity in architecture and composition

    Get PDF
    The emergence of cost-effective and rapid sequencing approaches has resulted in an exponential rise in the number of mitogenomes on public databases in recent years, providing greater opportunity for undertaking large-scale comparative genomic and systematic research. Nonetheless, current datasets predominately come from small and disconnected studies on a limited number of related species, introducing sampling biases and impeding research of broad taxonomic relevance. This study contributes 21 crustacean mitogenomes from several under-represented decapod infraorders including Polychelida and Stenopodidea, which are used in combination with 225 mitogenomes available on NCBI to investigate decapod mitogenome diversity and phylogeny. An overview of mitochondrial gene orders (MGOs) reveals a high level of genomic variability within the Decapoda, with a large number of MGOs deviating from the ancestral arthropod ground pattern and unevenly distributed among infraorders. Despite the substantial morphological and ecological variation among decapods, there was limited evidence for correlations between gene rearrangement events and species ecology or lineage specific nucleotide substitution rates. Within a phylogenetic context, predicted scenarios of rearrangements show some MGOs to be informative synapomorphies for some taxonomic groups providing strong independent support for phylogenetic relationships. Additional comparisons for a range of mitogenomic features including nucleotide composition, strand asymmetry, unassigned regions and codon usage indicate several clade-specific trends that are of evolutionary and ecological interest

    HYMENOPTERAN MOLECULAR PHYLOGENETICS: FROM APOCRITA TO BRACONIDAE (ICHNEUMONOIDEA)

    Get PDF
    Two separate phylogenetic studies were performed for two different taxonomic levels within Hymenoptera. The first study examined the utility of expressed sequence tags for resolving relationships among hymenopteran superfamilies. Transcripts were assembled from 14,000 sequenced clones for 6 disparate Hymenopteran taxa, averaging over 660 unique contigs per species. Orthology and gene determination were performed using modifications to a previously developed computerized pipeline and compared against annotated insect genomes. Sequences from additional taxa were added from public databases with a final dataset of 24 genes for 16 taxa. The concatenated dataset recovered a robust and well-supported topology; however, there was extreme incongruity among individual gene trees. Analyses of sequences indicated strong compositional and transition biases, particularly in the third codon positions. The use of filtered supernetworks aided visualization of the existing congruent phylogenetic signal that existed across the individual gene trees. Additionally, treeness triangle plots indicated a strong residual signal in several gene trees and across codon positions in the concatenated dataset. However, most analyses of the concatenated dataset recovered expected relationships, known from other independent analyses. Thus, ESTs provide a powerful source of information for phylogenetic analysis, but results are sensitive to low taxonomic sampling and missing data. The second study examined subfamilial relationships within the parasitoid family Braconidae, using over 4kb of sequence data for 139 taxa. Bayesian inference of the concatenated dataset recovered a robust phylogeny, particularly for early divergences within the family. There was strong evidence supporting two independent lineages within the family: one leading to the noncyclostomes and one leading to the cyclostomes. Ancestral state reconstructions were performed to test the theory of ectoparasitism as the ancestral condition for all taxa within the family. Results indicated an endoparasitic ancestor for the family and for the non-cyclostome lineage, with an early transition to ectoparasitism for the cyclostome lineage. However, reconstructions of some nodes were sensitive to outgroup coding and will also be impacted with increased biological knowledge

    Polyploidy, base composition bias, and incomplete lineage sorting in fish phylogenetics

    Get PDF
    Thesis (Ph.D.) University of Alaska Fairbanks, 2014.Understanding the evolutionary relationships between organisms is of fundamental importance in biology. Originally based on overall similarity in morphological traits, depiction of evolutionary relationships is now often pursued by constructing trees based on molecular data- molecular phylogenetics. Molecular phylogenetic inference uses variation in molecular data in a variety of frameworks to produce hypothetical relationships between organisms. As with many practices making use of biological data, the inherent noise and complexity challenges phylogeneticists. In this dissertation, I examine three empirical datasets while addressing three possible issues in phylogenetic inference: polyploidy, base composition bias and incomplete lineage sorting. Polyploidy leads to incorrect genes (paralogs) being analyzed, since it is often impossible to distinguish between gene copies generated as a result of polyploidization. My analysis indicates that incorrect assumptions of orthology have led to incorrect conclusions being drawn from phylogenetic studies including the polyploid salmons (Salmoniformes). Results indicate that pikes (Esociformes) and the polyploid salmons are not only sister taxa, but that the graylings (Thymallinae) and whitefishes (Coregoninae) are most closely related to each other. Base composition bias misleads inference through the overall similarity between sequences being a result of changes in base composition, not shared evolutionary history. Incomplete lineage sorting refers to the fact that the reconstructed relationships of different genes do not agree. Genetic variants may persist through speciation events and are not completely "sorted" between lineages, and require a methodology to reconcile the different genealogies. In two chapters I focused on base composition bias and incomplete lineage sorting in a detailed study of flatfish (Pleuronectiformes) origins. A major issue in fish phylogenetics is the question of whether flatfish are monophyletic with poor support from both morphological and molecular data. Often it appears that cranial asymmetry is the only characteristic uniting the group. I found very little evidence for a single evolutionary origin of the extant flatfishes. Base composition bias appears not to be a major contributor to flatfish non-monophyly; however incomplete lineage sorting likely results in the inability to generate robust statistical support for inferred relationships of flatfishes and relatives. Results of my work indicate that more care should be exercised in phylogenetics in determining orthology of genes. I also find that not acknowledging the presence of paralogs does indeed mislead analyses. With increased data availability and computational capabilities, non-neutral models of nucleotide evolution should be developed and included in further studies. Presenting the heterogeneity of datasets and actively accounting for incomplete lineage sorting will definitively improve the field of phylogenetics as well.Chapter 1. Introduction -- Chapter 2. Pike and Salmon as Sister Taxa: Detailed Intraclade Resolution and Divergence Time Estimation of Esociformes + Salmoniformes Based on Whole Mitochondrial Genome Sequences -- Chapter 3. Are Flatfishes (Pleuronectiformes) Monophyletic? -- Chapter 4. Mitochondrial Genomic Investigation of Flatfish (Pleuronectiformes) Monophyly -- Chapter 5. Conclusion

    Detecting Clusters of Mutations

    Get PDF
    Positive selection for protein function can lead to multiple mutations within a small stretch of DNA, i.e., to a cluster of mutations. Recently, Wagner proposed a method to detect such mutation clusters. His method, however, did not take into account that residues with high solvent accessibility are inherently more variable than residues with low solvent accessibility. Here, we propose a new algorithm to detect clustered evolution. Our algorithm controls for different substitution probabilities at buried and exposed sites in the tertiary protein structure, and uses random permutations to calculate accurate P values for inferred clusters. We apply the algorithm to genomes of bacteria, fly, and mammals, and find several clusters of mutations in functionally important regions of proteins. Surprisingly, clustered evolution is a relatively rare phenomenon. Only between 2% and 10% of the genes we analyze contain a statistically significant mutation cluster. We also find that not controlling for solvent accessibility leads to an excess of clusters in terminal and solvent-exposed regions of proteins. Our algorithm provides a novel method to identify functionally relevant divergence between groups of species. Moreover, it could also be useful to detect artifacts in automatically assembled genomes

    The molecular phylogeny of placental mammals and its application to uncovering signatures of molecular adaptation.

    Get PDF
    Considerable conflict remains in the literature as to the position of the root of placental mammals, and the placement of several intra-ordinal groups. Debate continues over the use of DNA or amino acids datasets and over the use of Supertree or Supermatrix approaches. Known phenomena exist within mammal data that complicate the reconstruction of phylogeny. These include (but are not limited to), variation in longevity, body size, metabolic rates, and germ-line generation time that result in variation in mutation rates and composition biases. Previous attempts to resolve the placental mammal phylogeny have used homogeneous evolutionary models that cannot capture and adequately describe these features across the species sampled. In this thesis I explore the properties of different datasets and data types and their suitability to the resolution of the mammal phylogeny at different depths: (i) the position of the root of the placental mammals, and (ii), the intraordinal placements within the Laurasiatheria. The datasets tested were (i) mitochondrial and nuclear data types, (ii) previously published datasets for mammals, and (iii), datasets I assembled specifically for analyses at different phylogenetic depths. I propose and apply the use of heterogeneous models to resolve the position of the root of the placental mammal phylogeny to these datasets. Reconstruction of a robust mammal phylogeny provides us with an essential framework for understanding the molecular underpinnings of adaptation to environment. The placental mammals display a huge variations in life traits such longevity, body size and DNA repair efficiency, since they emerged ~100 million years ago. With this robust phylogeny, I set out to determine the level of adaptive and non-adaptive processes acting on a set of mammal genes that are linked with longevity and cancer. The results of these analyses yield important insights into data and model suitability, and provide strong evidence for a single hypothesis for the rooting of placental mammals. These results also show that Laurasiatheria intra-ordinal placements are not fully resolved and additional sampling from this diverse clade is required. Using this resolved phylogeny, specific molecular adaptations and non-adaptive mechanisms were identified in the mammalia for a set of telomere-associated genes

    Genome evolution in Prochlorococcus and marine Synechococcus

    Get PDF

    Novel bioinformatics programs for taxonomical classification and functional analysis of the whole genome sequencing data of arbuscular mycorrhizal fungi

    Full text link
    RĂ©sumĂ© [TITRE] Classification taxonomique et analyse fonctionnelle spĂ©cifique Ă la position des sĂ©quences gĂ©nomique des champignons mycorhiziens arbusculaires et les microorganismes qui leurs sont associĂ©s [PROBLÉMATIQUE ET CADRE CONCEPTUEL] Les champignons mycorhiziens arbusculaires (CMA) sont des symbiotes obligatoires des racines de la majoritĂ©des plantes vasculaires. Les CMA appartiennent au phylum Glomeromycota et ils sont considĂ©rĂ©s comme une lignĂ©e fongique primitive qui a conservĂ© la structure coenocytique des hyphes et la production des spores asexuĂ©es multinuclĂ©Ă©es. De nombeuses Ă©tudes ont dĂ©montrĂ©que plusieurs microorganismes sont associĂ©s avec les mycĂ©lia des CMA soit Ă la surface des hyphes et des spores mais aussi Ă l'intĂ©rieurs de celles-ci. Le sĂ©quençage des gĂ©nomes des CMA cultivĂ©s in-vivo reprĂ©sente un dĂ©fi considĂ©rable car il s’agit d’un mĂ©tagĂ©nome constituĂ©du gĂ©nome du CMA lui-mĂȘme et les gĂ©nomes des microbes qui lui sont associĂ©s. Par consĂ©quence, l’identification de l'origine taxonomique de chaque sĂ©quence reprĂ©sente une tĂąche extrĂȘmement ardue. Dans mon projet, j’ai dĂ©veloppĂ©deux nouveaux programmes bioinformatiques qui permettent de classer les sĂ©quences selon groupe taxonomique et d’identifier les fonctions de celles-ci. J’ai crĂ©Ă©une base de donnĂ©es avec 444 gĂ©nomes d'espĂšces appartenant Ă 54 genres. Le choix de ces espĂšces des bactĂ©ries et des champignons a Ă©tĂ©basĂ©sur leur abondance dans les sols). [MÉTHODOLOGIE] Le programme bioinformatique utilise le tableau des rĂ©fĂ©rences des microorganismes et des mĂ©thodes statistiques pour la classification taxonomique des sĂ©quences. Par la suite, des tableaux des codons synonymes Ă©taient crĂ©Ă©s Ă partir des structures secondaires (SS) des bases de donnĂ©es de protĂ©ines (PDB) pour les sĂ©quences codantes (SC) et des motifs de composition pour les sĂ©quences non-codantes (SNC). Chaque tableau est composĂ©de 3 niveaux - les caractĂ©ristiques d'acides aminĂ©s; l'utilisation des acides aminĂ©s synonymes correspondants, et l'utilisation des codons synonymes correspondants. En comparant les mĂ©thodes existantes qui utilisent les taux de substitution moyenne globale quelle que soit les spĂ©cificitĂ©s des acides aminĂ©s dans diverses structures, mon programme fournit une classification Ă haute rĂ©solution pour des sĂ©quences courtes (150-300 pb) parce que les biais dans l'utilisation des codons synonymes Ă partir d'environ 8000 trimĂšres d'acides aminĂ©s spĂ©cifiques des sous-unitĂ©s de structure secondaire, ont Ă©tĂ©extraits avec des substitutions d'acides aminĂ©s pris en considĂ©ration dans chaque trimĂšre spĂ©cifique. Pour l'analyse fonctionnelle, le programme crĂ©e dynamiquement des donnĂ©es comparatives de 54 genres microbiens basĂ©s sur leurs biais dans l'utilisation des codons synonymes d'appariement de trois codons d’ADN (9-mĂšres) identifiĂ©s dans une sĂ©quence de requĂȘte. Le programme applique une analyse en composantes principales basĂ©e sur la matrice de corrĂ©lation en association avec le partitionnement en k-moyennes aux donnĂ©es comparatives. [RETOMBÉES] Les taux de prĂ©diction correcte de la CDS et les non-CDS Ă©taient de 50 Ă 71% pour les bactĂ©ries, et 65 Ă 73% pour les champignons, respectivement. Pour les CMA, 49% des CDS et 72% des non-CDS ont Ă©tĂ©correctement classĂ©s. Ce programme nous permet d'estimer les abondances approximatives des communautĂ©s microbiennes associĂ©es au CMA. Les rĂ©sultats de l'analyse fonctionnelle peuvent fournir des informations sur des sites d'interaction molĂ©culaire importants impliquĂ©s dans la diversification des sĂ©quences et l’évolution des gĂšnes. Les programmes sont disponibles gratuitement sur www.fungalsesame.org. Mots-clĂ©s: sesame, sesame PS function, les caractĂ©ristiques d'acides aminĂ©s, trois codons ADN 9-mĂšres, structure secondaire, classification taxonomique, analyse fonctionnelle spĂ©cifique Ă la position; Code gĂ©nĂ©tique; Étude Comparative; GĂ©nome MitochondrialAbstract Arbuscular Mycorrhizal Fungi (AMF) are obligate plant-root symbionts belonging to the phylum Glomeromycota. They form coenocytic hyphae and reproduce through large multinucleated asexual spores. Numerous studies have shown that AMF interact closely or loosely with a myriad of microorganisms, particularly bacteria and fungi that live on the surface of or inside of their mycelia and spores. Whole genome sequencing (WGS) data of the AMF grown in-vivo (typically grown in root of a host plant in pot filled with soil) contain a large amount of sequences from microorganisms inhabiting in their spore along with their own genome sequences, resulting in a metagenome. The goal of my study was to develop bioinformatics programs for taxonomical classification and for functional analysis of the WGS data of the AMF. In the area of metagenomics, there are mainly two approaches for taxonomical classification: similarity-based (i.e., homology search) and composition-based (i.e., k-mers) methods. Similarity-based method solely depends on bioinformatics sequence databases and homology search programs such as BLAST program. The similarity-based method may not be suitable for ancient fungi AMF, because bioinformatics databases represent only a small fraction of the diversity of existing microorganisms, and gene prediction programs are highly biased towards intensively studied microorganisms. Considering that AMF have high inter/ intra genome variations, in addition to coenocytic and multi-genomic characteristics, probably due to their adaptation via various kinds of symbioses, composition-based method alone is not an effective solution for AMF, because it relies on base composition biases and focuses on taxonomical classification for prokaryotic organisms. In the first project, I a developed novel bioinformatics program, called SeSaMe (Spore associated Symbiotic Microbes), for taxonomical classification of the WGS data of the AMF. I selected microorganisms that were dominant in soil environment and grouped them into 54 genera which were used as references. I created a reference sequence database with a variable called Three codon DNA 9-mer. They were created based on a large number of structure files from Protein Data Bank (PDB): approx. 224,000 Three codon DNA 9-mers encoding for subunits of protein secondary structures. Based on the reference sequence database, I created genus specific usage databases containing codon usage and amino acid usage per taxonomic rank- genus. The program distinguishes between coding sequence (CDS) and non-CDS, detects an open reading frame, and classifies a query sequence into a genus group out of 54 genera used as reference. The developed program enables us to estimate relative abundances of taxonomic groups and to assess symbiotic roles of taxonomic groups associated with AMF. The program can be applied to other microorganisms as well as soil metagenome data. The program has applications in applied environmental microbiology. The developed program is available for free of charge at www.fungalsesame.org. In the second project, I developed another bioinformatics program, called SeSaMe PS Function, for position specific functional analysis of the WGS data of the AMF. AMF may contain a large portion of genes with unknown functions for which we may not be able to find homologues in existing sequence databases. While existing motif annotation programs rely on sequence alignment and have limitations for inferring functionality of novel genes, the developed program identifies potentially important interaction sites that are structurally and functionally distinctive from other subsequences, within a query sequence with exploratory data analysis. The program identifies matching Three codon DNA 9-mers in a query sequence, and dynamically creates comparative dataset of 54 genera, based on codon usage bias information retrieved from the genus specific usage databases. The program applies correlation Principal Component Analysis in conjunction with K-means clustering method to the comparative dataset. The program identifies outliers; Three codon DNA 9-mers, assigned into a cluster with a single member or with only a few members, are often outliers with important structures that may play roles in molecular interaction. In the third project, I developed a novel bioinformatics program called Posts (POsition Specific genetic code Tables) that assigns a codon into an amino acid group according to the codon position. The standard genetic code table may be more readily applicable to the genes whose genetic codes comply with the standard biological coding rules obtained from model organisms grown under laboratory condition. However, it may be insufficient for studying evolutions of genetic codes that may provide important information about codon properties. The mainstream hypotheses of genetic code origin suggested that codon position played important roles in the evolution of genetic codes. As a case study, we investigated irregular codons in 187 mitochondrial genomes of plants, lichen-forming fungi, endophytic fungi, and AMF. Each column of the Post contains 16 codons and the amino acids encoded by these are called an amino acid characteristics group (A.A. Char Group). Based on A.A. Char Group, an irregular codon can be classified into within-column type or trans-column type. The majority of the identified irregular codons belonged to the within-column type. The Post may offer new perspectives on codon property and codon assignment. The developed program is freely available at www.codon.kr. Taken together, the developed programs, the SeSaMe, the SeSaMe PS Function, and the Post, provide important research tools for advancing our knowledge of AMF genomics and for studying their symbiotic relations with associated microorganisms. Keywords: Sesame; Spore associated Symbiotic Microbes; Symbiosis; Sesame PS function; Arbuscular mycorrhizal fungi; Three codon DNA 9-mer; Amino acid characteristics; Secondary structure; Taxonomical classification; Position specific functional analysis; Position specific genetic code tables; Post; Comparative study; Mitochondrial genom
    • 

    corecore