63 research outputs found

    Current Methods for Automated Filtering of Multiple Sequence Alignments Frequently Worsen Single-Gene Phylogenetic Inference

    Get PDF
    Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithm

    Current methods for automated filtering of multiple sequence alignments frequently worsen single-gene phylogenetic inference

    Get PDF
    Phylogenetic inference is generally performed on the basis of multiple sequence alignments (MSA). Because errors in an alignment can lead to errors in tree estimation, there is a strong interest in identifying and removing unreliable parts of the alignment. In recent years several automated filtering approaches have been proposed, but despite their popularity, a systematic and comprehensive comparison of different alignment filtering methods on real data has been lacking. Here, we extend and apply recently introduced phylogenetic tests of alignment accuracy on a large number of gene families and contrast the performance of unfiltered versus filtered alignments in the context of single-gene phylogeny reconstruction. Based on multiple genome-wide empirical and simulated data sets, we show that the trees obtained from filtered MSAs are on average worse than those obtained from unfiltered MSAs. Furthermore, alignment filtering often leads to an increase in the proportion of well-supported branches that are actually wrong. We confirm that our findings hold for a wide range of parameters and methods. Although our results suggest that light filtering (up to 20% of alignment positions) has little impact on tree accuracy and may save some computation time, contrary to widespread practice, we do not generally recommend the use of current alignment filtering methods for phylogenetic inference. By providing a way to rigorously and systematically measure the impact of filtering on alignments, the methodology set forth here will guide the development of better filtering algorithms

    Toward community standards in the quest for orthologs

    Get PDF
    The identification of orthologs—genes pairs descended from a common ancestor through speciation, rather than duplication—has emerged as an essential component of many bioinformatics applications, ranging from the annotation of new genomes to experimental target prioritization. Yet, the development and application of orthology inference methods is hampered by the lack of consensus on source proteomes, file formats and benchmarks. The second ‘Quest for Orthologs' meeting brought together stakeholders from various communities to address these challenges. We report on achievements and outcomes of this meeting, focusing on topics of particular relevance to the research community at large. The Quest for Orthologs consortium is an open community that welcomes contributions from all researchers interested in orthology research and applications. Contact: [email protected]

    Repeat associated mechanisms of genome evolution and function revealed by the Mus caroli and Mus pahari genomes

    Get PDF
    Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 and 6 million yr ago, but that are absent in the Hominidae. Hominidae show between four- and sevenfold lower rates of nucleotide change and feature turnover in both neutral and functional sequences, suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. Recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli, which resulted in thousands of novel, species-specific CTCF binding sites. Our results show that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology

    Repeat associated mechanisms of genome evolution and function revealed by the Mus caroli and Mus pahari genomes.

    Get PDF
    Understanding the mechanisms driving lineage-specific evolution in both primates and rodents has been hindered by the lack of sister clades with a similar phylogenetic structure having high-quality genome assemblies. Here, we have created chromosome-level assemblies of the Mus caroli and Mus pahari genomes. Together with the Mus musculus and Rattus norvegicus genomes, this set of rodent genomes is similar in divergence times to the Hominidae (human-chimpanzee-gorilla-orangutan). By comparing the evolutionary dynamics between the Muridae and Hominidae, we identified punctate events of chromosome reshuffling that shaped the ancestral karyotype of Mus musculus and Mus caroli between 3 and 6 million yr ago, but that are absent in the Hominidae. Hominidae show between four- and sevenfold lower rates of nucleotide change and feature turnover in both neutral and functional sequences, suggesting an underlying coherence to the Muridae acceleration. Our system of matched, high-quality genome assemblies revealed how specific classes of repeats can play lineage-specific roles in related species. Recent LINE activity has remodeled protein-coding loci to a greater extent across the Muridae than the Hominidae, with functional consequences at the species level such as reproductive isolation. Furthermore, we charted a Muridae-specific retrotransposon expansion at unprecedented resolution, revealing how a single nucleotide mutation transformed a specific SINE element into an active CTCF binding site carrier specifically in Mus caroli, which resulted in thousands of novel, species-specific CTCF binding sites. Our results show that the comparison of matched phylogenetic sets of genomes will be an increasingly powerful strategy for understanding mammalian biology

    Sequencing of the Sea Lamprey (Petromyzon marinus) Genome Provides Insights into Vertebrate Evolution

    Get PDF
    Lampreys are representatives of an ancient vertebrate lineage that diverged from our own ∼500 million years ago. By virtue of this deeply shared ancestry, the sea lamprey (P. marinus) genome is uniquely poised to provide insight into the ancestry of vertebrate genomes and the underlying principles of vertebrate biology. Here, we present the first lamprey whole-genome sequence and assembly. We note challenges faced owing to its high content of repetitive elements and GC bases, as well as the absence of broad-scale sequence information from closely related species. Analyses of the assembly indicate that two whole-genome duplications likely occurred before the divergence of ancestral lamprey and gnathostome lineages. Moreover, the results help define key evolutionary events within vertebrate lineages, including the origin of myelin-associated proteins and the development of appendages. The lamprey genome provides an important resource for reconstructing vertebrate origins and the evolutionary events that have shaped the genomes of extant organisms

    Reconstruction of ancestral vertebrate genomes

    No full text
    Implicitement, identifier des similarités entre deux génomes revient à décrire une propriété ancestrale qu'ils partagent encore de nos jours. L'abondance de données génomiques provenant de centaines d'espèces différentes rend possible de nombreuses comparaisons de ce type, mais souvent restreinte à deux espèces comparées l'une à l'autre, hors de tout cadre unifié et sans références particulières. Ce travail de thèse décrit une nouvelle méthode, appelée AGORA (Algorithms for Gene Order Reconstruction in Ancestors), pour reconstruire de manière automatique et systématique l'ordre des gènes et les caryotypes de toutes les espèces ancestrales dans une phylogénie donnée. AGORA est capable de gérer les duplications de gènes, les délétions, et les gains, et interprète de manière réaliste des phylogénies complexes de gènes. Nous avons appliqué la méthode chez 46 espèces de vertébrés séquencées et annotées (en utilisant 8 espèces supplémentaires en référence externe) pour reconstruire des ordres de gènes ancestraux dans 43 génomes ancestraux sur près de 600 millions d'années d'évolution. Les performances d'AGORA ont été mesurées par des simulations de génomes de vertébrés, et par confrontation à des génomes ancestraux déjà connus. Les données, présentées graphiquement dans un serveur web nommé Genomicus (http://www.dyogen.ens.fr/genomicus) fournissent un nouveau cadre unifié dans lequel les génomes ancestraux peuvent servir de référence naturelle auxquelles comparer les génomes modernes qui en descendent. À ce titre, ces données fournissent une nouvelle ressource pour étudier l'évolution de l'organisation de l'information génétique dans les génomes.Biological studies rarely limit to the single-genome-analysis, and often include several species, thus encompassing an entire window of genome evolution (by the comparison of several species), and adding time and evolution as a new dimension to the study. Generally, this includes defining characters of ancestral genomes. With the lack of a wide ancestral genomes database, studies are often performed several times. Here we describe a new method, named AGORA (Algorithms for Gene Order Reconstruction in Ancestors) to automatically and systematically reconstruct gene order and karyotypes in all the ancestral species of a given phylogeny. AGORA can handle different gene content between species (duplications, gains, and loss) by using accurate gene phylogenies as input. We applied AGORA on 46 sequenced and annotated vertebrate genomes (using 8 outgroups genomes) to reconstruct ancestral gene order in 43 ancestral genomes on a 600 million years time-frame. AGORA performances were estimated using simulated datasets, and comparison with other studies. The results can be freely browsed and downloaded from a new web server, Genomicus, dedicated to the study of genome evolution, helping areas such as gene evolution, or genome rearrangements

    Reconstruction de génomes ancestraux chez les vertébrés

    No full text
    Biological studies rarely limit to the single-genome-analysis, and often include several species, thus encompassing an entire window of genome evolution (by the comparison of several species), and adding time and evolution as a new dimension to the study. Generally, this includes defining characters of ancestral genomes. With the lack of a wide ancestral genomes database, studies are often performed several times. Here we describe a new method, named AGORA (Algorithms for Gene Order Reconstruction in Ancestors) to automatically and systematically reconstruct gene order and karyotypes in all the ancestral species of a given phylogeny. AGORA can handle different gene content between species (duplications, gains, and loss) by using accurate gene phylogenies as input. We applied AGORA on 46 sequenced & annotated vertebrate genomes (using 8 outgroups genomes) to reconstruct ancestral gene order in 43 ancestral genomes on a 600 million years time-frame. AGORA performances were estimated using simulated datasets, and comparison with other studies. The results can be freely browsed and downloaded from a new web server, Genomicus, dedicated to the study of genome evolution, helping areas such as gene evolution, or genome rearrangements.La génomique comparative est une discipline de la biologie qui s'intéresse à l'évolution des génomes par le biais de la comparaison entre espèces de leur structure et de l'information qu'ils contiennent. Implicitement, identifier des similarités entre deux génomes revient à décrire une propriété ancestrale qu'ils partagent encore de nos jours. L'abondance de données génomiques provenant de centaines d'espèces différentes rend possible de nombreuses comparaisons de ce type, mais souvent restreinte à deux espèces comparées l'une à l'autre, hors de tout cadre unifié et sans références particulières. Ce travail de thèse décrit une nouvelle méthode, appelée AGORA (Algorithms for Gene Order Reconstruction in Ancestors), pour reconstruire de manière automatique et systématique l'ordre des gènes et les caryotypes de toutes les espèces ancestrales dans une phylogénie donnée. AGORA est capable de gérer les duplications de gènes, les délétions, et les gains, et interprète de manière réaliste des phylogénies complexes de gènes. Nous avons appliqué la méthode chez 46 espèces de vertébrés séquencées et annotées (en utilisant 8 espèces supplémentaires en référence externe) pour reconstruire des ordres de gènes ancestraux dans 43 génomes ancestraux sur près de 600 millions d'années d'évolution. Les performances d'AGORA ont été mesurées par des simulations de génomes de vertébrés, et par confrontation à des génomes ancestraux déjà connus. Les données, présentées graphiquement dans un serveur web nommé Genomicus (http://www.dyogen.ens.fr/genomicus) fournissent un nouveau cadre unifié dans lequel les génomes ancestraux peuvent servir de référence naturelle auxquelles comparer les génomes modernes qui en descendent. À ce titre, ces données fournissent une nouvelle ressource pour étudier l'évolution de l'organisation de l'information génétique dans les génomes

    Reconstruction de génomes ancestraux chez les vertébrés

    No full text
    Biological studies rarely limit to the single-genome-analysis, and often include several species, thus encompassing an entire window of genome evolution (by the comparison of several species), and adding time and evolution as a new dimension to the study. Generally, this includes defining characters of ancestral genomes. With the lack of a wide ancestral genomes database, studies are often performed several times. Here we describe a new method, named AGORA (Algorithms for Gene Order Reconstruction in Ancestors) to automatically and systematically reconstruct gene order and karyotypes in all the ancestral species of a given phylogeny. AGORA can handle different gene content between species (duplications, gains, and loss) by using accurate gene phylogenies as input. We applied AGORA on 46 sequenced & annotated vertebrate genomes (using 8 outgroups genomes) to reconstruct ancestral gene order in 43 ancestral genomes on a 600 million years time-frame. AGORA performances were estimated using simulated datasets, and comparison with other studies. The results can be freely browsed and downloaded from a new web server, Genomicus, dedicated to the study of genome evolution, helping areas such as gene evolution, or genome rearrangements.La génomique comparative est une discipline de la biologie qui s'intéresse à l'évolution des génomes par le biais de la comparaison entre espèces de leur structure et de l'information qu'ils contiennent. Implicitement, identifier des similarités entre deux génomes revient à décrire une propriété ancestrale qu'ils partagent encore de nos jours. L'abondance de données génomiques provenant de centaines d'espèces différentes rend possible de nombreuses comparaisons de ce type, mais souvent restreinte à deux espèces comparées l'une à l'autre, hors de tout cadre unifié et sans références particulières. Ce travail de thèse décrit une nouvelle méthode, appelée AGORA (Algorithms for Gene Order Reconstruction in Ancestors), pour reconstruire de manière automatique et systématique l'ordre des gènes et les caryotypes de toutes les espèces ancestrales dans une phylogénie donnée. AGORA est capable de gérer les duplications de gènes, les délétions, et les gains, et interprète de manière réaliste des phylogénies complexes de gènes. Nous avons appliqué la méthode chez 46 espèces de vertébrés séquencées et annotées (en utilisant 8 espèces supplémentaires en référence externe) pour reconstruire des ordres de gènes ancestraux dans 43 génomes ancestraux sur près de 600 millions d'années d'évolution. Les performances d'AGORA ont été mesurées par des simulations de génomes de vertébrés, et par confrontation à des génomes ancestraux déjà connus. Les données, présentées graphiquement dans un serveur web nommé Genomicus (http://www.dyogen.ens.fr/genomicus) fournissent un nouveau cadre unifié dans lequel les génomes ancestraux peuvent servir de référence naturelle auxquelles comparer les génomes modernes qui en descendent. À ce titre, ces données fournissent une nouvelle ressource pour étudier l'évolution de l'organisation de l'information génétique dans les génomes
    corecore