14 research outputs found

    Identifying orthologs with OMA: A primer [version 1; peer review: 2 approved]

    Get PDF
    The Orthologous Matrix (OMA) is a method and database that allows users to identify orthologs among many genomes. OMA provides three different types of orthologs: pairwise orthologs, OMA Groups and Hierarchical Orthologous Groups (HOGs). This Primer is organized in two parts. In the first part, we provide all the necessary background information to understand the concepts of orthology, how we infer them and the different subtypes of orthology in OMA, as well as what types of analyses they should be used for. In the second part, we describe protocols for using the OMA browser to find a specific gene and its various types of orthologs. By the end of the Primer, readers should be able to (i) understand homology and the different types of orthologs reported in OMA, (ii) understand the best type of orthologs to use for a particular analysis; (iii) find particular genes of interest in the OMA browser; and (iv) identify orthologs for a given gene.  The data can be freely accessed from the OMA browser at https://omabrowser.org

    Identifying orthologs with OMA: A primer.

    Get PDF
    The Orthologous Matrix (OMA) is a method and database that allows users to identify orthologs among many genomes. OMA provides three different types of orthologs: pairwise orthologs, OMA Groups and Hierarchical Orthologous Groups (HOGs). This Primer is organized in two parts. In the first part, we provide all the necessary background information to understand the concepts of orthology, how we infer them and the different subtypes of orthology in OMA, as well as what types of analyses they should be used for. In the second part, we describe protocols for using the OMA browser to find a specific gene and its various types of orthologs. By the end of the Primer, readers should be able to (i) understand homology and the different types of orthologs reported in OMA, (ii) understand the best type of orthologs to use for a particular analysis; (iii) find particular genes of interest in the OMA browser; and (iv) identify orthologs for a given gene. The data can be freely accessed from the OMA browser at https://omabrowser.org

    Phylogenetic profiling: how much input data is enough?

    Get PDF
    Phylogenetic profiling is a well-established approach for predicting gene function based on patterns of gene presence and absence across species. Much of the recent developments have focused on methodological improvements, but relatively little is known about the effect of input data size on the quality of predictions. In this work, we ask: how many genomes and functional annotations need to be considered for phylogenetic profiling to be effective? Phylogenetic profiling generally benefits from an increased amount of input data. However, by decomposing this improvement in predictive accuracy in terms of the contribution of additional genomes and of additional annotations, we observed diminishing returns in adding more than ∼ 100 genomes, whereas increasing the number of annotations remained strongly beneficial throughout. We also observed that maximising phylogenetic diversity within a clade of interest improves predictive accuracy, but the effect is small compared to changes in the number of genomes under comparison. Finally, we show that these findings are supported in light of the Open World Assumption, which posits that functional annotation databases are inherently incomplete. All the tools and data used in this work are available for reuse from http://lab.dessimoz.org/14_phylprof. Scripts used to analyse the data are available on request from the authors

    The evolutionary signal in metagenome phyletic profiles predicts many gene functions

    Get PDF
    Background. The function of many genes is still not known even in model organisms. An increasing availability of microbiome DNA sequencing data provides an opportunity to infer gene function in a systematic manner. Results. We evaluated if the evolutionary signal contained in metagenome phyletic profiles (MPP) is predictive of a broad array of gene functions. The MPPs are an encoding of environmental DNA sequencing data that consists of relative abundances of gene families across metagenomes. We find that such MPPs can accurately predict 826 Gene Ontology functional categories, while drawing on human gut microbiomes, ocean metagenomes, and DNA sequences from various other engineered and natural environments. Overall, in this task, the MPPs are highly accurate, and moreover they provide coverage for a set of Gene Ontology terms largely complementary to standard phylogenetic profiles, derived from fully sequenced genomes. We also find that metagenomes approximated from taxon relative abundance obtained via 16S rRNA gene sequencing may provide surprisingly useful predictive models. Crucially, the MPPs derived from different types of environments can infer distinct, non-overlapping sets of gene functions and therefore complement each other. Consistently, simulations on > 5000 metagenomes indicate that the amount of data is not in itself critical for maximizing predictive accuracy, while the diversity of sampled environments appears to be the critical factor for obtaining robust models. Conclusions. In past work, metagenomics has provided invaluable insight into ecology of various habitats, into diversity of microbial life and also into human health and disease mechanisms. We propose that environmental DNA sequencing additionally constitutes a useful tool to predict biological roles of genes, yielding inferences out of reach for existing comparative genomics approaches

    Evolution of spatiotemporal organization of biological systems : origins and phenotypic impact of duplicated genes

    Get PDF
    Eine Schwemme von Genomsequenzen sowie weitere groß angelegte Studien zur Charakterisierung von molekularen Funktionen hat Forschern erlaubt, komparative Studien der funktionellen Komponenten und ihrer Interaktionen für eine große Anzahl von Spezies durchegeführt werden. Die so gewonnen Erkenntnisse können weiter untersucht werden und mithilfe von Orthologie (Homologie abgeleitet durch Artenbildung) auf neu-sequenzierte Spezies übertragen werden um Erkenntnisse über die Evolution von molekularen Funktionen und ihrer Organisation zu gewinnen. Eine robuste Orthologie ist Voraussetzung für akkurate phylogenomische und komparative Analysen. Obwohl sich das Forschungsfeld der Orthologie Fortschritte gemacht hat, ist die Orthologie- Voraussage noch immer von widersprüchlich und unsicher. Aus diesem Grund sollten Tests zur Qualitätskontrolle eingeführt werden. Im Rahmen dieser Arbeit wurde ein Phylogenie-basierter Datensatz entwickelt, mit dem die Orthologie Voraussage in den Animalia überprüft werden kann. Dieser Datensatz wurde benutzt um die Orthologie-Voraussagen von fünf öffentlich zugänglichen Repositorien zu evaluieren und die Auswirkungen von einer Anzahl von technischen und biologischen Faktoren zu untersuchen. Gleichzeitig hat die große Anzahl von komplett sequenzierten Genomen zur Formulierung von interessanten Hypothesen über die Mechanismen der Evolution von molekularen Funktionen geführt. Zum Beispiel wurden Paraloge, Homologe die durch Gen- oder Genomduplikation entstanden sind, mit der Erweiterung und Teilung von molekularen Funktionen assoziiert. Eine Vielzahl von Studien wurden durchgeführt um herauszufinden, wie siche duplizierte Gene, die mit morphologischen Veränderungen assoziert werden, ihre Genexpressionsraten in unterschiedlichen Geweben ändern. Es wurde jedoch noch nicht großflächig untersucht, ob die regulatorische Divergenz von Paralogen bestimmte Muster bevorzugt und wie diese Muster entstanden sind. Um dies zu untersuchen wurden die Expressionsdaten von 31 menschlichen Geweben benutzt und bevorzugte Gewebekombinationen von sub(neo)funktionalisierten Paralogen identifiziert. Interessanterweise stellte sich heraus, dass Paraloge die mit dem Choradata- Wirbeltiere Übergang im Zusammenhang stehen und bereits vor dem Ur-Wibeltier vorhanden waren, häufig zwischen Gehirn und nicht-Gehirngeweben divergieren. Im Kontrast zur weitreichenden Literatur über die Evolution von Geweben und Paralogie, ist die Rolle von Genduplikation in der temporalen Regulation von biologischen Systemen schlechter untersucht. Um dies zu untersuchen wurden Orthologie und Genexpressionsdaten kombiniert. Wir konnten herausfinden, dass der Zell-Zyklus und andere periodische Prozesse (wie der Circadianen und Ultradianen Rhythmik) von Paralogen reguliert werden. Das funktionelle Repertoire dieser Paraloge unterscheidet sich in 3 eukaryotischen Spezies (Arabidopsis, Mensch und Hefe), was impliziert, dass sich die temporale Regulation der Zellen durch Paraloge sich in den drei Organismen unabhängig Zusammenfassung voneinander entwickelt hat. Zusammenfassend ist die größte Herausvorderung der postgenomischen Ära eine effektive Integration von funktionell relevanten genomischen Daten um herauszufinden, wie komplexe Eigenschaften sich entwickelt haben. Um dieses Ziel zu erreichen sollten die dynamischen Veränderungen der Gen-Inventare unter Beachtung von der Beziehung von Orthologen (gleicher Ursprung) und Paralogen (Potenzial für Divergenz) untersucht werden

    Combining learning and constraints for genome-wide protein annotation

    Get PDF
    BackgroundThe advent of high-throughput experimental techniques paved the way to genome-wide computational analysis and predictive annotation studies. When considering the joint annotation of a large set of related entities, like all proteins of a certain genome, many candidate annotations could be inconsistent, or very unlikely, given the existing knowledge. A sound predictive framework capable of accounting for this type of constraints in making predictions could substantially contribute to the quality of machine-generated annotations at a genomic scale.ResultsWe present Ocelot, a predictive pipeline which simultaneously addresses functional and interaction annotation of all proteins of a given genome. The system combines sequence-based predictors for functional and protein-protein interaction (PPI) prediction with a consistency layer enforcing (soft) constraints as fuzzy logic rules. The enforced rules represent the available prior knowledge about the classification task, including taxonomic constraints over each GO hierarchy (e.g. a protein labeled with a GO term should also be labeled with all ancestor terms) as well as rules combining interaction and function prediction. An extensive experimental evaluation on the Yeast genome shows that the integration of prior knowledge via rules substantially improves the quality of the predictions. The system largely outperforms GoFDR, the only high-ranking system at the last CAFA challenge with a readily available implementation, when GoFDR is given access to intra-genome information only (as Ocelot), and has comparable or better results (depending on the hierarchy and performance measure) when GoFDR is allowed to use information from other genomes. Our system also compares favorably to recent methods based on deep learning

    Xenacoelomorpha: The "simple" key to bilaterian ancestry?

    Get PDF
    Xenacoelomorpha (comprising Xenoturbellida, Acoela and Nemertodermatida) is a clade of marine worms whose position in the tree of life is still in debate. Several phylogenetic analyses have shown them to be placed at the base of all bilaterian animals (e. g. chordates, arthropods) or at a more derived position as sister group to the Ambulacraria (echinoderms and hemichordates) within the Bilateria. A key characteristic is the absence of traits found in other bilaterian animals. Orthogroups are groups of orthologous genes found in several organisms. Orthologues are assumed to retain the same function. These functions would be specific to the clade where an orthogroup is prevalent. I investigate a method to automatically establish and validate orthogroups specific to Bilateria, Protostomia and Deuterostomia. These genes could be relevant for the clades’ respective emergence and differences. These sets will also help to ascertain what genes/functions are absent from Xenacoelomorpha. MicroRNAs (miRNAs) are small non-coding RNA molecules involved in RNA silencing and post-transcriptional regulation of gene expression. MiRNAs have not been extens- ively studied in the Xenaceolomorpha. I introduce a fully automatic miRNA detection pipeline to infer and confirm the existence of pre-miRNA sequences in the genome of Xenoturbella bocki as well as predict miRNA candidates from several xenacoel gen- omes. I report previously undetected miRNA families and opine that previous analyses on Acoelomorpha failed due to loss caused by the higher evolutionary rate when compared to the Xenoturbellida

    Interactomics-Based Functional Analysis: Using Interaction Conservation To Probe Bacterial Protein Functions

    Get PDF
    The emergence of genomics as a discrete field of biology has changed humanity’s understanding of our relationship with bacteria. Sequencing the genome of each newly-discovered bacterial species can reveal novel gene sequences, though the genome may contain genes coding for hundreds or thousands of proteins of unknown function (PUFs). In some cases, these coding sequences appear to be conserved across nearly all bacteria. Exploring the functional roles of these cases ideally requires an integrative, cross-species approach involving not only gene sequences but knowledge of interactions among their products. Protein interactions, studied at genome scale, extend genomics into the field of interactomics. I have employed novel computational methods to provide context for bacterial PUFs and to leverage the rich genomic, proteomic, and interactomic data available for hundreds of bacterial species. The methods employed in this study began with sets of protein complexes. I initially hypothesized that, if protein interactions reveal protein functions and interactions are frequently conserved through protein complexes, then conserved protein functions should be revealed through the extent of conservation of protein complexes and their components. The subsequent analyses revealed how partial protein complex conservation may, unexpectedly, be the rule rather than the exception. Next, I expanded the analysis by combining sets of thousands of experimental protein-protein interactions. Progressing beyond the scope of protein complexes into interactions across full proteomes revealed novel evolutionary consistencies across bacteria but also exposed deficiencies among interactomics-based approaches. I have concluded this study with an expansion beyond bacterial protein interactions and into those involving bacteriophage-encoded proteins. This work concerns emergent evolutionary properties among bacterial proteins. It is primarily intended to serve as a resource for microbiologists but is relevant to any research into evolutionary biology. As microbiomes and their occupants become increasingly critical to human health, similar approaches may become increasingly necessary

    Evolutionary genomics : statistical and computational methods

    Get PDF
    This open access book addresses the challenge of analyzing and understanding the evolutionary dynamics of complex biological systems at the genomic level, and elaborates on some promising strategies that would bring us closer to uncovering of the vital relationships between genotype and phenotype. After a few educational primers, the book continues with sections on sequence homology and alignment, phylogenetic methods to study genome evolution, methodologies for evaluating selective pressures on genomic sequences as well as genomic evolution in light of protein domain architecture and transposable elements, population genomics and other omics, and discussions of current bottlenecks in handling and analyzing genomic data. Written for the highly successful Methods in Molecular Biology series, chapters include the kind of detail and expert implementation advice that lead to the best results. Authoritative and comprehensive, Evolutionary Genomics: Statistical and Computational Methods, Second Edition aims to serve both novices in biology with strong statistics and computational skills, and molecular biologists with a good grasp of standard mathematical concepts, in moving this important field of study forward
    corecore