Search CORE

762 research outputs found

Orthology prediction at scalable resolution by phylogenetic tree analysis

Author: Huynen Martijn A
Snel Berend
van der Heijden René TJM
van Noort Vera
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Orthology is one of the cornerstones of gene function prediction. Dividing the phylogenetic relations between genes into either orthologs or paralogs is however an oversimplification. Already in two-species gene-phylogenies, the complicated, non-transitive nature of phylogenetic relations results in inparalogs and outparalogs. For situations with more than two species we lack semantics to specifically describe the phylogenetic relations, let alone to exploit them. Published procedures to extract orthologous groups from phylogenetic trees do not allow identification of orthology at various levels of resolution, nor do they document the relations between the orthologous groups. RESULTS: We introduce "levels of orthology" to describe the multi-level nature of gene relations. This is implemented in a program LOFT (Levels of Orthology From Trees) that assigns hierarchical orthology numbers to genes based on a phylogenetic tree. To decide upon speciation and gene duplication events in a tree LOFT can be instructed either to perform classical species-tree reconciliation or to use the species overlap between partitions in the tree. The hierarchical orthology numbers assigned by LOFT effectively summarize the phylogenetic relations between genes. The resulting high-resolution orthologous groups are depicted in colour, facilitating visual inspection of (large) trees. A benchmark for orthology prediction, that takes into account the varying levels of orthology between genes, shows that the phylogeny-based high-resolution orthology assignments made by LOFT are reliable. CONCLUSION: The "levels of orthology" concept offers high resolution, reliable orthology, while preserving the relations between orthologous groups. A Windows as well as a preliminary Java version of LOFT is available from the LOFT website

Lirias

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Radboud Repository

eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges

Author: Arnold Roland
Bork Peer
Doerks Tobias
Jensen Lars J.
Kuhn Michael
Letunic Ivica
Muller Jean
Powell Sean
Rattei Thomas
Roth Alexander
Szklarczyk Damian
Trachana Kalliopi
von Mering Christian
Publication venue
Publication date: 02/08/2017
Field of study

Orthologous relationships form the basis of most comparative genomic and metagenomic studies and are essential for proper phylogenetic and functional analyses. The third version of the eggNOG database (http://eggnog.embl.de) contains non-supervised orthologous groups constructed from 1133 organisms, doubling the number of genes with orthology assignment compared to eggNOG v2. The new release is the result of a number of improvements and expansions: (i) the underlying homology searches are now based on the SIMAP database; (ii) the orthologous groups have been extended to 41 levels of selected taxonomic ranges enabling much more fine-grained orthology assignments; and (iii) the newly designed web page is considerably faster with more functionality. In total, eggNOG v3 contains 721 801 orthologous groups, encompassing a total of 4 396 591 genes. Additionally, we updated 4873 and 4850 original COGs and KOGs, respectively, to include all 1133 organisms. At the universal level, covering all three domains of life, 101 208 orthologous groups are available, while the others are applicable at 40 more limited taxonomic ranges. Each group is amended by multiple sequence alignments and maximum-likelihood trees and broad functional descriptions are provided for 450 904 orthologous groups (62.5%

RERO DOC Digital Library

eggNOG v2.0: extending the evolutionary genealogy of genes with enhanced non-supervised orthologous groups, species and functional annotations

Author: A. Roth
Altschul
Aurrecoechea
Berglund
C. von Mering
D. Szklarczyk
Datta
Edgar
Eyre
Felsenstein
Finn
Fitch
Gilbert
Guindon
Harris
Hubbard
Huerta-Cepas
I. Letunic
J. Muller
Jensen
Jensen
Kanehisa
Katoh
Koonin
Kriventseva
Kuhn
Kuzniar
L. J. Jensen
Letunic
Letunic
Li
Loytynoja
M. Kuhn
Makarova
P. Bork
P. Julien
Pruitt
Roth
S. Powell
Saebo
Sonnhammer
Swarbreck
T. Doerks
Tatusov
Tatusov
Thompson
Thompson
Uchiyama
van der Heijden
Vilella
Wapinski
Waterhouse
Zmasek
Publication venue: Oxford University Press
Publication date: 01/01/2010
Field of study

The identification of orthologous relationships forms the basis for most comparative genomics studies. Here, we present the second version of the eggNOG database, which contains orthologous groups (OGs) constructed through identification of reciprocal best BLAST matches and triangular linkage clustering. We applied this procedure to 630 complete genomes (529 bacteria, 46 archaea and 55 eukaryotes), which is a 2-fold increase relative to the previous version. The pipeline yielded 224 847 OGs, including 9724 extended versions of the original COG and KOG. We computed OGs for different levels of the tree of life; in addition to the species groups included in our first release (i.e. fungi, metazoa, insects, vertebrates and mammals), we have now constructed OGs for archaea, fishes, rodents and primates. We automatically annotate the non-supervised orthologous groups (NOGs) with functional descriptions, protein domains, and functional categories as defined initially for the COG/KOG database. In-depth analysis is facilitated by precomputed high-quality multiple sequence alignments and maximum-likelihood trees for each of the available OGs. Altogether, eggNOG covers 2 242 035 proteins (built from 2 590 259 proteins) and provides a broad functional description for at least 1 966 709 (88%) of them. Users can access the complete set of orthologous groups via a web interface at: http://eggnog.embl.de

Crossref

PubMed Central

UCL Discovery

Copenhagen University Research Information System

ZORA

MDC Repository

Genome-wide signatures of complex introgression and adaptive evolution in the big cats.

Author: Antunes Agostinho
Assis Juliana
Azevedo Fernando CC
Bi Ke
Brassaloti Ricardo A
Coutinho Luiz L
Eizirik Eduardo
Fernandes Gabriel
Figueiró Henrique V
Gabaldón Toni
Hughes Graham M
Kantek Daniel
Komissarov Aleksey
Li Gang
Linderoth Tyler
Loska Damian
Morato Ronaldo G
Murphy William J
Nielsen Rasmus
Nunes Adauto LV
O'Brien Stephen J
Oliveira Guilherme
Pais Fabiano
Ramalho Emiliano
Rodrigues Maíra R
Santos Sarah HD
Saragüeta Patricia
Silveira Leandro
Teeling Emma C
Teixeira Rodrigo HF
Trinca Cristine S
Trindade Fernanda J
Villela Priscilla MS
Publication venue: eScholarship, University of California
Publication date: 01/01/2017
Field of study

The great cats of the genus Panthera comprise a recent radiation whose evolutionary history is poorly understood. Their rapid diversification poses challenges to resolving their phylogeny while offering opportunities to investigate the historical dynamics of adaptive divergence. We report the sequence, de novo assembly, and annotation of the jaguar (Panthera onca) genome, a novel genome sequence for the leopard (Panthera pardus), and comparative analyses encompassing all living Panthera species. Demographic reconstructions indicated that all of these species have experienced variable episodes of population decline during the Pleistocene, ultimately leading to small effective sizes in present-day genomes. We observed pervasive genealogical discordance across Panthera genomes, caused by both incomplete lineage sorting and complex patterns of historical interspecific hybridization. We identified multiple signatures of species-specific positive selection, affecting genes involved in craniofacial and limb development, protein metabolism, hypoxia, reproduction, pigmentation, and sensory perception. There was remarkable concordance in pathways enriched in genomic segments implicated in interspecies introgression and in positive selection, suggesting that these processes were connected. We tested this hypothesis by developing exome capture probes targeting ~19,000 Panthera genes and applying them to 30 wild-caught jaguars. We found at least two genes (DOCK3 and COL4A5, both related to optic nerve development) bearing significant signatures of interspecies introgression and within-species positive selection. These findings indicate that post-speciation admixture has contributed genetic material that facilitated the adaptive evolution of big cat lineages

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

CONICET Digital

eScholarship - University of California

NSU Works

Graph-based methods for large-scale protein classification and orthology inference

Author: Kuzniar A.
Publication venue: S.n.
Publication date: 01/01/2009
Field of study

The quest for understanding how proteins evolve and function has been a prominent and costly human endeavor. With advances in genomics and use of bioinformatics tools, the diversity of proteins in present day genomes can now be studied more efficiently than ever before. This thesis describes computational methods suitable for large-scale protein classification of many proteomes of diverse species. Specifically, we focus on methods that combine unsupervised learning (clustering) techniques with the knowledge of molecular phylogenetics, particularly that of orthology. In chapter 1 we introduce the biological context of protein structure, function and evolution, review the state-of-the-art sequence-based protein classification methods, and then describe methods used to validate the predictions. Finally, we present the outline and objectives of this thesis. Evolutionary (phylogenetic) concepts are instrumental in studying subjects as diverse as the diversity of genomes, cellular networks, protein structures and functions, and functional genome annotation. In particular, the detection of orthologous proteins (genes) across genomes provides reliable means to infer biological functions and processes from one organism to another. Chapter 2 evaluates the available computational tools, such as algorithms and databases, used to infer orthologous relationships between genes from fully sequenced genomes. We discuss the main caveats of large-scale orthology detection in general as well as the merits and pitfalls of each method in particular. We argue that establishing true orthologous relationships requires a phylogenetic approach which combines both trees and graphs (networks), reliable species phylogeny, genomic data for more than two species, and an insight into the processes of molecular evolution. Also proposed is a set of guidelines to aid researchers in selecting the correct tool. Moreover, this review motivates further research in developing reliable and scalable methods for functional and phylogenetic classification of large protein collections. Chapter 3 proposes a framework in which various protein knowledge-bases are combined into unique network of mappings (links), and hence allows comparisons to be made between expert curated and fully-automated protein classifications from a single entry point. We developed an integrated annotation resource for protein orthology, ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap), to help researchers and database annotators who often need to assess the coherence of proposed annotations and/or group assignments, as well as users of high throughput methodologies (e.g., microarrays or proteomics) who deal with partially annotated genomic data. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240,000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF using a fast and fully automated sequence-based mapping approach. The ProGMap database is equipped with a web interface that enables queries to be made using synonymous sequence identifiers, gene symbols, protein functions, and amino acid or nucleotide sequences. It incorporates also services, namely BLAST similarity search and QuickMatch identity search, for finding sequences similar (or identical) to a query sequence, and tools for presenting the results in graphic form. Graphs (networks) have gained an increasing attention in contemporary biology because they have enabled complex biological systems and processes to be modeled and better understood. For example, protein similarity networks constructed of all-versus-all sequence comparisons are frequently used to delineate similarity groups, such as protein families or orthologous groups in comparative genomics studies. Chapter 4.1 presents a benchmark study of freely available graph software used for this purpose. Specifically, the computational complexity of the programs is investigated using both simulated and biological networks. We show that most available software is not suitable for large networks, such as those encountered in large-scale proteome analyzes, because of the high demands on computational resources. To address this, we developed a fast and memory-efficient graph software, netclust (http://www.bioinformatics.nl/netclust/), which can scale to large protein networks, such as those constructed of millions of proteins and sequence similarities, on a standard computer. An extended version of this program called Multi-netclust is presented in chapter 4.2. This tool that can find connected clusters of data presented by different network data sets. It uses user-defined threshold values to combine the data sets in such a way that clusters connected in all or in either of the networks can be retrieved efficiently. Automated protein sequence clustering is an important task in genome annotation projects and phylogenomic studies. During the past years, several protein clustering programs have been developed for delineating protein families or orthologous groups from large sequence collections. However, most of these programs have not been benchmarked systematically, in particular with respect to the trade-off between computational complexity and biological soundness. In chapter 5 we evaluate three best known algorithms on different protein similarity networks and validation (or 'gold' standard) data sets to find out which one can scale to hundreds of proteomes and still delineate high quality similarity groups at the minimum computational cost. For this, a reliable partition-based approach was used to assess the biological soundness of predicted groups using known protein functions, manually curated protein/domain families and orthologous groups available in expert-curated databases. Our benchmark results support the view that a simple and computationally cheap method such as netclust can perform similar to and in cases even better than more sophisticated, yet much more costly methods. Moreover, we introduce an efficient graph-based method that can delineate protein orthologs of hundreds of proteomes into hierarchical similarity groups de novo. The validity of this method is demonstrated on data obtained from 347 prokaryotic proteomes. The resulting hierarchical protein classification is not only in agreement with manually curated classifications but also provides an enriched framework in which the functional and evolutionary relationships between proteins can be studied at various levels of specificity. Finally, in chapter 6 we summarize the main findings and discuss the merits and shortcomings of the methods developed herein. We also propose directions for future research. The ever increasing flood of new sequence data makes it clear that we need improved tools to be able to handle and extract relevant (orthological) information from these protein data. This thesis summarizes these needs and how they can be addressed by the available tools, or be improved by the new tools that were developed in the course of this research. <br/

Wageningen University & Research Publications

eggNOG v3.0: orthologous groups covering 1133 organisms at 41 different taxonomic ranges

Author: A. Roth
Altenhoff
C. von Mering
Chen
Chen
Ciccarelli
Creevey
D. Szklarczyk
Eisen
Gabaldon
Hulsen
I. Letunic
J. Muller
K. Trachana
Koonin
Kuzniar
L. J. Jensen
Linard
M. Kuhn
Makarova
Milinkovitch
P. Bork
Pearson
R. Arnold
S. Powell
T. Doerks
T. Rattei
Tatusov
Tatusov
Trachana
van der Heijden
von Mering
Wapinski
Publication venue: Oxford University Press
Publication date: 01/01/2012
Field of study

Crossref

University of Birmingham Research Portal

PubMed Central

Copenhagen University Research Information System

ZORA

MDC Repository

Inferring Hierarchical Orthologous Groups

Author: Train Clément
Publication venue: Université de Lausanne, Faculté de biologie et médecine
Publication date: 01/01/2019
Field of study

The reconstruction of ancestral evolutionary histories is the cornerstone of most phylogenetic analyses. Many applications are possible once the evolutionary history is unveiled, such as identifying taxonomically restricted genes (genome barcoding), predicting the function of unknown genes based on their evolutionary related genes gene ontologies, identifying gene losses and gene gains among gene families, or pinpointing the time in evolution where particular gene families emerge (sometimes referred to as “phylostratigraphy”). Typically, the reconstruction of the evolutionary histories is limited to the inference of evolutionary relationships (homology, orthology, paralogy) and basic clustering of these orthologs. In this thesis, we adopted the concept of Hierarchical Orthology Groups (HOGs), introduced a decade ago, and proposed several improvements both to improve their inference and to use them in biological analyses such as the aforementioned applications. In addition, HOGs are a powerful framework to investigate ancestral genomes since HOGs convey information regarding gene family evolution (gene losses, gene duplications or gene gains). In this thesis, an ancestral genome at a given taxonomic level denotes the last common ancestor genome for the related taxon and its hypothetical ancestral gene composition and gene order (synteny). The ancestral genes composition and ancestral synteny for a given ancestral genome provides valuable information to study the genome evolution in terms of genomic rearrangement (duplication, translocation, deletion, inversion) or of gene family evolution (variation of the gene function, accelerate gene evolution, duplication rich clade). This thesis identifies three major open challenges that composed my three research arcs. First, inferring HOGs is complex and computationally demanding meaning that robust and scalable algorithms are mandatory to generate good quality HOGs in a reasonable time. Second, benchmarking orthology clustering without knowing the true evolutionary history is a difficult task, which requires appropriate benchmark strategies. And third, the lack of tools to handle HOGs limits their applications. In the first arc of the thesis, I proposed two new algorithm refinements to improve orthology inference in order to produce orthologs less sensitive to gene fragmentations and imbalances in the rate of evolution among paralogous copies. In addition, I introduced version 2.0 of the GETHOGs 2.0 algorithm, which infers HOGs in a bottom up fashion, and which has been shown to be both faster and more accurate. In the second arc, I proposed new strategies to benchmark the reconstruction of gene families using detailed cases studies based on evidence from multiple sequence alignments along with reconstructed gene trees, and to benchmark orthology using a simulation framework that provides full control of the evolutionary genomic setup. This work highlights the main challenges in current methods. Third, I created pyHam (python HOG analysis method), iHam (interactive HOG analysis method) and GTM (Graph - Tree - Multiple sequence alignment)—a collection of tools to process, manipulate and visualise HOGs. pyHam offers an easy way to handle and work with HOGs using simple python coding. Embedded at its heart are two visualisation tools to synthesise HOG-derived information: iHam that allow interactive browsing of HOG structure and a tree based visualisation called tree profile that pinpoints evolutionary events induced by the HOGs on a species tree. In addition, I develop GTM an interactive web based visualisation tool that combine for a given gene family (or set of genes) the related sequences, gene tree and orthology graph. In this thesis, I show that HOGs are a useful framework for phylogenetics, with considerable work done to produce robust and scalable inferences. Another important aspect is that our inferences are benchmarked using manual case studies and automated verification using simulation or reference Quest for Orthologs Benchmarks. Lastly, one of the major advances was the conception and implementation of tools to manipulate and visualise HOG. Such tools have already proven useful when investigating HOGs for developmental reasons or for downstream analysis. Ultimately, the HOG framework is amenable to integration of all aspects which can reasonably be expected to have evolved along the history of genes and ancestral genome reconstruction. -- La reconstruction de l'histoire évolutive ancestrale est la pierre angulaire de la majorité des analyses phylogénétiques. Nombreuses sont les applications possibles une fois que l'histoire évolutive est révélée, comme l'identification de gènes restreints taxonomiquement (barcoding de génome), la prédiction de fonction pour les gènes inconnus en se basant sur les ontologies des gènes relatifs evolutionnairement, l'identification de la perte ou de l'apparition de gènes au sein de familles de gènes ou encore pour dater au cours de l'évolution l'apparition de famille de gènes (phylostratigraphie). Généralement, la reconstruction de l'histoire évolutive se limite à l'inférence des relations évolutives (homologie, orthologie, paralogie) ainsi qu'à la construction de groupes d’orthologues simples. Dans cette thèse, nous adoptons le concept des groupes hiérarchiques d’orthologues (HOGs en anglais pour Hierarchical Orthology Groups), introduit il y a plus de 10 ans, et proposons plusieurs améliorations tant bien au niveau de leurs inférences que de leurs utilisations dans les analyses biologiques susmentionnées. Cette thèse a pour but d'identifier les trois problématiques majeures qui composent mes trois axes de recherches. Premièrement, l'inférence des HOGs est complexe et nécessite une puissance computationnelle importante ce qui rend obligatoire la création d'algorithmes robustes et efficients dans l'espace temps afin de maintenir une génération de résultats de qualité rigoureuse dans un temps raisonnable. Deuxièmement, le contrôle de la qualité du groupement des orthologues est une tâche difficile si on ne connaît l'histoire évolutive réelle ce qui nécessite la mise en place de stratégies de contrôle de qualité adaptées. Tertio, le manque d'outils pour manipuler les HOGs limite leur utilisation ainsi que leurs applications. Dans le premier axe de ma thèse, je propose deux nouvelles améliorations de l'algorithme pour l'inférence des orthologues afin de pallier à la sensibilité de l'inférence vis à vis de la fragmentation des gènes et de l'asymétrie du taux d'évolution au sein de paralogues. De plus, j'introduis la version 2.0 de l'algorithme GETHOGs qui utilise une nouvelle approche de type 'bottom-up' afin de produire des résultats plus rapides et plus précis. Dans le second axe, je propose de nouvelles stratégies pour contrôler la qualité de la reconstruction des familles de gènes en réalisant des études de cas manuels fondés sur des preuves apportées par des alignement multiples de séquences et des reconstructions d'arbres géniques, et aussi pour contrôler la qualité de l'orthologie en simulant l'évolution de génomes afin de pouvoir contrôler totalement le matériel génétique produit. Ce travail met en avant les principales problématiques des méthodes actuelles. Dans le dernier axe, je montre pyHam, iHam et GTM - une panoplie d'outils que j’ai créée afin de faciliter la manipulation et la visualisation des HOGs en utilisant un programmation simple en python. Deux outils de visualisation sont directement intégrés au sein de pyHam afin de pouvoir synthétiser l'information véhiculée par les HOGs: iHam permet d’interactivement naviguer dans les HOGs ainsi qu’une autre visualisation appelée “tree profile” utilisant un arbre d'espèces où sont localisés les événements révolutionnaires contenus dans les HOGs. En sus, j'ai développé GTM un outil interactif web qui combine pour une famille de gènes donnée (ou un ensemble de gènes) leurs séquences alignées, leur arbre de gène ainsi que le graphe d'orthologie en relation. Dans cette thèse, je montre que le concept des HOGs est utile à la phylogénétique et qu'un travail considérable a été réalisé dans le but d'améliorer leur inférences de façon robuste et rapide. Un autre point important est que la qualité de nos inférences soit contrôlée en réalisant des études de cas manuellement ou en utilisant le Quest for Orthologs Benchmark qui est une référence dans le contrôle de la qualité de l’orthologie. Dernièrement, une des avancée majeure proposée est la conception et l'implémentation d'outils pour visualiser et manipuler les HOGs. Ces outils s'avèrent déjà utilisés tant pour l'étude des HOGs dans un but d'amélioration de leur qualité que pour leur utilisation dans des analyses biologiques. Pour conclure, on peut noter que tous les aspects qui semblent avoir évolué en relation avec l'histoire évolutive des gènes ou des génomes ancestraux peuvent être intégrés au concept des HOGs

Serveur académique lausannois

eggNOG v4.0: nested orthology inference across 3686 organisms

Author: Bork Peer
Creevey Chris
Forslund Kristoffer
Gabaldón Toni
Huerta-Cepas Jaime
Jensen Lars J.
Kuhn Michael
Powell Sean
Rattei Thomas
Roth Alexander
Szklarczyk Damian
Trachana Kalliopi
von Mering Christian
Publication venue
Publication date: 02/08/2017
Field of study

With the increasing availability of various ‘omics data, high-quality orthology assignment is crucial for evolutionary and functional genomics studies. We here present the fourth version of the eggNOG database (available at http://eggnog.embl.de) that derives nonsupervised orthologous groups (NOGs) from complete genomes, and then applies a comprehensive characterization and analysis pipeline to the resulting gene families. Compared with the previous version, we have more than tripled the underlying species set to cover 3686 organisms, keeping track with genome project completions while prioritizing the inclusion of high-quality genomes to minimize error propagation from incomplete proteome sets. Major technological advances include (i) a robust and scalable procedure for the identification and inclusion of high-quality genomes, (ii) provision of orthologous groups for 107 different taxonomic levels compared with 41 in eggNOGv3, (iii) identification and annotation of particularly closely related orthologous groups, facilitating analysis of related gene families, (iv) improvements of the clustering and functional annotation approach, (v) adoption of a revised tree building procedure based on the multiple alignments generated during the process and (vi) implementation of quality control procedures throughout the entire pipeline. As in previous versions, eggNOGv4 provides multiple sequence alignments and maximum-likelihood trees, as well as broad functional annotation. Users can access the complete database of orthologous groups via a web interface, as well as through bulk downloa

RERO DOC Digital Library