9 research outputs found

    Large scale hierarchical clustering of protein sequences

    Get PDF
    Background: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to. Results: We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at http://systers.molgen.mpg.de/. Conclusions: Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences

    Large scale hierarchical clustering of protein sequences

    Get PDF
    BACKGROUND: Searching a biological sequence database with a query sequence looking for homologues has become a routine operation in computational biology. In spite of the high degree of sophistication of currently available search routines it is still virtually impossible to identify quickly and clearly a group of sequences that a given query sequence belongs to. RESULTS: We report on our developments in grouping all known protein sequences hierarchically into superfamily and family clusters. Our graph-based algorithms take into account the topology of the sequence space induced by the data itself to construct a biologically meaningful partitioning. We have applied our clustering procedures to a non-redundant set of about 1,000,000 sequences resulting in a hierarchical clustering which is being made available for querying and browsing at . CONCLUSIONS: Comparisons with other widely used clustering methods on various data sets show the abilities and strengths of our clustering methods in producing a biologically meaningful grouping of protein sequences

    Genome-Wide Comparative Gene Family Classification

    Get PDF
    Correct classification of genes into gene families is important for understanding gene function and evolution. Although gene families of many species have been resolved both computationally and experimentally with high accuracy, gene family classification in most newly sequenced genomes has not been done with the same high standard. This project has been designed to develop a strategy to effectively and accurately classify gene families across genomes. We first examine and compare the performance of computer programs developed for automated gene family classification. We demonstrate that some programs, including the hierarchical average-linkage clustering algorithm MC-UPGMA and the popular Markov clustering algorithm TRIBE-MCL, can reconstruct manual curation of gene families accurately. However, their performance is highly sensitive to parameter setting, i.e. different gene families require different program parameters for correct resolution. To circumvent the problem of parameterization, we have developed a comparative strategy for gene family classification. This strategy takes advantage of existing curated gene families of reference species to find suitable parameters for classifying genes in related genomes. To demonstrate the effectiveness of this novel strategy, we use TRIBE-MCL to classify chemosensory and ABC transporter gene families in C. elegans and its four sister species. We conclude that fully automated programs can establish biologically accurate gene families if parameterized accordingly. Comparative gene family classification finds optimal parameters automatically, thus allowing rapid insights into gene families of newly sequenced species

    Graph-based methods for large-scale protein classification and orthology inference

    Get PDF
    The quest for understanding how proteins evolve and function has been a prominent and costly human endeavor. With advances in genomics and use of bioinformatics tools, the diversity of proteins in present day genomes can now be studied more efficiently than ever before. This thesis describes computational methods suitable for large-scale protein classification of many proteomes of diverse species. Specifically, we focus on methods that combine unsupervised learning (clustering) techniques with the knowledge of molecular phylogenetics, particularly that of orthology. In chapter 1 we introduce the biological context of protein structure, function and evolution, review the state-of-the-art sequence-based protein classification methods, and then describe methods used to validate the predictions. Finally, we present the outline and objectives of this thesis. Evolutionary (phylogenetic) concepts are instrumental in studying subjects as diverse as the diversity of genomes, cellular networks, protein structures and functions, and functional genome annotation. In particular, the detection of orthologous proteins (genes) across genomes provides reliable means to infer biological functions and processes from one organism to another. Chapter 2 evaluates the available computational tools, such as algorithms and databases, used to infer orthologous relationships between genes from fully sequenced genomes. We discuss the main caveats of large-scale orthology detection in general as well as the merits and pitfalls of each method in particular. We argue that establishing true orthologous relationships requires a phylogenetic approach which combines both trees and graphs (networks), reliable species phylogeny, genomic data for more than two species, and an insight into the processes of molecular evolution. Also proposed is a set of guidelines to aid researchers in selecting the correct tool. Moreover, this review motivates further research in developing reliable and scalable methods for functional and phylogenetic classification of large protein collections. Chapter 3 proposes a framework in which various protein knowledge-bases are combined into unique network of mappings (links), and hence allows comparisons to be made between expert curated and fully-automated protein classifications from a single entry point. We developed an integrated annotation resource for protein orthology, ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap), to help researchers and database annotators who often need to assess the coherence of proposed annotations and/or group assignments, as well as users of high throughput methodologies (e.g., microarrays or proteomics) who deal with partially annotated genomic data. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240,000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF using a fast and fully automated sequence-based mapping approach. The ProGMap database is equipped with a web interface that enables queries to be made using synonymous sequence identifiers, gene symbols, protein functions, and amino acid or nucleotide sequences. It incorporates also services, namely BLAST similarity search and QuickMatch identity search, for finding sequences similar (or identical) to a query sequence, and tools for presenting the results in graphic form. Graphs (networks) have gained an increasing attention in contemporary biology because they have enabled complex biological systems and processes to be modeled and better understood. For example, protein similarity networks constructed of all-versus-all sequence comparisons are frequently used to delineate similarity groups, such as protein families or orthologous groups in comparative genomics studies. Chapter 4.1 presents a benchmark study of freely available graph software used for this purpose. Specifically, the computational complexity of the programs is investigated using both simulated and biological networks. We show that most available software is not suitable for large networks, such as those encountered in large-scale proteome analyzes, because of the high demands on computational resources. To address this, we developed a fast and memory-efficient graph software, netclust (http://www.bioinformatics.nl/netclust/), which can scale to large protein networks, such as those constructed of millions of proteins and sequence similarities, on a standard computer. An extended version of this program called Multi-netclust is presented in chapter 4.2. This tool that can find connected clusters of data presented by different network data sets. It uses user-defined threshold values to combine the data sets in such a way that clusters connected in all or in either of the networks can be retrieved efficiently. Automated protein sequence clustering is an important task in genome annotation projects and phylogenomic studies. During the past years, several protein clustering programs have been developed for delineating protein families or orthologous groups from large sequence collections. However, most of these programs have not been benchmarked systematically, in particular with respect to the trade-off between computational complexity and biological soundness. In chapter 5 we evaluate three best known algorithms on different protein similarity networks and validation (or 'gold' standard) data sets to find out which one can scale to hundreds of proteomes and still delineate high quality similarity groups at the minimum computational cost. For this, a reliable partition-based approach was used to assess the biological soundness of predicted groups using known protein functions, manually curated protein/domain families and orthologous groups available in expert-curated databases. Our benchmark results support the view that a simple and computationally cheap method such as netclust can perform similar to and in cases even better than more sophisticated, yet much more costly methods. Moreover, we introduce an efficient graph-based method that can delineate protein orthologs of hundreds of proteomes into hierarchical similarity groups de novo. The validity of this method is demonstrated on data obtained from 347 prokaryotic proteomes. The resulting hierarchical protein classification is not only in agreement with manually curated classifications but also provides an enriched framework in which the functional and evolutionary relationships between proteins can be studied at various levels of specificity. Finally, in chapter 6 we summarize the main findings and discuss the merits and shortcomings of the methods developed herein. We also propose directions for future research. The ever increasing flood of new sequence data makes it clear that we need improved tools to be able to handle and extract relevant (orthological) information from these protein data. This thesis summarizes these needs and how they can be addressed by the available tools, or be improved by the new tools that were developed in the course of this research. <br/

    Identifizierung ähnlicher Reaktionsmechanismen in homologen Enzymen unterschiedlicher Funktion unter Verwendung konservierter Sequenzdomänen

    Get PDF
    Enzyme sind außerordentlich effiziente Biokatalysatoren und beschleunigen als solche nahezu sämtliche biochemischen Reaktionen in biologischen Systemen. Neue Enzyme entstehen nicht de novo, sondern entwickeln sich schrittweise durch Abwandlung der bereits vorhandenen Enzyme. Daher lassen sich die Reaktionen des Grundstoffwechsels der Zellen trotz ihrer Vielfalt auf relativ wenige Grundtypen zurückführen. Diese Tatsache hat man teilweise bei der EC-Klassifikation der Enzyme berücksichtigt. Die Einordnung in EC-Klassen erfolgt jedoch im allgemeinen nicht aufgrund von gemeinsamer Abstammung oder ähnlichen Reaktionsmechanismen, sondern überwiegend nach enzymologischen Kriterien wie der Wirkungs- und Substratspezifität. Infolgedessen weisen Enzyme der gleichen EC-Klasse häufig keine strukturelle Ähnlichkeit zueinander auf, wodurch impliziert wird, daß diese Enzyme eher durch Konvergenz als durch Divergenz entstanden sind, während umgekehrt Enzyme gemeinsamen evolutionären Ursprungs oftmals ganz unterschiedlichen EC-Klassen angehören. Letzteres führte zur Annahme, daß Enzyme trotz gemeinsamer Abstammung ganz verschiedene Funktionen haben können. Es gibt jedoch Hinweise darauf, daß diese Enzyme ähnliche Reaktionsmechanismen zur Realisierung der verschiedenen Funktionen verwenden. Während die EC-Klassifikation alle an sie gestellten Anforderungen erfüllt, besteht somit Bedarf für ein alternatives, komplementäres Klassifizierungssystem, das nicht auf einer empirischen Einteilung der beobachteten Reaktionen, sondern auf der evolutionären Verwandtschaft der Enzyme beruht und infolgedessen Rückschlüsse auf die zugrundeliegenden Reaktionsmechanismen zuläßt. In der vorliegenden Dissertation wurde untersucht, ob eine auf Sequenzhomologie basierende Einteilung der Enzyme mit den von den Enzymen verwendeten Reaktionsmechanismen korreliert. Ziel war die systematische Clusterung und Analyse aller bekannten Enzymsequenzen zur Identifizierung von gemeinsamen oder ähnlichen Enzymmechanismen. Vorbedingung zur Bearbeitung des Problems war die Entwicklung einer Methode zur Identifizierung modular aufgebauter Proteinen, die aus mehreren, evolutionär oftmals unabhängigen Sequenzdomänen bestehen. Da solche modularen Enzyme in unterschiedlichen Bereichen Ähnlichkeit zu verschiedenen Enzymfamilien aufweisen können, implizieren sie häufig ein scheinbares, tatsächlich jedoch nicht vorhandenes gemeinsames Auftreten von Enzymaktivitäten in einem Sequenzcluster. Die Domänenstruktur wurde mittels der Lage und Ausdehnung lokaler Sequenzalignments ermittelt. Anschließend wurden die so bestimmten Sequenzbereiche entsprechend ihrer Sequenzähnlichkeit zu Gruppen homologer Sequenzabschnitte zusammengefaßt. Hierzu wurde die Methode der Clusteranalyse verwendet. Die Analyse erfolgte bei verschiedenen Grenzwerten, um eine hierarchische Strukturierung des Sequenz-Raumes zu erhalten. Hierbei zeigte sich, daß abhängig vom verwendeten Grenzwert bis zu 40% der generierten Sequenzcluster Enzyme verschiedener Enzymklassen, teilweise sogar verschiedener EC-Hauptklassen enthielten. Bei der Analyse zeigte sich jedoch, daß in allen betrachteten Fällen trotz auf den ersten Blick unterschiedlicher Katalyse der Reaktionsmechanismus oder aber die Substratspezifität dieser Reaktionen sehr ähnlich sind

    Comparative Genome Analysis of Malaria Parasite Species

    Get PDF
    With over 200 million infections and up to one million deaths every year, malaria remains one of the most devastating infectious diseases affecting humans. Over the last few years, complete genome sequences of both human and non-human malaria parasite species have become available, adding comparative genomics to the toolbox of molecular biologists to study the genetic basis of human virulence. In this thesis, I computationally compared the published genomes of seven malaria parasite species with the aim to gain new insights into genes underlying human virulence. This comparison was performed using two complementary approaches. In the first approach, I used whole-genome synteny analysis to find genes present in human but not non-human malaria parasites. In the second approach, I first clustered virulence-associated genes into gene families and then examined these gene families for species-specific differences. Both comparisons resulted in interesting gene lists. Synteny analysis identified three key enzymes of the thiamine (vitamin B1) biosynthesis pathway to be present in human but not rodent malaria parasites, indicating that these two groups of parasites differ in their ability to synthesize vitamin B1 de novo. My gene family classification exposed within the largest and highly divergent surface antigen gene family pir a group of unusually well conserved orthologs, which should be considered as high-priority targets for experimental characterization and vaccine development. In conclusion, this thesis highlights genes and pathways that are different between human and non-human malaria parasites and therefore could play important roles in human virulence. Experimental studies can now be initiated to confirm virulence-associated functions and to explore their potential value for drug and vaccine development

    Fifth Biennial Report : June 1999 - August 2001

    No full text

    The evolution of the mammal placenta — a computational approach to the identification and analysis of placenta-specific genes and microRNAs.

    Get PDF
    The presence of a placenta is an important synapomorphy that defines the mammal clade. From the fossil record we know that the first placental mammal lived approximately 125 million years ago, with the chorioallantoic placenta evolving not long after. In this thesis a set of 22 complete genomes from Eutherian, non-Eutherian and outgroup species are compared, the aim being to identify protein-coding and regulatory alterations that are likely to be implicated in the emergence of mammal placenta in the fossil record. To this end we have examined the roles played by positive selection and miRNA regulation in the evolution of the placenta. We have identified those genes that underwent functional shift uniquely in the ancestral placental mammal lineage and that are also heavily implicated in disorders of the placenta. Carrying out a thorough analysis of non-coding regions of the 22 genomes included in the study we identified a cohort of miRNAs that exist only in placental mammals. Many of the placenta related genes described above have multiple predicted “placenta-specific” miRNA binding sites. Together these results indicate a role for both adaptation in protein-coding regions and emergence of novel noncoding regulators in the origin and evolution of mammal placentation
    corecore