208 research outputs found

    Sequence- and structure-based approaches to deciphering enzyme evolution in the Haloalkonoate Dehalogenase superfamily

    Full text link
    Understanding how changes in functional requirements of the cell select for changes in protein sequence and structure is a fundamental challenge in molecular evolution. This dissertation delineates some of the underlying evolutionary forces using as a model system, the Haloalkanoate Dehalogenase Superfamily (HADSF). HADSF members have unique cap-core architecture with the Rossmann-fold core domain accessorized by variable cap domain insertions (delineated by length, topology, and point of insertion). To identify the boundaries of variable domain insertions in protein sequences, I have developed a comprehensive computational strategy (CapPredictor or CP) using a novel sequence alignment algorithm in conjunction with a structure-guided sequence profile. Analysis of more than 40,000 HADSF sequences led to the following observations: (i) cap-type classes exhibit similar distributions across different phyla, indicating existence of all cap-types in the last universal common ancestor, and (ii) comparative analysis of the predicted cap-type and functional diversity indicated that cap-type does not dictate the divergence of substrate recognition and chemical pathway, and hence biological function. By analyzing a unique dataset of core- and cap-domain-only protein structures, I investigated the consequences of the accessory cap domain on the sequence-structure relationship of the core domain. The relationship between sequence and structure divergence in the core fold was shown to be monotonic and independent of the corresponding cap type. However, core domains with the same cap type bore a greater similarity than the core domains with different cap types, suggesting coevolution of the cap and core domains. Remarkably, a few degrees of freedom are needed to describe the structural diversity in the Rossmann fold accounting for the majority of the observed structural variance. Finally, I examined the location and role of conserved residue positions and co-evolving residue pairs in the core domain in the context of the cap domain. Positions critical for function were conserved while non-conserved positions mapped to highly mobile regions. Notably, we found exponential dependence of co-variance on inter-residue distance. Collectively, these novel algorithms and analyses contribute to an improved understanding of enzyme evolution, especially in the context of the use of domain insertions to expand substrate specificity and chemical mechanism

    How to identify pathogenic mutations among all those variations: Variant annotation and filtration in the genome sequencing era

    Get PDF
    High-throughput sequencing technologies have become fundamental for the identification of disease-causing mutations in human genetic diseases both in research and clinical testing contexts. The cumulative number of genes linked to rare diseases is now close to 3,500 with more than 1,000 genes identified between 2010 and 2014 because of the early adoption of Exome Sequencing technologies. However, despite these encouraging figures, the success rate of clinical exome diagnosis remains low due to several factors including wrong variant annotation and nonoptimal filtration practices, which may lead to misinterpretation of disease-causing mutations. In this review, we describe the critical steps of variant annotation and filtration processes to highlight a handful of potential disease-causing mutations for downstream analysis. We report the key annotation elements to gather at multiple levels for each mutation, and which systems are designed to help in collecting this mandatory information. We describe the filtration options, their efficiency, and limits and provide a generic filtration workflow and highlight potential pitfalls through a use case

    The Human Gene Mutation Database: towards a comprehensive repository of inherited mutation data for medical research, genetic diagnosis and next-generation sequencing studies

    Get PDF
    The Human Gene Mutation Database (HGMD®) constitutes a comprehensive collection of published germline mutations in nuclear genes that underlie, or are closely associated with human inherited disease. At the time of writing (March 2017), the database contained in excess of 203,000 different gene lesions identified in over 8000 genes manually curated from over 2600 journals. With new mutation entries currently accumulating at a rate exceeding 17,000 per annum, HGMD represents de facto the central unified gene/disease-oriented repository of heritable mutations causing human genetic disease used worldwide by researchers, clinicians, diagnostic laboratories and genetic counsellors, and is an essential tool for the annotation of next-generation sequencing data. The public version of HGMD (http://www.hgmd.org) is freely available to registered users from academic institutions and non-profit organisations whilst the subscription version (HGMD Professional) is available to academic, clinical and commercial users under license via QIAGEN Inc

    Functional Annotations of Paralogs: A Blessing and a Curse

    Get PDF
    Gene duplication followed by mutation is a classic mechanism of neofunctionalization, producing gene families with functional diversity. In some cases, a single point mutation is sufficient to change the substrate specificity and/or the chemistry performed by an enzyme, making it difficult to accurately separate enzymes with identical functions from homologs with different functions. Because sequence similarity is often used as a basis for assigning functional annotations to genes, non-isofunctional gene families pose a great challenge for genome annotation pipelines. Here we describe how integrating evolutionary and functional information such as genome context, phylogeny, metabolic reconstruction and signature motifs may be required to correctly annotate multifunctional families. These integrative analyses can also lead to the discovery of novel gene functions, as hints from specific subgroups can guide the functional characterization of other members of the family. We demonstrate how careful manual curation processes using comparative genomics can disambiguate subgroups within large multifunctional families and discover their functions. We present the COG0720 protein family as a case study. We also discuss strategies to automate this process to improve the accuracy of genome functional annotation pipelines

    The Evolution of Function in the Rab family of Small GTPases

    Get PDF
    Dissertation presented to obtain the PhD degree in Computational Biology.The question how protein function evolves is a fundamental problem with profound implications for both functional end evolutionary studies on proteins. Here, we review some of the work that has addressed or contributed to this question. We identify and comment on three different levels relevant for the evolution of protein function. First, biochemistry. This is the focus of our discussion, as protein function itself commonly receives least attention in studies on protein evolution.(...

    Development and Integration of Informatic Tools for Qualitative and Quantitative Characterization of Proteomic Datasets Generated by Tandem Mass Spectrometry

    Get PDF
    Shotgun proteomic experiments provide qualitative and quantitative analytical information from biological samples ranging in complexity from simple bacterial isolates to higher eukaryotes such as plants and humans and even to communities of microbial organisms. Improvements to instrument performance, sample preparation, and informatic tools are increasing the scope and volume of data that can be analyzed by mass spectrometry (MS). To accommodate for these advances, it is becoming increasingly essential to choose and/or create tools that can not only scale well but also those that make more informed decisions using additional features within the data. Incorporating novel and existing tools into a scalable, modular workflow not only provides more accurate, contextualized perspectives of processed data, but it also generates detailed, standardized outputs that can be used for future studies dedicated to mining general analytical or biological features, anomalies, and trends. This research developed cyber-infrastructure that would allow a user to seamlessly run multiple analyses, store the results, and share processed data with other users. The work represented in this dissertation demonstrates successful implementation of an enhanced bioinformatics workflow designed to analyze raw data directly generated from MS instruments and to create fully-annotated reports of qualitative and quantitative protein information for large-scale proteomics experiments. Answering these questions requires several points of engagement between informatics and analytical understanding of the underlying biochemistry of the system under observation. Deriving meaningful information from analytical data can be achieved through linking together the concerted efforts of more focused, logistical questions. This study focuses on the following aspects of proteomics experiments: spectra to peptide matching, peptide to protein mapping, and protein quantification and differential expression. The interaction and usability of these analyses and other existing tools are also described. By constructing a workflow that allows high-throughput processing of massive datasets, data collected within the past decade can be standardized and updated with the most recent analyses

    Graph-based methods for large-scale protein classification and orthology inference

    Get PDF
    The quest for understanding how proteins evolve and function has been a prominent and costly human endeavor. With advances in genomics and use of bioinformatics tools, the diversity of proteins in present day genomes can now be studied more efficiently than ever before. This thesis describes computational methods suitable for large-scale protein classification of many proteomes of diverse species. Specifically, we focus on methods that combine unsupervised learning (clustering) techniques with the knowledge of molecular phylogenetics, particularly that of orthology. In chapter 1 we introduce the biological context of protein structure, function and evolution, review the state-of-the-art sequence-based protein classification methods, and then describe methods used to validate the predictions. Finally, we present the outline and objectives of this thesis. Evolutionary (phylogenetic) concepts are instrumental in studying subjects as diverse as the diversity of genomes, cellular networks, protein structures and functions, and functional genome annotation. In particular, the detection of orthologous proteins (genes) across genomes provides reliable means to infer biological functions and processes from one organism to another. Chapter 2 evaluates the available computational tools, such as algorithms and databases, used to infer orthologous relationships between genes from fully sequenced genomes. We discuss the main caveats of large-scale orthology detection in general as well as the merits and pitfalls of each method in particular. We argue that establishing true orthologous relationships requires a phylogenetic approach which combines both trees and graphs (networks), reliable species phylogeny, genomic data for more than two species, and an insight into the processes of molecular evolution. Also proposed is a set of guidelines to aid researchers in selecting the correct tool. Moreover, this review motivates further research in developing reliable and scalable methods for functional and phylogenetic classification of large protein collections. Chapter 3 proposes a framework in which various protein knowledge-bases are combined into unique network of mappings (links), and hence allows comparisons to be made between expert curated and fully-automated protein classifications from a single entry point. We developed an integrated annotation resource for protein orthology, ProGMap (Protein Group Mappings, http://www.bioinformatics.nl/progmap), to help researchers and database annotators who often need to assess the coherence of proposed annotations and/or group assignments, as well as users of high throughput methodologies (e.g., microarrays or proteomics) who deal with partially annotated genomic data. ProGMap is based on a non-redundant dataset of over 6.6 million protein sequences which is mapped to 240,000 protein group descriptions collected from UniProt, RefSeq, Ensembl, COG, KOG, OrthoMCL-DB, HomoloGene, TRIBES and PIRSF using a fast and fully automated sequence-based mapping approach. The ProGMap database is equipped with a web interface that enables queries to be made using synonymous sequence identifiers, gene symbols, protein functions, and amino acid or nucleotide sequences. It incorporates also services, namely BLAST similarity search and QuickMatch identity search, for finding sequences similar (or identical) to a query sequence, and tools for presenting the results in graphic form. Graphs (networks) have gained an increasing attention in contemporary biology because they have enabled complex biological systems and processes to be modeled and better understood. For example, protein similarity networks constructed of all-versus-all sequence comparisons are frequently used to delineate similarity groups, such as protein families or orthologous groups in comparative genomics studies. Chapter 4.1 presents a benchmark study of freely available graph software used for this purpose. Specifically, the computational complexity of the programs is investigated using both simulated and biological networks. We show that most available software is not suitable for large networks, such as those encountered in large-scale proteome analyzes, because of the high demands on computational resources. To address this, we developed a fast and memory-efficient graph software, netclust (http://www.bioinformatics.nl/netclust/), which can scale to large protein networks, such as those constructed of millions of proteins and sequence similarities, on a standard computer. An extended version of this program called Multi-netclust is presented in chapter 4.2. This tool that can find connected clusters of data presented by different network data sets. It uses user-defined threshold values to combine the data sets in such a way that clusters connected in all or in either of the networks can be retrieved efficiently. Automated protein sequence clustering is an important task in genome annotation projects and phylogenomic studies. During the past years, several protein clustering programs have been developed for delineating protein families or orthologous groups from large sequence collections. However, most of these programs have not been benchmarked systematically, in particular with respect to the trade-off between computational complexity and biological soundness. In chapter 5 we evaluate three best known algorithms on different protein similarity networks and validation (or 'gold' standard) data sets to find out which one can scale to hundreds of proteomes and still delineate high quality similarity groups at the minimum computational cost. For this, a reliable partition-based approach was used to assess the biological soundness of predicted groups using known protein functions, manually curated protein/domain families and orthologous groups available in expert-curated databases. Our benchmark results support the view that a simple and computationally cheap method such as netclust can perform similar to and in cases even better than more sophisticated, yet much more costly methods. Moreover, we introduce an efficient graph-based method that can delineate protein orthologs of hundreds of proteomes into hierarchical similarity groups de novo. The validity of this method is demonstrated on data obtained from 347 prokaryotic proteomes. The resulting hierarchical protein classification is not only in agreement with manually curated classifications but also provides an enriched framework in which the functional and evolutionary relationships between proteins can be studied at various levels of specificity. Finally, in chapter 6 we summarize the main findings and discuss the merits and shortcomings of the methods developed herein. We also propose directions for future research. The ever increasing flood of new sequence data makes it clear that we need improved tools to be able to handle and extract relevant (orthological) information from these protein data. This thesis summarizes these needs and how they can be addressed by the available tools, or be improved by the new tools that were developed in the course of this research. <br/

    Using machine-learning-driven approaches to boost hot-spot's knowledge

    Get PDF
    Understanding protein–protein interactions (PPIs) is fundamental to describe and to characterize the formation of biomolecular assemblies, and to establish the energetic principles underlying biological networks. One key aspect of these interfaces is the existence and prevalence of hot-spots (HS) residues that, upon mutation to alanine, negatively impact the formation of such protein–protein complexes. HS have been widely considered in research, both in case studies and in a few large-scale predictive approaches. This review aims to present the current knowledge on PPIs, providing a detailed understanding of the microspecifications of the residues involved in those interactions and the characteristics of those defined as HS through a thorough assessment of related field-specific methodologies. We explore recent accurate artificial intelligence-based techniques, which are progressively replacing well-established classical energy-based methodologies. This article is categorized under: Data Science > Databases and Expert Systems Structure and Mechanism > Computational Biochemistry and Biophysics Molecular and Statistical Mechanics > Molecular Interactions

    In Silico prediction of the Caspase degradome

    Get PDF
    Ph.DDOCTOR OF PHILOSOPH
    corecore