83 research outputs found

    The LabelHash algorithm for substructure matching

    Get PDF
    Background: There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Results: We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95 % sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs a

    Redundancy-aware learning of protein structure-function relationships

    Get PDF
    The protein kinases are a large family of enzymes that play a fundamental role in propagating signals within the cell. Because of the high degree of binding site similarity shared among protein kinases, designing drug compounds with high specificity among the kinases has proven difficult. However, computational approaches to comparing the 3-dimensional geometry and physicochemical properties of key binding site residues, referred to here as substructures, have been shown to be informative of inhibitor selectivity. This thesis introduces two fundamental approaches for the comparative analysis of substructure similarity and demonstrates the importance of each method on a variety of large protein structure datasets for multiple biological applications. The Family-wise Alignment of SubStructural Templates Framework (The FASST Framework) provides an unsupervised learning approach for identifying substructure clusterings. The substructure clusterings identified by FASST allow for the automatic evaluation of substructure variability, the identification of distinct structural conformations and the selection of anomalous outlier structures within large structure datasets. These clusterings are shown to be capable of identifying biologically meaningful structure trends among a diverse number of protein families. The FASST Live visualization and analysis platform provides multiple comparative analysis pipelines and allows the user to interactively explore the substructure clusterings computed by FASST. The Combinatorial Clustering Of Residue Position Subsets (CCORPS) method provides a supervised learning approach for identifying structural features that are correlated with a given set of annotation labels. The ability of CCORPS to identify structural features predictive of functional divergence among families of homologous enzymes is demonstrated across 48 distinct protein families. The CCORPS method is further demonstrated to generalize to the very difficult problem of predicting protein kinase inhibitor affinity. CCORPS is demonstrated to make perfect or near-perfect predictions for the binding ability of 12 of the 38 kinase inhibitors studied, while only having overall poor predictive ability for 1 of the 38 compounds. Additionally, CCORPS is shown to identify shared structural features across phylogenetically diverse groups of kinases that are correlated with binding affinity for particular inhibitors; such instances of structural similarity among phylogenetically diverse kinases are also shown to not be rare among kinases. Finally, these function-specific structural features may serve as potential starting points for the development of highly specific kinase inhibitors. Importantly, both The FASST Framework and CCORPS implement a redundancy-aware approach to dealing with structure overrepresentation that allows for the incorporation of all available structure data. As shown in this thesis, surprising structural variability exists even among structure datasets consisting of a single protein sequence. By incorporating the full variety of structural conformations within the analysis, the methods presented here provide a richer view of the variability of large protein structure datasets

    Combinatorial Clustering of Residue Position Subsets Predicts Inhibitor Affinity across the Human Kinome

    Get PDF
    The protein kinases are a large family of enzymes that play fundamental roles in propagating signals within the cell. Because of the high degree of binding site similarity shared among protein kinases, designing drug compounds with high specificity among the kinases has proven difficult. However, computational approaches to comparing the 3-dimensional geometry and physicochemical properties of key binding site residue positions have been shown to be informative of inhibitor selectivity. The Combinatorial Clustering Of Residue Position Subsets (CCORPS) method, introduced here, provides a semi-supervised learning approach for identifying structural features that are correlated with a given set of annotation labels. Here, CCORPS is applied to the problem of identifying structural features of the kinase ATP binding site that are informative of inhibitor binding. CCORPS is demonstrated to make perfect or near-perfect predictions for the binding affinity profile of 8 of the 38 kinase inhibitors studied, while only having overall poor predictive ability for 1 of the 38 compounds. Additionally, CCORPS is shown to identify shared structural features across phylogenetically diverse groups of kinases that are correlated with binding affinity for particular inhibitors; such instances of structural similarity among phylogenetically diverse kinases are also shown to not be rare among kinases. Finally, these function-specific structural features may serve as potential starting points for the development of highly specific kinase inhibitors

    Leveraging Structural Flexibility to Predict Protein Function

    Get PDF
    Proteins are essentially versatile and flexible molecules and understanding protein function plays a fundamental role in understanding biological systems. Protein structure comparisons are widely used for revealing protein function. However,with rigidity or partial rigidity assumption, most existing comparison methods do not consider conformational flexibility in protein structures. To address this issue, this thesis seeks to develop algorithms for flexible structure comparisons to predict one specific aspect of protein function, binding specificity. Given conformational samples as flexibility representation, we focus on two predictive problems related to specificity: aggregate prediction and individual prediction.For aggregate prediction, we have designed FAVA (Flexible Aggregate Volumetric Analysis). FAVA is the first conformationally general method to compare proteins with identical folds but different specificities. FAVA is able to correctly categorize members of protein superfamilies and to identify influential amino acids that cause different specificities. A second method PEAP (Point-based Ensemble for Aggregate Prediction) employs ensemble clustering techniques from many base clustering to predict binding specificity. This method incorporates structural motions of functional substructures and is capable of mitigating prediction errors.For individual prediction, the first method is an atomic point representation for representing flexibilities in the binding cavity. This representation is able to predict binding specificity on each protein conformation with high accuracy, and it is the first to analyze maps of binding cavity conformations that describe proteins with different specificities. Our second method introduces a volumetric lattice representation. This representation localizes solvent-accessible shape of the binding cavity by computing cavity volume in each user-defined space. It proves to be more informative than point-based representations. Last but not least, we discuss a structure-independent representation. This representation builds a lattice model on protein electrostatic isopotentials. This is the first known method to predict binding specificity explicitly from the perspective of electrostatic fields.The methods presented in this thesis incorporate the variety of protein conformations into the analysis of protein ligand binding, and provide more views on flexible structure comparisons and structure-based function annotation of molecular design

    Computational Methods for the Modulation of Protein-Protein Interactions

    Get PDF
    During the last decades, drug discovery development has made considerable progress. However, annual numbers of released drugs for novel targets have been decreasing concomitantly. Limited success rates of combinatorial chemistry and high-throughput screening, as well as availability of feasible targets are some reasons for this problem. A strategy to overcome it is exploration of novel target classes in order to expand the druggable space. An example are protein-protein interactions (PPIs) that can be inhibited or stabilized. Inhibition aims at developing binders for one protein to prevent complex formation. However, known PPI inhibitors differ significantly from conventional drugs and current active site-biased compound libraries are probably inappropriate to discover them. The design of novel screening libraries is thus very important. PPI stabilization aims at developing molecules that bind to a protein complex to increase its stability like a molecular glue. In contrast to inhibition, it is rather unexplored but ground-breaking examples from nature inspire research efforts. This work presents novel theoretical and experimental drug discovery approaches for these challenges. In the first part, we introduce novel chemoinformatics approaches for clustering of large chemical libraries. The development of a fast algorithm for pairwise similarity calculations forms the basis for an exact and deterministic clustering method, which is able to process the available chemical space in a short time. We complement our chemoinformatics work by a novel approach for fast classification of small molecules according to the similarity of their frameworks, the so-called scaffolds. The method generates families of molecules that share geometry conserving scaffolds and we show that family members possess similar activity on identical targets. The second part introduces computational methods for PPI modulation. First, we present structure-based analysis of known stabilized PPIs, which enables the development of novel in silico approaches to screen for small molecule PPI stabilizers. We demonstrate their applicability by an experimentally tested virtual screening for 14-3-3 protein interaction stabilizers. Finally, we present a virtual screening approach dedicated to identify small molecule inhibitors of 14-3-3 protein interactions. Predicted inhibitors are experimentally verified and characterized by in vitro assays and X-ray crystallography. Structure-activity relationship studies yielded PPI inhibitors in the low micromolar range, which are also active in cell-based experiments

    Modeling and Engineering Proteins Thermostability

    Get PDF
    Enzymes have evolved during millions of years to become efficient catalysts for specific biochemical reactions within a specific range of working conditions in the cellular environment. The activity of an enzyme is directly related to its folded structure and even slight changes in its 3D conformation may cause irreversible, negative effects on its activity. The structure of enzymes are very sensitive to the environmental conditions and changes from their optimal conditions, like higher temperature, salt concentration, and pH, might result in denaturation and subsequently inactivation. In the past decades, natural enzymes extracted from different organisms have found a wide range of applications in the industrial and biotechnological setting. For the majority of the applications their activity at high temperatures is more favorable, however enzymes have evolved, in most of the cases, to optimally work in limited range of temperature of their native cellular environment. Therefore, enhancing enzyme thermostability will not only increase their application range, but could also shed light into new aspects of their evolution and chemical activity. The physical chemical principles underlying enzymatic thermostability are keys in fact to understand the way evolution has shaped proteins to adapt to a broad range of temperatures. Understanding the molecular determinants at the basis of protein thermostability, using both in silico methods and in vitro experiments, is also an important way for engineering more thermo-resistant enzymes to be used in the industrial setting, as for instance DNA ligases, which are important for DNA replication and repair and have been long used in molecular biology and biotechnology. In this thesis I used in silico techniques, like molecular modeling and simulation coupled with bioinformatics analyses, to assess existing methods and predict potential thermo-stabilizing mutations for target proteins. First, I studied a thermophilic protein and after exploring the origins of its thermostability I proposed mutations to further increase its thermostability. Then, I took advantage of what learned from this study to explore further thermostability engineering methods in order to develop faster, accurate, and easy-to-use methods that can be generally used for a broad array of proteins. 1. Understanding and engineering thermostability in the DNA ligase from Thermococcus sp. 1519 (LigTh1519). In this thesis, I first addressed the origin of thermostability in the thermophilic DNA ligase from archaeon Thermococcus sp. 1519, and identified thermo-sensitive regions using molecular modeling and simulations. In addition, I predicted mutations that can enhance thermostability of the enzyme through bioinformatics analyses. I showed that thermo-sensitive regions of this enzyme are stabilized at higher temperatures by optimization of charged groups on the surface, and predicted that thermostability can be further increased by further optimization of the network among these charged groups. Engineering this DNA ligase by introducing selected mutations (i.e., A287K, G304D, S364I and A387K) produced eventually a significant and additive increase in the half-life time of the enzyme when compared to the wild-type. Then, based on what I learned from thermostability analyses and improvement of LigTh1519, my aim was to design a general-purpose protein thermostability engineering protocol that can enable thermostability engineering [...

    Metabolomics Data Processing and Data Analysis—Current Best Practices

    Get PDF
    Metabolomics data analysis strategies are central to transforming raw metabolomics data files into meaningful biochemical interpretations that answer biological questions or generate novel hypotheses. This book contains a variety of papers from a Special Issue around the theme “Best Practices in Metabolomics Data Analysis”. Reviews and strategies for the whole metabolomics pipeline are included, whereas key areas such as metabolite annotation and identification, compound and spectral databases and repositories, and statistical analysis are highlighted in various papers. Altogether, this book contains valuable information for researchers just starting in their metabolomics career as well as those that are more experienced and look for additional knowledge and best practice to complement key parts of their metabolomics workflows
    corecore