10 research outputs found

    The LabelHash algorithm for substructure matching

    Get PDF
    Background: There is an increasing number of proteins with known structure but unknown function. Determining their function would have a significant impact on understanding diseases and designing new therapeutics. However, experimental protein function determination is expensive and very time-consuming. Computational methods can facilitate function determination by identifying proteins that have high structural and chemical similarity. Results: We present LabelHash, a novel algorithm for matching substructural motifs to large collections of protein structures. The algorithm consists of two phases. In the first phase the proteins are preprocessed in a fashion that allows for instant lookup of partial matches to any motif. In the second phase, partial matches for a given motif are expanded to complete matches. The general applicability of the algorithm is demonstrated with three different case studies. First, we show that we can accurately identify members of the enolase superfamily with a single motif. Next, we demonstrate how LabelHash can complement SOIPPA, an algorithm for motif identification and pairwise substructure alignment. Finally, a large collection of Catalytic Site Atlas motifs is used to benchmark the performance of the algorithm. LabelHash runs very efficiently in parallel; matching a motif against all proteins in the 95 % sequence identity filtered non-redundant Protein Data Bank typically takes no more than a few minutes. The LabelHash algorithm is available through a web server and as a suite of standalone programs a

    Redundancy-aware learning of protein structure-function relationships

    Get PDF
    The protein kinases are a large family of enzymes that play a fundamental role in propagating signals within the cell. Because of the high degree of binding site similarity shared among protein kinases, designing drug compounds with high specificity among the kinases has proven difficult. However, computational approaches to comparing the 3-dimensional geometry and physicochemical properties of key binding site residues, referred to here as substructures, have been shown to be informative of inhibitor selectivity. This thesis introduces two fundamental approaches for the comparative analysis of substructure similarity and demonstrates the importance of each method on a variety of large protein structure datasets for multiple biological applications. The Family-wise Alignment of SubStructural Templates Framework (The FASST Framework) provides an unsupervised learning approach for identifying substructure clusterings. The substructure clusterings identified by FASST allow for the automatic evaluation of substructure variability, the identification of distinct structural conformations and the selection of anomalous outlier structures within large structure datasets. These clusterings are shown to be capable of identifying biologically meaningful structure trends among a diverse number of protein families. The FASST Live visualization and analysis platform provides multiple comparative analysis pipelines and allows the user to interactively explore the substructure clusterings computed by FASST. The Combinatorial Clustering Of Residue Position Subsets (CCORPS) method provides a supervised learning approach for identifying structural features that are correlated with a given set of annotation labels. The ability of CCORPS to identify structural features predictive of functional divergence among families of homologous enzymes is demonstrated across 48 distinct protein families. The CCORPS method is further demonstrated to generalize to the very difficult problem of predicting protein kinase inhibitor affinity. CCORPS is demonstrated to make perfect or near-perfect predictions for the binding ability of 12 of the 38 kinase inhibitors studied, while only having overall poor predictive ability for 1 of the 38 compounds. Additionally, CCORPS is shown to identify shared structural features across phylogenetically diverse groups of kinases that are correlated with binding affinity for particular inhibitors; such instances of structural similarity among phylogenetically diverse kinases are also shown to not be rare among kinases. Finally, these function-specific structural features may serve as potential starting points for the development of highly specific kinase inhibitors. Importantly, both The FASST Framework and CCORPS implement a redundancy-aware approach to dealing with structure overrepresentation that allows for the incorporation of all available structure data. As shown in this thesis, surprising structural variability exists even among structure datasets consisting of a single protein sequence. By incorporating the full variety of structural conformations within the analysis, the methods presented here provide a richer view of the variability of large protein structure datasets

    Combinatorial Clustering of Residue Position Subsets Predicts Inhibitor Affinity across the Human Kinome

    Get PDF
    The protein kinases are a large family of enzymes that play fundamental roles in propagating signals within the cell. Because of the high degree of binding site similarity shared among protein kinases, designing drug compounds with high specificity among the kinases has proven difficult. However, computational approaches to comparing the 3-dimensional geometry and physicochemical properties of key binding site residue positions have been shown to be informative of inhibitor selectivity. The Combinatorial Clustering Of Residue Position Subsets (CCORPS) method, introduced here, provides a semi-supervised learning approach for identifying structural features that are correlated with a given set of annotation labels. Here, CCORPS is applied to the problem of identifying structural features of the kinase ATP binding site that are informative of inhibitor binding. CCORPS is demonstrated to make perfect or near-perfect predictions for the binding affinity profile of 8 of the 38 kinase inhibitors studied, while only having overall poor predictive ability for 1 of the 38 compounds. Additionally, CCORPS is shown to identify shared structural features across phylogenetically diverse groups of kinases that are correlated with binding affinity for particular inhibitors; such instances of structural similarity among phylogenetically diverse kinases are also shown to not be rare among kinases. Finally, these function-specific structural features may serve as potential starting points for the development of highly specific kinase inhibitors

    Leveraging Structural Flexibility to Predict Protein Function

    Get PDF
    Proteins are essentially versatile and flexible molecules and understanding protein function plays a fundamental role in understanding biological systems. Protein structure comparisons are widely used for revealing protein function. However,with rigidity or partial rigidity assumption, most existing comparison methods do not consider conformational flexibility in protein structures. To address this issue, this thesis seeks to develop algorithms for flexible structure comparisons to predict one specific aspect of protein function, binding specificity. Given conformational samples as flexibility representation, we focus on two predictive problems related to specificity: aggregate prediction and individual prediction.For aggregate prediction, we have designed FAVA (Flexible Aggregate Volumetric Analysis). FAVA is the first conformationally general method to compare proteins with identical folds but different specificities. FAVA is able to correctly categorize members of protein superfamilies and to identify influential amino acids that cause different specificities. A second method PEAP (Point-based Ensemble for Aggregate Prediction) employs ensemble clustering techniques from many base clustering to predict binding specificity. This method incorporates structural motions of functional substructures and is capable of mitigating prediction errors.For individual prediction, the first method is an atomic point representation for representing flexibilities in the binding cavity. This representation is able to predict binding specificity on each protein conformation with high accuracy, and it is the first to analyze maps of binding cavity conformations that describe proteins with different specificities. Our second method introduces a volumetric lattice representation. This representation localizes solvent-accessible shape of the binding cavity by computing cavity volume in each user-defined space. It proves to be more informative than point-based representations. Last but not least, we discuss a structure-independent representation. This representation builds a lattice model on protein electrostatic isopotentials. This is the first known method to predict binding specificity explicitly from the perspective of electrostatic fields.The methods presented in this thesis incorporate the variety of protein conformations into the analysis of protein ligand binding, and provide more views on flexible structure comparisons and structure-based function annotation of molecular design

    Modeling regionalized volumetric differences in protein-ligand binding cavities

    Get PDF
    Identifying elements of protein structures that create differences in protein-ligand binding specificity is an essential method for explaining the molecular mechanisms underlying preferential binding. In some cases, influential mechanisms can be visually identified by experts in structural biology, but subtler mechanisms, whose significance may only be apparent from the analysis of many structures, are harder to find. To assist this process, we present a geometric algorithm and two statistical models for identifying significant structural differences in protein-ligand binding cavities. We demonstrate these methods in an analysis of sequentially nonredundant structural representatives of the canonical serine proteases and the enolase superfamily. Here, we observed that statistically significant structural variations identified experimentally established determinants of specificity. We also observed that an analysis of individual regions inside cavities can reveal areas where small differences in shape can correspond to differences in specificity

    A Gibbs sampling strategy for mining of protein-protein interaction networks and protein structures

    Get PDF
    Complex networks are general and can be used to model phenomena that belongs to different fields of research, from biochemical applications to social networks. However, due to the intrinsic complexity of real networks, their analysis can be computationally demanding. Recently, several statistic and probabilistic analysis approaches have been designed, resulting to be much faster, flexible and effective than deterministic algorithms. Among statistical methods, Gibbs sampling is one of the simplest and most powerful algorithms for solving complex optimization problems and it has been applied in different contexts. It has shown its effectiveness in computational biology but in sequence analysis rather than in network analysis. One approach to analyze complex networks is to compare them, in order to identify similar patterns of interconnections and predict the function or the role of some unknown nodes. Thus, this motivated the main goal of the thesis: designing and implementing novel graph mining techniques based on Gibbs sampling to compare two or more complex networks. The methodology is domain-independent and can work on any complex system of interacting entities with associated attributes. However, in this thesis we focus our attention on protein analysis overcoming the strong current limitations in this area. Proteins can be analyzed from two different points of view: (i) an internal perspective, i.e. the 3D structure of the protein, (ii) an external perspective, i.e. the interactions with other macromolecules. In both cases, a comparative analysis with other proteins of the same or distinct species can reveal important clues for the function of the protein and evolutionary convergences or divergences between different organisms in the way a specific function or process is carried out. First, we present two methods based on Gibbs sampling for the comparative analysis of protein-protein interaction networks: GASOLINE and SPECTRA. GASOLINE is a stochastic and greedy algorithm to find similar groups of interacting proteins in two or more networks. It can align many networks and more quickly than the state-of-the-art methods. SPECTRA is a framework to retrieve and compare networks of proteins that interact with one another in specific healthy or tumor tissues. The aim in this case is to identify changes in protein concentration or protein "behaviour" across different tissues. SPECTRA is an adaptation of GASOLINE for weighted protein-protein interaction networks with gene expressions as node weights. It is the first algorithm proposed for multiple comparison of tissue-specific interaction networks. We also describe a Gibbs sampling based algorithm for 3D protein structure comparison, called PROPOSAL, which finds local structural similarities across two or more protein structures. Experimental results confirm our computational predictions and show that the proposed algorithms are much faster and in most cases more accurate than existing methods

    Graph-Based Approaches to Protein StructureComparison - From Local to Global Similarity

    Get PDF
    The comparative analysis of protein structure data is a central aspect of structural bioinformatics. Drawing upon structural information allows the inference of function for unknown proteins even in cases where no apparent homology can be found on the sequence level. Regarding the function of an enzyme, the overall fold topology might less important than the specific structural conformation of the catalytic site or the surface region of a protein, where the interaction with other molecules, such as binding partners, substrates and ligands occurs. Thus, a comparison of these regions is especially interesting for functional inference, since structural constraints imposed by the demands of the catalyzed biochemical function make them more likely to exhibit structural similarity. Moreover, the comparative analysis of protein binding sites is of special interest in pharmaceutical chemistry, in order to predict cross-reactivities and gain a deeper understanding of the catalysis mechanism. From an algorithmic point of view, the comparison of structured data, or, more generally, complex objects, can be attempted based on different methodological principles. Global methods aim at comparing structures as a whole, while local methods transfer the problem to multiple comparisons of local substructures. In the context of protein structure analysis, it is not a priori clear, which strategy is more suitable. In this thesis, several conceptually different algorithmic approaches have been developed, based on local, global and semi-global strategies, for the task of comparing protein structure data, more specifically protein binding pockets. The use of graphs for the modeling of protein structure data has a long standing tradition in structural bioinformatics. Recently, graphs have been used to model the geometric constraints of protein binding sites. The algorithms developed in this thesis are based on this modeling concept, hence, from a computer scientist's point of view, they can also be regarded as global, local and semi-global approaches to graph comparison. The developed algorithms were mainly designed on the premise to allow for a more approximate comparison of protein binding sites, in order to account for the molecular flexibility of the protein structures. A main motivation was to allow for the detection of more remote similarities, which are not apparent by using more rigid methods. Subsequently, the developed approaches were applied to different problems typically encountered in the field of structural bioinformatics in order to assess and compare their performance and suitability for different problems. Each of the approaches developed during this work was capable of improving upon the performance of existing methods in the field. Another major aspect in the experiments was the question, which methodological concept, local, global or a combination of both, offers the most benefits for the specific task of protein binding site comparison, a question that is addressed throughout this thesis

    Methods for the Efficient Comparison of Protein Binding Sites and for the Assessment of Protein-Ligand Complexes

    Get PDF
    In the present work, accelerated methods for the comparison of protein binding sites as well as an extended procedure for the assessment of ligand poses in protein binding sites are presented. Protein binding site comparisons are frequently used receptor-based techniques in early stages of the drug development process. Binding sites of other proteins which are similar to the binding site of the target protein can offer hints for possible side effects of a new drug prior to clinical studies. Moreover, binding site comparisons are used as an idea generator for bioisosteric replacements of individual functional groups of the newly developed drug and to unravel the function of hitherto orphan proteins. The structural comparison of binding sites is especially useful when applied on distantly related proteins as a comparison solely based on the amino acid sequence is not sufficient in such cases. Methods for the assessment of ligand poses in protein binding sites are also used in the early phase of drug development within docking programs. These programs are utilized to screen entire libraries of molecules for a possible ligand of a binding site and to furthermore estimate in which conformation the ligand will most likely bind. By employing this information, molecule libraries can be filtered for subsequent affinity assays and molecular structures can be refined with regard to affinity and selectivity

    Computational approaches for the characterization of the Dipeptidyl Peptidase IV inhibition: Applications to drug discovery, drug design and binding site similarity

    Get PDF
    La inhibició de l'enzim dipeptidil peptidasa IV (DPP-IV) ha emergit durant les últimes dècades com un dels tractaments més efectius per a la diabetis mellitus tipus II gràcies al seu baix risc hipoglucèmic i al manteniment del pes corporal. Els estudis d'anàlisi de relació estructura-activitat i els protocols de cribratge virtual s'han fet servir per explicar com els lligands interactuen amb el lloc d'unió de la DPP-IV i cercar en extenses bases de dades de compostos de baix pes molecular per tal de trobar nous inhibidors de DPP-IV. Per tant, la tesi doctoral s'ha centrat en: (a) la caracterització de la inhibició de DPP-IV amb l'objectiu de suggerir com els cribratges virtuals podrien ser millorats per a afavorir la identificació d'inhibidors de DPP-IV potents i selectius o bé per cercar noves molècules de partida; (b) el disseny d'una estratègia computacional adequada per identificar nous compostos de partida en bases de dades de molècules comercials que presentin baixa (o nul·la) similitud amb els actius existents; (c) la demostració que almenys de forma parcial, l'efecte antidiabètic descrit per a extractes de diferents espècies d'Ephedra és el resultat de l'activitat inhibitòria de DPP-IV per part dels compostos d'efedrina i derivats d'efedrina trobats en aquests mateixos extractes; i (d) l'anàlisi de les característiques fisico-químiques compartides pels llocs d'unió de DPP-IV i del receptor adrenèrgic β2 i comparar-los amb l'objectiu d'avaluar si és possible que un lligand pugui presentar activitat dual com a inhibidor de DPP-IV i β-bloquejant. És important destacar que el nostre treball aporta una nova hipòtesi sobre l'efecte cardiosaludable associat a la inhibició de DPP-IV i obre la porta al disseny d'un únic tractament dirigit simultàniament per a la diabetis mellitus tipus II i les malalties cardiovasculars, ambdues involucrades en la síndrome metabòlica.La inhibición de la enzima dipeptidil peptidasa IV (DPP-IV) ha surgido durante las últimas décadas como uno de los tratamientos más efectivos para la diabetes mellitus tipo II gracias a su bajo riesgo hipoglucémico y al mantenimiento del peso corporal. Los estudios de análisis de relación estructura-actividad y los protocolos de cribado virtual se han usado para explicar cómo los ligandos interactúan con el lugar de unión de la DPP-IV y buscar en extensas bases de datos de compuestos de bajo peso molecular para identificar nuevos inhibidores de DPP-IV. Por lo tanto, la presente tesis doctoral se ha centrado en: (a) la caracterización de la inhibición de DPP-IV con el objetivo de sugerir cómo los cribados virtuales podrían mejorarse para favorecer la identificación de inhibidores de DPP-IV potentes y selectivos o bien como buscar nuevas moléculas de partida; (b) el diseño de una estrategia computacional adecuada para identificar nuevos compuestos de partida en bases de datos de moléculas comerciales que presenten baja (o nula) similitud con los activos existentes; (c) la demostración de que al menos de forma parcial, el efecto antidiabético descrito para los extractos de diferentes especies de Ephedra es el resultado de la actividad inhibitoria de DPP-IV por parte de las moléculas de efedrina y derivados de ésta encontrados en estos mismos extractos; y (d) el análisis de las características fisico-químicas compartidas por los lugares de unión de DPP-IV y del receptor adrenérgico β2 y compararlos con el objetivo de evaluar si es posible que un ligando pueda presentar actividad dual como inhibidor de DPP-IV y β-bloqueante. Es importante destacar que nuestro trabajo aporta una nueva hipótesis sobre el efecto cardiosaludable asociado a la inhibición de DPP-IV y abre la puerta al diseño de un único tratamiento dirigido simultáneamente para la diabetes mellitus tipo II y las enfermedades cardiovasculares, ambas involucradas en el síndrome metabólico.The inhibition of dipeptidyl peptidase-IV (DPP-IV) enzyme has emerged over the last decade as one of the most effective treatments for type II diabetes mellitus with low risk for hypoglycemia and weight gain. Structure-activity relationship analyses and virtual screening protocols have been used to explain how ligands interact with the DPP- IV binding site and to mine large databases of small molecules searching for new DPP-IV inhibitors. The present doctoral thesis has been therefore focused on: (a) the characterization of DPP-IV inhibition in order to suggest how virtual screening protocols may be improved either to favor the identification of potent and selective DPP-IV inhibitors or to look for new lead molecules; (b) the design of a computational strategy suitable for identifying new lead compounds with very low (or no) similarity to known actives in purchasable databases; (c) the demonstration that, at least partly, the described antidiabetic effect of different Ephedra species extracts is the result of the DPP-IV inhibitory bioactivity by ephedrine and the ephedrine-derivatives found in these extracts and (d) the analysis of the physico-chemical features shared by the DPP-IV and β2-adrenergic receptors binding sites and their comparison in order to evaluate if small molecules with dual bioactivity as DPP-IV inhibitors and β-blockers are possible. It is noteworthy that our work provides a new hypothesis about the cardioprotective effect associated with DPP-IV inhibition and opens the door to a single treatment focused toward type II diabetes mellitus and cardiovascular diseases involved in the metabolic syndrome

    Protocols to capture the functional plasticity of protein domain superfamilies

    Get PDF
    Most proteins comprise several domains, segments that are clearly discernable in protein structure and sequence. Over the last two decades, it has become increasingly clear that domains are often also functional modules that can be duplicated and recombined in the course of evolution. This gives rise to novel protein functions. Traditionally, protein domains are grouped into homologous domain superfamilies in resources such as SCOP and CATH. This is done primarily on the basis of similarities in their three-dimensional structures. A biologically sound subdivision of the domain superfamilies into families of sequences with conserved function has so far been missing. Such families form the ideal framework to study the evolutionary and functional plasticity of individual superfamilies. In the few existing resources that aim to classify domain families, a considerable amount of manual curation is involved. Whilst immensely valuable, the latter is inherently slow and expensive. It can thus impede large-scale application. This work describes the development and application of a fully-automatic pipeline for identifying functional families within superfamilies of protein domains. This pipeline is built around a method for clustering large-scale sequence datasets in distributed computing environments. In addition, it implements two different protocols for identifying families on the basis of the clustering results: a supervised and an unsupervised protocol. These are used depending on whether or not high-quality protein function annotation data are associated with a given superfamily. The results attained for more than 1,500 domain superfamilies are discussed in both a qualitative and quantitative manner. The use of domain sequence data in conjunction with Gene Ontology protein function annotations and a set of rules and concepts to derive families is a novel approach to large-scale domain sequence classification. Importantly, the focus lies on domain, not whole-protein function
    corecore