506 research outputs found

    Development of computational approaches for structural classification, analysis and prediction of molecular recognition regions in proteins

    Get PDF
    The vast and growing volume of 3D protein structural data stored in the PDB contains abundant information about macromolecular complexes, and hence, data about protein interfaces. Non-covalent contacts between amino acids are the basis of protein interactions, and they are responsible for binding afinity and specificity in biological processes. In addition, water networks in protein interfaces can also complement direct interactions contributing significantly to molecular recognition, although their exact role is still not well understood. It is estimated that protein complexes in the PDB are substantially underrepresented due to their crystallization dificulties. Methods for automatic classifification and description of the protein complexes are essential to study protein interfaces, and to propose putative binding regions. Due to this strong need, several protein-protein interaction databases have been developed. However, most of them do not take into account either protein-peptide complexes, solvent information or a proper classification of the binding regions, which are fundamental components to provide an accurate description of protein interfaces. In the firest stage of my thesis, I developed the SCOWLP platform, a database and web application that structurally classifies protein binding regions at family level and defines accurately protein interfaces at atomic detail. The analysis of the results showed that protein-peptide complexes are substantially represented in the PDB, and are the only source of interacting information for several families. By clustering the family binding regions, I could identify 9,334 binding regions and 79,803 protein interfaces in the PDB. Interestingly, I observed that 65% of protein families interact to other molecules through more than one region and in 22% of the cases the same region recognizes different protein families. The database and web application are open to the research community (www.scowlp.org) and can tremendously facilitate high-throughput comparative analysis of protein binding regions, as well as, individual analysis of protein interfaces. SCOWLP and the other databases collect and classify the protein binding regions at family level, where sequence and structure homology exist. Interestingly, it has been observed that many protein families also present structural resemblances within each other, mostly across folds. Likewise, structurally similar interacting motifs (binding regions) have been identified among proteins with different folds and functions. For these reasons, I decided to explore the possibility to infer protein binding regions independently of their fold classification. Thus, I performed the firest systematic analysis of binding region conservation within all protein families that are structurally similar, calculated using non-sequential structural alignment methods. My results indicate there is a substantial molecular recognition information that could be potentially inferred among proteins beyond family level. I obtained a 6 to 8 fold enrichment of binding regions, and identified putative binding regions for 728 protein families that lack binding information. Within the results, I found out protein complexes from different folds that present similar interfaces, confirming the predictive usage of the methodology. The data obtained with my approach may complement the SCOWLP family binding regions suggesting alternative binding regions, and can be used to assist protein-protein docking experiments and facilitate rational ligand design. In the last part of my thesis, I used the interacting information contained in the SCOWLP database to help understand the role that water plays in protein interactions in terms of affinity and specificity. I carried out one of the firest high-throughput analysis of solvent in protein interfaces for a curated dataset of transient and obligate protein complexes. Surprisingly, the results highlight the abundance of water-bridged residues in protein interfaces (40.1% of the interfacial residues) that reinforces the importance of including solvent in protein interaction studies (14.5% extra residues interacting only water- mediated). Interestingly, I also observed that obligate and transient interfaces present a comparable amount of solvent, which contrasts the old thoughts saying that obligate protein complexes are expected to exhibit similarities to protein cores having a dry and hydrophobic interfaces. I characterized novel features of water-bridged residues in terms of secondary structure, temperature factors, residue composition, and pairing preferences that differed from direct residue-residue interactions. The results also showed relevant aspects in the mobility and energetics of water-bridged interfacial residues. Collectively, my doctoral thesis work can be summarized in the following points: 1. I developed SCOWLP, an improved framework that identiffies protein interfaces and classifies protein binding regions at family level. 2. I developed a novel methodology to predict alternative binding regions among structurally similar protein families independently of the fold they belong to. 3. I performed a high-throughput analysis of water-bridged interactions contained in SCOWLP to study the role of solvent in protein interfaces. These three components of my thesis represent novel methods for exploiting existing structural information to gain insights into protein- protein interactions, key mechanisms to understand biological processes

    Protein binding hot spots and the residue-residue pairing preference: a water exclusion perspective

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>A protein binding hot spot is a small cluster of residues tightly packed at the center of the interface between two interacting proteins. Though a hot spot constitutes a small fraction of the interface, it is vital to the stability of protein complexes. Recently, there are a series of hypotheses proposed to characterize binding hot spots, including the pioneering O-ring theory, the insightful 'coupling' and 'hot region' principle, and our 'double water exclusion' (DWE) hypothesis. As the perspective changes from the O-ring theory to the DWE hypothesis, we examine the physicochemical properties of the binding hot spots under the new hypothesis and compare with those under the O-ring theory.</p> <p>Results</p> <p>The requirements for a cluster of residues to form a hot spot under the DWE hypothesis can be mathematically satisfied by a biclique subgraph if a vertex is used to represent a residue, an edge to indicate a close distance between two residues, and a bipartite graph to represent a pair of interacting proteins. We term these hot spots as DWE bicliques. We identified DWE bicliques from crystal packing contacts, obligate and non-obligate interactions. Our comparative study revealed that there are abundant <it>unique </it>bicliques to the biological interactions, indicating specific biological binding behaviors in contrast to crystal packing. The two sub-types of biological interactions also have their own signature bicliques. In our analysis on residue compositions and residue pairing preferences in DWE bicliques, the focus was on interaction-preferred residues (ipRs) and interaction-preferred residue pairs (ipRPs). It is observed that hydrophobic residues are heavily involved in the ipRs and ipRPs of the obligate interactions; and that aromatic residues are in favor in the ipRs and ipRPs of the biological interactions, especially in those of the non-obligate interactions. In contrast, the ipRs and ipRPs in crystal packing are dominated by hydrophilic residues, and most of the anti-ipRs of crystal packing are the ipRs of the obligate or non-obligate interactions.</p> <p>Conclusions</p> <p>These ipRs and ipRPs in our DWE bicliques describe a diverse binding features among the three types of interactions. They also highlight the specific binding behaviors of the biological interactions, sharply differing from the artifact interfaces in the crystal packing. It can be noted that DWE bicliques, especially the unique bicliques, can capture deep insights into the binding characteristics of protein interfaces.</p

    Exploring the potential of 3D Zernike descriptors and SVM for protein\u2013protein interface prediction

    Get PDF
    Abstract Background The correct determination of protein–protein interaction interfaces is important for understanding disease mechanisms and for rational drug design. To date, several computational methods for the prediction of protein interfaces have been developed, but the interface prediction problem is still not fully understood. Experimental evidence suggests that the location of binding sites is imprinted in the protein structure, but there are major differences among the interfaces of the various protein types: the characterising properties can vary a lot depending on the interaction type and function. The selection of an optimal set of features characterising the protein interface and the development of an effective method to represent and capture the complex protein recognition patterns are of paramount importance for this task. Results In this work we investigate the potential of a novel local surface descriptor based on 3D Zernike moments for the interface prediction task. Descriptors invariant to roto-translations are extracted from circular patches of the protein surface enriched with physico-chemical properties from the HQI8 amino acid index set, and are used as samples for a binary classification problem. Support Vector Machines are used as a classifier to distinguish interface local surface patches from non-interface ones. The proposed method was validated on 16 classes of proteins extracted from the Protein–Protein Docking Benchmark 5.0 and compared to other state-of-the-art protein interface predictors (SPPIDER, PrISE and NPS-HomPPI). Conclusions The 3D Zernike descriptors are able to capture the similarity among patterns of physico-chemical and biochemical properties mapped on the protein surface arising from the various spatial arrangements of the underlying residues, and their usage can be easily extended to other sets of amino acid properties. The results suggest that the choice of a proper set of features characterising the protein interface is crucial for the interface prediction task, and that optimality strongly depends on the class of proteins whose interface we want to characterise. We postulate that different protein classes should be treated separately and that it is necessary to identify an optimal set of features for each protein class

    ‘Double water exclusion’: a hypothesis refining the O-ring theory for the hot spots at protein interfaces

    Get PDF
    Motivation: The O-ring theory reveals that the binding hot spot at a protein interface is surrounded by a ring of residues that are energetically less important than the residues in the hot spot. As this ring of residues is served to occlude water molecules from the hot spot, the O-ring theory is also called ‘water exclusion’ hypothesis. We propose a ‘double water exclusion’ hypothesis to refine the O-ring theory by assuming the hot spot itself is water-free. To computationally model a water-free hot spot, we use a biclique pattern that is defined as two maximal groups of residues from two chains in a protein complex holding the property that every residue contacts with all residues in the other group

    Prediction of protein-protein interactions from primary structure using a Random Forest classifier

    Get PDF
    Međusobne interakcije između proteina temelj su niza bioloơkih procesa, od regulacije metaboličkih puteva, specifičnosti imunoloơkih reakcija, replikacije DNK do sinteze proteina. Nagli razvoj visokoprotočnih metoda doveo je do velikog povećanja produkcije bioloơkih sekvenci, stvorivơi potrebu za razvojem metoda i alata za njihovu funkcijsku analizu, te predviđanje fenotipskih svojstava, kako na molekularnoj, tako i na razini cijelog organizma. U ovom radu smo agregirali strukturalne podatke iz postojećih baza podataka, čime smo dobili skup proteinskih kvaternih struktura visoke kvalitete koji nam je omogućio primjenu metoda strojnog učenja za predviđanje interakcija između proteina. Iskoristili smo „Random Forest“ algoritam za predviđanje interakcijskih aminokiselina iz primarnih struktura proteina. Pokazali smo da, iako „Random Forest“ alogritam ima mogućnost klasifikacije visokodimenzionalnih podataka s izuzetnom točnoơću, trenutno znanje o strukturalnim faktorima koji utječu na specifičnost interakcija između proteina nije na razini koja bi omogućila predviđanje interakcija na razlučivosti pojedinih aminokiselina koristeći isključivo sekvence proteina.The interaction between proteins is fundamental to a broad spectrum of biological functions, including regulation of metabolic pathways, immunological recognition, DNA replication, progression through the cell cycle, and protein synthesis. Due to the growing disparity between the amount of sequenced genomic content and functional data, there exist a pressing need for tools and methods that will enable prediction of phenotypic traits, on the molecular or organism level, based on the sequence alone. In this work we have constructed a high quality dataset of protein structures that has enabled us to use the Random Forest non-linear classificator to develop a method for prediction of interacting residues from the protein primary structure. Our results have shown that, although the Random Forest algorithm has a unique capability of accurately classifying highly dimensional data, we still have an incomplete knowledge of structural factors that determine the specificity of protein-protein interactions, thus putting an upper limit the on the usefulness of the machine learning approach in predicting protein interactions on the level of single amino-acids

    Prediction of protein-protein interaction types using machine learning approaches

    Get PDF
    Prediction and analysis of protein-protein interactions (PPIs) is an important problem in life science research because of the fundamental roles of PPIs in many biological processes in living cells. One of the important problems surrounding PPIs is the identification and prediction of different types of complexes, which are characterized by properties such as type and numbers of proteins that interact, stability of the proteins, and also duration of the interactions. This thesis focuses on studying the temporal and stability aspects of the PPIs mostly using structural data. We have addressed the problem of predicting obligate and non-obligate protein complexes, as well as those aspects related to transient versus permanent because of the importance of non-obligate and transient complexes as therapeutic targets for drug discovery and development. We have presented a computational model to predict-protein interaction types using our proposed physicochemical features of desolvation and electrostatic energies and also structural and sequence domain-based features. To achieve a comprehensive comparison and demonstrate the strength of our proposed features to predict PPI types, we have also computed a wide range of previously used properties for prediction including physical features of interface area, chemical features of hydrophobicity and amino acid composition, physicochemical features of solvent-accessible surface area (SASA) and atomic contact vectors (ACV). After extracting the main features of the complexes, a variety of machine learning approaches have been used to predict PPI types. The prediction is performed via several state-of-the-art classification techniques, including linear dimensionality reduction (LDR), support vector machine (SVM), naive Bayes (NB) and k-nearest neighbor (k-NN). Moreover, several feature selection algorithms including gain ratio (GR), information gain (IG), chi-square (Chi2) and minimum redundancy maximum relevance (mRMR) are applied on the available datasets to obtain more discriminative and relevant properties to distinguish between these two types of complexes Our computational results on different datasets confirm that using our proposed physicochemical features of desolvation and electrostatic energies lead to significant improvements on prediction performance. Moreover, using structural and sequence domains of CATH and Pfam and doing biological analysis help us to achieve a better insight on obligate and non-obligate complexes and their interactions

    Computational analysis of protein-protein interactions

    Get PDF
    In the past years protein-protein interactions have gained a lot of interest in the fields of pharmacy, medicine, biology, and bioinformatics. In this work, statistical information on transient protein-protein interactions are collected and analyzed. Characteristic properties are then evaluated and their predictability estimated. Therefore, the results from a common docking approach are re-evaluated with the collected information to discriminate the native structure from those that simply have a high geometric complementarity at the interface region. The results show that although there is a noticeable improvement of the predictability after applying statistical information, the overall accuracy is still low. To find other more specific properties, transient and permanent complexes were compared to each other. The lack of data leads to an extensive search for more suitable structural data and the development of an extensive database. This database was ultimately used to retrieve a large number of protein properties that were automatically analyzed for their separation precision. A high accuracy was obtained in separating transient and permanent interactions based on the combination of only four properties. Combining this information with common docking approaches based on geometrical complementarity may lead to satisfying sensitivities.Protein-Protein Interaktionen haben in den letzten Jahren sowohl im Bereich der Pharmazie, Medizin, Biologie, als auch im Bereich der Bioinformatik großes Interesse erlangt. In dieser Arbeit werden statistische Daten zu transienten Protein-Protein Interaktionen gesammelt und ausgewertet. Charakteristische Mermale werden in einem weiteren Ansatz auf ihre Vorhersagekraft untersucht. Dazu werden die Ergebnisse aus einem Docking-Programm nach diesen Merkmalen bewertet um natĂŒrliche Komplexe von solchen, die lediglich eine hohe geometrische KomplementaritĂ€t aufweisen, zu unterscheiden. Die Ergebnisse zeigen Verbesserungen, aber dennoch SchwĂ€chen in der Vorhersagekraft auf. Um noch spezifischere Merkmale ausfindig zu machen, werden transiente und permanente Komplexe gegeneinander verglichen. Der eingeschrĂ€nkte Datensatz fĂŒhrt schließlich zu einer ausgedehnten Datensuche und Datenbank-Konstruktion. Diese wird schlussendlich fĂŒr eine sehr detaillierte Merkmalsanalyse verwendet, die ein automatisiertes Mustererkennungs-Programm verwendet. Mit Hilfe dieses Programmes können sogar Kombinationen von Merkmalen auf ihre SpezifitĂ€t untersucht werden, die schliesslich zu einer hohen Genauigkeit der Unterscheidung von transienten und permanenten Protein-Protein Interaktionen fĂŒhrt. Eine Kombination von vier Merkmalsgruppen ist dabei ausreichend. Damit können nun Docking-Programme verbessert werden, die zum Zwecke der Rechenzeitreduktion die Auswertung der Komplex-Anordnungen nur auf geometrische KomplementaritĂ€t beziehen

    Characterization, classification and alignment of protein-protein interfaces

    Get PDF
    Protein structural models provide essential information for the research on protein-protein interactions. In this dissertation, we describe two projects on the analysis of protein interactions using structural information. The focus of the first is to characterize and classify different types of interactions. We discriminate between biological obligate and biological non-obligate interactions, and crystal packing contacts. To this end, we defined six interface properties and used them to compare the three types of interactions in a hand-curated dataset. Based on the analysis, a classifier, named NOXclass, was constructed using a support vector machine algorithm in order to generate predictions of interaction types. NOXclass was tested on a non-redundant dataset of 243 protein-protein interactions and reaches an accuracy of 91.8%. The program is benecial for structural biologists for the interpretation of protein quaternary structures and to form hypotheses about the nature of proteinprotein interactions when experimental data are yet unavailable. In the second part of the dissertation, we present Galinter, a novel program for the geometrical comparison of protein-protein interfaces. The Galinter program aims at identifying similar patterns of different non-covalent interactions at interfaces. It is a graph-based approach optimized for aligning non-covalent interactions. A scoring scheme was developed for estimating the statistical signicance of the alignments. We tested the Galinter method on a published dataset of interfaces. Galinter alignments agree with those delivered by methods based on interface residue comparison and backbone structure comparison. In addition, we applied Galinter on four medically relevant examples of protein mimicry. Our results are consistent with previous human-curated analysis. The Galinter program provides an intuitive method of comparative analysis and visualization of binding modes and may assist in the prediction of interaction partners, and the design and engineering of protein interactions and interaction inhibitors

    Propensity vectors of low-ASA residue pairs in the distinction of protein interactions

    Full text link
    We introduce low-ASA residue pairs as classification features for distinguishing the different types of protein interactions. A low-ASA residue pair is defined as two contact residues each from one chain that have a small solvent accessible surface area (ASA). This notion of residue pairs is novel as it first combines residue pairs with the O-ring theory, an influential proposition stating that the binding hot spots at the interface are often surrounded by a ring of energetically less important residues. As binding hot spots lie in the core of the stability for protein interactions, we believe that low-ASA residue pairs can sharpen the distinction of protein interactions. The main part of our feature vector is 210-dimensional, consisting of all possible low-ASA residue pairs; the value of every feature is determined by a propensity measure. Our classification method is called OringPV, which uses propensity vectors of protein interactions for support vector machine. OringPV is tested on three benchmark datasets for a variety of classification tasks such as the distinction between crystal packing and biological interactions, the distinction between two different types of biological interactions, etc. The evaluation frameworks include within-dataset, crossdataset comparison, and leave-one-out crossvalidation. The results show that low-ASA residue pairs and the propensity vector description of protein interactions are truly strong in the distinction. In particular, many cross-dataset generalization capability tests have achieved excellent recalls and overall accuracies, much outperforming existing benchmark methods. © 2009 Wiley-Liss, Inc
    • 

    corecore