    Prediction of protein-protein interaction types using machine learning approaches

    
    Prediction and analysis of protein-protein interactions (PPIs) is an important problem in life science research because of the fundamental roles of PPIs in many biological processes in living cells. One of the important problems surrounding PPIs is the identification and prediction of different types of complexes, which are characterized by properties such as type and numbers of proteins that interact, stability of the proteins, and also duration of the interactions. This thesis focuses on studying the temporal and stability aspects of the PPIs mostly using structural data. We have addressed the problem of predicting obligate and non-obligate protein complexes, as well as those aspects related to transient versus permanent because of the importance of non-obligate and transient complexes as therapeutic targets for drug discovery and development. We have presented a computational model to predict-protein interaction types using our proposed physicochemical features of desolvation and electrostatic energies and also structural and sequence domain-based features. To achieve a comprehensive comparison and demonstrate the strength of our proposed features to predict PPI types, we have also computed a wide range of previously used properties for prediction including physical features of interface area, chemical features of hydrophobicity and amino acid composition, physicochemical features of solvent-accessible surface area (SASA) and atomic contact vectors (ACV). After extracting the main features of the complexes, a variety of machine learning approaches have been used to predict PPI types. The prediction is performed via several state-of-the-art classification techniques, including linear dimensionality reduction (LDR), support vector machine (SVM), naive Bayes (NB) and k-nearest neighbor (k-NN). Moreover, several feature selection algorithms including gain ratio (GR), information gain (IG), chi-square (Chi2) and minimum redundancy maximum relevance (mRMR) are applied on the available datasets to obtain more discriminative and relevant properties to distinguish between these two types of complexes Our computational results on different datasets confirm that using our proposed physicochemical features of desolvation and electrostatic energies lead to significant improvements on prediction performance. Moreover, using structural and sequence domains of CATH and Pfam and doing biological analysis help us to achieve a better insight on obligate and non-obligate complexes and their interactions

    Prediction of protein-protein interaction types using association rule based classification

    
    This article has been made available through the Brunel Open Access Publishing Fund - Copyright @ 2009 Park et alBackground: Protein-protein interactions (PPI) can be classified according to their characteristics into, for example obligate or transient interactions. The identification and characterization of these PPI types may help in the functional annotation of new protein complexes and in the prediction of protein interaction partners by knowledge driven approaches. Results: This work addresses pattern discovery of the interaction sites for four different interaction types to characterize and uses them for the prediction of PPI types employing Association Rule Based Classification (ARBC) which includes association rule generation and posterior classification. We incorporated domain information from protein complexes in SCOP proteins and identified 354 domain-interaction sites. 14 interface properties were calculated from amino acid and secondary structure composition and then used to generate a set of association rules characterizing these domain-interaction sites employing the APRIORI algorithm. Our results regarding the classification of PPI types based on a set of discovered association rules shows that the discriminative ability of association rules can significantly impact on the prediction power of classification models. We also showed that the accuracy of the classification can be improved through the use of structural domain information and also the use of secondary structure content. Conclusion: The advantage of our approach is that we can extract biologically significant information from the interpretation of the discovered association rules in terms of understandability and interpretability of rules. A web application based on our method can be found at http://bioinfo.ssu.ac.kr/~shpark/picasso/SHP was supported by the Korea Research Foundation Grant funded by the Korean Government(KRF-2005-214-E00050). JAR has been supported by the Programme Alβan, the European Union Programme of High level Scholarships for Latin America, scholarship E04D034854CL. SK was supported by Soongsil University Research Fund

    Development of computational approaches for structural classification, analysis and prediction of molecular recognition regions in proteins

    
    The vast and growing volume of 3D protein structural data stored in the PDB contains abundant information about macromolecular complexes, and hence, data about protein interfaces. Non-covalent contacts between amino acids are the basis of protein interactions, and they are responsible for binding afinity and specificity in biological processes. In addition, water networks in protein interfaces can also complement direct interactions contributing significantly to molecular recognition, although their exact role is still not well understood. It is estimated that protein complexes in the PDB are substantially underrepresented due to their crystallization dificulties. Methods for automatic classifification and description of the protein complexes are essential to study protein interfaces, and to propose putative binding regions. Due to this strong need, several protein-protein interaction databases have been developed. However, most of them do not take into account either protein-peptide complexes, solvent information or a proper classification of the binding regions, which are fundamental components to provide an accurate description of protein interfaces. In the firest stage of my thesis, I developed the SCOWLP platform, a database and web application that structurally classifies protein binding regions at family level and defines accurately protein interfaces at atomic detail. The analysis of the results showed that protein-peptide complexes are substantially represented in the PDB, and are the only source of interacting information for several families. By clustering the family binding regions, I could identify 9,334 binding regions and 79,803 protein interfaces in the PDB. Interestingly, I observed that 65% of protein families interact to other molecules through more than one region and in 22% of the cases the same region recognizes different protein families. The database and web application are open to the research community (www.scowlp.org) and can tremendously facilitate high-throughput comparative analysis of protein binding regions, as well as, individual analysis of protein interfaces. SCOWLP and the other databases collect and classify the protein binding regions at family level, where sequence and structure homology exist. Interestingly, it has been observed that many protein families also present structural resemblances within each other, mostly across folds. Likewise, structurally similar interacting motifs (binding regions) have been identified among proteins with different folds and functions. For these reasons, I decided to explore the possibility to infer protein binding regions independently of their fold classification. Thus, I performed the firest systematic analysis of binding region conservation within all protein families that are structurally similar, calculated using non-sequential structural alignment methods. My results indicate there is a substantial molecular recognition information that could be potentially inferred among proteins beyond family level. I obtained a 6 to 8 fold enrichment of binding regions, and identified putative binding regions for 728 protein families that lack binding information. Within the results, I found out protein complexes from different folds that present similar interfaces, confirming the predictive usage of the methodology. The data obtained with my approach may complement the SCOWLP family binding regions suggesting alternative binding regions, and can be used to assist protein-protein docking experiments and facilitate rational ligand design. In the last part of my thesis, I used the interacting information contained in the SCOWLP database to help understand the role that water plays in protein interactions in terms of affinity and specificity. I carried out one of the firest high-throughput analysis of solvent in protein interfaces for a curated dataset of transient and obligate protein complexes. Surprisingly, the results highlight the abundance of water-bridged residues in protein interfaces (40.1% of the interfacial residues) that reinforces the importance of including solvent in protein interaction studies (14.5% extra residues interacting only water- mediated). Interestingly, I also observed that obligate and transient interfaces present a comparable amount of solvent, which contrasts the old thoughts saying that obligate protein complexes are expected to exhibit similarities to protein cores having a dry and hydrophobic interfaces. I characterized novel features of water-bridged residues in terms of secondary structure, temperature factors, residue composition, and pairing preferences that differed from direct residue-residue interactions. The results also showed relevant aspects in the mobility and energetics of water-bridged interfacial residues. Collectively, my doctoral thesis work can be summarized in the following points: 1. I developed SCOWLP, an improved framework that identiffies protein interfaces and classifies protein binding regions at family level. 2. I developed a novel methodology to predict alternative binding regions among structurally similar protein families independently of the fold they belong to. 3. I performed a high-throughput analysis of water-bridged interactions contained in SCOWLP to study the role of solvent in protein interfaces. These three components of my thesis represent novel methods for exploiting existing structural information to gain insights into protein- protein interactions, key mechanisms to understand biological processes

    Comparing interfacial dynamics in protein-protein complexes: an elastic network approach

    
    <p>Abstract</p> <p>Background</p> <p>The transient, or permanent, association of proteins to form organized complexes is one of the most common mechanisms of regulation of biological processes. Systematic physico-chemical studies of the binding interfaces have previously shown that a key mechanism for the formation/stabilization of dimers is the steric and chemical complementarity of the two semi-interfaces. The role of the fluctuation dynamics at the interface of the interacting subunits, although expectedly important, proved more elusive to characterize. The aim of the present computational study is to gain insight into salient dynamics-based aspects of protein-protein interfaces.</p> <p>Results</p> <p>The interface dynamics was characterized by means of an elastic network model for 22 representative dimers covering three main interface types. The three groups gather dimers sharing the same interface but with good (type I) or poor (type II) similarity of the overall fold, or dimers sharing only one of the semi-interfaces (type III). The set comprises obligate dimers, which are complexes for which no structural representative of the free form(s) is available. Considerations were accordingly limited to bound and unbound forms of the monomeric subunits of the dimers. We proceeded by first computing the mobility of amino acids at the interface of the bound forms and compare it with the mobility of (i) other surface amino acids (ii) interface amino acids in the unbound forms. In both cases different dynamic patterns were observed across interface types and depending on whether the interface belongs to an obligate or non-obligate complex.</p> <p>Conclusions</p> <p>The comparative investigation indicated that the mobility of amino acids at the dimeric interface is generally lower than for other amino acids at the protein surface. The change in interfacial mobility upon removing "in silico" the partner monomer (unbound form) was next found to be correlated with the interface type, size and obligate nature of the complex. In particular, going from the unbound to the bound forms, the interfacial mobility is noticeably reduced for dimers with type I interfaces, while it is largely unchanged for type II ones. The results suggest that these structurally- and biologically-different types of interfaces are stabilized by different balancing mechanisms between enthalpy and conformational entropy.</p

    An evolutionary perspective on the kinome of malaria parasites

    
    Malaria parasites belong to an ancient lineage that diverged very early from the main branch of eukaryotes. The approximately 90-member plasmodial kinome includes a majority of eukaryotic protein kinases that clearly cluster within the AGC, CMGC, TKL, CaMK and CK1 groups found in yeast, plants and mammals, testifying to the ancient ancestry of these families. However, several hundred millions years of independent evolution, and the specific pressures brought about by first a photosynthetic and then a parasitic lifestyle, led to the emergence of unique features in the plasmodial kinome. These include taxon-restricted kinase families, and unique peculiarities of individual enzymes even when they have homologues in other eukaryotes. Here, we merge essential aspects of all three malaria-related communications that were presented at the Evolution of Protein Phosphorylation meeting, and propose an integrated discussion of the specific features of the parasite's kinome and phosphoproteome

    Sequence homology based protein-protein interacting residue predictions and the applications in ranking docked conformations

    
    Protein-protein interactions play a central role in the formation of protein complexes and the biological pathways that orchestrate virtually all cellular processes. Three dimensional structures of a complex formed by a protein with one or more of its interaction partners provide useful information regarding the specific amino acid residues that make up the interface between proteins. The emergence of high throughput techniques such as Yeast 2 Hybrid (Y2H) assays has made it possible to identify putative interactions between thousands of proteins (but not the interfaces that form the structural basis of interactions or the structures of protein complexes that result from such interactions). Reliable identification of the specific amino acid residues that form the interface of a protein with one or more other proteins is critical for understanding the structural and physico-chemical basis of protein interactions and their role in key cellular processes, for predicting protein complexes, for validating protein interactions predicted by high throughput methods, for ranking conformations of protein complexes generated by docking, and for identifying and prioritizing drug targets in computational drug design. However, given the high cost of experimental determination of the structures of protein complexes, there is an urgent need for reliable and fast computational methods for identifying interface residues and/or predicting the structure of a complex formed by a protein of interest with its interaction partners. Given the large and growing gap between the number of known protein sequences and the number of experimentally determined structures, sequence-based methods for predicting protein-protein interfaces are of particular interest. Against this background, we develop HomPPI ( http://homppi.cs.iastate.edu/), a class of sequence homology based approaches to protein interface prediction. We present two variants of HomPPI: (i) NPS-HomPPI (non-partner-specific HomPPI), which can be used to predict interface residues of a query protein in the absence of knowledge of the interaction partner. NPS-HomPPI is based on the results of a systematic analysis of the conditions under which interface residues of a query protein are conserved among its sequence homologs (and hence can be inferred from the known interface residues in proteins that are sequence homologs of the query protein). Our experiments suggest that when sequence homologs of the query protein can be reliably identified, NPS-HomPPI is competitive with several state-of-the-art interface prediction servers including those that exploit the structure of the query proteins. (ii) PS-HomPPI (partner-specific HomPPI), which can be used to predict the interface residues of a query protein with a specific target protein. PS-HomPPI is based on a systematic analysis of the conditions under which the interface residues that make up the interface between a query protein and its interaction partner are preserved among their homo-interologs, i.e., complexes formed by their respective sequence homologs. To the best of our knowledge, with the exception of protein-protein docking (which is computationally much more expensive than PS-HomPPI), PS-HomPPI is one of the first partner-specific protein-protein interface predictors. Our experiments with PS-HomPPI show that when homo-interologs of a query protein and its putative interaction partner can be reliably identified, the interface predictions generated by PS-HomPPI are significantly more reliable than those generated by NPS-HomPPI. Protein-Protein Docking offers a powerful approach to computational determination of the 3-dimensional conformation of protein complexes and protein-protein interfaces. However, the reliability of conformations produced by docking is limited by the efficacy of the scoring functions used to select a few near-native conformations from among tens of thousands of possible conformations, generated by docking programs. Against this background, we introduce DockRank, a novel approach to rank docked conformations based on the degree to which the interface residues inferred from the docked conformation match the interface residues predicted by a partner-specific sequence homology based interface predictor PS-HomPPI. We compare, on a data set of 69 docked cases with 54,000 decoys per case, the ranking of conformations produced using DockRank\u27s interface similarity scoring function applied to predicted interface residues obtained from four protein interface predictors: PS-HomPPI, and three NPS interface predictors NPS-HomPPI, PRISE, and meta-PPISP, with the rankings produced by two state-of-the-art energy-based scoring functions ZRank and IRAD. Our results show that DockRank significantly outperforms these ranking methods. Our results that NPS interface predictors (homology based and machine learning-based methods) failed to select near-native conformations that are superior to those selected by DockRank (partner-specific interface prediction based), highlight the importance of the knowledge of the binding partners in using predicted interfaces to rank docked models. The application of DockRank, as a third-party scoring function without access to all the original docked models, for improving ClusPro results on two benchmark data sets of 32 and 56 test cases shows the viability of combining our scoring function with existing docking software. An online implementation of DockRank is available at http://einstein.cs.iastate.edu/DockRank/

    The NMR structure of the engineered halophilic DnaE intein for segmental isotopic labeling using conditional protein splicing

    
    Protein trans-splicing catalyzed by split inteins has been used for segmental isotopic labeling of proteins for alleviating the complexity of NMR signals. Whereas inteins spontaneously trigger protein splicing upon protein folding, inteins from extremely halophilic organisms require a high salinity condition to induce protein splicing. We designed and created a salt-inducible intein from the widely used DnaE intein from Nostoc punctiforme by introducing 29 mutations, which required a lower salt concentration than naturally occurring halo-obligate inteins. We determined the NMR solution structure of the engineered salt-inducible DnaE intein in 2 M NaCl, showing the essentially identical three-dimensional structure to the original one, albeit it unfolds without salts. The NMR structure of a halo-obligate intein under high salinity suggests that the stabilization of the active folded conformation is not a mere result of various intramolecular interactions but the subtle energy balance from the complex interactions, including the solvation energy, which involve waters, ions, co-solutes, and protein polypeptide chains. (C) 2022 The Authors. Published by Elsevier Inc.Peer reviewe