7 research outputs found

    Discovery of stable and significant binding motif pairs from PDB complexes and protein interaction datasets

    Full text link
    Motivation: Discovery of binding sites is important in the study of protein-protein interactions. In this paper, we introduce stable and significant motif pairs to model protein-binding sites. The stability is the pattern's resistance to some transformation. The significance is the unexpected frequency of occurrence of the pattern in a sequence dataset comprising known interacting protein pairs. Discovery of stable motif pairs is an iterative process, undergoing a chain of changing but converging patterns. Determining the starting point for such a chain is an interesting problem. We use a protein complex dataset extracted from the Protein Data Bank to help in identifying those starting points, so that the computational complexity of the problem is much released. Results: We found 913 stable motif pairs, of which 765 are significant. We evaluated these motif pairs using comprehensive comparison results against random patterns. Wet-experimentally discovered motifs reported in the literature were also used to confirm the effectiveness of our method. © Oxford University Press 2004; all rights reserved

    Using structural motif descriptors for sequence-based binding site prediction

    Get PDF
    All authors are with the Biotechnological Center, TU Dresden, Tatzberg 47-51, 01307 Dresden, Germany and -- Wan Kyu Kim is with the Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, TX 78712, USABackground: Many protein sequences are still poorly annotated. Functional characterization of a protein is often improved by the identification of its interaction partners. Here, we aim to predict protein-protein interactions (PPI) and protein-ligand interactions (PLI) on sequence level using 3D information. To this end, we use machine learning to compile sequential segments that constitute structural features of an interaction site into one profile Hidden Markov Model descriptor. The resulting collection of descriptors can be used to screen sequence databases in order to predict functional sites. -- Results: We generate descriptors for 740 classified types of protein-protein binding sites and for more than 3,000 protein-ligand binding sites. Cross validation reveals that two thirds of the PPI descriptors are sufficiently conserved and significant enough to be used for binding site recognition. We further validate 230 PPIs that were extracted from the literature, where we additionally identify the interface residues. Finally we test ligand-binding descriptors for the case of ATP. From sequences with Swiss-Prot annotation "ATP-binding", we achieve a recall of 25% with a precision of 89%, whereas Prosite's P-loop motif recognizes an equal amount of hits at the expense of a much higher number of false positives (precision: 57%). Our method yields 771 hits with a precision of 96% that were not previously picked up by any Prosite-pattern. -- Conclusion: The automatically generated descriptors are a useful complement to known Prosite/InterPro motifs. They serve to predict protein-protein as well as protein-ligand interactions along with their binding site residues for proteins where merely sequence information is available.Institute for Cellular and Molecular [email protected]

    Automated linear motif discovery from protein interaction network

    Get PDF
    Master'sMASTER OF SCIENC

    A structural classification of protein-protein interactions for detection of convergently evolved motifs and for prediction of protein binding sites on sequence level

    Get PDF
    BACKGROUND: A long-standing challenge in the post-genomic era of Bioinformatics is the prediction of protein-protein interactions, and ultimately the prediction of protein functions. The problem is intrinsically harder, when only amino acid sequences are available, but a solution is more universally applicable. So far, the problem of uncovering protein-protein interactions has been addressed in a variety of ways, both experimentally and computationally. MOTIVATION: The central problem is: How can protein complexes with solved threedimensional structure be utilized to identify and classify protein binding sites and how can knowledge be inferred from this classification such that protein interactions can be predicted for proteins without solved structure? The underlying hypothesis is that protein binding sites are often restricted to a small number of residues, which additionally often are well-conserved in order to maintain an interaction. Therefore, the signal-to-noise ratio in binding sites is expected to be higher than in other parts of the surface. This enables binding site detection in unknown proteins, when homology based annotation transfer fails. APPROACH: The problem is addressed by first investigating how geometrical aspects of domain-domain associations can lead to a rigorous structural classification of the multitude of protein interface types. The interface types are explored with respect to two aspects: First, how do interface types with one-sided homology reveal convergently evolved motifs? Second, how can sequential descriptors for local structural features be derived from the interface type classification? Then, the use of sequential representations for binding sites in order to predict protein interactions is investigated. The underlying algorithms are based on machine learning techniques, in particular Hidden Markov Models. RESULTS: This work includes a novel approach to a comprehensive geometrical classification of domain interfaces. Alternative structural domain associations are found for 40% of all family-family interactions. Evaluation of the classification algorithm on a hand-curated set of interfaces yielded a precision of 83% and a recall of 95%. For the first time, a systematic screen of convergently evolved motifs in 102.000 protein-protein interactions with structural information is derived. With respect to this dataset, all cases related to viral mimicry of human interface bindings are identified. Finally, a library of 740 motif descriptors for binding site recognition - encoded as Hidden Markov Models - is generated and cross-validated. Tests for the significance of motifs are provided. The usefulness of descriptors for protein-ligand binding sites is demonstrated for the case of "ATP-binding", where a precision of 89% is achieved, thus outperforming comparable motifs from PROSITE. In particular, a novel descriptor for a P-loop variant has been used to identify ATP-binding sites in 60 protein sequences that have not been annotated before by existing motif databases

    Sequence-based prediction of RNA-protein interactions

    Get PDF
    The interaction of RNAs with proteins is fundamental for executing many of the key roles they play in living systems, including translation, post-transcriptional regulation of gene expression, RNA splicing, and viral replication. Recently, new roles for RNA-protein interactions have emerged, following the discovery that the human genome is pervasively transcribed and produces thousands of non-coding RNAs (ncRNAs). Although the functions of many ncRNAs are not yet known, one emerging theme is that long non-coding RNAs (lncRNAs) often drive the formation of ribonucleoprotein (RNP) complexes, which in turn influence the regulation of gene expression. Although the human genome is predicted to encode almost as many different RNA-binding proteins as DNA-binding transcription factors, our current understanding of the cellular roles of RNA-binding proteins, how they recognize their targets, and how they are regulated, lags far behind our understanding of transcription factors. To improve our comprehension of RNA-protein recognition and the regulation of RNA-protein interaction networks within cells, this dissertation has four related goals: (i) performing a rigorous and systematic evaluation of sequence- and structure-based methods for predicting RNA-binding residues in proteins; (ii) developing improved method for predicting interfacial residues in RNA-binding proteins, using only sequence information; (iii) generating a comprehensive collection of RNA-protein interaction motifs (RPIMs); and (iv) developing improved methods for RNA-protein interaction partner prediction. First, we present a systematic evaluation of state-of-the-art machine learning methods for predicting RNA-binding residues in proteins, using three carefully curated benchmark datasets and a rich set of data representations. We show that sequence-based methods trained using position-specific scoring matrices (PSSMs) perform better than structure-based methods, which use more complex features extracted from the 3D structures of proteins. Second, we present RNABindRPlus, a new method for predicting RNA-binding residues in proteins, using only sequence information. The predictor combines output from an optimized Support Vector Machine (SVM) classifier with the output from a novel homology-based method (HomPRIP). We show that RNABindRPlus performs better than all currently available methods for predicting interfacial residues in proteins. Third, we extract more than 30,000 unique RNA-protein interfacial motifs (RPIMs), consisting of contiguous residues from both the RNA and protein chains of characterized RNA-protein complexes. Lastly, we demonstrate the utility of RPIMs in predicting RNA-protein interaction partners. We employ them in an innovative and significantly improved method for partner prediction and show that it has both a high true positive rate and a much lower false positive rate than other available methods. Taken together, the results presented here provide important new insights into the determinants of RNA-protein recognition, in addition to valuable new software tools for interrogating and predicting RNA-protein complexes and interaction networks
    corecore