3,194 research outputs found

    Using structural motif descriptors for sequence-based binding site prediction

    Get PDF
    All authors are with the Biotechnological Center, TU Dresden, Tatzberg 47-51, 01307 Dresden, Germany and -- Wan Kyu Kim is with the Institute for Cellular and Molecular Biology, University of Texas at Austin, Austin, TX 78712, USABackground: Many protein sequences are still poorly annotated. Functional characterization of a protein is often improved by the identification of its interaction partners. Here, we aim to predict protein-protein interactions (PPI) and protein-ligand interactions (PLI) on sequence level using 3D information. To this end, we use machine learning to compile sequential segments that constitute structural features of an interaction site into one profile Hidden Markov Model descriptor. The resulting collection of descriptors can be used to screen sequence databases in order to predict functional sites. -- Results: We generate descriptors for 740 classified types of protein-protein binding sites and for more than 3,000 protein-ligand binding sites. Cross validation reveals that two thirds of the PPI descriptors are sufficiently conserved and significant enough to be used for binding site recognition. We further validate 230 PPIs that were extracted from the literature, where we additionally identify the interface residues. Finally we test ligand-binding descriptors for the case of ATP. From sequences with Swiss-Prot annotation "ATP-binding", we achieve a recall of 25% with a precision of 89%, whereas Prosite's P-loop motif recognizes an equal amount of hits at the expense of a much higher number of false positives (precision: 57%). Our method yields 771 hits with a precision of 96% that were not previously picked up by any Prosite-pattern. -- Conclusion: The automatically generated descriptors are a useful complement to known Prosite/InterPro motifs. They serve to predict protein-protein as well as protein-ligand interactions along with their binding site residues for proteins where merely sequence information is available.Institute for Cellular and Molecular [email protected]

    Kernel-based machine learning protocol for predicting DNA-binding proteins

    Get PDF
    DNA-binding proteins (DNA-BPs) play a pivotal role in various intra- and extra-cellular activities ranging from DNA replication to gene expression control. Attempts have been made to identify DNA-BPs based on their sequence and structural information with moderate accuracy. Here we develop a machine learning protocol for the prediction of DNA-BPs where the classifier is Support Vector Machines (SVMs). Information used for classification is derived from characteristics that include surface and overall composition, overall charge and positive potential patches on the protein surface. In total 121 DNA-BPs and 238 non-binding proteins are used to build and evaluate the protocol. In self-consistency, accuracy value of 100% has been achieved. For cross-validation (CV) optimization over entire dataset, we report an accuracy of 90%. Using leave 1-pair holdout evaluation, the accuracy of 86.3% has been achieved. When we restrict the dataset to less than 20% sequence identity amongst the proteins, the holdout accuracy is achieved at 85.8%. Furthermore, seven DNA-BPs with unbounded structures are all correctly predicted. The current performances are better than results published previously. The higher accuracy value achieved here originates from two factors: the ability of the SVM to handle features that demonstrate a wide range of discriminatory power and, a different definition of the positive patch. Since our protocol does not lean on sequence or structural homology, it can be used to identify or predict proteins with DNA-binding function(s) regardless of their homology to the known ones

    A structural classification of protein-protein interactions for detection of convergently evolved motifs and for prediction of protein binding sites on sequence level

    Get PDF
    BACKGROUND: A long-standing challenge in the post-genomic era of Bioinformatics is the prediction of protein-protein interactions, and ultimately the prediction of protein functions. The problem is intrinsically harder, when only amino acid sequences are available, but a solution is more universally applicable. So far, the problem of uncovering protein-protein interactions has been addressed in a variety of ways, both experimentally and computationally. MOTIVATION: The central problem is: How can protein complexes with solved threedimensional structure be utilized to identify and classify protein binding sites and how can knowledge be inferred from this classification such that protein interactions can be predicted for proteins without solved structure? The underlying hypothesis is that protein binding sites are often restricted to a small number of residues, which additionally often are well-conserved in order to maintain an interaction. Therefore, the signal-to-noise ratio in binding sites is expected to be higher than in other parts of the surface. This enables binding site detection in unknown proteins, when homology based annotation transfer fails. APPROACH: The problem is addressed by first investigating how geometrical aspects of domain-domain associations can lead to a rigorous structural classification of the multitude of protein interface types. The interface types are explored with respect to two aspects: First, how do interface types with one-sided homology reveal convergently evolved motifs? Second, how can sequential descriptors for local structural features be derived from the interface type classification? Then, the use of sequential representations for binding sites in order to predict protein interactions is investigated. The underlying algorithms are based on machine learning techniques, in particular Hidden Markov Models. RESULTS: This work includes a novel approach to a comprehensive geometrical classification of domain interfaces. Alternative structural domain associations are found for 40% of all family-family interactions. Evaluation of the classification algorithm on a hand-curated set of interfaces yielded a precision of 83% and a recall of 95%. For the first time, a systematic screen of convergently evolved motifs in 102.000 protein-protein interactions with structural information is derived. With respect to this dataset, all cases related to viral mimicry of human interface bindings are identified. Finally, a library of 740 motif descriptors for binding site recognition - encoded as Hidden Markov Models - is generated and cross-validated. Tests for the significance of motifs are provided. The usefulness of descriptors for protein-ligand binding sites is demonstrated for the case of "ATP-binding", where a precision of 89% is achieved, thus outperforming comparable motifs from PROSITE. In particular, a novel descriptor for a P-loop variant has been used to identify ATP-binding sites in 60 protein sequences that have not been annotated before by existing motif databases

    Pockets as structural descriptors of EGFR kinase conformations

    Get PDF
    Epidermal Growth Factor Receptor (EGFR), a tyrosine kinase receptor, is one of the main tumor markers in different types of cancers. The kinase native state is mainly composed of two populations of conformers: active and inactive. Several sequence variations in EGFR kinase region promote the differential enrichment of conformers with higher activity. Some structural characteristics have been proposed to differentiate kinase conformations, but these considerations could lead to ambiguous classifications. We present a structural characterisation of EGFR kinase conformers, focused on active site pocket comparisons, and the mapping of known pathological sequence variations. A structural based clustering of this pocket accurately discriminates active from inactive, well-characterised conformations. Furthermore, this main pocket contains, or is in close contact with, ≈65% of cancer-related variation positions. Although the relevance of protein dynamics to explain biological function has been extensively recognised, the usage of the ensemble of conformations in dynamic equilibrium to represent the functional state of proteins and the importance of pockets, cavities and/or tunnels was often neglected in previous studies. These functional structures and the equilibrium between them could be structurally analysed in wild type as well as in sequence variants. Our results indicate that biologically important pockets, as well as their shape and dynamics, are central to understanding protein function in wild-type, polymorphic or disease-related variations.Fil: Hasenahuer, Marcia Anahí. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Barletta Roldan, Patricio German. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Fernández Alberti, Sebastián. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Parisi, Gustavo Daniel. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; ArgentinaFil: Fornasari, Maria Silvina. Universidad Nacional de Quilmes. Departamento de Ciencia y Tecnología; Argentina. Consejo Nacional de Investigaciones Científicas y Técnicas; Argentin

    Prediction of the functional class of metal-binding proteins from sequence derived physicochemical properties by support vector machine approach

    Get PDF
    Metal-binding proteins play important roles in structural stability, signaling, regulation, transport, immune response, metabolism control, and metal homeostasis. Because of their functional and sequence diversity, it is desirable to explore additional methods for predicting metal-binding proteins irrespective of sequence similarity. This work explores support vector machines (SVM) as such a method. SVM prediction systems were developed by using 53,333 metal-binding and 147,347 non-metal-binding proteins, and evaluated by an independent set of 31,448 metal-binding and 79,051 non-metal-binding proteins. The computed prediction accuracy is 86.3%, 81.6%, 83.5%, 94.0%, 81.2%, 85.4%, 77.6%, 90.4%, 90.9%, 74.9% and 78.1% for calcium-binding, cobalt-binding, copper-binding, iron-binding, magnesium-binding, manganese-binding, nickel-binding, potassium-binding, sodium-binding, zinc-binding, and all metal-binding proteins respectively. The accuracy for the non-member proteins of each class is 88.2%, 99.9%, 98.1%, 91.4%, 87.9%, 94.5%, 99.2%, 99.9%, 99.9%, 98.0%, and 88.0% respectively. Comparable accuracies were obtained by using a different SVM kernel function. Our method predicts 67% of the 87 metal-binding proteins non-homologous to any protein in the Swissprot database and 85.3% of the 333 proteins of known metal-binding domains as metal-binding. These suggest the usefulness of SVM for facilitating the prediction of metal-binding proteins. Our software can be accessed at the SVMProt server

    Pemodelan Molekular Enzim 3β-Hydroxysteroid Dehydrogenase Tipe 2: Pemodelan Kombinasi Homologi, Docking dan Pendekatan QSAR

    Get PDF
    A homology model of human 3β-HSD type 2 has been developed from homology modeling techniques using Phyre2 server and refi ned by ModRefi ner. The PROCHECK, QMEAN and ProSA-web online tools were carried out to evaluate the stereochemical quality of the model. The Ramachandran plot resulted from PROCHECK showed that 84.5% residues are in the most favored region, 13.7% are in the additional allowed region, 1.5% are in the generously allowed region and 0.3% are in the disallowed region. The QMEAN (Z-score) are 0.509 (-3.006) and Z-score of ProSA-web is -7.10. The negative values of protein fold energies also found in almost all sequences. Furthermore, molecular docking was also applied to validate the model using MOE. The hydrogen bonding interactions with Tyr154, Ser124, and Ser218 are found in all docked substrates as well as known inhibitors (trilostane and epostane). A dataset of azasteroid inhibitors were also docked into the substrate active site of human 3β-HSD2. These docked structures were utilized to construct corresponding docking-based QSAR equation by employing genetic algorithm (GA) statistical analysis. The contructed best QSAR equation has a robust predictive power according to its statistical parameters, hence may be applied to supersede the default scoring function provided by docking software. These results indicate that the human 3β-HSD2 model was successfully evaluated as a good model.Model homologi dari enzim 3β-HSD2 telah dikonstruksi menggunakan server Phyre2 dan dilanjutkan dengan ModRefi ner. Piranti lunak daring PROCHECK, QMEAN dan ProSA-web digunakan untuk mengevaluasi kualitas model stereokimia. Plot Ramachandran yang dihasilkan dari PROCHECK menunjukkan bahwa 84,5% residu berada di most favored region, 13,7% di additional allowed region, 1,5% di generously allowed region dan 0,3% di dissallowed region. Nilai QMEAN (Z-score) adalah 0,509 (-3,006) dan Z-score dari ProSA-web adalah -7,10. Nilai negatif pada energi folding protein juga ditemukan di hampir seluruh sekuens. Selanjutnya, penambatan molekuler juga diterapkan untuk memvalidasi model menggunakan program MOE. Interaksi ikatan hidrogen dengan Tyr154, Ser124 dan Ser218 ditemukan disemua substrat yang ditambatkan, seperti halnya di senyawa-senyawa inhibitor yang telah dikenal (trilostane dan epostane). Dataset inhibitor azasteroid juga ditambatkan ke situs aktif substrat pada enzim 3β-HSD2. Struktur yang tertambatkan digunakan untuk membangun persamaan QSAR berbasis penambatan molekuler dengan menerapkan analisis statistik genetic algorithm (GA). Persamaan QSAR terbaik yang terkonstruksi memiliki daya prediksi yang kuat sesuai dengan parameter statistiknya, sehingga dapat diaplikasikan untuk menggantikan fungsi scoring default yang disediakan oleh program MOE. Hasil ini menunjukkan bahwa model enzim 3β-HSD2 manusia berhasil dievaluasi sebagai model yang baik
    corecore