105 research outputs found

    A two-stage approach for improved prediction of residue contact maps

    Get PDF
    BACKGROUND: Protein topology representations such as residue contact maps are an important intermediate step towards ab initio prediction of protein structure. Although improvements have occurred over the last years, the problem of accurately predicting residue contact maps from primary sequences is still largely unsolved. Among the reasons for this are the unbalanced nature of the problem (with far fewer examples of contacts than non-contacts), the formidable challenge of capturing long-range interactions in the maps, the intrinsic difficulty of mapping one-dimensional input sequences into two-dimensional output maps. In order to alleviate these problems and achieve improved contact map predictions, in this paper we split the task into two stages: the prediction of a map's principal eigenvector (PE) from the primary sequence; the reconstruction of the contact map from the PE and primary sequence. Predicting the PE from the primary sequence consists in mapping a vector into a vector. This task is less complex than mapping vectors directly into two-dimensional matrices since the size of the problem is drastically reduced and so is the scale length of interactions that need to be learned. RESULTS: We develop architectures composed of ensembles of two-layered bidirectional recurrent neural networks to classify the components of the PE in 2, 3 and 4 classes from protein primary sequence, predicted secondary structure, and hydrophobicity interaction scales. Our predictor, tested on a non redundant set of 2171 proteins, achieves classification performances of up to 72.6%, 16% above a base-line statistical predictor. We design a system for the prediction of contact maps from the predicted PE. Our results show that predicting maps through the PE yields sizeable gains especially for long-range contacts which are particularly critical for accurate protein 3D reconstruction. The final predictor's accuracy on a non-redundant set of 327 targets is 35.4% and 19.8% for minimum contact separations of 12 and 24, respectively, when the top length/5 contacts are selected. On the 11 CASP6 Novel Fold targets we achieve similar accuracies (36.5% and 19.7%). This favourably compares with the best automated predictors at CASP6. CONCLUSION: Our final system for contact map prediction achieves state-of-the-art performances, and may provide valuable constraints for improved ab initio prediction of protein structures. A suite of predictors of structural features, including the PE, and PE-based contact maps, is available at

    An adaptive model for learning molecular endpoints

    Get PDF
    I will describe a recursive neural network that deals with undirected graphs, and its application to predicting property labels or activity values of small molecules. The model is entirely general, in that it can process any undirected graph with a finite number of nodes by factorising it into a number of directed graphs with the same skeleton. The model\u27s only input in the applications I will present is the graph representing the chemical structure of the molecule. In spite of its simplicity, the model outperforms or matches the state of the art in three of the four tasks, and in the fourth is outperformed only by a method resorting to a very problem-specific feature

    Spritz: a server for the prediction of intrinsically disordered regions in protein sequences using kernel machines

    Get PDF
    Intrinsically disordered proteins have long stretches of their polypeptide chain, which do not adopt a single native structure composed of stable secondary and tertiary structure in the absence of binding partners. The prediction of intrinsically disordered regions in proteins from sequence is increasingly becoming of interest, as the presence of many such regions in the complete genome sequences are discovered and important functional roles are associated with them. We have developed a machine learning approach based on two support vector machines (SVM) to discriminate disordered regions from sequence. The SVM are trained and benchmarked on two sets, representing long and short disordered regions. A preliminary version of Spritz was shown to perform consistently well at the recent biannual CASP-6 experiment [Critical Assessment of Techniques for Protein Structure Prediction (CASP), 2004]. The fully developed Spritz method is freely available as a web server at and

    Distill: a suite of web servers for the prediction of one-, two- and three-dimensional structural features of proteins

    Get PDF
    BACKGROUND: We describe Distill, a suite of servers for the prediction of protein structural features: secondary structure; relative solvent accessibility; contact density; backbone structural motifs; residue contact maps at 6, 8 and 12 Angstrom; coarse protein topology. The servers are based on large-scale ensembles of recursive neural networks and trained on large, up-to-date, non-redundant subsets of the Protein Data Bank. Together with structural feature predictions, Distill includes a server for prediction of C(α )traces for short proteins (up to 200 amino acids). RESULTS: The servers are state-of-the-art, with secondary structure predicted correctly for nearly 80% of residues (currently the top performance on EVA), 2-class solvent accessibility nearly 80% correct, and contact maps exceeding 50% precision on the top non-diagonal contacts. A preliminary implementation of the predictor of protein C(α )traces featured among the top 20 Novel Fold predictors at the last CASP6 experiment as group Distill (ID 0348). The majority of the servers, including the C(α )trace predictor, now take into account homology information from the PDB, when available, resulting in greatly improved reliability. CONCLUSION: All predictions are freely available through a simple joint web interface and the results are returned by email. In a single submission the user can send protein sequences for a total of up to 32k residues to all or a selection of the servers. Distill is accessible at the address:

    Ab initio and homology based prediction of protein domains by recursive neural networks

    Get PDF
    Background: Proteins, especially larger ones, are often composed of individual evolutionary units, domains, which have their own function and structural fold. Predicting domains is an important intermediate step in protein analyses, including the prediction of protein structures. Results: We describe novel systems for the prediction of protein domain boundaries powered by Recursive Neural Networks. The systems rely on a combination of primary sequence and evolutionary information, predictions of structural features such as secondary structure, solvent accessibility and residue contact maps, and structural templates, both annotated for domains (from the SCOP dataset) and unannotated (from the PDB). We gauge the contribution of contact maps, and PDB and SCOP templates independently and for different ranges of template quality. We find that accurately predicted contact maps are informative for the prediction of domain boundaries, while the same is not true for contact maps predicted ab initio. We also find that gap information from PDB templates is informative, but, not surprisingly, less than SCOP annotations. We test both systems trained on templates of all qualities, and systems trained only on templates of marginal similarity to the query (less than 25% sequence identity). While the first batch of systems produces near perfect predictions in the presence of fair to good templates, the second batch outperforms or match ab initio predictors down to essentially any level of template quality. We test all systems in 5-fold cross-validation on a large non-redundant set of multi-domain and single domain proteins. The final predictors are state-of-the-art, with a template-less prediction boundary recall of 50.8% (precision 38.7%) within ± 20 residues and a single domain recall of 80.3% (precision 78.1%). The SCOP-based predictors achieve a boundary recall of 74% (precision 77.1%) again within ± 20 residues, and classify single domain proteins as such in over 85% of cases, when we allow a mix of bad and good quality templates. If we only allow marginal templates (max 25% sequence identity to the query) the scores remain high, with boundary recall and precision of 59% and 66.3%, and 80% of all single domain proteins predicted correctly. Conclusion: The systems presented here may prove useful in large-scale annotation of protein domains in proteins of unknown structure. The methods are available as public web servers at the address: http://distill.ucd.ie/shandy/ and we plan on running them on a multi-genomic scale and make the results public in the near future.Science Foundation IrelandHealth Research BoardUCD President's Award 2004au, da, sp, ke, ab - kpw2/12/1

    Ab initio and template-based prediction of multi-class distance maps by two-dimensional recursive neural networks

    Get PDF
    Background: Prediction of protein structures from their sequences is still one of the open grand challenges of computational biology. Some approaches to protein structure prediction, especially ab initio ones, rely to some extent on the prediction of residue contact maps. Residue contact map predictions have been assessed at the CASP competition for several years now. Although it has been shown that exact contact maps generally yield correct three-dimensional structures, this is true only at a relatively low resolution (3–4 Å from the native structure). Another known weakness of contact maps is that they are generally predicted ab initio, that is not exploiting information about potential homologues of known structure. Results: We introduce a new class of distance restraints for protein structures: multi-class distance maps. We show that C trace reconstructions based on 4-class native maps are significantly better than those from residue contact maps. We then build two predictors of 4-class maps based on recursive neural networks: one ab initio, or relying on the sequence and on evolutionary information; one template-based, or in which homology information to known structures is provided as a further input. We show that virtually any level of sequence similarity to structural templates (down to less than 10%) yields more accurate 4-class maps than the ab initio predictor. We show that template-based predictions by recursive neural networks are consistently better than the best template and than a number of combinations of the best available templates. We also extract binary residue contact maps at an 8 Å threshold (as per CASP assessment) from the 4-class predictors and show that the template-based version is also more accurate than the best template and consistently better than the ab initio one, down to very low levels of sequence identity to structural templates. Furthermore, we test both ab-initio and template-based 8 Å predictions on the CASP7 targets using a pre-CASP7 PDB, and find that both predictors are state-of-the-art, with the template-based one far outperforming the best CASP7 systems if templates with sequence identity to the query of 10% or better are available. Although this is not the main focus of this paper we also report on reconstructions of C traces based on both ab initio and template-based 4-class map predictions, showing that the latter are generally more accurate even when homology is dubious. Conclusion: Accurate predictions of multi-class maps may provide valuable constraints for improved ab initio and template-based prediction of protein structures, naturally incorporate multiple templates, and yield state-of-the- art binary maps. Predictions of protein structures and 8 Å contact maps based on the multi-class distance map predictors described in this paper are freely available to academic users at the url http://distill.ucd.ie/.Science Foundation IrelandHealth Research BoardUCD President's Award 2004au, ti, sp, ke, ab - kpw16/12/1

    Infra red spectroscopy of the regulated asbestos amphiboles

    Get PDF
    Vibrational spectroscopies (Fourier Transform Infra Red, FTIR, and Raman) are exceptionally valuable tools for the identification and crystal\u2013chemical study of fibrous minerals, and asbestos amphiboles in particular. Raman spectroscopy has been widely applied in toxicological studies and thus a large corpus of reference data on regulated species is found in the literature. However, FTIR spectroscopy has been mostly used in crystal\u2013chemical studies and very few data are found on asbestos amphiboles. This paper is intended to fill this gap. We report new FTIR data collected on a suite of well-characterized samples of the five regulated amphibole species: anthophyllite, amosite, and crocidolite, provided by the Union for International Cancer Control (UICC) Organization, and tremolite and actinolite, from two well-known occurrences. The data from these reference samples have been augmented by results from additional specimens to clarify some aspects of their spectroscopic features. We show that the FTIR spectra in both the OH-stretching region and in the lattice modes region can be effective for rapid identification of the asbestos type

    DOME: recommendations for supervised machine learning validation in biology

    Get PDF
    Supervised machine learning is widely used in biology and deserves more scrutiny. We present a set of community-wide recommendations (DOME) aiming to help establish standards of supervised machine learning validation in biology. Formulated as questions, the DOME recommendations improve the assessment and reproducibility of papers when included as supplementary material.The work of the Machine Learning Focus Group was funded by ELIXIR, the research infrastructure for life-science data. IW was funded by the A*STAR Career Development Award (project no. C210112057) from the Agency for Science, Technology and Research (A*STAR), Singapore. D.F. was supported by Estonian Research Council grants (PRG1095, PSG59 and ERA-NET TRANSCAN-2 (BioEndoCar)); Project No 2014-2020.4.01.16-0271, ELIXIR and the European Regional Development Fund through EXCITE Center of Excellence. S.C.E.T. has received funding from the European Union’s Horizon 2020 research and innovation programme under Marie Skłodowska-Curie Grant agreements No. 778247 and No. 823886, and Italian Ministry of University and Research PRIN 2017 grant 2017483NH8.Peer Reviewed"Article signat per 8 autors més 28 autors/es de l' ELIXIR Machine Learning Focus Group: Emidio Capriotti, Rita Casadio, Salvador Capella-Gutierrez, Davide Cirillo, Alessio Del Conte, Alexandros C. Dimopoulos, Victoria Dominguez Del Angel, Joaquin Dopazo, Piero Fariselli, José Maria Fernández, Florian Huber, Anna Kreshuk, Tom Lenaerts, Pier Luigi Martelli, Arcadi Navarro, Pilib Ó Broin, Janet Piñero, Damiano Piovesan, Martin Reczko, Francesco Ronzano, Venkata Satagopam, Castrense Savojardo, Vojtech Spiwok, Marco Antonio Tangaro, Giacomo Tartari, David Salgado, Alfonso Valencia & Federico Zambelli"Postprint (author's final draft
    corecore