70,299 research outputs found

    Better prediction of protein contact number using a support vector regression analysis of amino acid sequence

    Get PDF
    BACKGROUND: Protein tertiary structure can be partly characterized via each amino acid's contact number measuring how residues are spatially arranged. The contact number of a residue in a folded protein is a measure of its exposure to the local environment, and is defined as the number of C(β )atoms in other residues within a sphere around the C(β )atom of the residue of interest. Contact number is partly conserved between protein folds and thus is useful for protein fold and structure prediction. In turn, each residue's contact number can be partially predicted from primary amino acid sequence, assisting tertiary fold analysis from sequence data. In this study, we provide a more accurate contact number prediction method from protein primary sequence. RESULTS: We predict contact number from protein sequence using a novel support vector regression algorithm. Using protein local sequences with multiple sequence alignments (PSI-BLAST profiles), we demonstrate a correlation coefficient between predicted and observed contact numbers of 0.70, which outperforms previously achieved accuracies. Including additional information about sequence weight and amino acid composition further improves prediction accuracies significantly with the correlation coefficient reaching 0.73. If residues are classified as being either "contacted" or "non-contacted", the prediction accuracies are all greater than 77%, regardless of the choice of classification thresholds. CONCLUSION: The successful application of support vector regression to the prediction of protein contact number reported here, together with previous applications of this approach to the prediction of protein accessible surface area and B-factor profile, suggests that a support vector regression approach may be very useful for determining the structure-function relation between primary protein sequence and higher order consecutive protein structural and functional properties

    A two-stage approach for improved prediction of residue contact maps

    Get PDF
    BACKGROUND: Protein topology representations such as residue contact maps are an important intermediate step towards ab initio prediction of protein structure. Although improvements have occurred over the last years, the problem of accurately predicting residue contact maps from primary sequences is still largely unsolved. Among the reasons for this are the unbalanced nature of the problem (with far fewer examples of contacts than non-contacts), the formidable challenge of capturing long-range interactions in the maps, the intrinsic difficulty of mapping one-dimensional input sequences into two-dimensional output maps. In order to alleviate these problems and achieve improved contact map predictions, in this paper we split the task into two stages: the prediction of a map's principal eigenvector (PE) from the primary sequence; the reconstruction of the contact map from the PE and primary sequence. Predicting the PE from the primary sequence consists in mapping a vector into a vector. This task is less complex than mapping vectors directly into two-dimensional matrices since the size of the problem is drastically reduced and so is the scale length of interactions that need to be learned. RESULTS: We develop architectures composed of ensembles of two-layered bidirectional recurrent neural networks to classify the components of the PE in 2, 3 and 4 classes from protein primary sequence, predicted secondary structure, and hydrophobicity interaction scales. Our predictor, tested on a non redundant set of 2171 proteins, achieves classification performances of up to 72.6%, 16% above a base-line statistical predictor. We design a system for the prediction of contact maps from the predicted PE. Our results show that predicting maps through the PE yields sizeable gains especially for long-range contacts which are particularly critical for accurate protein 3D reconstruction. The final predictor's accuracy on a non-redundant set of 327 targets is 35.4% and 19.8% for minimum contact separations of 12 and 24, respectively, when the top length/5 contacts are selected. On the 11 CASP6 Novel Fold targets we achieve similar accuracies (36.5% and 19.7%). This favourably compares with the best automated predictors at CASP6. CONCLUSION: Our final system for contact map prediction achieves state-of-the-art performances, and may provide valuable constraints for improved ab initio prediction of protein structures. A suite of predictors of structural features, including the PE, and PE-based contact maps, is available at

    Predicting residue-wise contact orders in proteins by support vector regression

    Get PDF
    BACKGROUND: The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. RESULTS: We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. CONCLUSION: The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences

    Computational protein structure prediction using deep learning

    Get PDF
    Protein structure prediction is of great importance in bioinformatics and computational biology. Over the past 30 years, many machine learning methods have been developed for this problem in homology-based and ab-initio approaches. Recently, deep learning has been successfully applied and has outperformed previous methods. Deep learning methods could effectively handle high dimensional feature inputs in modeling the complex mapping from protein primary amino acid sequences to protein 2-D or 3-D structures. In this dissertation, new deep learning methods and deep learning networks have been proposed for three problems in protein structure prediction: loop modeling, contact map prediction, and contact map refinement. They have been implemented in the state-of-the-art MUFOLD software and obtained significant performance improvement. The goal of loop modeling is to predict the conformation of a relatively short stretch of protein backbone. A new method based on Generative Adversarial Network (GAN), called MUFOLD-LM, is proposed. The protein 3-D structure can be represented using the 2-D distance map of C [subscript alpha] atoms. The missing region in the structure will be a missing region in the distance map correspondingly. Our network uses the Generator Network to fill in the missing regions in the distance map based on the context, and the Discriminator Network will take both the predicted complete distance map and the ground truth as input to distinguish between them. The method utilizes both the features and context of the missing loop region to make better prediction of the 3-D structure of the loop region. In experiments using commonly used benchmark datasets 8-Res and 12-Res, MUFOLD-LM outperformed previous methods significantly, up to 43.9 [percent] and 4.13 [percent] in RMSD, respectively. To the best of our knowledge, it is the first successful GAN application in protein structure prediction. The goal of contact map prediction is to predict whether the distance between two C [subscript beta] atoms (C [subscript alpha] for Glycine) in a protein falls within a certain threshold. It can help to determine the global s"tructure of a protein in order to assist the 3D modeling process. In this work, a new two-stage multi-branch neural network based on Fully Convolutional Network and Dilated Residual Network, called MUFOLD_Contact, is proposed. It formulates the problem as a pixel-wise regression and classification problem. The first stage predicts distance maps for short-, medium-, and long-range residue pairs. The second stage takes the predicted distances from stage 1 along with other features as input to predict a binary contact map. The method utilizes the distance distribution information in the feature set to improve the binary prediction results. In experiments using CASP13 targets, the new method outperformed single stage networks and is comparable with the best existing tools. In addition to predicting contact directly using deep neural networks, a new method, called TPCref (Template Prediction Correction refinement), is proposed to refine and improve the prediction results of a contact predictor using protein templates. Based on the idea of collaborative filtering from recommendation system, TPCref first finds multiple template sequences based on the target sequence and uses the templates' structures and the templates' predicted contact map generated by a contact predictor to form a target contact map filter using the idea of collaborative filtering. Then the contact-map filter is used to refine the predicted contact map. In experimental results using recently released PDB proteins, TPCref significantly improved the contact prediction results of existing predictors, improving MUFOLD_Contact, MetaPSICOV, and CCMPred by 5.0 [percent], 12.8 [percent], and 37.2 [percent], respectively. The proposed new methods have been implemented in MUFOLD, a comprehensive platform for protein structure prediction. It provides a rich set of functions, including database generation, secondary and supersecondary structure prediction, beta-turn and gamma-turn prediction, contact map prediction and refinement, protein 3D structure prediction, loop modeling, model quality assessment, and model refinement. In this work, a new modularized MUFOLD pipeline has been designed and developed. Each module is decoupled from each other and provides standard communication protocol interfaces for other programs to call. The modularization provides the capability to easily integrate new algorithms and tools to have a fast iteration during research. In addition, a new web portal for MUFOLD has been designed and implemented to provide online services or APIs of our tools to the community

    Protein secondary structure: Entropy, correlations and prediction

    Get PDF
    Is protein secondary structure primarily determined by local interactions between residues closely spaced along the amino acid backbone, or by non-local tertiary interactions? To answer this question we have measured the entropy densities of primary structure and secondary structure sequences, and the local inter-sequence mutual information density. We find that the important inter-sequence interactions are short ranged, that correlations between neighboring amino acids are essentially uninformative, and that only 1/4 of the total information needed to determine the secondary structure is available from local inter-sequence correlations. Since the remaining information must come from non-local interactions, this observation supports the view that the majority of most proteins fold via a cooperative process where secondary and tertiary structure form concurrently. To provide a more direct comparison to existing secondary structure prediction methods, we construct a simple hidden Markov model (HMM) of the sequences. This HMM achieves a prediction accuracy comparable to other single sequence secondary structure prediction algorithms, and can extract almost all of the inter-sequence mutual information. This suggests that these algorithms are almost optimal, and that we should not expect a dramatic improvement in prediction accuracy. However, local correlations between secondary and primary structure are probably of under-appreciated importance in many tertiary structure prediction methods, such as threading.Comment: 8 pages, 5 figure

    CLP-based protein fragment assembly

    Full text link
    The paper investigates a novel approach, based on Constraint Logic Programming (CLP), to predict the 3D conformation of a protein via fragments assembly. The fragments are extracted by a preprocessor-also developed for this work- from a database of known protein structures that clusters and classifies the fragments according to similarity and frequency. The problem of assembling fragments into a complete conformation is mapped to a constraint solving problem and solved using CLP. The constraint-based model uses a medium discretization degree Ca-side chain centroid protein model that offers efficiency and a good approximation for space filling. The approach adapts existing energy models to the protein representation used and applies a large neighboring search strategy. The results shows the feasibility and efficiency of the method. The declarative nature of the solution allows to include future extensions, e.g., different size fragments for better accuracy.Comment: special issue dedicated to ICLP 201

    Modelling the structure of full-length Epstein-Barr virus nuclear antigen 1

    Get PDF
    Epstein-Barr virus (EBV) is a clinically important human virus associated with several cancers and is the etiologic agent of infectious mononucleosis. The viral nuclear antigen-1 (EBNA1) is central to the replication and propagation of the viral genome and likely contributes to tumourigenesis. We have compared EBNA1 homologues from other primate lymphocryptoviruses (LCV) and found that the central glycine/alanine repeat (GAr) domain, as well as predicted cellular protein (USP7 and CK2) binding sites are present in homologues in the Old World primates, but not the marmoset; suggesting that these motifs may have co-evolved. Using the resolved structure of the C-terminal one third of EBNA1 (homodimerisation and DNA binding domain), we have gone on to develop monomeric and dimeric models in silico of the full length protein. The C-terminal domain is predicted to be structurally highly similar between homologues, indicating conserved function. Zinc could be stably incorporated into the model, bonding with two N-terminal cysteines predicted to facilitate multimerisation. The GAr contains secondary structural elements in the models, while the protein binding regions are unstructured, irrespective of the prediction approach used and sequence origin. These intrinsically disordered regions may facilitate the diversity observed in partner interactions. We hypothsise that the structured GAr could mask the disordered regions, thereby protecting the protein from default degradation. In the dimer conformation, the C-terminal tails of each monomer wrap around a proline-rich protruding loop of the partner monomer, providing dimer stability, a feature which could be exploited in therapeutic design
    • …
    corecore