38,809 research outputs found

    Improving protein structure prediction by deep learning and computational optimization

    Get PDF
    Includes vitaProtein structure prediction is one of the most important scientific problems in the field of bioinformatics and computational biology. The availability of protein three-dimensional (3D) structure is crucial for studying biological and cellular functions of proteins. The importance of four major sub-problems in protein structure prediction have been clearly recognized. Those include, first, protein secondary structure prediction, second, protein fold recognition, third, protein quality assessment, and fourth, multi-domain assembly. In recent years, deep learning techniques have proved to be a highly effective machine learning method, which has brought revolutionary advances in computer vision, speech recognition and bioinformatics. In this dissertation, five contributions are described. First, DNSS2, a method for protein secondary structure prediction using one-dimensional deep convolution network. Second, DeepSF, a method of applying deep convolutional network to classify protein sequence into one of thousands known folds. Third, CNNQA & DeepRank, two deep neural network approaches to systematically evaluate the quality of predicted protein structures and select the most accurate model as the final protein structure prediction. Fourth, MULTICOM, a protein structure prediction system empowered by deep learning and protein contact prediction. Finally, SAXSDOM, a data-assisted method for protein domain assembly using small-angle X-ray scattering data. All the methods are available as software tools or web servers which are freely available to the scientific community.Includes bibliographical reference

    Improving prediction of secondary structure, local backbone angles, and solvent accessible surface area of proteins by iterative deep learning

    Get PDF
    Direct prediction of protein structure from sequence is a challenging problem. An effective approach is to break it up into independent sub-problems. These sub-problems such as prediction of protein secondary structure can then be solved independently. In a previous study, we found that an iterative use of predicted secondary structure and backbone torsion angles can further improve secondary structure and torsion angle prediction. In this study, we expand the iterative features to include solvent accessible surface area and backbone angles and dihedrals based on Cα atoms. By using a deep learning neural network in three iterations, we achieved 82% accuracy for secondary structure prediction, 0.76 for the correlation coefficient between predicted and actual solvent accessible surface area, 19° and 30° for mean absolute errors of backbone φ and ψ angles, respectively, and 8° and 32° for mean absolute errors of Cα-based θ and τ angles, respectively, for an independent test dataset of 1199 proteins. The accuracy of the method is slightly lower for 72 CASP 11 targets but much higher than those of model structures from current state-of-the-art techniques. This suggests the potentially beneficial use of these predicted properties for model assessment and ranking

    Computational protein structure prediction using deep learning

    Get PDF
    Protein structure prediction is of great importance in bioinformatics and computational biology. Over the past 30 years, many machine learning methods have been developed for this problem in homology-based and ab-initio approaches. Recently, deep learning has been successfully applied and has outperformed previous methods. Deep learning methods could effectively handle high dimensional feature inputs in modeling the complex mapping from protein primary amino acid sequences to protein 2-D or 3-D structures. In this dissertation, new deep learning methods and deep learning networks have been proposed for three problems in protein structure prediction: loop modeling, contact map prediction, and contact map refinement. They have been implemented in the state-of-the-art MUFOLD software and obtained significant performance improvement. The goal of loop modeling is to predict the conformation of a relatively short stretch of protein backbone. A new method based on Generative Adversarial Network (GAN), called MUFOLD-LM, is proposed. The protein 3-D structure can be represented using the 2-D distance map of C [subscript alpha] atoms. The missing region in the structure will be a missing region in the distance map correspondingly. Our network uses the Generator Network to fill in the missing regions in the distance map based on the context, and the Discriminator Network will take both the predicted complete distance map and the ground truth as input to distinguish between them. The method utilizes both the features and context of the missing loop region to make better prediction of the 3-D structure of the loop region. In experiments using commonly used benchmark datasets 8-Res and 12-Res, MUFOLD-LM outperformed previous methods significantly, up to 43.9 [percent] and 4.13 [percent] in RMSD, respectively. To the best of our knowledge, it is the first successful GAN application in protein structure prediction. The goal of contact map prediction is to predict whether the distance between two C [subscript beta] atoms (C [subscript alpha] for Glycine) in a protein falls within a certain threshold. It can help to determine the global s"tructure of a protein in order to assist the 3D modeling process. In this work, a new two-stage multi-branch neural network based on Fully Convolutional Network and Dilated Residual Network, called MUFOLD_Contact, is proposed. It formulates the problem as a pixel-wise regression and classification problem. The first stage predicts distance maps for short-, medium-, and long-range residue pairs. The second stage takes the predicted distances from stage 1 along with other features as input to predict a binary contact map. The method utilizes the distance distribution information in the feature set to improve the binary prediction results. In experiments using CASP13 targets, the new method outperformed single stage networks and is comparable with the best existing tools. In addition to predicting contact directly using deep neural networks, a new method, called TPCref (Template Prediction Correction refinement), is proposed to refine and improve the prediction results of a contact predictor using protein templates. Based on the idea of collaborative filtering from recommendation system, TPCref first finds multiple template sequences based on the target sequence and uses the templates' structures and the templates' predicted contact map generated by a contact predictor to form a target contact map filter using the idea of collaborative filtering. Then the contact-map filter is used to refine the predicted contact map. In experimental results using recently released PDB proteins, TPCref significantly improved the contact prediction results of existing predictors, improving MUFOLD_Contact, MetaPSICOV, and CCMPred by 5.0 [percent], 12.8 [percent], and 37.2 [percent], respectively. The proposed new methods have been implemented in MUFOLD, a comprehensive platform for protein structure prediction. It provides a rich set of functions, including database generation, secondary and supersecondary structure prediction, beta-turn and gamma-turn prediction, contact map prediction and refinement, protein 3D structure prediction, loop modeling, model quality assessment, and model refinement. In this work, a new modularized MUFOLD pipeline has been designed and developed. Each module is decoupled from each other and provides standard communication protocol interfaces for other programs to call. The modularization provides the capability to easily integrate new algorithms and tools to have a fast iteration during research. In addition, a new web portal for MUFOLD has been designed and implemented to provide online services or APIs of our tools to the community

    Distance-based Protein Folding Powered by Deep Learning

    Full text link
    Contact-assisted protein folding has made very good progress, but two challenges remain. One is accurate contact prediction for proteins lack of many sequence homologs and the other is that time-consuming folding simulation is often needed to predict good 3D models from predicted contacts. We show that protein distance matrix can be predicted well by deep learning and then directly used to construct 3D models without folding simulation at all. Using distance geometry to construct 3D models from our predicted distance matrices, we successfully folded 21 of the 37 CASP12 hard targets with a median family size of 58 effective sequence homologs within 4 hours on a Linux computer of 20 CPUs. In contrast, contacts predicted by direct coupling analysis (DCA) cannot fold any of them in the absence of folding simulation and the best CASP12 group folded 11 of them by integrating predicted contacts into complex, fragment-based folding simulation. The rigorous experimental validation on 15 CASP13 targets show that among the 3 hardest targets of new fold our distance-based folding servers successfully folded 2 large ones with <150 sequence homologs while the other servers failed on all three, and that our ab initio folding server also predicted the best, high-quality 3D model for a large homology modeling target. Further experimental validation in CAMEO shows that our ab initio folding server predicted correct fold for a membrane protein of new fold with 200 residues and 229 sequence homologs while all the other servers failed. These results imply that deep learning offers an efficient and accurate solution for ab initio folding on a personal computer

    Accurate De Novo Prediction of Protein Contact Map by Ultra-Deep Learning Model

    Full text link
    Recently exciting progress has been made on protein contact prediction, but the predicted contacts for proteins without many sequence homologs is still of low quality and not very useful for de novo structure prediction. This paper presents a new deep learning method that predicts contacts by integrating both evolutionary coupling (EC) and sequence conservation information through an ultra-deep neural network formed by two deep residual networks. This deep neural network allows us to model very complex sequence-contact relationship as well as long-range inter-contact correlation. Our method greatly outperforms existing contact prediction methods and leads to much more accurate contact-assisted protein folding. Tested on three datasets of 579 proteins, the average top L long-range prediction accuracy obtained our method, the representative EC method CCMpred and the CASP11 winner MetaPSICOV is 0.47, 0.21 and 0.30, respectively; the average top L/10 long-range accuracy of our method, CCMpred and MetaPSICOV is 0.77, 0.47 and 0.59, respectively. Ab initio folding using our predicted contacts as restraints can yield correct folds (i.e., TMscore>0.6) for 203 test proteins, while that using MetaPSICOV- and CCMpred-predicted contacts can do so for only 79 and 62 proteins, respectively. Further, our contact-assisted models have much better quality than template-based models. Using our predicted contacts as restraints, we can (ab initio) fold 208 of the 398 membrane proteins with TMscore>0.5. By contrast, when the training proteins of our method are used as templates, homology modeling can only do so for 10 of them. One interesting finding is that even if we do not train our prediction models with any membrane proteins, our method works very well on membrane protein prediction. Finally, in recent blind CAMEO benchmark our method successfully folded 5 test proteins with a novel fold

    Protein Secondary Structure Prediction Using Cascaded Convolutional and Recurrent Neural Networks

    Full text link
    Protein secondary structure prediction is an important problem in bioinformatics. Inspired by the recent successes of deep neural networks, in this paper, we propose an end-to-end deep network that predicts protein secondary structures from integrated local and global contextual features. Our deep architecture leverages convolutional neural networks with different kernel sizes to extract multiscale local contextual features. In addition, considering long-range dependencies existing in amino acid sequences, we set up a bidirectional neural network consisting of gated recurrent unit to capture global contextual features. Furthermore, multi-task learning is utilized to predict secondary structure labels and amino-acid solvent accessibility simultaneously. Our proposed deep network demonstrates its effectiveness by achieving state-of-the-art performance, i.e., 69.7% Q8 accuracy on the public benchmark CB513, 76.9% Q8 accuracy on CASP10 and 73.1% Q8 accuracy on CASP11. Our model and results are publicly available.Comment: 8 pages, 3 figures, Accepted by International Joint Conferences on Artificial Intelligence (IJCAI
    • …
    corecore