1 research outputs found

    New methods for protein structure prediction using machine learning and deep learning

    Get PDF
    Computational protein structure prediction is one of the most challenging problems in bioinformatics area. Due to the widespread use of sampling-and-selection strategy, protein model quality assessment became important. In this dissertation, new machine learning and deep learning methods have been proposed for protein model quality assessment, protein contact prediction, protein model refinement, and loop modeling. The goal of model quality assessment (QA) is to estimate the quality of predicted protein models. First, two new single-model QA methods based on Residual Neural Networks, called PDRN and VDRN, were proposed to achieve state-of-the-art performance. They used a comprehensive set of structure features to predict a quality score in the range of [0, 1]. Next, three single-model QA methods, MMQA-1 MMQA-2 and MMQA-HE, were proposed based on ideas of two-stage learning and hierarchical ensembles. MMQA-1 and MMQA-2 divided the entire feature set into two different sets and used different feature sets and training data in each stage of learning. In addition, MMQA-HE created ensembles of models in the first stage of learning for improved performance. In CASP14, MMQA-1 ranked NO. 2 in terms of average GDT-TS difference. MMQA-2 and MMQA-HE outperformed MMQA-1 consistently across different QA performance metrics in our experiments. Furthermore, a quasi-single-model QA method called INC-QA was proposed using a new method that trained a deep neural network as a QA predictor for each protein target based on template structure information generated from the target sequence. Experimental results using CASP data showed that INC-QA achieved state-of-the-art results, outperforming existing methods on CASP QA stage 2 category on CASP 13 targets. With the release of groundbreaking protein structure prediction software AlphaFold2 and RosettaFold, many research teams start using them to generate highly accurate protein models. We evaluated the performance of different QA methods on models generated by them with random modification by 3DRobot and found that multi-model QA methods were still better than single-model QA methods on these kind of high-performance model pools. Finally, in terms of the prediction of overall folding accuracy and overall interface accuracy for protein complexes in CASP15, we found a strong correlation between the predicted folding accuracy and predicted interface accuracy of protein models. Loop modeling tries to predict the conformation of a relatively short stretch of protein backbone and sidechain. It is a difficult problem due to conformational variability. AlphaFold2 achieved outstanding results in 3-D protein structure prediction and was expected to perform well on loop modeling. We investigated the performances of AlphaFold2 variants on loop modeling benchmark datasets and proposed an efficient constant-time method of using AlphaFold2 for loop modeling, called IAFLoop. To predict the structure of a loop region, IAFLoop ran a fast version of AlphaFold2 with a reduced database without ensembling on an extended segment of the target loop region, and used RMSD based consensus scores to select the top models. Our experimental results showed that IAFLoop generated highly accurate loop models, outperforming basic AlphaFold2 by up to 17 percent in RMSD error, while using less than half of the time. Compared to the previous best method, IAFLoop reduces the RMSD error by more than half. Contact map prediction is to predict whether the Euclidean distance between two C[beta] atoms (C[alpha] for Glycine) in a protein structure is less than 8 angstroms. Contacts information can act as a powerful constraint for determining the overall structural and assist the protein 3D structure prediction process. Based on MUFold-Contact, a new two-stage multi-branch deep neural network based on Residual Network and Inception V3 Network was proposed to improve the performance of MUFold-Contact. In the first stage, distance maps of shortrange, medium-range and long-range residue pairs were predicted, respectively, and the predicted distance along with other features were used as input to predict a binary contact map in the second stage. The role of protein structure refinement is to take models generated by protein structure prediction process and bring them closer to the true native structure. Inspired by AlphaFold in CASP13, a new protein structure refinement process MUFOLD-REFINE based on distance distribution of template pool was developed and achieve improved performance over the MUFOLD refinement method used in CASP13Includes bibliographical references
    corecore