98,127 research outputs found

    Collective estimation of multiple bivariate density functions with application to angular-sampling-based protein loop modeling

    Get PDF
    This article develops a method for simultaneous estimation of density functions for a collection of populations of protein backbone angle pairs using a data-driven, shared basis that is constructed by bivariate spline functions defined on a triangulation of the bivariate domain. The circular nature of angular data is taken into account by imposing appropriate smoothness constraints across boundaries of the triangles. Maximum penalized likelihood is used to fit the model and an alternating blockwise Newton-type algorithm is developed for computation. A simulation study shows that the collective estimation approach is statistically more efficient than estimating the densities individually. The proposed method was used to estimate neighbor-dependent distributions of protein backbone dihedral angles (i.e., Ramachandran distributions). The estimated distributions were applied to protein loop modeling, one of the most challenging open problems in protein structure prediction, by feeding them into an angular-sampling-based loop structure prediction framework. Our estimated distributions compared favorably to the Ramachandran distributions estimated by fitting a hierarchical Dirichlet process model; and in particular, our distributions showed significant improvements on the hard cases where existing methods do not work well

    LEAP: highly accurate prediction of protein loop conformations by integrating coarse-grained sampling and optimized energy scores with all-atom refinement of backbone and side chains

    Get PDF
    Prediction of protein loop conformations without any prior knowledge (ab initio prediction) is an unsolved problem. Its solution will significantly impact protein homology and template-based modeling as well as ab initio protein-structure prediction. Here, we developed a coarse-grained, optimized scoring function for initial sampling and ranking of loop decoys. The resulting decoys are then further optimized in backbone and side-chain conformations and ranked by all-atom energy scoring functions. The final integrated technique called loop prediction by energy-assisted protocol achieved a median value of 2.1 Å root mean square deviation (RMSD) for 325 12-residue test loops and 2.0 Å RMSD for 45 12-residue loops from critical assessment of structure-prediction techniques (CASP) 10 target proteins with native core structures (backbone and side chains). If all side-chain conformations in protein cores were predicted in the absence of the target loop, loop-prediction accuracy only reduces slightly (0.2 Å difference in RMSD for 12-residue loops in the CASP target proteins). The accuracy obtained is about 1 Å RMSD or more improvement over other methods we tested. The executable file for a Linux system is freely available for academic users at http://sparks-lab.org

    New methods for protein structure prediction using machine learning and deep learning

    Get PDF
    Computational protein structure prediction is one of the most challenging problems in bioinformatics area. Due to the widespread use of sampling-and-selection strategy, protein model quality assessment became important. In this dissertation, new machine learning and deep learning methods have been proposed for protein model quality assessment, protein contact prediction, protein model refinement, and loop modeling. The goal of model quality assessment (QA) is to estimate the quality of predicted protein models. First, two new single-model QA methods based on Residual Neural Networks, called PDRN and VDRN, were proposed to achieve state-of-the-art performance. They used a comprehensive set of structure features to predict a quality score in the range of [0, 1]. Next, three single-model QA methods, MMQA-1 MMQA-2 and MMQA-HE, were proposed based on ideas of two-stage learning and hierarchical ensembles. MMQA-1 and MMQA-2 divided the entire feature set into two different sets and used different feature sets and training data in each stage of learning. In addition, MMQA-HE created ensembles of models in the first stage of learning for improved performance. In CASP14, MMQA-1 ranked NO. 2 in terms of average GDT-TS difference. MMQA-2 and MMQA-HE outperformed MMQA-1 consistently across different QA performance metrics in our experiments. Furthermore, a quasi-single-model QA method called INC-QA was proposed using a new method that trained a deep neural network as a QA predictor for each protein target based on template structure information generated from the target sequence. Experimental results using CASP data showed that INC-QA achieved state-of-the-art results, outperforming existing methods on CASP QA stage 2 category on CASP 13 targets. With the release of groundbreaking protein structure prediction software AlphaFold2 and RosettaFold, many research teams start using them to generate highly accurate protein models. We evaluated the performance of different QA methods on models generated by them with random modification by 3DRobot and found that multi-model QA methods were still better than single-model QA methods on these kind of high-performance model pools. Finally, in terms of the prediction of overall folding accuracy and overall interface accuracy for protein complexes in CASP15, we found a strong correlation between the predicted folding accuracy and predicted interface accuracy of protein models. Loop modeling tries to predict the conformation of a relatively short stretch of protein backbone and sidechain. It is a difficult problem due to conformational variability. AlphaFold2 achieved outstanding results in 3-D protein structure prediction and was expected to perform well on loop modeling. We investigated the performances of AlphaFold2 variants on loop modeling benchmark datasets and proposed an efficient constant-time method of using AlphaFold2 for loop modeling, called IAFLoop. To predict the structure of a loop region, IAFLoop ran a fast version of AlphaFold2 with a reduced database without ensembling on an extended segment of the target loop region, and used RMSD based consensus scores to select the top models. Our experimental results showed that IAFLoop generated highly accurate loop models, outperforming basic AlphaFold2 by up to 17 percent in RMSD error, while using less than half of the time. Compared to the previous best method, IAFLoop reduces the RMSD error by more than half. Contact map prediction is to predict whether the Euclidean distance between two C[beta] atoms (C[alpha] for Glycine) in a protein structure is less than 8 angstroms. Contacts information can act as a powerful constraint for determining the overall structural and assist the protein 3D structure prediction process. Based on MUFold-Contact, a new two-stage multi-branch deep neural network based on Residual Network and Inception V3 Network was proposed to improve the performance of MUFold-Contact. In the first stage, distance maps of shortrange, medium-range and long-range residue pairs were predicted, respectively, and the predicted distance along with other features were used as input to predict a binary contact map in the second stage. The role of protein structure refinement is to take models generated by protein structure prediction process and bring them closer to the true native structure. Inspired by AlphaFold in CASP13, a new protein structure refinement process MUFOLD-REFINE based on distance distribution of template pool was developed and achieve improved performance over the MUFOLD refinement method used in CASP13Includes bibliographical references

    Including Functional Annotations and Extending the Collection of Structural Classifications of Protein Loops (ArchDB)

    Get PDF
    Loops represent an important part of protein structures. The study of loop is critical for two main reasons: First, loops are often involved in protein function, stability and folding. Second, despite improvements in experimental and computational structure prediction methods, modeling the conformation of loops remains problematic. Here, we present a structural classification of loops, ArchDB, a mine of information with application in both mentioned fields: loop structure prediction and function prediction. ArchDB (http://sbi.imim.es/archdb) is a database of classified protein loop motifs. The current database provides four different classification sets tailored for different purposes. ArchDB-40, a loop classification derived from SCOP40, well suited for modeling common loop motifs. Since features relevant to loop structure or function can be more easily determined on well-populated clusters, we have developed ArchDB-95, a loop classification derived from SCOP95. This new classification set shows a ~40% increase in the number of subclasses, and a large 7-fold increase in the number of putative structure/function-related subclasses. We also present ArchDB-EC, a classification of loop motifs from enzymes, and ArchDB-KI, a manually annotated classification of loop motifs from kinases. Information about ligand contacts and PDB sites has been included in all classification sets. Improvements in our classification scheme are described, as well as several new database features, such as the ability to query by conserved annotations, sequence similarity, or uploading 3D coordinates of a protein. The lengths of classified loops range between 0 and 36 residues long. ArchDB offers an exhaustive sampling of loop structures. Functional information about loops and links with related biological databases are also provided. All this information and the possibility to browse/query the database through a web-server outline an useful tool with application in the comparative study of loops, the analysis of loops involved in protein function and to obtain templates for loop modeling

    Computational protein structure prediction using deep learning

    Get PDF
    Protein structure prediction is of great importance in bioinformatics and computational biology. Over the past 30 years, many machine learning methods have been developed for this problem in homology-based and ab-initio approaches. Recently, deep learning has been successfully applied and has outperformed previous methods. Deep learning methods could effectively handle high dimensional feature inputs in modeling the complex mapping from protein primary amino acid sequences to protein 2-D or 3-D structures. In this dissertation, new deep learning methods and deep learning networks have been proposed for three problems in protein structure prediction: loop modeling, contact map prediction, and contact map refinement. They have been implemented in the state-of-the-art MUFOLD software and obtained significant performance improvement. The goal of loop modeling is to predict the conformation of a relatively short stretch of protein backbone. A new method based on Generative Adversarial Network (GAN), called MUFOLD-LM, is proposed. The protein 3-D structure can be represented using the 2-D distance map of C [subscript alpha] atoms. The missing region in the structure will be a missing region in the distance map correspondingly. Our network uses the Generator Network to fill in the missing regions in the distance map based on the context, and the Discriminator Network will take both the predicted complete distance map and the ground truth as input to distinguish between them. The method utilizes both the features and context of the missing loop region to make better prediction of the 3-D structure of the loop region. In experiments using commonly used benchmark datasets 8-Res and 12-Res, MUFOLD-LM outperformed previous methods significantly, up to 43.9 [percent] and 4.13 [percent] in RMSD, respectively. To the best of our knowledge, it is the first successful GAN application in protein structure prediction. The goal of contact map prediction is to predict whether the distance between two C [subscript beta] atoms (C [subscript alpha] for Glycine) in a protein falls within a certain threshold. It can help to determine the global s"tructure of a protein in order to assist the 3D modeling process. In this work, a new two-stage multi-branch neural network based on Fully Convolutional Network and Dilated Residual Network, called MUFOLD_Contact, is proposed. It formulates the problem as a pixel-wise regression and classification problem. The first stage predicts distance maps for short-, medium-, and long-range residue pairs. The second stage takes the predicted distances from stage 1 along with other features as input to predict a binary contact map. The method utilizes the distance distribution information in the feature set to improve the binary prediction results. In experiments using CASP13 targets, the new method outperformed single stage networks and is comparable with the best existing tools. In addition to predicting contact directly using deep neural networks, a new method, called TPCref (Template Prediction Correction refinement), is proposed to refine and improve the prediction results of a contact predictor using protein templates. Based on the idea of collaborative filtering from recommendation system, TPCref first finds multiple template sequences based on the target sequence and uses the templates' structures and the templates' predicted contact map generated by a contact predictor to form a target contact map filter using the idea of collaborative filtering. Then the contact-map filter is used to refine the predicted contact map. In experimental results using recently released PDB proteins, TPCref significantly improved the contact prediction results of existing predictors, improving MUFOLD_Contact, MetaPSICOV, and CCMPred by 5.0 [percent], 12.8 [percent], and 37.2 [percent], respectively. The proposed new methods have been implemented in MUFOLD, a comprehensive platform for protein structure prediction. It provides a rich set of functions, including database generation, secondary and supersecondary structure prediction, beta-turn and gamma-turn prediction, contact map prediction and refinement, protein 3D structure prediction, loop modeling, model quality assessment, and model refinement. In this work, a new modularized MUFOLD pipeline has been designed and developed. Each module is decoupled from each other and provides standard communication protocol interfaces for other programs to call. The modularization provides the capability to easily integrate new algorithms and tools to have a fast iteration during research. In addition, a new web portal for MUFOLD has been designed and implemented to provide online services or APIs of our tools to the community

    The FALC-Loop web server for protein loop modeling

    Get PDF
    The FALC-Loop web server provides an online interface for protein loop modeling by employing an ab initio loop modeling method called FALC (fragment assembly and analytical loop closure). The server may be used to construct loop regions in homology modeling, to refine unreliable loop regions in experimental structures or to model segments of designed sequences. The FALC method is computationally less expensive than typical ab initio methods because the conformational search space is effectively reduced by the use of fragments derived from a structure database. The analytical loop closure algorithm allows efficient search for loop conformations that fit into the protein framework starting from the fragment-assembled structures. The FALC method shows prediction accuracy comparable to other state-of-the-art loop modeling methods. Top-ranked model structures can be visualized on the web server, and an ensemble of loop structures can be downloaded for further analysis. The web server can be freely accessed at http://falc-loop.seoklab.org/

    Full cyclic coordinate descent: solving the protein loop closure problem in Cα space

    Get PDF
    BACKGROUND: Various forms of the so-called loop closure problem are crucial to protein structure prediction methods. Given an N- and a C-terminal end, the problem consists of finding a suitable segment of a certain length that bridges the ends seamlessly. In homology modelling, the problem arises in predicting loop regions. In de novo protein structure prediction, the problem is encountered when implementing local moves for Markov Chain Monte Carlo simulations. Most loop closure algorithms keep the bond angles fixed or semi-fixed, and only vary the dihedral angles. This is appropriate for a full-atom protein backbone, since the bond angles can be considered as fixed, while the (φ, ψ) dihedral angles are variable. However, many de novo structure prediction methods use protein models that only consist of Cα atoms, or otherwise do not make use of all backbone atoms. These methods require a method that alters both bond and dihedral angles, since the pseudo bond angle between three consecutive Cα atoms also varies considerably. RESULTS: Here we present a method that solves the loop closure problem for Cα only protein models. We developed a variant of Cyclic Coordinate Descent (CCD), an inverse kinematics method from the field of robotics, which was recently applied to the loop closure problem. Since the method alters both bond and dihedral angles, which is equivalent to applying a full rotation matrix, we call our method Full CCD (FCDD). FCCD replaces CCD's vector-based optimization of a rotation around an axis with a singular value decomposition-based optimization of a general rotation matrix. The method is easy to implement and numerically stable. CONCLUSION: We tested the method's performance on sets of random protein Cα segments between 5 and 30 amino acids long, and a number of loops of length 4, 8 and 12. FCCD is fast, has a high success rate and readily generates conformations close to those of real loops. The presence of constraints on the angles only has a small effect on the performance. A reference implementation of FCCD in Python is available as supplementary information

    Protein Loop Prediction by Fragment Assembly

    Get PDF
    If the primary sequence of a protein is known, what is its three-dimensional structure? This is one of the most challenging problems in molecular biology and has many applications in proteomics. During the last three decades, this issue has been extensively researched. Techniques such as the protein folding approach have been demonstrated to be promising in predicting the core areas of proteins - α-helices and β-strands. However, loops that contain no regular units of secondary structure elements remain the most difficult regions for prediction. The protein loop prediction problem is to predict the spatial structure of a loop given the primary sequence of a protein and the spatial structures of all the other regions. There are two major approaches used to conduct loop prediction – the ab initio folding and database searching methods. The loop prediction accuracy is unsatisfactory because of the hypervariable property of the loops. The key contribution proposed by this thesis is a novel fragment assembly algorithm using branch-and-cut to tackle the loop prediction problem. We present various pruning rules to reduce the search space and to speed up the finding of good loop candidates. The algorithm has the advantages of the database-search approach and ensures that the predicted loops are physically reasonable. The algorithm also benefits from ab initio folding since it enumerates all the possible loops in the discrete approximation of the conformation space. We implemented the proposed algorithm as a protein loop prediction tool named LoopLocker. A test set from CASP6, the world wide protein structure prediction competition, was used to evaluate the performance of LoopLocker. Experimental results showed that LoopLocker is capable of predicting loops of 4, 8, 11-12, 13-15 residues with average RMSD errors of 0.452, 1.410, 1.741 and 1.895 A respectively. In the PDB, more than 90% loops are fewer than 15 residues. This concludes that our fragment assembly algorithm is successful in tackling the loop prediction problem

    Saturating representation of loop conformational fragments in structure databanks

    Get PDF
    BACKGROUND: Short fragments of proteins are fundamental starting points in various structure prediction applications, such as in fragment based loop modeling methods but also in various full structure build-up procedures. The applicability and performance of these approaches depend on the availability of short fragments in structure databanks. RESULTS: We studied the representation of protein loop fragments up to 14 residues in length. All possible query fragments found in sequence databases (Sequence Space) were clustered and cross referenced with available structural fragments in Protein Data Bank (Structure Space). We found that the expansion of PDB in the last few years resulted in a dense coverage of loop conformational fragments. For each loops of length 8 in the current Sequence Space there is at least one loop in Structure Space with 50% or higher sequence identity. By correlating sequence and structure clusters of loops we found that a 50% sequence identity generally guarantees structural similarity. These percentages of coverage at 50% sequence cutoff drop to 96, 94, 68, 53, 33 and 13% for loops of length 9, 10, 11, 12, 13, and 14, respectively. There is not a single loop in the current Sequence Space at any length up to 14 residues that is not matched with a conformational segment that shares at least 20% sequence identity. This minimum observed identity is 40% for loops of 12 residues or shorter and is as high as 50% for 10 residue or shorter loops. We also assessed the impact of rapidly growing sequence databanks on the estimated number of new loop conformations and found that while the number of sequentially unique sequence segments increased about six folds during the last five years there are almost no unique conformational segments among these up to 12 residues long fragments. CONCLUSION: The results suggest that fragment based prediction approaches are not limited any more by the completeness of fragments in databanks but rather by the effective scoring and search algorithms to locate them. The current favorable coverage and trends observed will be further accentuated with the progress of Protein Structure Initiative that targets new protein folds and ultimately aims at providing an exhaustive coverage of the structure space

    Prediction of backbone dihedral angles and protein secondary structure using support vector machines

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The prediction of the secondary structure of a protein is a critical step in the prediction of its tertiary structure and, potentially, its function. Moreover, the backbone dihedral angles, highly correlated with secondary structures, provide crucial information about the local three-dimensional structure.</p> <p>Results</p> <p>We predict independently both the secondary structure and the backbone dihedral angles and combine the results in a loop to enhance each prediction reciprocally. Support vector machines, a state-of-the-art supervised classification technique, achieve secondary structure predictive accuracy of 80% on a non-redundant set of 513 proteins, significantly higher than other methods on the same dataset. The dihedral angle space is divided into a number of regions using two unsupervised clustering techniques in order to predict the region in which a new residue belongs. The performance of our method is comparable to, and in some cases more accurate than, other multi-class dihedral prediction methods.</p> <p>Conclusions</p> <p>We have created an accurate predictor of backbone dihedral angles and secondary structure. Our method, called DISSPred, is available online at <url>http://comp.chem.nottingham.ac.uk/disspred/</url>.</p
    corecore