241,907 research outputs found

    DeepSF: deep convolutional neural network for mapping protein sequences to folds

    Get PDF
    Motivation Protein fold recognition is an important problem in structural bioinformatics. Almost all traditional fold recognition methods use sequence (homology) comparison to indirectly predict the fold of a tar get protein based on the fold of a template protein with known structure, which cannot explain the relationship between sequence and fold. Only a few methods had been developed to classify protein sequences into a small number of folds due to methodological limitations, which are not generally useful in practice. Results We develop a deep 1D-convolution neural network (DeepSF) to directly classify any protein se quence into one of 1195 known folds, which is useful for both fold recognition and the study of se quence-structure relationship. Different from traditional sequence alignment (comparison) based methods, our method automatically extracts fold-related features from a protein sequence of any length and map it to the fold space. We train and test our method on the datasets curated from SCOP1.75, yielding a classification accuracy of 80.4%. On the independent testing dataset curated from SCOP2.06, the classification accuracy is 77.0%. We compare our method with a top profile profile alignment method - HHSearch on hard template-based and template-free modeling targets of CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is 14.5%-29.1% higher than HHSearch on template-free modeling targets and 4.5%-16.7% higher on hard template-based modeling targets for top 1, 5, and 10 predicted folds. The hidden features extracted from sequence by our method is robust against sequence mutation, insertion, deletion and truncation, and can be used for other protein pattern recognition problems such as protein clustering, comparison and ranking.Comment: 28 pages, 13 figure

    Improving protein fold recognition using the amalgamation of evolutionary-based and structural-based information

    Get PDF
    Deciphering three dimensional structure of a protein sequence is a challenging task in biological science. Protein fold recognition and protein secondary structure prediction are transitional steps in identifying the three dimensional structure of a protein. For protein fold recognition, evolutionary-based information of amino acid sequences from the position specific scoring matrix (PSSM) has been recently applied with improved results. On the other hand, the SPINE-X predictor has been developed and applied for protein secondary structure prediction. Several reported methods for protein fold recognition have only limited accuracy. In this paper, we have developed a strategy of combining evolutionary-based information (from PSSM) and predicted secondary structure using SPINE-X to improve protein fold recognition. The strategy is based on finding the probabilities of amino acid pairs (AAP). The proposed method has been tested on several protein benchmark datasets and an improvement of 8.9% recognition accuracy has been achieved. We have achieved, for the first time over 90% and 75% prediction accuracies for sequence similarity values below 40% and 25%, respectively. We also obtain 90.6% and 77.0% prediction accuracies, respectively, for the Extended Ding and Dubchak and Taguchi and Gromiha benchmark protein fold recognition datasets widely used for in the literature

    Protein fold recognition by alignment of amino acid residues using kernelized dynamic time warping

    Get PDF
    In protein fold recognition, a protein is classified into one of its folds. The recognition of a protein fold can be done by employing feature extraction methods to extract relevant information from protein sequences and then by using a classifier to accurately recognize novel protein sequences. In the past, several feature extraction methods have been developed but with limited recognition accuracy only. Protein sequences of varying lengths share the same fold and therefore they are very similar (in a fold) if aligned properly. To this, we develop an amino acid alignment method to extract important features from protein sequences by computing dissimilarity distances between proteins. This is done by measuring distance between two respective position specific scoring matrices of protein sequences which is used in a support vector machine framework. We demonstrated the effectiveness of the proposed method on several benchmark datasets. The method shows significant improvement in the fold recognition performance which is in the range of 4.3–7.6% compared to several other existing feature extraction methods

    Knowledge-based potentials in protein fold recognition

    Get PDF
    An accurate potential function is essential for protein folding problem and structure prediction. Two different types of potential energy functions are currently in use. The first type is based on the law of physics and second type is referred to as statistical potentials or knowledge based potentials.  In the latter type, the energy function is extracted from statistical analysis of experimental data of known protein structures. By increasing the amount of three dimensional protein structures, this approach is growing rapidly.There are various forms of knowledge based potentials depending on how statistics are calculated and how proteins are modeled. In this review, we explain how the knowledge based potentials are extracted by using known protein structures and briefly compare many of the potentials in theory

    DeepFrag-k: A Fragment-Based Deep Learning Approach for Protein Fold Recognition

    Get PDF
    Background: One of the most essential problems in structural bioinformatics is protein fold recognition. In this paper, we design a novel deep learning architecture, so-called DeepFrag-k, which identifies fold discriminative features at fragment level to improve the accuracy of protein fold recognition. DeepFrag-k is composed of two stages: the first stage employs a multi-modal Deep Belief Network (DBN) to predict the potential structural fragments given a sequence, represented as a fragment vector, and then the second stage uses a deep convolutional neural network (CNN) to classify the fragment vector into the corresponding fold. Results: Our results show that DeepFrag-k yields 92.98% accuracy in predicting the top-100 most popular fragments, which can be used to generate discriminative fragment feature vectors to improve protein fold recognition. Conclusions: There is a set of fragments that can serve as structural “keywords” distinguishing between major protein folds. The deep learning architecture in DeepFrag-k is able to accurately identify these fragments as structure features to improve protein fold recognition

    DeepSF: Deep convolutional neural network for mapping protein sequences to folds

    Get PDF
    Motivation Protein fold recognition is an important problem in structural bioinformatics. Almost all traditional fold recognition methods use sequence (homology) comparison to indirectly predict the fold of a target protein based on the fold of a template protein with known structure, which cannot explain the relationship between sequence and fold. Only a few methods had been developed to classify protein sequences into a small number of folds due to methodological limitations, which are not generally useful in practice. Results We develop a deep 1D-convolution neural network (DeepSF) to directly classify any protein sequence into one of 1195 known folds, which is useful for both fold recognition and the study of sequence-structure relationship. Different from traditional sequence alignment (comparison) based methods, our method automatically extracts fold-related features from a protein sequence of any length and maps it to the fold space. We train and test our method on the datasets curated from SCOP1.75, yielding an average classification accuracy of 75.3%. On the independent testing dataset curated from SCOP2.06, the classification accuracy is 73.0%. We compare our method with a top profile-profile alignment method -HHSearch on hard template-based and template-free modeling targets of CASP9-12 in terms of fold recognition accuracy. The accuracy of our method is 12.63-26.32% higher than HHSearch on template-free modeling targets and 3.39-17.09% higher on hard template-based modeling targets for top 1, 5 and 10 predicted folds. The hidden features extracted from sequence by our method is robust against sequence mutation, insertion, deletion and truncation, and can be used for other protein pattern recognition problems such as protein clustering, comparison and ranking

    Protein fold recognition using HMM–HMM alignment and dynamic programming

    Get PDF
    Detecting three dimensional structures of protein sequences is a challenging task in biological sciences. For this purpose, protein fold recognition has been utilized as an intermediate step which helps in classifying a novel protein sequence into one of its folds. The process of protein fold recognition encompasses feature extraction of protein sequences and feature identification through suitable classi- fiers. Several feature extractors are developed to retrieve useful information from protein sequences. These features are generally extracted by constituting protein’s sequential, physicochemical and evolutionary properties. The performance in terms of recognition accuracy has also been gradually improved over the last decade. However, it is yet to reach a well reasonable and accepted level. In this work, we first applied HMM–HMM alignment of protein sequence from HHblits to extract profile HMM (PHMM) matrix. Then we computed the distance between respective PHMM matrices using kernalized dynamic programming. We have recorded significant improvement in fold recognition over the state-of-the-art feature extractors. The improvement of recognition accuracy is in the range of 2.7–11.6% when experimented on three benchmark datasets from Structural Classification of Proteins

    DescFold: A web server for protein fold recognition

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Machine learning-based methods have been proven to be powerful in developing new fold recognition tools. In our previous work [Zhang, Kochhar and Grigorov (2005) <it>Protein Science</it>, <b>14</b>: 431-444], a machine learning-based method called DescFold was established by using Support Vector Machines (SVMs) to combine the following four descriptors: a profile-sequence-alignment-based descriptor using Psi-blast <it>e</it>-values and bit scores, a sequence-profile-alignment-based descriptor using Rps-blast <it>e</it>-values and bit scores, a descriptor based on secondary structure element alignment (SSEA), and a descriptor based on the occurrence of PROSITE functional motifs. In this work, we focus on the improvement of DescFold by incorporating more powerful descriptors and setting up a user-friendly web server.</p> <p>Results</p> <p>In seeking more powerful descriptors, the profile-profile alignment score generated from the COMPASS algorithm was first considered as a new descriptor (i.e., PPA). When considering a profile-profile alignment between two proteins in the context of fold recognition, one protein is regarded as a template (i.e., its 3D structure is known). Instead of a sequence profile derived from a Psi-blast search, a structure-seeded profile for the template protein was generated by searching its structural neighbors with the assistance of the TM-align structural alignment algorithm. Moreover, the COMPASS algorithm was used again to derive a profile-structural-profile-alignment-based descriptor (i.e., PSPA). We trained and tested the new DescFold in a total of 1,835 highly diverse proteins extracted from the SCOP 1.73 version. When the PPA and PSPA descriptors were introduced, the new DescFold boosts the performance of fold recognition substantially. Using the SCOP_1.73_40% dataset as the fold library, the DescFold web server based on the trained SVM models was further constructed. To provide a large-scale test for the new DescFold, a stringent test set of 1,866 proteins were selected from the SCOP 1.75 version. At a less than 5% false positive rate control, the new DescFold is able to correctly recognize structural homologs at the fold level for nearly 46% test proteins. Additionally, we also benchmarked the DescFold method against several well-established fold recognition algorithms through the LiveBench targets and Lindahl dataset.</p> <p>Conclusions</p> <p>The new DescFold method was intensively benchmarked to have very competitive performance compared with some well-established fold recognition methods, suggesting that it can serve as a useful tool to assist in template-based protein structure prediction. The DescFold server is freely accessible at <url>http://202.112.170.199/DescFold/index.html</url>.</p
    • …
    corecore