528 research outputs found

    A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences.</p> <p>Results</p> <p>In this paper, a novel building block of proteins called Top-<it>n</it>-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-<it>n</it>-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-<it>n</it>-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-<it>n</it>-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-<it>n</it>-grams and LSA gives significantly better results compared to related methods.</p> <p>Conclusion</p> <p>The method based on Top-<it>n</it>-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-<it>n</it>-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.</p

    Building multiclass classifiers for remote homology detection and fold recognition

    Get PDF
    BACKGROUND: Protein remote homology detection and fold recognition are central problems in computational biology. Supervised learning algorithms based on support vector machines are currently one of the most effective methods for solving these problems. These methods are primarily used to solve binary classification problems and they have not been extensively used to solve the more general multiclass remote homology prediction and fold recognition problems. RESULTS: We present a comprehensive evaluation of a number of methods for building SVM-based multiclass classification schemes in the context of the SCOP protein classification. These methods include schemes that directly build an SVM-based multiclass model, schemes that employ a second-level learning approach to combine the predictions generated by a set of binary SVM-based classifiers, and schemes that build and combine binary classifiers for various levels of the SCOP hierarchy beyond those defining the target classes. CONCLUSION: Analyzing the performance achieved by the different approaches on four different datasets we show that most of the proposed multiclass SVM-based classification approaches are quite effective in solving the remote homology prediction and fold recognition problems and that the schemes that use predictions from binary models constructed for ancestral categories within the SCOP hierarchy tend to not only lead to lower error rates but also reduce the number of errors in which a superfamily is assigned to an entirely different fold and a fold is predicted as being from a different SCOP class. Our results also show that the limited size of the training data makes it hard to learn complex second-level models, and that models of moderate complexity lead to consistently better results

    A study of hierarchical and flat classification of proteins

    Get PDF
    Automatic classification of proteins using machine learning is an important problem that has received significant attention in the literature. One feature of this problem is that expert-defined hierarchies of protein classes exist and can potentially be exploited to improve classification performance. In this article we investigate empirically whether this is the case for two such hierarchies. We compare multi-class classification techniques that exploit the information in those class hierarchies and those that do not, using logistic regression, decision trees, bagged decision trees, and support vector machines as the underlying base learners. In particular, we compare hierarchical and flat variants of ensembles of nested dichotomies. The latter have been shown to deliver strong classification performance in multi-class settings. We present experimental results for synthetic, fold recognition, enzyme classification, and remote homology detection data. Our results show that exploiting the class hierarchy improves performance on the synthetic data, but not in the case of the protein classification problems. Based on this we recommend that strong flat multi-class methods be used as a baseline to establish the benefit of exploiting class hierarchies in this area

    Application of nonnegative matrix factorization to improve profile-profile alignment features for fold recognition and remote homolog detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Nonnegative matrix factorization (NMF) is a feature extraction method that has the property of intuitive part-based representation of the original features. This unique ability makes NMF a potentially promising method for biological sequence analysis. Here, we apply NMF to fold recognition and remote homolog detection problems. Recent studies have shown that combining support vector machines (SVM) with profile-profile alignments improves performance of fold recognition and remote homolog detection remarkably. However, it is not clear which parts of sequences are essential for the performance improvement.</p> <p>Results</p> <p>The performance of fold recognition and remote homolog detection using NMF features is compared to that of the unmodified profile-profile alignment (PPA) features by estimating Receiver Operating Characteristic (ROC) scores. The overall performance is noticeably improved. For fold recognition at the fold level, SVM with NMF features recognize 30% of homolog proteins at > 0.99 ROC scores, while original PPA feature, HHsearch, and PSI-BLAST recognize almost none. For detecting remote homologs that are related at the superfamily level, NMF features also achieve higher performance than the original PPA features. At > 0.90 ROC<sub>50 </sub>scores, 25% of proteins with NMF features correctly detects remotely related proteins, whereas using original PPA features only 1% of proteins detect remote homologs. In addition, we investigate the effect of number of positive training examples and the number of basis vectors on performance improvement. We also analyze the ability of NMF to extract essential features by comparing NMF basis vectors with functionally important sites and structurally conserved regions of proteins. The results show that NMF basis vectors have significant overlap with functional sites from PROSITE and with structurally conserved regions from the multiple structural alignments generated by MUSTANG. The correlation between NMF basis vectors and biologically essential parts of proteins supports our conjecture that NMF basis vectors can explicitly represent important sites of proteins.</p> <p>Conclusion</p> <p>The present work demonstrates that applying NMF to profile-profile alignments can reveal essential features of proteins and that these features significantly improve the performance of fold recognition and remote homolog detection.</p

    Physicochemical property distributions for accurate and rapid pairwise protein homology detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The challenge of remote homology detection is that many evolutionarily related sequences have very little similarity at the amino acid level. Kernel-based discriminative methods, such as support vector machines (SVMs), that use vector representations of sequences derived from sequence properties have been shown to have superior accuracy when compared to traditional approaches for the task of remote homology detection.</p> <p>Results</p> <p>We introduce a new method for feature vector representation based on the physicochemical properties of the primary protein sequence. A distribution of physicochemical property scores are assembled from 4-mers of the sequence and normalized based on the null distribution of the property over all possible 4-mers. With this approach there is little computational cost associated with the transformation of the protein into feature space, and overall performance in terms of remote homology detection is comparable with current state-of-the-art methods. We demonstrate that the features can be used for the task of pairwise remote homology detection with improved accuracy versus sequence-based methods such as BLAST and other feature-based methods of similar computational cost.</p> <p>Conclusions</p> <p>A protein feature method based on physicochemical properties is a viable approach for extracting features in a computationally inexpensive manner while retaining the sensitivity of SVM protein homology detection. Furthermore, identifying features that can be used for generic pairwise homology detection in lieu of family-based homology detection is important for applications such as large database searches and comparative genomics.</p

    Motif kernel generated by genetic programming improves remote homology and fold detection

    Get PDF
    BACKGROUND: Protein remote homology detection is a central problem in computational biology. Most recent methods train support vector machines to discriminate between related and unrelated sequences and these studies have introduced several types of kernels. One successful approach is to base a kernel on shared occurrences of discrete sequence motifs. Still, many protein sequences fail to be classified correctly for a lack of a suitable set of motifs for these sequences. RESULTS: We introduce the GPkernel, which is a motif kernel based on discrete sequence motifs where the motifs are evolved using genetic programming. All proteins can be grouped according to evolutionary relations and structure, and the method uses this inherent structure to create groups of motifs that discriminate between different families of evolutionary origin. When tested on two SCOP benchmarks, the superfamily and fold recognition problems, the GPkernel gives significantly better results compared to related methods of remote homology detection. CONCLUSION: The GPkernel gives particularly good results on the more difficult fold recognition problem compared to the other methods. This is mainly because the method creates motif sets that describe similarities among subgroups of both the related and unrelated proteins. This rich set of motifs give a better description of the similarities and differences between different folds than do previous motif-based methods

    Protein Remote Homology Detection Based on an Ensemble Learning Approach

    Get PDF

    DescFold: A web server for protein fold recognition

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Machine learning-based methods have been proven to be powerful in developing new fold recognition tools. In our previous work [Zhang, Kochhar and Grigorov (2005) <it>Protein Science</it>, <b>14</b>: 431-444], a machine learning-based method called DescFold was established by using Support Vector Machines (SVMs) to combine the following four descriptors: a profile-sequence-alignment-based descriptor using Psi-blast <it>e</it>-values and bit scores, a sequence-profile-alignment-based descriptor using Rps-blast <it>e</it>-values and bit scores, a descriptor based on secondary structure element alignment (SSEA), and a descriptor based on the occurrence of PROSITE functional motifs. In this work, we focus on the improvement of DescFold by incorporating more powerful descriptors and setting up a user-friendly web server.</p> <p>Results</p> <p>In seeking more powerful descriptors, the profile-profile alignment score generated from the COMPASS algorithm was first considered as a new descriptor (i.e., PPA). When considering a profile-profile alignment between two proteins in the context of fold recognition, one protein is regarded as a template (i.e., its 3D structure is known). Instead of a sequence profile derived from a Psi-blast search, a structure-seeded profile for the template protein was generated by searching its structural neighbors with the assistance of the TM-align structural alignment algorithm. Moreover, the COMPASS algorithm was used again to derive a profile-structural-profile-alignment-based descriptor (i.e., PSPA). We trained and tested the new DescFold in a total of 1,835 highly diverse proteins extracted from the SCOP 1.73 version. When the PPA and PSPA descriptors were introduced, the new DescFold boosts the performance of fold recognition substantially. Using the SCOP_1.73_40% dataset as the fold library, the DescFold web server based on the trained SVM models was further constructed. To provide a large-scale test for the new DescFold, a stringent test set of 1,866 proteins were selected from the SCOP 1.75 version. At a less than 5% false positive rate control, the new DescFold is able to correctly recognize structural homologs at the fold level for nearly 46% test proteins. Additionally, we also benchmarked the DescFold method against several well-established fold recognition algorithms through the LiveBench targets and Lindahl dataset.</p> <p>Conclusions</p> <p>The new DescFold method was intensively benchmarked to have very competitive performance compared with some well-established fold recognition methods, suggesting that it can serve as a useful tool to assist in template-based protein structure prediction. The DescFold server is freely accessible at <url>http://202.112.170.199/DescFold/index.html</url>.</p

    Exploiting physico-chemical properties in string kernels

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>String kernels are commonly used for the classification of biological sequences, nucleotide as well as amino acid sequences. Although string kernels are already very powerful, when it comes to amino acids they have a major short coming. They ignore an important piece of information when comparing amino acids: the physico-chemical properties such as size, hydrophobicity, or charge. This information is very valuable, especially when training data is less abundant. There have been only very few approaches so far that aim at combining these two ideas.</p> <p>Results</p> <p>We propose new string kernels that combine the benefits of physico-chemical descriptors for amino acids with the ones of string kernels. The benefits of the proposed kernels are assessed on two problems: MHC-peptide binding classification using position specific kernels and protein classification based on the substring spectrum of the sequences. Our experiments demonstrate that the incorporation of amino acid properties in string kernels yields improved performances compared to standard string kernels and to previously proposed non-substring kernels.</p> <p>Conclusions</p> <p>In summary, the proposed modifications, in particular the combination with the RBF substring kernel, consistently yield improvements without affecting the computational complexity. The proposed kernels therefore appear to be the kernels of choice for any protein sequence-based inference.</p> <p>Availability</p> <p>Data sets, code and additional information are available from <url>http://www.fml.tuebingen.mpg.de/raetsch/suppl/aask</url>. Implementations of the developed kernels are available as part of the Shogun toolbox.</p
    corecore