3,494 research outputs found

    A discriminative method for protein remote homology detection and fold recognition combining Top-n-grams and latent semantic analysis

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Protein remote homology detection and fold recognition are central problems in bioinformatics. Currently, discriminative methods based on support vector machine (SVM) are the most effective and accurate methods for solving these problems. A key step to improve the performance of the SVM-based methods is to find a suitable representation of protein sequences.</p> <p>Results</p> <p>In this paper, a novel building block of proteins called Top-<it>n</it>-grams is presented, which contains the evolutionary information extracted from the protein sequence frequency profiles. The protein sequence frequency profiles are calculated from the multiple sequence alignments outputted by PSI-BLAST and converted into Top-<it>n</it>-grams. The protein sequences are transformed into fixed-dimension feature vectors by the occurrence times of each Top-<it>n</it>-gram. The training vectors are evaluated by SVM to train classifiers which are then used to classify the test protein sequences. We demonstrate that the prediction performance of remote homology detection and fold recognition can be improved by combining Top-<it>n</it>-grams and latent semantic analysis (LSA), which is an efficient feature extraction technique from natural language processing. When tested on superfamily and fold benchmarks, the method combining Top-<it>n</it>-grams and LSA gives significantly better results compared to related methods.</p> <p>Conclusion</p> <p>The method based on Top-<it>n</it>-grams significantly outperforms the methods based on many other building blocks including N-grams, patterns, motifs and binary profiles. Therefore, Top-<it>n</it>-gram is a good building block of the protein sequences and can be widely used in many tasks of the computational biology, such as the sequence alignment, the prediction of domain boundary, the designation of knowledge-based potentials and the prediction of protein binding sites.</p

    Protein Remote Homology Detection Based on an Ensemble Learning Approach

    Get PDF

    Application of nonnegative matrix factorization to improve profile-profile alignment features for fold recognition and remote homolog detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Nonnegative matrix factorization (NMF) is a feature extraction method that has the property of intuitive part-based representation of the original features. This unique ability makes NMF a potentially promising method for biological sequence analysis. Here, we apply NMF to fold recognition and remote homolog detection problems. Recent studies have shown that combining support vector machines (SVM) with profile-profile alignments improves performance of fold recognition and remote homolog detection remarkably. However, it is not clear which parts of sequences are essential for the performance improvement.</p> <p>Results</p> <p>The performance of fold recognition and remote homolog detection using NMF features is compared to that of the unmodified profile-profile alignment (PPA) features by estimating Receiver Operating Characteristic (ROC) scores. The overall performance is noticeably improved. For fold recognition at the fold level, SVM with NMF features recognize 30% of homolog proteins at > 0.99 ROC scores, while original PPA feature, HHsearch, and PSI-BLAST recognize almost none. For detecting remote homologs that are related at the superfamily level, NMF features also achieve higher performance than the original PPA features. At > 0.90 ROC<sub>50 </sub>scores, 25% of proteins with NMF features correctly detects remotely related proteins, whereas using original PPA features only 1% of proteins detect remote homologs. In addition, we investigate the effect of number of positive training examples and the number of basis vectors on performance improvement. We also analyze the ability of NMF to extract essential features by comparing NMF basis vectors with functionally important sites and structurally conserved regions of proteins. The results show that NMF basis vectors have significant overlap with functional sites from PROSITE and with structurally conserved regions from the multiple structural alignments generated by MUSTANG. The correlation between NMF basis vectors and biologically essential parts of proteins supports our conjecture that NMF basis vectors can explicitly represent important sites of proteins.</p> <p>Conclusion</p> <p>The present work demonstrates that applying NMF to profile-profile alignments can reveal essential features of proteins and that these features significantly improve the performance of fold recognition and remote homolog detection.</p

    Probabilistic protein homology modeling

    Get PDF
    Searching sequence databases and building 3D models for proteins are important tasks for biologists. When the structure of a query protein is given, its function can be inferred. However, experimental methods for structure prediction are both expensive and time consuming. Fully automatic homology modeling refers to building a 3D model for a query sequence from an alignment to related homologous proteins with known structure (templates) by a computer. Current prediction servers can provide accurate models within a few hours to days. Our group has developed HHpred, which is one of the top performing structure prediction servers in the field. In general, homology based structure modeling consists of four steps: (1) finding homologous templates in a database, (2) selecting and (3) aligning templates to the query, (4) building a 3D model based on the alignment. In part one of this thesis, we will present improvements of step (2) and (4). Specifically, homology modeling has been shown to work best when multiple templates are selected instead of only a single one. Yet, current servers are using rather ad-hoc approaches to combine information from multiple templates. We provide a rigorous statistical framework for multi-template homology modeling. Given an alignment, we employ Modeller to calculate the most probable structure for a query. The 3D model is obtained by optimally satisfying spatial restraints derived from the alignment and expressed as probability density functions. We find that the query’s atomic distance restraints can be accurately described by two-component Gaussian mixtures. Moreover, we derive statistical weights to quantify the redundancy among related templates. This allows us to apply the standard rules of probability theory to combine restraints from several templates. Together with a heuristic template selection strategy, we have implemented this approach within HHpred and could significantly improve model quality. Furthermore, we took part in CASP, a community wide competition for structure prediction, where we were ranked first in template based modeling and, at the same time, were more than 450 times faster than all other top servers. Homology modeling heavily relies on detecting and correctly aligning templates to the query sequence (step (1) and (3) from above). But remote homologies are difficult to detect and hard to align on a pure sequence level. Hence, modern tools are based on profiles instead of sequences. A profile summarizes the evolutionary history of a given sequence and consists of position specific amino acid probabilities for each residue. In addition to the similarity score between profile columns, most methods use extra terms that compare 1D structural properties such as secondary structure or solvent accessibility. These can be predicted from local profile windows. In the second part of this thesis, we develop a new score that is independent of any predefined structural property. For this purpose, we learn a library of 32 profile patterns that are most conserved in alignments of remotely homologous, structurally aligned proteins. Each so called “context state” in the library consists of a 13-residue sequence profile. We integrate the new context score into our Hmm-Hmm alignment tool HHsearch and improve especially the sensitivity and precision of difficult pairwise alignments significantly. Taken together, we introduced probabilistic methods to improve all four main steps in homology based structure prediction

    Fast protein superfamily classification using principal component null space analysis.

    Get PDF
    The protein family classification problem, which consists of determining the family memberships of given unknown protein sequences, is very important for a biologist for many practical reasons, such as drug discovery, prediction of molecular functions and medical diagnosis. Neural networks and Bayesian methods have performed well on the protein classification problem, achieving accuracy ranging from 90% to 98% while running relatively slowly in the learning stage. In this thesis, we present a principal component null space analysis (PCNSA) linear classifier to the problem and report excellent results compared to those of neural networks and support vector machines. The two main parameters of PCNSA are linked to the high dimensionality of the dataset used, and were optimized in an exhaustive manner to maximize accuracy. Paper copy at Leddy Library: Theses & Major Papers - Basement, West Bldg. / Call Number: Thesis2005 .F74. Source: Masters Abstracts International, Volume: 44-03, page: 1400. Thesis (M.Sc.)--University of Windsor (Canada), 2005
    corecore