11,822 research outputs found

    Remote Homology Detection of Protein Sequences

    Get PDF
    The classification of protein sequences using string kernels provides valuable insights for protein function prediction. Almost all string kernels are based on patterns that are not independent, and therefore the associated scores are obtained using a set of redundant features. In this talk we will discuss how a class of patterns, called Irredundant, is specifically designed to address this issue. Loosely speaking the set of Irredundant patterns is the smallest class of independent patterns that can describe all patterns in a string. We present a classification method based on the statistics of these patterns, named Irredundant Class. Results on benchmark data show that Irredundant Class outperforms most of the string kernel methods previously proposed, and it achieves results as good as the current state-of-the-art methods with a fewer number of patterns. Unfortunately we show that the information carried by the irredundant patterns can not be easily interpreted, thus alternative notions are needed

    Physicochemical property distributions for accurate and rapid pairwise protein homology detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>The challenge of remote homology detection is that many evolutionarily related sequences have very little similarity at the amino acid level. Kernel-based discriminative methods, such as support vector machines (SVMs), that use vector representations of sequences derived from sequence properties have been shown to have superior accuracy when compared to traditional approaches for the task of remote homology detection.</p> <p>Results</p> <p>We introduce a new method for feature vector representation based on the physicochemical properties of the primary protein sequence. A distribution of physicochemical property scores are assembled from 4-mers of the sequence and normalized based on the null distribution of the property over all possible 4-mers. With this approach there is little computational cost associated with the transformation of the protein into feature space, and overall performance in terms of remote homology detection is comparable with current state-of-the-art methods. We demonstrate that the features can be used for the task of pairwise remote homology detection with improved accuracy versus sequence-based methods such as BLAST and other feature-based methods of similar computational cost.</p> <p>Conclusions</p> <p>A protein feature method based on physicochemical properties is a viable approach for extracting features in a computationally inexpensive manner while retaining the sensitivity of SVM protein homology detection. Furthermore, identifying features that can be used for generic pairwise homology detection in lieu of family-based homology detection is important for applications such as large database searches and comparative genomics.</p

    Word correlation matrices for protein sequence analysis and remote homology detection

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Classification of protein sequences is a central problem in computational biology. Currently, among computational methods discriminative kernel-based approaches provide the most accurate results. However, kernel-based methods often lack an interpretable model for analysis of discriminative sequence features, and predictions on new sequences usually are computationally expensive.</p> <p>Results</p> <p>In this work we present a novel kernel for protein sequences based on average word similarity between two sequences. We show that this kernel gives rise to a feature space that allows analysis of discriminative features and fast classification of new sequences. We demonstrate the performance of our approach on a widely-used benchmark setup for protein remote homology detection.</p> <p>Conclusion</p> <p>Our word correlation approach provides highly competitive performance as compared with state-of-the-art methods for protein remote homology detection. The learned model is interpretable in terms of biologically meaningful features. In particular, analysis of discriminative words allows the identification of characteristic regions in biological sequences. Because of its high computational efficiency, our method can be applied to ranking of potential homologs in large databases.</p

    Pairwise alignment incorporating dipeptide covariation

    Full text link
    Motivation: Standard algorithms for pairwise protein sequence alignment make the simplifying assumption that amino acid substitutions at neighboring sites are uncorrelated. This assumption allows implementation of fast algorithms for pairwise sequence alignment, but it ignores information that could conceivably increase the power of remote homolog detection. We examine the validity of this assumption by constructing extended substitution matrixes that encapsulate the observed correlations between neighboring sites, by developing an efficient and rigorous algorithm for pairwise protein sequence alignment that incorporates these local substitution correlations, and by assessing the ability of this algorithm to detect remote homologies. Results: Our analysis indicates that local correlations between substitutions are not strong on the average. Furthermore, incorporating local substitution correlations into pairwise alignment did not lead to a statistically significant improvement in remote homology detection. Therefore, the standard assumption that individual residues within protein sequences evolve independently of neighboring positions appears to be an efficient and appropriate approximation

    Ultra-large alignments using Phylogeny-aware Profiles

    Full text link
    Many biological questions, including the estimation of deep evolutionary histories and the detection of remote homology between protein sequences, rely upon multiple sequence alignments (MSAs) and phylogenetic trees of large datasets. However, accurate large-scale multiple sequence alignment is very difficult, especially when the dataset contains fragmentary sequences. We present UPP, an MSA method that uses a new machine learning technique - the Ensemble of Hidden Markov Models - that we propose here. UPP produces highly accurate alignments for both nucleotide and amino acid sequences, even on ultra-large datasets or datasets containing fragmentary sequences. UPP is available at https://github.com/smirarab/sepp.Comment: Online supplemental materials and data are available at http://www.cs.utexas.edu/users/phylo/software/upp

    Protein Remote Homology Detection Based on an Ensemble Learning Approach

    Get PDF
    Protein remote homology detection is one of the central problems in bioinformatics. Although some computational methods have been proposed, the problem is still far from being solved. In this paper, an ensemble classifier for protein remote homology detection, called SVM-Ensemble, was proposed with a weighted voting strategy. SVM-Ensemble combined three basic classifiers based on different feature spaces, including Kmer, ACC, and SC-PseAAC. These features consider the characteristics of proteins from various perspectives, incorporating both the sequence composition and the sequence-order information along the protein sequences. Experimental results on a widely used benchmark dataset showed that the proposed SVM-Ensemble can obviously improve the predictive performance for the protein remote homology detection. Moreover, it achieved the best performance and outperformed other state-of-the-art methods

    Family classification without domain chaining

    Get PDF
    Motivation: Classification of gene and protein sequences into homologous families, i.e. sets of sequences that share common ancestry, is an essential step in comparative genomic analyses. This is typically achieved by construction of a sequence homology network, followed by clustering to identify dense subgraphs corresponding to families. Accurate classification of single domain families is now within reach due to major algorithmic advances in remote homology detection and graph clustering. However, classification of multidomain families remains a significant challenge. The presence of the same domain in sequences that do not share common ancestry introduces false edges in the homology network that link unrelated families and stymy clustering algorithms

    Protein sequences classification based on weighting scheme

    Get PDF
    We present a new technique to recognize remote protein homologies that rely on combining probabilistic modeling and supervised learning in high-dimensional feature spaces. The main novelty of our technique is the method of constructing feature vectors using Hidden Markov Model and the combination of this representation with a classifier capable of learning in very sparse high-dimensional spaces. Each feature vector records the sensitivity of each protein domain to a previously learned set of sub-sequences (strings). Unlike other previous methods, our method takes in consideration the conserved and non-conserved regions. The system subsequently utilizes Support Vector Machines (SVM) classifiers to learn the boundaries between structural protein classes. Experiments show that this method, which we call the String Weighting Scheme-SVM (SWS-SVM) method, significantly improves on previous methods for the classification of protein domains based on remote homologies. Our method is then compared to five existing homology detection methods

    Reconstruction of ancestral protein sequences and its applications

    Get PDF
    BACKGROUND: Modern-day proteins were selected during long evolutionary history as descendants of ancient life forms. In silico reconstruction of such ancestral protein sequences facilitates our understanding of evolutionary processes, protein classification and biological function. Additionally, reconstructed ancestral protein sequences could serve to fill in sequence space thus aiding remote homology inference. RESULTS: We developed ANCESCON, a package for distance-based phylogenetic inference and reconstruction of ancestral protein sequences that takes into account the observed variation of evolutionary rates between positions that more precisely describes the evolution of protein families. To improve the accuracy of evolutionary distance estimation and ancestral sequence reconstruction, two approaches are proposed to estimate position-specific evolutionary rates. Comparisons show that at large evolutionary distances our method gives more accurate ancestral sequence reconstruction than PAML, PHYLIP and PAUP*. We apply the reconstructed ancestral sequences to homology inference and functional site prediction. We show that the usage of hypothetical ancestors together with the present day sequences improves profile-based sequence similarity searches; and that ancestral sequence reconstruction methods can be used to predict positions with functional specificity. CONCLUSIONS: As a computational tool to reconstruct ancestral protein sequences from a given multiple sequence alignment, ANCESCON shows high accuracy in tests and helps detection of remote homologs and prediction of functional sites. ANCESCON is freely available for non-commercial use. Pre-compiled versions for several platforms can be downloaded from

    MRFalign: Protein Homology Detection through Alignment of Markov Random Fields

    Full text link
    Sequence-based protein homology detection has been extensively studied and so far the most sensitive method is based upon comparison of protein sequence profiles, which are derived from multiple sequence alignment (MSA) of sequence homologs in a protein family. A sequence profile is usually represented as a position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This paper presents a new homology detection method MRFalign, consisting of three key components: 1) a Markov Random Fields (MRF) representation of a protein family; 2) a scoring function measuring similarity of two MRFs; and 3) an efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning two MRFs. Compared to HMM that can only model very short-range residue correlation, MRFs can model long-range residue interaction pattern and thus, encode information for the global 3D structure of a protein family. Consequently, MRF-MRF comparison for remote homology detection shall be much more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that MRFalign outperforms several popular HMM or PSSM-based methods in terms of both alignment accuracy and remote homology detection and that MRFalign works particularly well for mainly beta proteins. For example, tested on the benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM succeed on 48% and 52% of proteins, respectively, at superfamily level, and on 15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign succeeds on 57.3% and 42.5% of proteins at superfamily and fold level, respectively. This study implies that long-range residue interaction patterns are very helpful for sequence-based homology detection. The software is available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog
    corecore