33,004 research outputs found
LOCSVMPSI: a web server for subcellular localization of eukaryotic proteins using SVM and profile of PSI-BLAST
Subcellular location of a protein is one of the key functional characters as proteins must be localized correctly at the subcellular level to have normal biological function. In this paper, a novel method named LOCSVMPSI has been introduced, which is based on the support vector machine (SVM) and the position-specific scoring matrix generated from profiles of PSI-BLAST. With a jackknife test on the RH2427 data set, LOCSVMPSI achieved a high overall prediction accuracy of 90.2%, which is higher than the prediction results by SubLoc and ESLpred on this data set. In addition, prediction performance of LOCSVMPSI was evaluated with 5-fold cross validation test on the PK7579 data set and the prediction results were consistently better than the previous method based on several SVMs using composition of both amino acids and amino acid pairs. Further test on the SWISSPROT new-unique data set showed that LOCSVMPSI also performed better than some widely used prediction methods, such as PSORTII, TargetP and LOCnet. All these results indicate that LOCSVMPSI is a powerful tool for the prediction of eukaryotic protein subcellular localization. An online web server (current version is 1.3) based on this method has been developed and is freely available to both academic and commercial users, which can be accessed by at
A Comparative Study between Fixed-size Kernel Logistic Regression and Support Vector Machines Methods for beta-turns Prediction in Protein
Beta-turn is an important element of protein structure; it plays a significant role in protein configuration and function. There are several methods developed for prediction of beta-turns from protein sequences. The best methods are based on Neural Networks (NNs) or Support Vector Machines (SVMs). Although Kernel Logistic Regression (KLR) is a powerful classification technique that has been applied successfully in many classification problems, however it is often not found in beta-turns classification, mainly because it is computationally expensive. Fixed-Size Kernel Logistic Regression (FS-KLR) is a fast and accurate approximate implementation of KLR for large-scale data sets. It uses trust-region Newton’s method for large-scale Logistic Regression (LR) as a basis, to solve the approximate problem, and Nystrom method to approximate the features' matrix. In this paper we used FS-KLR for beta-turns prediction and the results obtained are compared to those obtained with SVM. Secondary structure information and Position Specific Scoring Matrices (PSSMs) are utilized as input features. The performance achieved using FS-KLR is found to be comparable to that of SVM method. FS-KLR has an advantage of yielding probabilistic outputs directly and its extension to the multi-class case is well-defined. In addition its evaluation time is less than that of SVM method.
Beta-turn is an important element of protein structure; it plays a significant role in protein configuration and function. There are several methods developed for prediction of beta-turns from protein sequences. The best methods are based on Neural Networks (NNs) or Support Vector Machines (SVMs). Although Kernel Logistic Regression (KLR) is a powerful classification technique that has been applied successfully in many classification problems, however it is often not found in beta-turns classification, mainly because it is computationally expensive. Fixed-Size Kernel Logistic Regression (FS-KLR) is a fast and accurate approximate implementation of KLR for large-scale data sets. It uses trust-region Newton’s method for large-scale Logistic Regression (LR) as a basis, to solve the approximate problem, and Nystrom method to approximate the features' matrix. In this paper we used FS-KLR for beta-turns prediction and the results obtained are compared to those obtained with SVM. Secondary structure information and Position Specific Scoring Matrices (PSSMs) are utilized as input features. The performance achieved using FS-KLR is found to be comparable to that of SVM method. FS-KLR has an advantage of yielding probabilistic outputs directly and its extension to the multi-class case is well-defined. In addition its evaluation time is less than that of SVM method
Predicting Secondary Structures, Contact Numbers, and Residue-wise Contact Orders of Native Protein Structure from Amino Acid Sequence by Critical Random Networks
Prediction of one-dimensional protein structures such as secondary structures
and contact numbers is useful for the three-dimensional structure prediction
and important for the understanding of sequence-structure relationship. Here we
present a new machine-learning method, critical random networks (CRNs), for
predicting one-dimensional structures, and apply it, with position-specific
scoring matrices, to the prediction of secondary structures (SS), contact
numbers (CN), and residue-wise contact orders (RWCO). The present method
achieves, on average, accuracy of 77.8% for SS, correlation coefficients
of 0.726 and 0.601 for CN and RWCO, respectively. The accuracy of the SS
prediction is comparable to other state-of-the-art methods, and that of the CN
prediction is a significant improvement over previous methods. We give a
detailed formulation of critical random networks-based prediction scheme, and
examine the context-dependence of prediction accuracies. In order to study the
nonlinear and multi-body effects, we compare the CRNs-based method with a
purely linear method based on position-specific scoring matrices. Although not
superior to the CRNs-based method, the surprisingly good accuracy achieved by
the linear method highlights the difficulty in extracting structural features
of higher order from amino acid sequence beyond that provided by the
position-specific scoring matrices.Comment: 20 pages, 1 figure, 5 tables; minor revision; accepted for
publication in BIOPHYSIC
MRFalign: Protein Homology Detection through Alignment of Markov Random Fields
Sequence-based protein homology detection has been extensively studied and so
far the most sensitive method is based upon comparison of protein sequence
profiles, which are derived from multiple sequence alignment (MSA) of sequence
homologs in a protein family. A sequence profile is usually represented as a
position-specific scoring matrix (PSSM) or an HMM (Hidden Markov Model) and
accordingly PSSM-PSSM or HMM-HMM comparison is used for homolog detection. This
paper presents a new homology detection method MRFalign, consisting of three
key components: 1) a Markov Random Fields (MRF) representation of a protein
family; 2) a scoring function measuring similarity of two MRFs; and 3) an
efficient ADMM (Alternating Direction Method of Multipliers) algorithm aligning
two MRFs. Compared to HMM that can only model very short-range residue
correlation, MRFs can model long-range residue interaction pattern and thus,
encode information for the global 3D structure of a protein family.
Consequently, MRF-MRF comparison for remote homology detection shall be much
more sensitive than HMM-HMM or PSSM-PSSM comparison. Experiments confirm that
MRFalign outperforms several popular HMM or PSSM-based methods in terms of both
alignment accuracy and remote homology detection and that MRFalign works
particularly well for mainly beta proteins. For example, tested on the
benchmark SCOP40 (8353 proteins) for homology detection, PSSM-PSSM and HMM-HMM
succeed on 48% and 52% of proteins, respectively, at superfamily level, and on
15% and 27% of proteins, respectively, at fold level. In contrast, MRFalign
succeeds on 57.3% and 42.5% of proteins at superfamily and fold level,
respectively. This study implies that long-range residue interaction patterns
are very helpful for sequence-based homology detection. The software is
available for download at http://raptorx.uchicago.edu/download/.Comment: Accepted by both RECOMB 2014 and PLOS Computational Biolog
Dissecting the Specificity of Protein-Protein Interaction in Bacterial Two-Component Signaling: Orphans and Crosstalks
Predictive understanding of the myriads of signal transduction pathways in a
cell is an outstanding challenge of systems biology. Such pathways are
primarily mediated by specific but transient protein-protein interactions,
which are difficult to study experimentally. In this study, we dissect the
specificity of protein-protein interactions governing two-component signaling
(TCS) systems ubiquitously used in bacteria. Exploiting the large number of
sequenced bacterial genomes and an operon structure which packages many pairs
of interacting TCS proteins together, we developed a computational approach to
extract a molecular interaction code capturing the preferences of a small but
critical number of directly interacting residue pairs. This code is found to
reflect physical interaction mechanisms, with the strongest signal coming from
charged amino acids. It is used to predict the specificity of TCS interaction:
Our results compare favorably to most available experimental results, including
the prediction of 7 (out of 8 known) interaction partners of orphan signaling
proteins in Caulobacter crescentus. Surveying among the available bacterial
genomes, our results suggest 15~25% of the TCS proteins could participate in
out-of-operon "crosstalks". Additionally, we predict clusters of crosstalking
candidates, expanding from the anecdotally known examples in model organisms.
The tools and results presented here can be used to guide experimental studies
towards a system-level understanding of two-component signaling.Comment: Supplementary information available on
http://www.plosone.org/article/info:doi/10.1371/journal.pone.001972
From principal component to direct coupling analysis of coevolution in proteins: Low-eigenvalue modes are needed for structure prediction
Various approaches have explored the covariation of residues in
multiple-sequence alignments of homologous proteins to extract functional and
structural information. Among those are principal component analysis (PCA),
which identifies the most correlated groups of residues, and direct coupling
analysis (DCA), a global inference method based on the maximum entropy
principle, which aims at predicting residue-residue contacts. In this paper,
inspired by the statistical physics of disordered systems, we introduce the
Hopfield-Potts model to naturally interpolate between these two approaches. The
Hopfield-Potts model allows us to identify relevant 'patterns' of residues from
the knowledge of the eigenmodes and eigenvalues of the residue-residue
correlation matrix. We show how the computation of such statistical patterns
makes it possible to accurately predict residue-residue contacts with a much
smaller number of parameters than DCA. This dimensional reduction allows us to
avoid overfitting and to extract contact information from multiple-sequence
alignments of reduced size. In addition, we show that low-eigenvalue
correlation modes, discarded by PCA, are important to recover structural
information: the corresponding patterns are highly localized, that is, they are
concentrated in few sites, which we find to be in close contact in the
three-dimensional protein fold.Comment: Supporting information can be downloaded from:
http://www.ploscompbiol.org/article/info:doi/10.1371/journal.pcbi.100317
- …