Search CORE

LipocalinPred: a SVM-based method for prediction of lipocalins

Author: A Ben-Hur
A Garg
A Garg
A Sali
AS Martin Vogt
B Adam
C Leslie
D Holloway
D Plewczynski
Dinesh Gupta
DR Flower
DR Flower
G Wang
H Saiga
H Saigo
J Ahnstrom
J Duan
J Hull-Thompson
J Thorsten
JA Swets
Jayashree Ramana
LJ McGuffin
M Sieber
M Zervakis
NV Vapnik
P Pavlidis
R Rajakariar
S Ahmad
S Arne
SF Altschul
SR Eddy
W Deng
X Yu
YR Chan
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Functional annotation of rapidly amassing nucleotide and protein sequences presents a challenging task for modern bioinformatics. This is particularly true for protein families sharing extremely low sequence identity, as for lipocalins, a family of proteins with varied functions and great diversity at the sequence level, yet conserved structures. Results In the present study we propose a SVM based method for identification of lipocalin protein sequences. The SVM models were trained with the input features generated using amino acid, dipeptide and secondary structure compositions as well as PSSM profiles. The model derived using both PSSM and secondary structure emerged as the best model in the study. Apart from achieving a high prediction accuracy (>90% in leave-one-out), lipocalinpred correctly differentiates closely related fatty acid-binding proteins and triabins as non-lipocalins. Conclusion The method offers a promising approach as a lipocalin prediction tool, complementing PROSITE, Pfam and homology modelling methods.</p

Directory of Open Access Journals

ScholarWorks @ Georgia State University

Machine Learning and Graph Theory Approaches for Classification and Prediction of Protein Structure

Author: Altun Gulsah
Publication venue: ScholarWorks @ Georgia State University
Publication date: 22/04/2008
Field of study

Recently, many methods have been proposed for the classification and prediction problems in bioinformatics. One of these problems is the protein structure prediction. Machine learning approaches and new algorithms have been proposed to solve this problem. Among the machine learning approaches, Support Vector Machines (SVM) have attracted a lot of attention due to their high prediction accuracy. Since protein data consists of sequence and structural information, another most widely used approach for modeling this structured data is to use graphs. In computer science, graph theory has been widely studied; however it has only been recently applied to bioinformatics. In this work, we introduced new algorithms based on statistical methods, graph theory concepts and machine learning for the protein structure prediction problem. A new statistical method based on z-scores has been introduced for seed selection in proteins. A new method based on finding common cliques in protein data for feature selection is also introduced, which reduces noise in the data. We also introduced new binary classifiers for the prediction of structural transitions in proteins. These new binary classifiers achieve much higher accuracy results than the current traditional binary classifiers

Combining classifiers for improved classification of proteins from sequence or structure

Author: Leslie Christina S
Melvin Iain
Noble William S
Weston Jason
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Predicting a protein's structural or functional class from its amino acid sequence or structure is a fundamental problem in computational biology. Recently, there has been considerable interest in using discriminative learning algorithms, in particular support vector machines (SVMs), for classification of proteins. However, because sufficiently many positive examples are required to train such classifiers, all SVM-based methods are hampered by limited coverage. Results In this study, we develop a hybrid machine learning approach for classifying proteins, and we apply the method to the problem of assigning proteins to structural categories based on their sequences or their 3D structures. The method combines a full-coverage but lower accuracy nearest neighbor method with higher accuracy but reduced coverage multiclass SVMs to produce a full coverage classifier with overall improved accuracy. The hybrid approach is based on the simple idea of "punting" from one method to another using a learned threshold. Conclusion In cross-validated experiments on the SCOP hierarchy, the hybrid methods consistently outperform the individual component methods at all levels of coverage. Code and data sets are available at <url>http://noble.gs.washington.edu/proj/sabretooth</url></p

Directory of Open Access Journals

arXiv.org e-Print Archive

: Protein Long Local Structure Prediction

Author: Altschul
Baeten
Benros
Benros
Blundell
Boeckmann
Bowie
Bystroff
Bystroff
de Bakker
de Brevern
de Brevern
de Brevern
de Brevern
de Brevern
de Brevern
de Brevern
Dong
Doppelt
Dudev
Eddy
Etchebest
Etchebest
Fawcett
Fiser
Fitzkee
Fourrier
Hastie
Hazout
Ihaka
Jauch
Joachims
Jones
Karchin
Kohonen
Kuang
Kuang
Lewis
Lin
Mittelman
Murzin
Noble
Noguchi
Offmann
Pei
Rangwala
Rohl
Rooman
Sander
Sawada
Song
Soto
Tyagi
Tyagi
Tyagi
Ward
Xiang
Yang
Zhang
Zhou
Zhu
Publication venue: 'Wiley'
Publication date: 14/01/2009
Field of study

International audienceA relevant and accurate description of three-dimensional (3D) protein structures can be achieved by characterizing recurrent local structures. In a previous study, we developed a library of 120 3D structural prototypes encompassing all known 11-residues long local protein structures and ensuring a good quality of structural approximation. A local structure prediction method was also proposed. Here, overlapping properties of local protein structures in global ones are taken into account to characterize frequent local networks. At the same time, we propose a new long local structure prediction strategy which involves the use of evolutionary information coupled with Support Vector Machines (SVMs). Our prediction is evaluated by a stringent geometrical assessment. Every local structure prediction with a Calpha RMSD less than 2.5 A from the true local structure is considered as correct. A global prediction rate of 63.1% is then reached, corresponding to an improvement of 7.7 points compared with the previous strategy. In the same way, the prediction of 88.33% of the 120 structural classes is improved with 8.65% mean gain. 85.33% of proteins have better prediction results with a 9.43% average gain. An analysis of prediction rate per local network also supports the global improvement and gives insights into the potential of our method for predicting super local structures. Moreover, a confidence index for the direct estimation of prediction quality is proposed. Finally, our method is proved to be very competitive with cutting-edge strategies encompassing three categories of local structure predictions. Proteins 2009. (c) 2009 Wiley-Liss, Inc

SBSM-Pro: Support Bio-sequence Machine for Proteins

Author: Ding Yijie
Wang Yizheng
Zhai Yixiao
Zou Quan
Publication venue
Publication date: 20/08/2023
Field of study

Proteins play a pivotal role in biological systems. The use of machine learning algorithms for protein classification can assist and even guide biological experiments, offering crucial insights for biotechnological applications. We propose a support bio-sequence machine for proteins, a model specifically designed for biological sequence classification. This model starts with raw sequences and groups amino acids based on their physicochemical properties. It incorporates sequence alignment to measure the similarities between proteins and uses a novel MKL approach to integrate various types of information, utilizing support vector machines for classification prediction. The results indicate that our model demonstrates commendable performance across 10 datasets in terms of the identification of protein function and posttranslational modification. This research not only showcases state-of-the-art work in protein classification but also paves the way for new directions in this domain, representing a beneficial endeavour in the development of platforms tailored for biological sequence classification. SBSM-Pro is available for access at http://lab.malab.cn/soft/SBSM-Pro/.Comment: 38 pages, 9 figure

PSP_MCSVM: brainstorming consensus prediction of protein secondary structures using two-stage multiclass support vector machines

Author: A Kloczkowski
AA Salamov
B Rost
B Rost
B Rost
C Cole
D Frishman
Dariusz Plewczynski
DG Kneller
H Lin
J Garnier
J Garnier
J Guo
JA Cuff
JF Gibrat
K Wu
LM Jonathon
M Ouali
Mahantapas Kundu
Mita Nasipuri
N Qian
P Chatterjee
Piyali Chatterjee
PY Chou
RD King
SF Altschul
Subhadip Basu
TD Jones
Publication venue: Springer-Verlag
Publication date: 01/01/2011
Field of study

Secondary structure prediction is a crucial task for understanding the variety of protein structures and performed biological functions. Prediction of secondary structures for new proteins using their amino acid sequences is of fundamental importance in bioinformatics. We propose a novel technique to predict protein secondary structures based on position-specific scoring matrices (PSSMs) and physico-chemical properties of amino acids. It is a two stage approach involving multiclass support vector machines (SVMs) as classifiers for three different structural conformations, viz., helix, sheet and coil. In the first stage, PSSMs obtained from PSI-BLAST and five specially selected physicochemical properties of amino acids are fed into SVMs as features for sequence-to-structure prediction. Confidence values for forming helix, sheet and coil that are obtained from the first stage SVM are then used in the second stage SVM for performing structure-to-structure prediction. The two-stage cascaded classifiers (PSP_MCSVM) are trained with proteins from RS126 dataset. The classifiers are finally tested on target proteins of critical assessment of protein structure prediction experiment-9 (CASP9). PSP_MCSVM with brainstorming consensus procedure performs better than the prediction servers like Predator, DSC, SIMPA96, for randomly selected proteins from CASP9 targets. The overall performance is found to be comparable with the current state-of-the art. PSP_MCSVM source code, train-test datasets and supplementary files are available freely in public domain at: http://sysbio.icm.edu.pl/secstruct and http://code.google.com/p/cmater-bioinfo

Machine learning integration for predicting the effect of single amino acid substitutions on protein stability

Author: Alpaydın Ethem
Gönen Mehmet
Haliloğlu Türkan
Özen Ayşegül
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background Computational prediction of protein stability change due to single-site amino acid substitutions is of interest in protein design and analysis. We consider the following four ways to improve the performance of the currently available predictors: (1) We include additional sequence- and structure-based features, namely, the amino acid substitution likelihoods, the equilibrium fluctuations of the alpha- and beta-carbon atoms, and the packing density. (2) By implementing different machine learning integration approaches, we combine information from different features or representations. (3) We compare classification vs. regression methods to predict the sign vs. the output of stability change. (4) We allow a reject option for doubtful cases where the risk of misclassification is high. Results We investigate three different approaches: early, intermediate and late integration, which respectively combine features, kernels over feature subsets, and decisions. We perform simulations on two data sets: (1) S1615 is used in previous studies, (2) S2783 is the updated version (as of July 2, 2009) extracted also from ProTherm. For S1615 data set, our highest accuracy using both sequence and structure information is 0.842 on cross-validation and 0.904 on testing using early integration. Newly added features, namely, local compositional packing and the mobility extent of the mutated residues, improve accuracy significantly with intermediate integration. For S2783 data set, we also train regression methods to estimate not only the sign but also the amount of stability change and apply risk-based classification to reject when the learner has low confidence and the loss of misclassification is high. The highest accuracy is 0.835 on cross-validation and 0.832 on testing using only sequence information. The percentage of false positives can be decreased to less than 0.005 by rejecting 10 per cent using late integration. Conclusion We find that in both early and late integration, combining inputs or decisions is useful in increasing accuracy. Intermediate integration allows assessing the contributions of individual features by looking at the assigned weights. Overall accuracy of regression is not better than that of classification but it has less false positives, especially when combined with the reject option. The server for stability prediction for three integration approaches and the data sets are available at <url>http://www.prc.boun.edu.tr/appserv/prc/mlsta</url>.</p

Directory of Open Access Journals

BhairPred: prediction of β-hairpins in a protein from multiple alignment information using ANN and SVM techniques

Author: Bhasin Manoj
Kumar Manish
Natt Navjot K.
Raghava G. P. S.
Publication venue: Oxford University Press
Publication date: 02/05/2005
Field of study

This paper describes a method for predicting a supersecondary structural motif, β-hairpins, in a protein sequence. The method was trained and tested on a set of 5102 hairpins and 5131 non-hairpins, obtained from a non-redundant dataset of 2880 proteins using the DSSP and PROMOTIF programs. Two machine-learning techniques, an artificial neural network (ANN) and a support vector machine (SVM), were used to predict β-hairpins. An accuracy of 65.5% was achieved using ANN when an amino acid sequence was used as the input. The accuracy improved from 65.5 to 69.1% when evolutionary information (PSI-BLAST profile), observed secondary structure and surface accessibility were used as the inputs. The accuracy of the method further improved from 69.1 to 79.2% when the SVM was used for classification instead of the ANN. The performances of the methods developed were assessed in a test case, where predicted secondary structure and surface accessibility were used instead of the observed structure. The highest accuracy achieved by the SVM based method in the test case was 77.9%. A maximum accuracy of 71.1% with Matthew's correlation coefficient of 0.41 in the test case was obtained on a dataset previously used by X. Cruz, E. G. Hutchinson, A. Shephard and J. M. Thornton (2002) Proc. Natl Acad. Sci. USA, 99, 11157–11162. The performance of the method was also evaluated on proteins used in the ‘6th community-wide experiment on the critical assessment of techniques for protein structure prediction (CASP6)’. Based on the algorithm described, a web server, BhairPred (), has been developed, which can be used to predict β-hairpins in a protein using the SVM approach