Search CORE

University of Melbourne Institutional Repository

Predicting residue-wise contact orders in proteins by support vector regression

Author: A Bairoch
AG Murzin
AR Kinjo
AR Kinjo
AR Kinjo
AR Kinjo
B Rost
CH Tsai
D Kihara
D Sarda
DT Jones
G Pollastri
G Pollastri
GP Raghava
HM Berman
J Song
J Wang
Jiangning Song
JM Chandonia
Kevin Burrage
KW Plaxco
M Punta
MPS Brown
NP Prabhu
S Ahmad
S Hua
S Hua
V Vapnik
V Vapnik
W Kabsch
W Liu
X Wang
Z Yuan
Z Yuan
Z Yuan
Z Yuan
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The residue-wise contact order (RWCO) describes the sequence separations between the residues of interest and its contacting residues in a protein sequence. It is a new kind of one-dimensional protein structure that represents the extent of long-range contacts and is considered as a generalization of contact order. Together with secondary structure, accessible surface area, the B factor, and contact number, RWCO provides comprehensive and indispensable important information to reconstructing the protein three-dimensional structure from a set of one-dimensional structural properties. Accurately predicting RWCO values could have many important applications in protein three-dimensional structure prediction and protein folding rate prediction, and give deep insights into protein sequence-structure relationships. RESULTS: We developed a novel approach to predict residue-wise contact order values in proteins based on support vector regression (SVR), starting from primary amino acid sequences. We explored seven different sequence encoding schemes to examine their effects on the prediction performance, including local sequence in the form of PSI-BLAST profiles, local sequence plus amino acid composition, local sequence plus molecular weight, local sequence plus secondary structure predicted by PSIPRED, local sequence plus molecular weight and amino acid composition, local sequence plus molecular weight and predicted secondary structure, and local sequence plus molecular weight, amino acid composition and predicted secondary structure. When using local sequences with multiple sequence alignments in the form of PSI-BLAST profiles, we could predict the RWCO distribution with a Pearson correlation coefficient (CC) between the predicted and observed RWCO values of 0.55, and root mean square error (RMSE) of 0.82, based on a well-defined dataset with 680 protein sequences. Moreover, by incorporating global features such as molecular weight and amino acid composition we could further improve the prediction performance with the CC to 0.57 and an RMSE of 0.79. In addition, combining the predicted secondary structure by PSIPRED was found to significantly improve the prediction performance and could yield the best prediction accuracy with a CC of 0.60 and RMSE of 0.78, which provided at least comparable performance compared with the other existing methods. CONCLUSION: The SVR method shows a prediction performance competitive with or at least comparable to the previously developed linear regression-based methods for predicting RWCO values. In contrast to support vector classification (SVC), SVR is very good at estimating the raw value profiles of the samples. The successful application of the SVR approach in this study reinforces the fact that support vector regression is a powerful tool in extracting the protein sequence-structure relationship and in estimating the protein structural profiles from amino acid sequences

Springer - Publisher Connector

Queensland University of Technology ePrints Archive

University of Queensland eSpace

Predicting Secondary Structures, Contact Numbers, and Residue-wise Contact Orders of Native Protein Structure from Amino Acid Sequence by Critical Random Networks

Author: Altschul S. F., Madden, T. L., Sch
Baldi P., Brunak, S., Frasconi, P.
CHANDONIA J-M
Crooks G. E. &amp
Kinjo A. R. &amp
Kinjo A. R. &amp
Kinjo A. R., Horimoto, K. &amp
Lee B. &amp
Li W., Jaroszewski, L. &amp
Nishikawa K. &amp
Pollastri G., Baldi, P., Fariselli
Rost B.
TATENO Y
Publication venue: 'Biophysical Society of Japan'
Publication date: 01/01/2005
Field of study

Prediction of one-dimensional protein structures such as secondary structures and contact numbers is useful for the three-dimensional structure prediction and important for the understanding of sequence-structure relationship. Here we present a new machine-learning method, critical random networks (CRNs), for predicting one-dimensional structures, and apply it, with position-specific scoring matrices, to the prediction of secondary structures (SS), contact numbers (CN), and residue-wise contact orders (RWCO). The present method achieves, on average,

Q_3

accuracy of 77.8% for SS, correlation coefficients of 0.726 and 0.601 for CN and RWCO, respectively. The accuracy of the SS prediction is comparable to other state-of-the-art methods, and that of the CN prediction is a significant improvement over previous methods. We give a detailed formulation of critical random networks-based prediction scheme, and examine the context-dependence of prediction accuracies. In order to study the nonlinear and multi-body effects, we compare the CRNs-based method with a purely linear method based on position-specific scoring matrices. Although not superior to the CRNs-based method, the surprisingly good accuracy achieved by the linear method highlights the difficulty in extracting structural features of higher order from amino acid sequence beyond that provided by the position-specific scoring matrices.Comment: 20 pages, 1 figure, 5 tables; minor revision; accepted for publication in BIOPHYSIC

arXiv.org e-Print Archive

Springer - Publisher Connector

Prediction of cis/trans isomerization in proteins using PSI-BLAST profiles and secondary structure information

Author: Burrage Kevin
Huber Thomas
Song Jiangning
Yuan Zheng
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: The majority of peptide bonds in proteins are found to occur in the trans conformation. However, for proline residues, a considerable fraction of Prolyl peptide bonds adopt the cis form. Proline cis/trans isomerization is known to play a critical role in protein folding, splicing, cell signaling and transmembrane active transport. Accurate prediction of proline cis/trans isomerization in proteins would have many important applications towards the understanding of protein structure and function. RESULTS: In this paper, we propose a new approach to predict the proline cis/trans isomerization in proteins using support vector machine (SVM). The preliminary results indicated that using Radial Basis Function (RBF) kernels could lead to better prediction performance than that of polynomial and linear kernel functions. We used single sequence information of different local window sizes, amino acid compositions of different local sequences, multiple sequence alignment obtained from PSI-BLAST and the secondary structure information predicted by PSIPRED. We explored these different sequence encoding schemes in order to investigate their effects on the prediction performance. The training and testing of this approach was performed on a newly enlarged dataset of 2424 non-homologous proteins determined by X-Ray diffraction method using 5-fold cross-validation. Selecting the window size 11 provided the best performance for determining the proline cis/trans isomerization based on the single amino acid sequence. It was found that using multiple sequence alignments in the form of PSI-BLAST profiles could significantly improve the prediction performance, the prediction accuracy increased from 62.8% with single sequence to 69.8% and Matthews Correlation Coefficient (MCC) improved from 0.26 with single local sequence to 0.40. Furthermore, if coupled with the predicted secondary structure information by PSIPRED, our method yielded a prediction accuracy of 71.5% and MCC of 0.43, 9% and 0.17 higher than the accuracy achieved based on the singe sequence information, respectively. CONCLUSION: A new method has been developed to predict the proline cis/trans isomerization in proteins based on support vector machine, which used the single amino acid sequence with different local window sizes, the amino acid compositions of local sequence flanking centered proline residues, the position-specific scoring matrices (PSSMs) extracted by PSI-BLAST and the predicted secondary structures generated by PSIPRED. The successful application of SVM approach in this study reinforced that SVM is a powerful tool in predicting proline cis/trans isomerization in proteins and biological sequence analysis

Queensland University of Technology ePrints Archive

The Australian National University

University of Queensland eSpace

TANGLE: Two-Level Support Vector Regression Approach for Protein Backbone Torsion Angle Prediction from Primary Sequences

Author: A Schlessinger
A Schlessinger
A Schlessinger
AG de Brevern
B Rost
B Rost
B Rost
B Xue
C Bystroff
C Haynes
C Mooney
C Zhang
C Zheng
Christian Schönbach
D Xie
DT Jones
E Faraggi
E Faraggi
G Helles
Geoffrey I. Webb
GN Ramachandran
GP Raghava
H Zhang
H Zhang
Hao Tan
HJ Dyson
HS Kang
J Cheng
J Gao
J Gsponer
J Song
J Song
J Song
J Song
J Song
J Song
Jiangning Song
JJ Ward
JS Chauhan
K Chen
K Chen
K Chen
L Chen
L Kurgan
M Kumar
Mingjun Wang
MJ Mizianty
MJ Rooman
MJ Wood
MJ Wood
MK Kalita
MN Nguyen
MN Nguyen
MV Berjanskii
O Dor
O Dor
O Zimmermann
P Chen
P Kountouris
P Kountouris
P Sliz
PC Chen
R Gaudet
R Karchin
R Kuang
R Verma
S Ahmad
S Ahmad
S Liang
S Qiu
S Wu
S Wu
SF Altschul
T Ishida
T Zhang
T Zhang
Tatsuya Akutsu
V Vapnik
V Vapnik
W Kabsch
W Liu
W Zhang
X Miao
X Wang
XY Pan
Y Ofran
Y Ofran
YM Huang
Z Markovic-Housley
Z Yuan
Z Yuan
Z Yuan
Z Yuan
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

Protein backbone torsion angles (Phi) and (Psi) involve two rotation angles rotating around the Cα-N bond (Phi) and the Cα-C bond (Psi). Due to the planarity of the linked rigid peptide bonds, these two angles can essentially determine the backbone geometry of proteins. Accordingly, the accurate prediction of protein backbone torsion angle from sequence information can assist the prediction of protein structures. In this study, we develop a new approach called TANGLE (Torsion ANGLE predictor) to predict the protein backbone torsion angles from amino acid sequences. TANGLE uses a two-level support vector regression approach to perform real-value torsion angle prediction using a variety of features derived from amino acid sequences, including the evolutionary profiles in the form of position-specific scoring matrices, predicted secondary structure, solvent accessibility and natively disordered region as well as other global sequence features. When evaluated based on a large benchmark dataset of 1,526 non-homologous proteins, the mean absolute errors (MAEs) of the Phi and Psi angle prediction are 27.8° and 44.6°, respectively, which are 1% and 3% respectively lower than that using one of the state-of-the-art prediction tools ANGLOR. Moreover, the prediction of TANGLE is significantly better than a random predictor that was built on the amino acid-specific basis, with the p-value<1.46e-147 and 7.97e-150, respectively by the Wilcoxon signed rank test. As a complementary approach to the current torsion angle prediction algorithms, TANGLE should prove useful in predicting protein structural properties and assisting protein fold recognition by applying the predicted torsion angles as useful restraints. TANGLE is freely accessible at http://sunflower.kuicr.kyoto-u.ac.jp/~sjn/TANGLE/

Public Library of Science (PLOS)

Monash University Research Portal

Machine Learning and Graph Theory Approaches for Classification and Prediction of Protein Structure

Author: Altun Gulsah
Publication venue: ScholarWorks @ Georgia State University
Publication date: 22/04/2008
Field of study

Recently, many methods have been proposed for the classification and prediction problems in bioinformatics. One of these problems is the protein structure prediction. Machine learning approaches and new algorithms have been proposed to solve this problem. Among the machine learning approaches, Support Vector Machines (SVM) have attracted a lot of attention due to their high prediction accuracy. Since protein data consists of sequence and structural information, another most widely used approach for modeling this structured data is to use graphs. In computer science, graph theory has been widely studied; however it has only been recently applied to bioinformatics. In this work, we introduced new algorithms based on statistical methods, graph theory concepts and machine learning for the protein structure prediction problem. A new statistical method based on z-scores has been introduced for seed selection in proteins. A new method based on finding common cliques in protein data for feature selection is also introduced, which reduces noise in the data. We also introduced new binary classifiers for the prediction of structural transitions in proteins. These new binary classifiers achieve much higher accuracy results than the current traditional binary classifiers

ScholarWorks @ Georgia State University

Prediction of peptide and protein propensity for amyloid formation

Author: A Quintas
A Trovato
A Trovato
AC Davison
AC Tsolis
Alexandre Quintas
AM Fernandez-Escamilla
AP Pawar
AV Finkelstein
B Rost
C Nerelius
Carlos Família
CM Dobson
D Eisenberg
David A. Phoenix
DJ Selkoe
DM Fowler
Eugene A. Permyakov
F Chiti
F Chiti
F Sasagawa
GG Tartaglia
GG Tartaglia
H Hu
I Cherny
I Walsh
IV Baskakov
J Palau
J Tian
JC Rochet
JD Sipe
JM Zimmerman
JW Kelly
JW Kelly
K Rajagopal
KF DuBay
KK Frousios
KT O’Neil
L Goldschmidt
LO Jimenez
M Belli
M Emily
M Hollander
M Kuhn
M López de la Paz
M Oliveberg
M Stefani
M Sunde
M Sunde
M Zamani
MB Kursa
MJ Thompson
MT Pastor
N Becker
N Qian
O Conchillo-Solé
PK Teng
PY Chou
RS Harrison
S Idicula-thomas
S Kawashima
S Kawashima
S Maurer-Stroh
S Ventura
S Yoon
S Yoon
Sarah R. Dennison
SJ Hamodrakas
SJ Hamodrakas
SK Maji
SO Garbuzynskiy
T Hothorn
T Hothorn
T Hothorn
T Scheibel
TPJ Knowles
VS Mathura
WH DePas
WT Astbury
Y Kallberg
Ž Eva
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 09/07/2014
Field of study

Understanding which peptides and proteins have the potential to undergo amyloid formation and what driving forces are responsible for amyloid-like fiber formation and stabilization remains limited. This is mainly because proteins that can undergo structural changes, which lead to amyloid formation, are quite diverse and share no obvious sequence or structural homology, despite the structural similarity found in the fibrils. To address these issues, a novel approach based on recursive feature selection and feed-forward neural networks was undertaken to identify key features highly correlated with the self-assembly problem. This approach allowed the identification of seven physicochemical and biochemical properties of the amino acids highly associated with the self-assembly of peptides and proteins into amyloid-like fibrils (normalized frequency of β-sheet, normalized frequency of β-sheet from LG, weights for β-sheet at the window position of 1, isoelectric point, atom-based hydrophobic moment, helix termination parameter at position j+1 and ΔGº values for peptides extrapolated in 0 M urea). Moreover, these features enabled the development of a new predictor (available at http://cran.r-project.org/web/packages/appnn/index.html) capable of accurately and reliably predicting the amyloidogenic propensity from the polypeptide sequence alone with a prediction accuracy of 84.9 % against an external validation dataset of sequences with experimental in vitro, evidence of amyloid formation

Public Library of Science (PLOS)

ScholarWorks @ Georgia State University

FigShare

Protein Secondary Structure Prediction Using Support Vector Machines, Nueral Networks and Genetic Algorithms

Author: Reyaz-Ahmed Anjum B
Publication venue: ScholarWorks @ Georgia State University
Publication date: 03/05/2007
Field of study

Bioinformatics techniques to protein secondary structure prediction mostly depend on the information available in amino acid sequence. Support vector machines (SVM) have shown strong generalization ability in a number of application areas, including protein structure prediction. In this study, a new sliding window scheme is introduced with multiple windows to form the protein data for training and testing SVM. Orthogonal encoding scheme coupled with BLOSUM62 matrix is used to make the prediction. First the prediction of binary classifiers using multiple windows is compared with single window scheme, the results shows single window not to be good in all cases. Two new classifiers are introduced for effective tertiary classification. This new classifiers use neural networks and genetic algorithms to optimize the accuracy of the tertiary classifier. The accuracy level of the new architectures are determined and compared with other studies. The tertiary architecture is better than most available techniques

SiteSeek: Post-translational modification analysis using adaptive locality-effective kernel methods and new profiles

Author: A Radzicka
A Radzicka
A Zanzoni
AA Salamov
Albert Y Zomaya
B Amos
B Rost
B Zhang
BA Ballif
Bing Bing Zhou
C Li
CJC Burgess
D Frishman
DT Larose
F Diella
G Horváth
G Rose
GD Rose
H Hu
H Kim
HH Jang
HM Berman
J Liu
J Shawe-Taylor
JC Obenauer
JH Kim
K Koliba
K Lin
L Graves
L Johnson
LA Pinna
LM Iakoucheva
M Hjerrild
M Mann
M Scholz
MA Kramer
MB Yaffe
MJ Korenberg
MJ Zvelebil
N Blom
NL Daly
P Baldi
P Cohen
Paul D Yoo
R David
R Lohmann
RD King
RE Schapire
SA Beausoleil
SB Ficarro
T Hunter
T Hunter
TG Dietterich
W Hardle
Y Freund
Y Xue
Yung Shwen Ho
Z Songyang
ZR Yang
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Post-translational modifications have a substantial influence on the structure and functions of protein. Post-translational phosphorylation is one of the most common modification that occur in intracellular proteins. Accurate prediction of protein phosphorylation sites is of great importance for the understanding of diverse cellular signalling processes in both the human body and in animals. In this study, we propose a new machine learning based protein phosphorylation site predictor, SiteSeek. SiteSeek is trained using a novel compact evolutionary and hydrophobicity profile to detect possible protein phosphorylation sites for a target sequence. The newly proposed method proves to be more accurate and exhibits a much stable predictive performance than currently existing phosphorylation site predictors. Results The performance of the proposed model was compared to nine existing different machine learning models and four widely known phosphorylation site predictors with the newly proposed PS-Benchmark_1 dataset to contrast their accuracy, sensitivity, specificity and correlation coefficient. SiteSeek showed better predictive performance with 86.6% accuracy, 83.8% sensitivity, 92.5% specificity and 0.77 correlation-coefficient on the four main kinase families (CDK, CK2, PKA, and PKC). Conclusion Our newly proposed methods used in SiteSeek were shown to be useful for the identification of protein phosphorylation sites as it performed much better than widely known predictors on the newly built PS-Benchmark_1 dataset.</p

Springer - Publisher Connector