Prediction of Protein Domain with mRMR Feature Selection and Analysis

AA Schaffer; AG Murzin; AK Dunker; AM Moses; AP Elhammer; B Saffari; Bi-Qing Li; Bin Xue; BQ Li; CA Orengo; D Chivian; D Li; DE Kim; E Angov; EC Mbamala; G Pugalenthi; GP Zhou; GP Zhou; H Ingolfsson; H Mohabatkar; H Peng; HB Shen; HB Shen; I Walsh; ID Campbell; IH Witten; J Chen; J Cheng; J Cheng; J Cheng; J Eickholt; J Lin; J Liu; J Liu; J Wang; JD Qiu; JE Gewehr; JJ Chou; JR Schnell; K Peng; K Shameer; K Wang; Kai-Yan Feng; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KC Chou; KK Kandaswamy; Kuo-Chen Chou; L Breiman; L Chen; L Holm; Le-Le Hu; Lei Chen; M Esmaeili; M Hayat; M Suyama; MJ Berardi; MK Yoon; N Nagarajan; N von Ohsen; NM Goldenberg; P Mundra; P Tompa; P Wang; PE Wright; PK Nielsen; Q Gu; R Apweiler; R Bondugula; R Guerois; R Linding; RA George; RA Poorman; S Gong; S Kawashima; S Roy; SC Jia; SF Altschul; SM Reynolds; T Ebina; T Huang; TA Holland; W Li; W Zhao; WR Atchley; WZ Lin; X Xiao; X Xiao; X Xiao; X Xiao; X Xiao; X Xiao; X Xiao; Y Zhang; YD Cai; YD Li; Yu-Dong Cai; YX Li; Z He; Z Qiu; ZC Wu; ZC Wu

Prediction of Protein Domain with mRMR Feature Selection and Analysis

Authors: AA Schaffer
AG Murzin
AK Dunker
AM Moses
AP Elhammer
B Saffari
Bi-Qing Li
Bin Xue
BQ Li
CA Orengo
D Chivian
D Li
DE Kim
E Angov
EC Mbamala
G Pugalenthi
GP Zhou
GP Zhou
H Ingolfsson
H Mohabatkar
H Peng
HB Shen
HB Shen
I Walsh
ID Campbell
IH Witten
J Chen
J Cheng
J Cheng
J Cheng
J Eickholt
J Lin
J Liu
J Liu
J Wang
JD Qiu
JE Gewehr
JJ Chou
JR Schnell
K Peng
K Shameer
K Wang
Kai-Yan Feng
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KK Kandaswamy
Kuo-Chen Chou
L Breiman
L Chen
L Holm
Le-Le Hu
Lei Chen
M Esmaeili
M Hayat
M Suyama
MJ Berardi
MK Yoon
N Nagarajan
N von Ohsen
NM Goldenberg
P Mundra
P Tompa
P Wang
PE Wright
PK Nielsen
Q Gu
R Apweiler
R Bondugula
R Guerois
R Linding
RA George
RA Poorman
S Gong
S Kawashima
S Roy
SC Jia
SF Altschul
SM Reynolds
T Ebina
T Huang
TA Holland
W Li
W Zhao
WR Atchley
WZ Lin
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
Y Zhang
YD Cai
YD Li
Yu-Dong Cai
YX Li
Z He
Z Qiu
ZC Wu
ZC Wu
Publication date: 1 January 2012
Publisher: Public Library of Science
Doi

Abstract

The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine