Search CORE

1,167 research outputs found

Improved general regression network for protein domain boundary prediction

Author: A Ceroni
A Vieira
Abdur R Sikder
AK Jain
Albert Y Zomaya
AR Sikder
AR Sikder
Bing Bing Zhou
C Chothia
C Civera
CC Lee
CR Robinson
DB Wetlaufer
FMG Pearl
G Pollastri
G Pollastri
HC Van Leeuwen
HM Berman
J Chen
J Cheng
J Liu
J Sim
JCB Melo
JE Gewehr
JS Richardson
JSR Jang
M Dumontier
M Dumontier
M Suyama
MJ Lehtinen
N Nagarajan
OV Galzitskaya
P Baldi
P Bork
Paul D Yoo
RA George
RE Schapire
RL Marsden
RR Copley
RR Joshi
RS Gokhale
S Prompramote
S Veretnik
SF Altschul
TA Holland
Y Freund
Publication venue: BioMed Central
Publication date: 13/02/2008
Field of study

Background: Protein domains present some of the most useful information that can be used to understand protein structure and functions. Recent research on protein domain boundary prediction has been mainly based on widely known machine learning techniques, such as Artificial Neural Networks and Support Vector Machines. In this study, we propose a new machine learning model (IGRN) that can achieve accurate and reliable classification, with significantly reduced computations. The IGRN was trained using a PSSM (Position Specific Scoring Matrix), secondary structure, solvent accessibility information and inter-domain linker index to detect possible domain boundaries for a target sequence. Results: The proposed model achieved average prediction accuracy of 67% on the Benchmark_2 dataset for domain boundary identification in multi-domains proteins and showed superior predictive performance and generalisation ability among the most widely used neural network models. With the CASP7 benchmark dataset, it also demonstrated comparable performance to existing domain boundary predictors such as DOMpro, DomPred, DomSSEA, DomCut and DomainDiscovery with 70.10% prediction accuracy. Conclusion: The performance of proposed model has been compared favourably to the performance of other existing machine learning based methods as well as widely known domain boundary predictors on two benchmark datasets and excels in the identification of domain boundaries in terms of model bias, generalisation and computational requirements. © 2008 Yoo et al; licensee BioMed Central Ltd

Crossref

Michigan Technological University

PubMed Central

Prediction of Protein Domain with mRMR Feature Selection and Analysis

Author: AA Schaffer
AG Murzin
AK Dunker
AM Moses
AP Elhammer
B Saffari
Bi-Qing Li
Bin Xue
BQ Li
CA Orengo
D Chivian
D Li
DE Kim
E Angov
EC Mbamala
G Pugalenthi
GP Zhou
GP Zhou
H Ingolfsson
H Mohabatkar
H Peng
HB Shen
HB Shen
I Walsh
ID Campbell
IH Witten
J Chen
J Cheng
J Cheng
J Cheng
J Eickholt
J Lin
J Liu
J Liu
J Wang
JD Qiu
JE Gewehr
JJ Chou
JR Schnell
K Peng
K Shameer
K Wang
Kai-Yan Feng
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KC Chou
KK Kandaswamy
Kuo-Chen Chou
L Breiman
L Chen
L Holm
Le-Le Hu
Lei Chen
M Esmaeili
M Hayat
M Suyama
MJ Berardi
MK Yoon
N Nagarajan
N von Ohsen
NM Goldenberg
P Mundra
P Tompa
P Wang
PE Wright
PK Nielsen
Q Gu
R Apweiler
R Bondugula
R Guerois
R Linding
RA George
RA Poorman
S Gong
S Kawashima
S Roy
SC Jia
SF Altschul
SM Reynolds
T Ebina
T Huang
TA Holland
W Li
W Zhao
WR Atchley
WZ Lin
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
X Xiao
Y Zhang
YD Cai
YD Li
Yu-Dong Cai
YX Li
Z He
Z Qiu
ZC Wu
ZC Wu
Publication venue: Public Library of Science
Publication date: 01/01/2012
Field of study

The domains are the structural and functional units of proteins. With the avalanche of protein sequences generated in the postgenomic age, it is highly desired to develop effective methods for predicting the protein domains according to the sequences information alone, so as to facilitate the structure prediction of proteins and speed up their functional annotation. However, although many efforts have been made in this regard, prediction of protein domains from the sequence information still remains a challenging and elusive problem. Here, a new method was developed by combing the techniques of RF (random forest), mRMR (maximum relevance minimum redundancy), and IFS (incremental feature selection), as well as by incorporating the features of physicochemical and biochemical properties, sequence conservation, residual disorder, secondary structure, and solvent accessibility. The overall success rate achieved by the new method on an independent dataset was around 73%, which was about 28–40% higher than those by the existing method on the same benchmark dataset. Furthermore, it was revealed by an in-depth analysis that the features of evolution, codon diversity, electrostatic charge, and disorder played more important roles than the others in predicting protein domains, quite consistent with experimental observations. It is anticipated that the new method may become a high-throughput tool in annotating protein domains, or may, at the very least, play a complementary role to the existing domain prediction methods, and that the findings about the key features with high impacts to the domain prediction might provide useful insights or clues for further experimental investigations in this area. Finally, it has not escaped our notice that the current approach can also be utilized to study protein signal peptides, B-cell epitopes, HIV protease cleavage sites, among many other important topics in protein science and biomedicine

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

FigShare

DomSVR: Domain boundary prediction with support vector regression from sequence information alone

Author: Burge L
Chen P
Gloster C
Li J
Liu C
Mohammad M
Southerland W
Wang B
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/08/2010
Field of study

Protein domains are structural and fundamental functional units of proteins. The information of protein domain boundaries is helpful in understanding the evolution, structures and functions of proteins, and also plays an important role in protein classification. In this paper, we propose a support vector regression-based method to address the problem of protein domain boundary identification based on novel input profiles extracted from AA-index database. As a result, our method achieves an average sensitivity of ∼36.5% and an average specificity of ∼ 81% for multi-domain protein chains, which is overall better than the performance of published approaches to identify domain boundary. As our method used sequence information alone, our method is simpler and faster.© Springer-Verlag 2010

OPUS - University of Technology Sydney

PubMed Central

Accurate Demarcation of Protein Domain Linkers Based on Structural Analysis of Linker Probable Region

Author: Hulgeriy Arvind
Samant Vivekanand V.
Tendulkar Ashish V.
Valencia’s Alfonso
Publication venue: International Journal for Computational Biology (IJCB)
Publication date: 21/04/2014
Field of study

In multi-domainproteins, the domainsare connected by a flexible unstructured region called as protein domain linker. The accurate demarcation of these linkers holds a key to understanding of their biochemical and evolutionary attributes. This knowledge helps in designing a suitable linker for engineering stable multi-domain chimeric proteins. Here we propose a novel method for the demarcation of the linker based on a three-dimensional protein structure and a domain definition. The proposed method is based on biological knowledge about structural flexibility of the linkers. We performed structural analysis on a linker probable region (LPR) around domain boundary points of known SCOP domains. The LPR was described using a set of overlapping peptide fragments of fixed size. Each peptide fragment was then described by geometricinvariants (GIs) and subjected to clustering process where the fragments corresponding to actual linker comeupasoutliers.We then discover the actual linkers by finding the longest continuous stretch ofoutlier fragments from LPRs. This method was evaluated on a benchmark dataset of 51 continuous multi-domain proteins, where it achieves F1 score of 0.745 (0.83precision and 0.66recall). When the method was applied on 725 continuous multi-domain proteins, it was able to identify novel linkers that were not reported previously. This method can be used in combination with supervised /sequence based linker prediction methods for accurate linker demarcation.

International Journal for Computational Biology (IJCB)

Protein Domain Linker Prediction: A Direction for Detecting Protein – Protein Interactions

Author: Hasan Shatnawi Maad Mohammad
Publication venue: Scholarworks@UAEU
Publication date: 01/06/2015
Field of study

Protein chains are generally long and consist of multiple domains. Domains are the basic of elements of protein structures that can exist, evolve and function independently. The accurate and reliable identification of protein domains and their interactions has very important impacts in several protein research areas. The accurate prediction of protein domains is a fundamental stage in both experimental and computational proteomics. The knowledge is an initial stage of protein tertiary structure prediction which can give insight into the way in which protein works. The knowledge of domains is also useful in classifying the proteins, understanding their structures, functions and evolution, and predicting protein-protein interactions (PPI). However, predicting structural domains within proteins is a challenging task in computational biology. A promising direction of domain prediction is detecting inter-domain linkers and then predicting the reigns of the protein sequence in which the structural domains are located accordingly. Protein-protein interactions occur at almost every level of cell function. The identification of interaction among proteins and their associated domains provide a global picture of cellular functions and biological processes. It is also an essential step in the construction of PPI networks for human and other organisms. PPI prediction has been considered as a promising alternative to the traditional drug design techniques. The identification of possible viral-host protein interaction can lead to a better understanding of infection mechanisms and, in turn, to the development of several medication drugs and treatment optimization. In this work, a compact and accurate approach for inter-domain linker prediction is developed based solely on protein primary structure information. Then, inter-domain linker knowledge is used in predicting structural domains and detecting PPI. The research work in this dissertation can be summarized in three main contributions. The first contribution is predicting protein inter-domain linker regions by introducing the concept of amino acid compositional index and refining the prediction by using the Simulated Annealing optimization technique. The second contribution is identifying structural domains based on inter-domain linker knowledge. The inter-domain linker knowledge, represented by the compositional index, is enhanced by the in cooperation of biological knowledge, represented by amino acid physiochemical properties. To develop a well optimized Random Forest classifier for predicting novel domain and inter-domain linkers. In the third contribution, the domain information knowledge is utilized to predict protein-protein interactions. This is achieved by characterizing structural domains within protein sequences, analyzing their interactions, and predicting protein interaction based on their interacting domains. The experimental studies and the higher accuracy achieved is a valid argument in favor of the proposed framework

United Arab Emirates University: Scholarworks@UAEU / جامعة الامارات

Ab initio methods for protein structure prediction

Author: Dousis Athanasios Dimitri
Publication venue
Publication date: 01/01/2010
Field of study

Recent breakthroughs in DNA and protein sequencing have unlocked many secrets of molecular biology. A complete understanding of gene function, however, requires a protein structure in addition to its sequence. Modern protein structure determination methods such as NMR, cryo-EM and X-ray crystallography are woefully unable to keep pace with automated sequencing techniques, creating a serious gap between available sequences and structures. This thesis describes several ab initio computational methods designed in the near-term to facilitate structure determination experiments, and in the long-term goal to predict protein structure completely and reliably. First, VecFold is a novel method for predicting the global tertiary structure topologies of proteins. VecFold applies fragment assembly to construct structural models from a target sequence by folding a chain of predicted secondary structure elements; these elements are represented either as Calpha-based rigid bodies or as vectors. The knowledge-based energy function OPUS-Ca or a knowledge-based geometric packing potential is used to guide the folding process. The newest version of VecFold is demonstrated to modestly outperform Rosetta, one of the leading ab initio predictors, on the CASP8 benchmark set. In our protein domain boundary prediction method OPUS-Dom, VecFold generates a large ensemble of folded structure models, and the domain boundaries of each model are labeled by a domain parsing algorithm. OPUS-Dom then derives consensus domain boundaries from the statistical distribution of the putative boundaries; the original version is also aided by three empirical sequence-based domain profiles. The latest version of OPUS-Dom outperformed, in terms of prediction sensitivity, several state-of-the-art domain prediction algorithms over various multi-domain protein sets. Even though many VecFold-generated structures contain large errors, collectively these structures provide a more robust delineation of domain boundaries. The success of OPUS-Dom suggests that the arrangement of protein domains is more a consequence of limited coordination patterns per domain arising from tertiary packing of secondary structure segments, rather than sequence-specific constraints. Finally, the knowledge-based energy function OPUS-Core was applied to the problem of protein folding core prediction, and it was shown to outpredict two leading computational methods on a benchmark set of 29 well-characterized protein targets

DSpace at Rice University

Folding by Numbers: Primary Sequence Statistics and Their Use in Studying Protein Folding

Author: Andrew
Anfinsen
Aurora
Bacardit
Bang
Brent Wathen
Broome
Bu
Bédard
Chan
Chiti
Chou
Chou
Cohen
Colloc’h
Cootes
Costantini
Crasto
Daffner
Dasgupta
de Brevern
Dill
Dill
Dill
Doig
Doig
Dong
Dunker
Dunker
Eaton
Edgar
Englander
Ermolenko
Etchebest
Fernández-Recio
Fersht
Fetrow
Fink
Fonseca
Fooks
Galzitskaya
Gruebele
Gu
Gunasekaran
Guruprasad
Hutchinson
Hutchinson
Jiménez
Jones
Kabsch
Kapp
Karplus
Kauzmann
Klingler
Krantz
Kryshtafovych
Levinthal
Levitt
Lifson
Lifson
Liu
Luo
Mahalanobis
Maity
Mandel-Gutfreund
Marqusee
Miyazaki
Murphy
Muñoz
Nakashima
Noguchi
Onuchic
Pal
Penel
Penel
Presta
Richardson
Rigden
Romero
Rose
Rossmann
Sagermann
Santiveri
Schueler-Furman
Schwartz
Schwartz
Serrano
Serrano
Shannon
Shortle
Strait
Suyama
Swanson
Unger
Uversky
Vazquez
Ventura
Viguera
Vincent
von Heijne
Walther
Wang
Wang
Weiss
West
Wetlaufer
Wheelan
White
Wilmot
Wilson
Wolynes
Wouters
Wright
Xiong
Ye
Yon
Yoo
Zhu
Zongchao Jia
Publication venue: Molecular Diversity Preservation International (MDPI)
Publication date: 01/04/2009
Field of study

The exponential growth over the past several decades in the quantity of both primary sequence data available and the number of protein structures determined has provided a wealth of information describing the relationship between protein primary sequence and tertiary structure. This growing repository of data has served as a prime source for statistical analysis, where underlying relationships between patterns of amino acids and protein structure can be uncovered. Here, we survey the main statistical approaches that have been used for identifying patterns within protein sequences, and discuss sequence pattern research as it relates to both secondary and tertiary protein structure. Limitations to statistical analyses are discussed, and a context for their role within the field of protein folding is given. We conclude by describing a novel statistical study of residue patterning in β-strands, which finds that hydrophobic (i,i+2) pairing in β-strands occurs more often than expected at locations near strand termini. Interpretations involving β-sheet nucleation and growth are discussed

Crossref

Directory of Open Access Journals

PubMed Central

A Combination of Compositional Index and Genetic Algorithm for Predicting Transmembrane Helical Segments

Author: A Krogh
A Thomas
B Rost
E Falkenauer
E Wallin
EL Sonnhammer
F Tekaia
G Tusnady
G von Heijne
GE Tusnady
H Berman
H Shen
H Zhou
J Holland
J Pylouster
JM Cuthbertson
L Kall
M Cserzo
M Suyama
MG Claros
Nazar Zaki
Pierandrea Temussi
R Garey
RY Kahsay
S Hosseini
S Jayasinghe
S Roy
Salah Bouktif
Sanja Lazarova-Molnar
T Hirokawa
T Nugent
T Taylor
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

Transmembrane helix (TMH) topology prediction is becoming a focal problem in bioinformatics because the structure of TM proteins is difficult to determine using experimental methods. Therefore, methods that can computationally predict the topology of helical membrane proteins are highly desirable. In this paper we introduce TMHindex, a method for detecting TMH segments using only the amino acid sequence information. Each amino acid in a protein sequence is represented by a Compositional Index, which is deduced from a combination of the difference in amino acid occurrences in TMH and non-TMH segments in training protein sequences and the amino acid composition information. Furthermore, a genetic algorithm was employed to find the optimal threshold value for the separation of TMH segments from non-TMH segments. The method successfully predicted 376 out of the 378 TMH segments in a dataset consisting of 70 test protein sequences. The sensitivity and specificity for classifying each amino acid in every protein sequence in the dataset was 0.901 and 0.865, respectively. To assess the generality of TMHindex, we also tested the approach on another standard 73-protein 3D helix dataset. TMHindex correctly predicted 91.8% of proteins based on TM segments. The level of the accuracy achieved using TMHindex in comparison to other recent approaches for predicting the topology of TM proteins is a strong argument in favor of our proposed method. Availability: The datasets, software together with supplementary materials are available at: http://faculty.uaeu.ac.ae/nzaki/TMHindex.htm

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

University of Southern Denmark Research Output

Exploiting residue-level and profile-level interface propensities for usage in binding sites prediction of proteins

Author: A Dubey
A Koike
A Rossi
AH Liu
AJ Bordner
AJ Bordner
AR Panchenko
AT Laurie
B Pils
B Thibert
B Wang
B Wilczynski
C Sander
C Yan
C Yan
C Zhang
CC Chang
D La
DH Morgan
F Osterberg
G Cheng
H Chen
H Deng
H Neuvirth
H Yao
H Yao
HX Zhou
I Res
I Xenarios
IM Nooren
IM Nooren
J Meiler
JL Chung
JR Bradford
JR Bradford
JW Torrance
K Henrick
KA Snyder
L Lo Conte
Lei Lin
MH Li
O Lichtarge
P Chakrabarti
Q Dong
Qiwen Dong
Qw Dong
QW Dong
S Jones
S Karlin
S Liang
SF Altschul
T Down
TJ Magliery
V Chelliah
VN Vapnik
W Kabsch
WS Valdar
WS Valdar
Xiaolong Wang
Y Kim
Y Ofran
Y Ofran
Yi Guan
Z Zhang
Publication venue: BioMed Central
Publication date: 01/05/2007
Field of study

Abstract Background Recognition of binding sites in proteins is a direct computational approach to the characterization of proteins in terms of biological and biochemical function. Residue preferences have been widely used in many studies but the results are often not satisfactory. Although different amino acid compositions among the interaction sites of different complexes have been observed, such differences have not been integrated into the prediction process. Furthermore, the evolution information has not been exploited to achieve a more powerful propensity. Result In this study, the residue interface propensities of four kinds of complexes (homo-permanent complexes, homo-transient complexes, hetero-permanent complexes and hetero-transient complexes) are investigated. These propensities, combined with sequence profiles and accessible surface areas, are inputted to the support vector machine for the prediction of protein binding sites. Such propensities are further improved by taking evolutional information into consideration, which results in a class of novel propensities at the profile level, i.e. the binary profiles interface propensities. Experiment is performed on the 1139 non-redundant protein chains. Although different residue interface propensities among different complexes are observed, the improvement of the classifier with residue interface propensities can be negligible in comparison with that without propensities. The binary profile interface propensities can significantly improve the performance of binding sites prediction by about ten percent in term of both precision and recall. Conclusion Although there are minor differences among the four kinds of complexes, the residue interface propensities cannot provide efficient discrimination for the complicated interfaces of proteins. The binary profile interface propensities can significantly improve the performance of binding sites prediction of protein, which indicates that the propensities at the profile level are more accurate than those at the residue level.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central