Search CORE

Wageningen University & Research Publications

Improving protein function prediction methods with integrated literature data

Author: A Karimpour-Fard
A Vazquez
A Vinayagam
Aaron P Gabow
AK Ramani
B Schwikowski
BTF Alako
C Brun
C von Mering
Debra S Goldberg
E Nabieva
HW Mewes
I Xenarios
J Rual
K Tsuda
L Hunter
L Hunter
L Tanabe
Lawrence E Hunter
M Ashburner
M Aubry
M Chagoyen
M Huynen
M Krallinger
M Krallinger
M Pelligri
M Yetisgen-Yildiz
OG Troyanskaya
P Srinivasan
PM Bowers
R Cilibrasi
R Hoffmann
S Letovsky
S Raychaudhuri
Sonia M Leach
T Schlitt
T Tanabe
TK Jenssen
U Karaoz
William A Baumgartner
Y Ofran
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background Determining the function of uncharacterized proteins is a major challenge in the post-genomic era due to the problem's complexity and scale. Identifying a protein's function contributes to an understanding of its role in the involved pathways, its suitability as a drug target, and its potential for protein modifications. Several graph-theoretic approaches predict unidentified functions of proteins by using the functional annotations of better-characterized proteins in protein-protein interaction networks. We systematically consider the use of literature co-occurrence data, introduce a new method for quantifying the reliability of co-occurrence and test how performance differs across species. We also quantify changes in performance as the prediction algorithms annotate with increased specificity. Results We find that including information on the co-occurrence of proteins within an abstract greatly boosts performance in the Functional Flow graph-theoretic function prediction algorithm in yeast, fly and worm. This increase in performance is not simply due to the presence of additional edges since supplementing protein-protein interactions with co-occurrence data outperforms supplementing with a comparably-sized genetic interaction dataset. Through the combination of protein-protein interactions and co-occurrence data, the neighborhood around unknown proteins is quickly connected to well-characterized nodes which global prediction algorithms can exploit. Our method for quantifying co-occurrence reliability shows superior performance to the other methods, particularly at threshold values around 10% which yield the best trade off between coverage and accuracy. In contrast, the traditional way of asserting co-occurrence when at least one abstract mentions both proteins proves to be the worst method for generating co-occurrence data, introducing too many false positives. Annotating the functions with greater specificity is harder, but co-occurrence data still proves beneficial. Conclusion Co-occurrence data is a valuable supplemental source for graph-theoretic function prediction algorithms. A rapidly growing literature corpus ensures that co-occurrence data is a readily-available resource for nearly every studied organism, particularly those with small protein interaction databases. Though arguably biased toward known genes, co-occurrence data provides critical additional links to well-studied regions in the interaction network that graph-theoretic function prediction algorithms can exploit.</p

A probabilistic framework to predict protein function from interaction data integrated with semantic knowledge

Author: A Ruepp
A Valencia
A Vazquez
Aidong Zhang
B Schwikowski
B-J Breitkreutz
DS Goldberg
E Nabieva
E Sprinzak
EM Marcotte
H Hishigaki
H Lee
HN Chua
HW Mewes
I Friedberg
International Human Genome Sequencing Consortium
JBL Bard
JR Parrish
JZ Wang
L Salwinski
Lei Shi
M Deng
M Kirac
M Pellegrini
MB Eisen
Murali Ramanathan
P Resnik
PW Lord
R Aebersold
R Overbeek
SF Altschul
The Gene Ontology Consortium
U Karaoz
WR Pearson
X Guo
X Wu
Y Tao
Y-R Cho
Young-Rae Cho
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background The functional characterization of newly discovered proteins has been a challenge in the post-genomic era. Protein-protein interactions provide insights into the functional analysis because the function of unknown proteins can be postulated on the basis of their interaction evidence with known proteins. The protein-protein interaction data sets have been enriched by high-throughput experimental methods. However, the functional analysis using the interaction data has a limitation in accuracy because of the presence of the false positive data experimentally generated and the interactions that are a lack of functional linkage. Results Protein-protein interaction data can be integrated with the functional knowledge existing in the Gene Ontology (GO) database. We apply similarity measures to assess the functional similarity between interacting proteins. We present a probabilistic framework for predicting functions of unknown proteins based on the functional similarity. We use the leave-one-out cross validation to compare the performance. The experimental results demonstrate that our algorithm performs better than other competing methods in terms of prediction accuracy. In particular, it handles the high false positive rates of current interaction data well. Conclusion The experimentally determined protein-protein interactions are erroneous to uncover the functional associations among proteins. The performance of function prediction for uncharacterized proteins can be enhanced by the integration of multiple data sources available.</p

Public Library of Science (PLOS)

Protein Function Assignment through Mining Cross-Species Protein-Protein Interactions

Author: A Bateman
A Schlicker
A Schlicker
A Vazquez
A Zanzoni
AJ Enright
B Schwikowski
C Brun
C Stark
EM Marcotte
EM Marcotte
F Ramirez
GD Bader
H Hishigaki
H Holzl
H Lee
HW Jacobs
HW Mewes
J McDermott
J Wojcik
JB Pereira-Leal
JB Pereira-Leal
JZ Wang
K Tschop
KP O'Brien
L Salwinski
M Ashburner
M Deng
M Deng
M Pellegrini
MA Crosby
MC Costanzo
Mei Liu
MO Lee
MP Brown
MY Galperin
N Nariai
OG Troyanskaya
P Gallant
P Resnik
PJ Ellis
R Apweiler
R Kraut
Robert Ward
S Letovsky
S Li
S Peri
SF Altschul
Sudhindra Gadagkar
T Pawson
V Spirin
W Poller
WR Pearson
Xue-wen Chen
XW Chen
Y Chen
Publication venue: Public Library of Science
Publication date: 01/01/2007
Field of study

Background: As we move into the post genome-sequencing era, an immediate challenge is how to make best use of the large amount of high-throughput experimental data to assign functions to currently uncharacterized proteins. We here describe CSIDOP, a new method for protein function assignment based on shared interacting domain patterns extracted from cross-species protein-protein interaction data. Methodology/Principal Findings: The proposed method is assessed both biologically and statistically over the genome of H. sapiens. The CSIDOP method is capable of making protein function prediction with accuracy of 95.42 % using 2,972 gene ontology (GO) functional categories. In addition, we are able to assign novel functional annotations for 181 previously uncharacterized proteins in H. sapiens. Furthermore, we demonstrate that for proteins that are characterized by GO, the CSIDOP may predict extra functions. This is attractive as a protein normally executes a variety of functions in different processes and its current GO annotation may be incomplete. Conclusions/Significance: It can be shown through experimental results that the CSIDOP method is reliable and practical in use. The method will continue to improve as more high quality interaction data becomes available and is readily scalable t

CiteSeerX

KU ScholarWorks

MEGADOCK 3.0: a high-performance protein-protein interaction prediction software using hybrid parallel computing for petascale supercomputing environments

Author: Masahito Ohue
Nobuyuki Uchikoga
Takashi Ishida
Takehiro Shimoda
Toshiyuki Sato
Yuri Matsuzaki
Yutaka Akiyama
Publication venue: Springer Nature
Publication date: 01/01/2013
Field of study

BACKGROUND: Protein-protein interaction (PPI) plays a core role in cellular functions. Massively parallel supercomputing systems have been actively developed over the past few years, which enable large-scale biological problems to be solved, such as PPI network prediction based on tertiary structures. RESULTS: We have developed a high throughput and ultra-fast PPI prediction system based on rigid docking, “MEGADOCK”, by employing a hybrid parallelization (MPI/OpenMP) technique assuming usages on massively parallel supercomputing systems. MEGADOCK displays significantly faster processing speed in the rigid-body docking process that leads to full utilization of protein tertiary structural data for large-scale and network-level problems in systems biology. Moreover, the system was scalable as shown by measurements carried out on two supercomputing environments. We then conducted prediction of biological PPI networks using the post-docking analysis. CONCLUSIONS: We present a new protein-protein docking engine aimed at exhaustive docking of mega-order numbers of protein pairs. The system was shown to be scalable by running on thousands of nodes. The software package is available at: http://www.bi.cs.titech.ac.jp/megadock/k/

Digital Repository @ Iowa State University (ISU)

Extraction of an Effective Pairwise Potential for Amino Acids

Author: Luo Jie
Publication venue: Iowa State University Digital Repository
Publication date: 01/01/2011
Field of study

Key to successful protein structure prediction is a potential that recognizes the native state from misfolded structures. In this thesis, we introduced a novel way to extract interaction potential functions between the 20 types of amino acids, which used the Modified Hypenetted Chain (MHNC) and the Reverse Monte-Carlo (RMC) method. We extract Radial Distribution Functions (RDFs) from 996 known protein crystal structures from the Protein Data Bank, and using these RDFs we were able to first generate the potential-of-mean-force (PMF) for different pairs of residues, and then we improved these PMFs by including the higher order terms of the Ornstein-Zernike equation using an iteration that starting from the HNC approximation for the pair interaction potential, and in each of the follow step, we conducted Monte-Carlo simulations to generate the RDFs for the updated potential. The updated potentials in each iteration step can be generated either using MHNC or the RMC method. These effective pairwise potentials were then summed up to obtain the total energy score for known protein structures, and their effectiveness was validated by conducting single and multiple decoy set tests using the `R\u27 Us decoy set

EcID. A database for the inference of functional interactions in E. coli

Author: A. Valencia
Altschul
B. Garcia
Bowers
Chenna
D. Juan
Dandekar
E. Andres Leon
Edgar
Gaasterland
Goh
Hermjakob
Hoffmann
I. Ezkurdia
Ibanez-Ruiz
Kersey
Keseler
Marcotte
Nitschk
Pazos
Pazos
Pellegrini
Peterson
Shannon
Tamames
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

The EcID database (Escherichia coli Interaction Database) provides a framework for the integration of information on functional interactions extracted from the following sources: EcoCyc (metabolic pathways, protein complexes and regulatory information), KEGG (metabolic pathways), MINT and IntAct (protein interactions). It also includes information on protein complexes from the two E. coli high-throughput pull-down experiments and potential interactions extracted from the literature using the web services associated to the iHOP text-mining system. Additionally, EcID incorporates results of various prediction methods, including two protein interaction prediction methods based on genomic information (Phylogenetic Profiles and Gene Neighbourhoods) and three methods based on the analysis of co-evolution (Mirror Tree, In Silico 2 Hybrid and Context Mirror). EcID associates to each prediction a specifically developed confidence score. The two main features that make EcID different from other systems are the combination of co-evolution-based predictions with the experimental data, and the introduction of E. coli-specific information, such as gene regulation information from EcoCyc. The possibilities offered by the combination of the EcID database information are illustrated with a prediction of potential functions for a group of poorly characterized genes related to yeaG. EcID is available online at http://ecid.bioinfo.cnio.es

Universidad Carlos III de Madrid e-Archivo

EcID. A database for the inference of functional interactions in E. coli

Author: A. Valencia
Altschul
B. Garcia
Bowers
Chenna
D. Juan
Dandekar
E. Andres Leon
Edgar
Gaasterland
Goh
Hermjakob
Hoffmann
I. Ezkurdia
Ibanez-Ruiz
Kersey
Keseler
Marcotte
Nitschk
Pazos
Pazos
Pellegrini
Peterson
Shannon
Tamames
Publication venue: Oxford University Press
Publication date: 01/01/2009
Field of study

Universidad Carlos III de Madrid e-Archivo

Predicting domain-domain interaction based on domain profiles with feature selection and support vector machines

Author: González Alvaro J
Liao Li
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Protein-protein interaction (PPI) plays essential roles in cellular functions. The cost, time and other limitations associated with the current experimental methods have motivated the development of computational methods for predicting PPIs. As protein interactions generally occur via domains instead of the whole molecules, predicting domain-domain interaction (DDI) is an important step toward PPI prediction. Computational methods developed so far have utilized information from various sources at different levels, from primary sequences, to molecular structures, to evolutionary profiles. Results In this paper, we propose a computational method to predict DDI using support vector machines (SVMs), based on domains represented as interaction profile hidden Markov models (ipHMM) where interacting residues in domains are explicitly modeled according to the three dimensional structural information available at the Protein Data Bank (PDB). Features about the domains are extracted first as the Fisher scores derived from the ipHMM and then selected using singular value decomposition (SVD). Domain pairs are represented by concatenating their selected feature vectors, and classified by a support vector machine trained on these feature vectors. The method is tested by leave-one-out cross validation experiments with a set of interacting protein pairs adopted from the 3DID database. The prediction accuracy has shown significant improvement as compared to <it>InterPreTS </it>(Interaction Prediction through Tertiary Structure), an existing method for PPI prediction that also uses the sequences and complexes of known 3D structure. Conclusions We show that domain-domain interaction prediction can be significantly enhanced by exploiting information inherent in the domain profiles via feature selection based on Fisher scores, singular value decomposition and supervised learning based on support vector machines. Datasets and source code are freely available on the web at <url>http://liao.cis.udel.edu/pub/svdsvm</url>. Implemented in Matlab and supported on Linux and MS Windows.</p