Search CORE

35 research outputs found

Gene Function Classification Using Bayesian Models with Hierarchy-Based Priors

Author: A Clare
A McCallum
AS Weigend
B Rost
B Schoikowski
B Shahbaba
Babak Shahbaba
BE Engelhardt
D Koller
EM Marcotte
FR Blattner
H Blockeel
I Tsochantaridis
IUBMB
J DeRisi
J Fox
J Goodman
J Struyf
J Zhang
JA Eisen
JR Guest
K Sjölander
L Cai
L Dehaspe
M Brown
M Deng
M Deng
M Eisen
M Riley
M Riley
N Cesa-Bianchi
O Dekel
P Pavlidis
R Caruana
R Eisner
Radford M Neal
RD King
RD King
RM Neal
RM Neal
RM Neal
S Rison
S Sattath
S Spiro
SF Altschul
ST Dumais
WR Pearson
Z Barutcuoglu
Publication venue
Publication date: 01/01/2006
Field of study

We investigate the application of hierarchical classification schemes to the annotation of gene function based on several characteristics of protein sequences including phylogenic descriptors, sequence based attributes, and predicted secondary structure. We discuss three Bayesian models and compare their performance in terms of predictive accuracy. These models are the ordinary multinomial logit (MNL) model, a hierarchical model based on a set of nested MNL models, and a MNL model with a prior that introduces correlations between the parameters for classes that are nearby in the hierarchy. We also provide a new scheme for combining different sources of information. We use these models to predict the functional class of Open Reading Frames (ORFs) from the E. coli genome. The results from all three models show substantial improvement over previous methods, which were based on the C5 algorithm. The MNL model using a prior based on the hierarchy outperforms both the non-hierarchical MNL model and the nested MNL model. In contrast to previous attempts at combining these sources of information, our approach results in a higher accuracy rate when compared to models that use each data source alone. Together, these results show that gene function can be predicted with higher accuracy than previously achieved, using Bayesian models that incorporate suitable prior information

arXiv.org e-Print Archive

University of Toronto Research Repository

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

eScholarship - University of California

A probabilistic framework to predict protein function from interaction data integrated with semantic knowledge

Author: A Ruepp
A Valencia
A Vazquez
Aidong Zhang
B Schwikowski
B-J Breitkreutz
DS Goldberg
E Nabieva
E Sprinzak
EM Marcotte
H Hishigaki
H Lee
HN Chua
HW Mewes
I Friedberg
International Human Genome Sequencing Consortium
JBL Bard
JR Parrish
JZ Wang
L Salwinski
Lei Shi
M Deng
M Kirac
M Pellegrini
MB Eisen
Murali Ramanathan
P Resnik
PW Lord
R Aebersold
R Overbeek
SF Altschul
The Gene Ontology Consortium
U Karaoz
WR Pearson
X Guo
X Wu
Y Tao
Y-R Cho
Young-Rae Cho
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

Abstract Background The functional characterization of newly discovered proteins has been a challenge in the post-genomic era. Protein-protein interactions provide insights into the functional analysis because the function of unknown proteins can be postulated on the basis of their interaction evidence with known proteins. The protein-protein interaction data sets have been enriched by high-throughput experimental methods. However, the functional analysis using the interaction data has a limitation in accuracy because of the presence of the false positive data experimentally generated and the interactions that are a lack of functional linkage. Results Protein-protein interaction data can be integrated with the functional knowledge existing in the Gene Ontology (GO) database. We apply similarity measures to assess the functional similarity between interacting proteins. We present a probabilistic framework for predicting functions of unknown proteins based on the functional similarity. We use the leave-one-out cross validation to compare the performance. The experimental results demonstrate that our algorithm performs better than other competing methods in terms of prediction accuracy. In particular, it handles the high false positive rates of current interaction data well. Conclusion The experimentally determined protein-protein interactions are erroneous to uncover the functional associations among proteins. The performance of function prediction for uncharacterized proteins can be enhanced by the integration of multiple data sources available.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Annotation Error in Public Databases: Misannotation of Molecular Function in Enzyme Superfamilies

Due to the rapid release of new data from genome sequencing projects, the majority of protein sequences in public databases have not been experimentally characterized; rather, sequences are annotated using computational analysis. The level of misannotation and the types of misannotation in large public databases are currently unknown and have not been analyzed in depth. We have investigated the misannotation levels for molecular function in four public protein sequence databases (UniProtKB/Swiss-Prot, GenBank NR, UniProtKB/TrEMBL, and KEGG) for a model set of 37 enzyme families for which extensive experimental information is available. The manually curated database Swiss-Prot shows the lowest annotation error levels (close to 0% for most families); the two other protein sequence databases (GenBank NR and TrEMBL) and the protein sequences in the KEGG pathways database exhibit similar and surprisingly high levels of misannotation that average 5%–63% across the six superfamilies studied. For 10 of the 37 families examined, the level of misannotation in one or more of these databases is >80%. Examination of the NR database over time shows that misannotation has increased from 1993 to 2005. The types of misannotation that were found fall into several categories, most associated with “overprediction” of molecular function. These results suggest that misannotation in enzyme superfamilies containing multiple families that catalyze different reactions is a larger problem than has been recognized. Strategies are suggested for addressing some of the systematic problems contributing to these high levels of misannotation

CiteSeerX

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central