Abstract Background The BioCreative text mining evaluation investigated the application of text mining methods to the task of automatically extracting information from text in biomedical research articles. We participated in Task 2 of the evaluation. For this task, we built a system to automatically annotate a given protein with codes from the Gene Ontology (GO) using the text of an article from the biomedical literature as evidence. Methods Our system relies on simple statistical analyses of the full text article provided. We learn <it>n</it>-gram models for each GO code using statistical methods and use these models to hypothesize annotations. We also learn a set of Naïve Bayes models that identify textual clues of possible connections between the given protein and a hypothesized annotation. These models are used to filter and rank the predictions of the <it>n</it>-gram models. Results We report experiments evaluating the utility of various components of our system on a set of data held out during development, and experiments evaluating the utility of external data sources that we used to learn our models. Finally, we report our evaluation results from the BioCreative organizers. Conclusion We observe that, on the test data, our system performs quite well relative to the other systems submitted to the evaluation. From other experiments on the held-out data, we observe that (i) the Naïve Bayes models were effective in filtering and ranking the initially hypothesized annotations, and (ii) our learned models were significantly more accurate when external data sources were used during learning.</p

Craven, Mark

Ray, Soumya

English

PubMed

Springer - Publisher Connector

Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text

Soumya Ray

Mark Craven

Crossref

Abstract Background The BioCreative text mining evaluation investigated the application of text mining methods to the task of automatically extracting information from text in biomedical research articles. We participated in Task 2 of the evaluation. For this task, we built a system to automatically annotate a given protein with codes from the Gene Ontology (GO) using the text of an article from the biomedical literature as evidence. Methods Our system relies on simple statistical analyses of the full text article provided. We learn n-gram models for each GO code using statistical methods and use these models to hypothesize annotations. We also learn a set of Naïve Bayes models that identify textual clues of possible connections between the given protein and a hypothesized annotation. These models are used to filter and rank the predictions of the n-gram models. Results We report experiments evaluating the utility of various components of our system on a set of data held out during development, and experiments evaluating the utility of external data sources that we used to learn our models. Finally, we report our evaluation results from the BioCreative organizers. Conclusion We observe that, on the test data, our system performs quite well relative to the other systems submitted to the evaluation. From other experiments on the held-out data, we observe that (i) the Naïve Bayes models were effective in filtering and ranking the initially hypothesized annotations, and (ii) our learned models were significantly more accurate when external data sources were used during learning.</p

Ray Soumya

Craven Mark

Directory of Open Access Journals

BMC Bioinformatics

A n d r a d a R , B o t s t e i n D , C h e r r y J M : Saccharomyces Genome Database.

An Algorithm for Suffix Stripping. Program

Apweiler R: The Gene Ontology Annotation (GOA) Database: sharing knowledge in Uniprot with Gene Ontology. Nucleic Acids Research

Apweiler R: The SWISS-PROT Protein Sequence Data Bank and its Supplement TrEMBL. Nucleic Acids Research

LD: WormBase: a multi-species resource for nematode biology and genomics. Nucleic Acids Research

Machine Learning

S: Guidelines for Human Gene Nomenclature. Genomics

Solving the Multiple Instance Problem with Axis-Parallel Rectangles. Artificial Intelligence

The Arabidopsis Information Resource (TAIR): A comprehensive database and web-based information retrieval, analysis, and visualization system for a model plant. Nucleic Acids Research

The FlyBase Consortium: The FlyBase database of the Drosophila genome projects and community literature.

The Gene Ontology Consortium: Gene Ontology: tool for the unification of biology.

https://bmcbioinformatics.biomedcentral.com/track/pdf/10.1186/1471-2105-6-S1-S18?site=bmcbioinformatics.biomedcentral.com

Learning Statistical Models for Annotating Proteins with Function Information using Biomedical Text

Abstract

Similar works

Full text

Available Versions

Springer - Publisher Connector

Crossref

Directory of Open Access Journals

Springer - Publisher Connector