Background: The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results: Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion: In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community

A Moschitti

AB Clegg

Adrian J Shepherd

AM Cohen

Andrew B Clegg

AS Yeh

B Settles

C Nédellec

D Rebholz-Schuhmann

H Jose

HL Johnson

J Ding

J Fluck

JD Kim

K Franzén

K Fundel

K Sagae

L Hunter

M Krallinger

N Domedel-Puig

R Bunescu

R Hoffmann

R Kabiljo

R Leaman

R Sætre

Renata Kabiljo

S Pyysalo

T Hara

WA Baumgartner

BMC Bioinformatics

English

PubMed

Abstract Background The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community.</p

Shepherd Adrian J

Clegg Andrew B

Kabiljo Renata

Directory of Open Access Journals

A realistic assessment of methods for extracting gene/protein interactions from free text

Kabiljo, R.

Clegg, A.B.

Shepherd, Adrian J.

Name not available

Springer - Publisher Connector

Background: The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger.Results: Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions.Conclusion: In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community

Kabiljo, R

Clegg, AB

Shepherd, AJ

UCL Discovery

Crossref

Birkbeck Institutional Research Online

A study on Convolution Kernels for Shallow Semantic Parsing.

A: A Gene Network for Navigating the Literature. Nature Genetics

A: Overview of the protein-protein interaction annotation extraction task of BioCreative II. Genome Biol 2008, 9(Suppl 2):S4.

A: Text processing through Web services: calling Whatizit. Bioinformatics

ABNER: an open source tool for automatically tagging genes, proteins and other entity names in text. Bioinformatics

AJ: ProSpecTome: a new tagged corpus for protein named entity recognition.

AJ: Protein Name Tagging in the Immunological Domain.

AJ: Syntactic pattern matching with GraphSpider and MPL.

BANNER: An executable survey of advances in biomedical named entity recognition. Pacific Symposium on Biocomputing

BioCreAtIvE task 1A: gene mention finding evaluation.

Concept recognition for extracting protein interaction relations from biomedical text. Genome Biology 2008, 9(Suppl 2):S9.

Corpus annotation for mining biomedical events from literature.

Corpus Refactoring: a Feasibility Study.

Dependency parsing and domain adaptation with LR models and parser ensembles.

Evaluating Impact of Re-training a Lexical Disambiguation Model on Domain Adaptation of an HPSG Parser.

Extraction of Protein Interaction Data: A Comparative Analysis of Methods in Use.

GENIA corpus – a semantically annotated corpus for bio-textmining. Bioinformatics

KB: OpenDMAP: An open source, ontology-driven concept analysis engine, with application to capturing knowledge regarding protein transport, protein interactions and cell-type-specific gene expression.

Learning language in logic – genic interaction extraction challenge.

ProMiner: recognition of human gene and protein names using regularly updated dictionaries.

Protein names and how to find them.

RelEx – relation extraction using dependency parse trees. Bioinformatics

Salakoski T: BioInfer: A corpus for information extraction in the biomedical domain.

Salakoski T: Comparative analysis of five protein-protein interaction corpora. BMC Bioinformatics, special issue 2008, 9(Suppl 3):S6.

Static Relations: a Piece in the Biomedical Information Extraction Puzzle.

Syntactic features for protein-protein interaction extraction.

Wernisch L: Applying GIFT, a Gene Interactions Finder in Text, to fly literature. Bioinformatics

Wong YW: Comparative Experiments on Learning Information Extractors for Proteins and their Interactions. Artif Intell Med, Summarization and Information Extraction from Medical Documents

WR: A survey of current work in biomedical text mining.

Wurtele E: Mining MEDLINE: abstracts, sentences, or phrases?

http://discovery.ucl.ac.uk/118373/1/1471-2105-10-233.pdf

A realistic assessment of methods for extracting gene/protein interactions from free text

Abstract

Similar works

Full text

Available Versions

Directory of Open Access Journals

Name not available

Springer - Publisher Connector

UCL Discovery

Crossref

Springer - Publisher Connector

Birkbeck Institutional Research Online