Search CORE

130 research outputs found

A realistic assessment of methods for extracting gene/protein interactions from free text

Author: A Moschitti
AB Clegg
Adrian J Shepherd
AM Cohen
Andrew B Clegg
AS Yeh
B Settles
C Nédellec
D Rebholz-Schuhmann
H Jose
HL Johnson
J Ding
J Fluck
JD Kim
JD Kim
K Franzén
K Fundel
K Sagae
L Hunter
M Krallinger
N Domedel-Puig
R Bunescu
R Hoffmann
R Kabiljo
R Kabiljo
R Leaman
R Sætre
Renata Kabiljo
S Pyysalo
S Pyysalo
S Pyysalo
T Hara
WA Baumgartner
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

Background: The automated extraction of gene and/or protein interactions from the literature is one of the most important targets of biomedical text mining research. In this paper we present a realistic evaluation of gene/protein interaction mining relevant to potential non-specialist users. Hence we have specifically avoided methods that are complex to install or require reimplementation, and we coupled our chosen extraction methods with a state-of-the-art biomedical named entity tagger. Results: Our results show: that performance across different evaluation corpora is extremely variable; that the use of tagged (as opposed to gold standard) gene and protein names has a significant impact on performance, with a drop in F-score of over 20 percentage points being commonplace; and that a simple keyword-based benchmark algorithm when coupled with a named entity tagger outperforms two of the tools most widely used to extract gene/protein interactions. Conclusion: In terms of availability, ease of use and performance, the potential non-specialist user community interested in automatically extracting gene and/or protein interactions from free text is poorly served by current tools and systems. The public release of extraction tools that are easy to install and use, and that achieve state-of-art levels of performance should be treated as a high priority by the biomedical text mining community

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

UCL Discovery

PubMed Central

Birkbeck Institutional Research Online

Long term outcome after subarachnoid haemorrhage of unknown aetiology

Author: Keski-Nisula Leo H
Kähärä Veikko J
Niskakangas Tero T
Pyysalo Liisa M
Öhman Juha E
Publication venue: BMJ Group
Publication date
Field of study

Crossref

PubMed Central

BioInfer: a corpus for information extraction in the biomedical domain

Author: A Yakushiji
CF Baker
D Lin
DD Sleator
E Alphonse
E Tsivtsivadze
E Tsivtsivadze
F Ginter
Filip Ginter
G Hripcsak
H Shatkay
J Cohen
J Ding
J Kim
Jari Björne
JM Temkin
Jorma Boberg
Jouni Järvinen
Juho Heimonen
K Franzén
K Kipper
KB Cohen
KB Cohen
L Hirschman
L Salwinski
M Ashburner
N Daraselia
P Kingsbury
P Kingsbury
P Szolovits
S Aubin
S Pyysalo
S Pyysalo
S Pyysalo
S Siegel
Sampo Pyysalo
T Ohta
T Pahikkala
T Wattarujeekrit
Tapio Salakoski
TH King
Y Tateisi
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora. RESULTS: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation. CONCLUSION: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Overview of the ID, EPI and REL tasks of BioNLP Shared Task 2011

Author: A Morgan
A Riggs
A Vlachos
A Yeh
Bruno Sobral
C Arighi
C Nédellec
C Quirk
C Wang
CH Wei
CH Wu
Chunhong Mao
Chunxia Wang
D Barford
D McClosky
D McClosky
D McClosky
D McClosky
D Rebholz-Schuhmann
D Tikk
Dan Sullivan
DD Sleator
E Buyko
E Charniak
ES Witze
EW Noreen
H Kilicoglu
H Kilicoglu
H Kilicoglu
H Lee
H Liu
H Liu
H Poon
J Björne
J Björne
J Björne
J Björne
J Hakenberg
J Stock
J Tsujii
J Wermter
J Wilbur
JD Kim
JD Kim
JD Kim
JD Kim
JD Kim
JD Kim
Jun'ichi Tsujii
K Yoshikawa
L Hirschman
L McGrath
L Tanabe
M Ashburner
M Gerner
M Glickman
M Krallinger
M Miwa
M Miwa
M Narayanaswamy
M Ongenaert
M Porter
M Porter
MC de Marneffe
ME Winston
MS Simpson
N Chinchor
N Chinchor
N Nguyen
O Bodenreider
P Corbett
P Stenetorp
P Stenetorp
P Thomason
P Thompson
P Zweigenbaum
Q Le Minh
R Farkas
R Hoehndorf
R Holliday
R Holliday
R Jaenisch
R Leaman
Rafal Rak
S Ananiadou
S Ananiadou
S Ananiadou
S Pyysalo
S Pyysalo
S Pyysalo
S Pyysalo
S Pyysalo
S Pyysalo
S Pyysalo
S Riedel
S Riedel
S Riedel
S Riedel
S Strassel
S Van Landeghem
S Van Landeghem
S Van Landeghem
S Van Landeghem
Sampo Pyysalo
Sophia Ananiadou
T Krell
T Mascher
T Ohta
T Ohta
T Ohta
T Ohta
T Ohta
T Ohta
T Ohta
Tomoko Ohta
V Vincze
W Hersh
X Yuan
Y Gotoh
Y Sasaki
Y Tateisi
Y Wang
ZZ Hu
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

We present the preparation, resources, results and analysis of three tasks of the BioNLP Shared Task 2011: the main tasks on Infectious Diseases (ID) and Epigenetics and Post-translational Modifications (EPI), and the supporting task on Entity Relations (REL). The two main tasks represent extensions of the event extraction model introduced in the BioNLP Shared Task 2009 (ST'09) to two new areas of biomedical scientific literature, each motivated by the needs of specific biocuration tasks. The ID task concerns the molecular mechanisms of infection, virulence and resistance, focusing in particular on the functions of a class of signaling systems that are ubiquitous in bacteria. The EPI task is dedicated to the extraction of statements regarding chemical modifications of DNA and proteins, with particular emphasis on changes relating to the epigenetic control of gene expression. By contrast to these two application-oriented main tasks, the REL task seeks to support extraction in general by separating challenges relating to part-of relations into a subproblem that can be addressed by independent systems. Seven groups participated in each of the two main tasks and four groups in the supporting task. The participating systems indicated advances in the capability of event extraction methods and demonstrated generalization in many aspects: from abstracts to full texts, from previously considered subdomains to new ones, and from the ST'09 extraction targets to other entities and events. The highest performance achieved in the supporting task REL, 58% F-score, is broadly comparable with levels reported for other relation extraction tasks. For the ID task, the highest-performing system achieved 56% F-score, comparable to the state-of-the-art performance at the established ST'09 task. In the EPI task, the best result was 53% F-score for the full set of extraction targets and 69% F-score for a reduced set of core extraction targets, approaching a level of performance sufficient for user-facing applications. In this study, we extend on previously reported results and perform further analyses of the outputs of the participating systems. We place specific emphasis on aspects of system performance relating to real-world applicability, considering alternate evaluation metrics and performing additional manual analysis of system outputs. We further demonstrate that the strengths of extraction systems can be combined to improve on the performance achieved by any system in isolation. The manually annotated corpora, supporting resources, and evaluation tools for all tasks are available from http://www.bionlp-st.org and the tasks continue as open challenges for all interested parties

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The University of Manchester - Institutional Repository

A neural network multi-task learning approach to biomedical named entity recognition

Author: A Argyriou
A Maurer
A Søgaard
Anna Korhonen
B Bakker
B Chiu
Billy Chiu
CH Wei
D Campos
E Pafilis
G Zhou
Gamal Crichton
H Li
HM Alonso
J Bingel
JD Kim
JD Kim
JD Kim
JR Finkel
K Hakala
L Smith
M Bada
M Gerner
M Krallinger
M Luong
N Srivastava
O Levy
PS Huang
R Batista-Navarro
R Caruana
R Collobert
R Leaman
R Leaman
R Leaman
R Leaman
RI Doğan
RK Ando
RK Ando
S Pyysalo
S Pyysalo
S Pyysalo
S Pyysalo
Sampo Pyysalo
T Evgeniou
T Munkhdalai
T Ohta
T Ohta
V Nair
W Zhang
Y Qi
Y Qi
Y Wang
Z Wu
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Linguistic feature analysis for protein interaction extraction

Author: A Airola
A Moschitti
A Yakushiji
B Schölkopf
C Cortes
C Giuliano
C Nedellec
CC Chang
Chris Cornelis
D Haussler
H Lodhi
J Ding
J Xiao
JH Eom
K Fundel
M Collins
Martine De Cock
MF Porter
R Bunescu
R Saetre
RC Bunescu
S Katrenko
S Kim
S Pyysalo
S Pyysalo
S Van Landeghem
T Fayruzov
T Fayruzov
Timur Fayruzov
Veronique Hoste
Y Saeys
Publication venue: BioMed Central
Publication date: 01/01/2009
Field of study

Abstract Background The rapid growth of the amount of publicly available reports on biomedical experimental results has recently caused a boost of text mining approaches for protein interaction extraction. Most approaches rely implicitly or explicitly on linguistic, i.e., lexical and syntactic, data extracted from text. However, only few attempts have been made to evaluate the contribution of the different feature types. In this work, we contribute to this evaluation by studying the relative importance of deep syntactic features, i.e., grammatical relations, shallow syntactic features (part-of-speech information) and lexical features. For this purpose, we use a recently proposed approach that uses support vector machines with structured kernels. Results Our results reveal that the contribution of the different feature types varies for the different data sets on which the experiments were conducted. The smaller the training corpus compared to the test data, the more important the role of grammatical relations becomes. Moreover, deep syntactic information based classifiers prove to be more robust on heterogeneous texts where no or only limited common vocabulary is shared. Conclusion Our findings suggest that grammatical relations play an important role in the interaction extraction task. Moreover, the net advantage of adding lexical and shallow syntactic features is small related to the number of added features. This implies that efficient classifiers can be built by using only a small fraction of the features that are typically being used in recent approaches.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

Ghent University Academic Bibliography

PubMed Central

Semantically linking molecular entities in literature through entity relationships

Author: A Airola
A Reverter
Bernard De Baets
C Burgess
D Jurgens
D McClosky
DLT Rohde
E Charniak
EW Sayers
H Kilicoglu
I Tsochantaridis
J Björne
J Björne
J Björne
J Björne
Jari Björne
JD Kim
JD Kim
JD Kim
M Buckland
M de Marneffe
M de Marneffe
M Krallinger
M Miwa
M Sahlgren
MF Porter
R Leaman
S Pyysalo
S Pyysalo
S Pyysalo
S van Dongen
S Van Landeghem
S Van Landeghem
S Van Landeghem
S Van Landeghem
S Van Landeghem
S Van Landeghem
Sofie Van Landeghem
T Ohta
Tapio Salakoski
The UniProt Consortium
Thomas Abeel
TK Landauer
VN Vapnik
Yves Van de Peer
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Background Text mining tools have gained popularity to process the vast amount of available research articles in the biomedical literature. It is crucial that such tools extract information with a sufficient level of detail to be applicable in real life scenarios. Studies of mining non-causal molecular relations attribute to this goal by formally identifying the relations between genes, promoters, complexes and various other molecular entities found in text. More importantly, these studies help to enhance integration of text mining results with database facts. Results We describe, compare and evaluate two frameworks developed for the prediction of non-causal or 'entity' relations (REL) between gene symbols and domain terms. For the corresponding REL challenge of the BioNLP Shared Task of 2011, these systems ranked first (57.7% F-score) and second (41.6% F-score). In this paper, we investigate the performance discrepancy of 16 percentage points by benchmarking on a related and more extensive dataset, analysing the contribution of both the term detection and relation extraction modules. We further construct a hybrid system combining the two frameworks and experiment with intersection and union combinations, achieving respectively high-precision and high-recall results. Finally, we highlight extremely high-performance results (F-score > 90%) obtained for the specific subclass of embedded entity relations that are essential for integrating text mining predictions with database facts. Conclusions The results from this study will enable us in the near future to annotate semantic relations between molecular entities in the entire scientific literature available through PubMed. The recent release of the EVEX dataset, containing biomolecular event predictions for millions of PubMed articles, is an interesting and exciting opportunity to overlay these entity relations with event predictions on a literature-wide scale

Crossref

TU Delft Repository

Springer - Publisher Connector

Ghent University Academic Bibliography

PubMed Central

Archivsystem Ask23

Investigating heterogeneous protein annotations toward cross-corpora utilization

Author: A Arnold
A Yeh
AM Cohen
B Alex
B Efron
C Nédellec
CJ Kuo
EFTK Sang
EW Noreen
F Rinaldi
F Sha
G Zhou
H Daumé III
H Shatkay
HL Johnson
J Wilbur
JD Kim
JD Kim
Jin-Dong Kim
Jun'ichi Tsujii
K Franzén
K Yoshida
KB Cohen
L Gillick
L Tanabe
MA Mandel
R Bunescu
R Bunescu
R Kabiljo
RTH Tsai
Rune Sætre
S Pyysalo
Sampo Pyysalo
T Ohta
V Hatzivassiloglou
X Sun
Y Song
Y Wang
Yue Wang
Publication venue: BioMed Central
Publication date: 01/12/2009
Field of study

Abstract Background The number of corpora, collections of structured texts, has been increasing, as a result of the growing interest in the application of natural language processing methods to biological texts. Many named entity recognition (NER) systems have been developed based on these corpora. However, in the biomedical community, there is yet no general consensus regarding named entity annotation; thus, the resources are largely incompatible, and it is difficult to compare the performance of systems developed on resources that were divergently annotated. On the other hand, from a practical application perspective, it is desirable to utilize as many existing annotated resources as possible, because annotation is costly. Thus, it becomes a task of interest to integrate the heterogeneous annotations in these resources. Results We explore the potential sources of incompatibility among gene and protein annotations that were made for three common corpora: GENIA, GENETAG and AIMed. To show the inconsistency in the corpora annotations, we first tackle the incompatibility problem caused by corpus integration, and we quantitatively measure the effect of this incompatibility on protein mention recognition. We find that the F-score performance declines tremendously when training with integrated data, instead of training with pure data; in some cases, the performance drops nearly 12%. This degradation may be caused by the newly added heterogeneous annotations, and cannot be fixed without an understanding of the heterogeneities that exist among the corpora. Motivated by the result of this preliminary experiment, we further qualitatively analyze a number of possible sources for these differences, and investigate the factors that would explain the inconsistencies, by performing a series of well-designed experiments. Our analyses indicate that incompatibilities in the gene/protein annotations exist mainly in the following four areas: the boundary annotation conventions, the scope of the entities of interest, the distribution of annotated entities, and the ratio of overlap between annotated entities. We further suggest that almost all of the incompatibilities can be prevented by properly considering the four aspects aforementioned. Conclusion Our analysis covers the key similarities and dissimilarities that exist among the diverse gene/protein corpora. This paper serves to improve our understanding of the differences in the three studied corpora, which can then lead to a better understanding of the performance of protein recognizers that are based on the corpora.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

University of Turku in the BioNLP'11 Shared Task

Author: A Jimeno Yepes
D McClosky
D McClosky
de Marneffe
E Buyko
E Charniak
Filip Ginter
H Kilicoglu
H Kilicoglu
I Tsochantaridis
J Björne
J Björne
J Björne
J Heimonen
J Jourde
Jari Björne
JD Kim
JD Kim
JD Kim
JP Euzéby
M Miwa
M Miwa
MC de Marneffe
MF Porter
N Nguyen
P Stenetorp
R Bossy
S Pyysalo
S Pyysalo
S Riedel
S Riedel
S Riedel
S Van Landeghem
S Van Landeghem
T Ohta
Tapio Salakoski
Y Kim
Z Ratkovic
Publication venue: BioMed Central
Publication date: 01/01/2012
Field of study

Crossref

Springer - Publisher Connector

PubMed Central

Tricholoma matsutake 1-Ocen-3-ol and methyl cinnamate repel mycophagous Proisotoma minuta (Collembola: Insecta)

Author: A Ohta
E Kaminski
F Hiol Hiol
G Bengtsson
G Bengtsson
H Pyysalo
JF Ponge
K Hedlund
KA Vogt
KJ Cromack
Masahiro Suzuki
N Stark
PA Schultz
PM Hammond
RM Pfeil
S Kaneda
S Yamashita
Satoshi Shimano
T Nakamori
T Nakamori
T Sawahata
T Sawahata
T Sawahata
T Sawahata
T Sawahata
T Terashita
Takuo Sawahata
WF Wood
Publication venue: Springer-Verlag
Publication date: 01/01/2007
Field of study

Two major volatiles produced by the mycelia and fruiting bodies of Tricholoma matsutake (1-octen-3-ol and methyl cinnamate) repel a mycophagous collembolan, Proisotoma minuta. Aggregation of the collembolans on their diet was significantly inhibited by exposure to 1 ppm methyl cinnamate or 10 to 100 ppm 1-octen-3-ol. The aggregation activity decreased dose-dependently upon exposure to 1-octen-3-ol at concentrations higher than 0.01 ppm. Aggregation in the presence of methyl cinnamate exhibited three phases: no significant effect at concentrations ranging from 0.001 to 0.1 ppm, significant inhibition from 1 to 100 ppm, and strong inhibition at 1,000 ppm. These results may explain why certain collembolan species do not prefer T. matsutake fruiting bodies

Crossref

Springer - Publisher Connector

PubMed Central