2,330 research outputs found
New Resources and Perspectives for Biomedical Event Extraction
Event extraction is a major focus of recent work in biomedical information extraction. Despite substantial advances, many challenges still remain for reliable automatic extraction of events from text. We introduce a new biomedical event extraction resource consisting of analyses automatically created by systems participating in the recent BioNLP Shared Task (ST) 2011. In providing for the first time the outputs of a broad set of state-ofthe-art event extraction systems, this resource opens many new opportunities for studying aspects of event extraction, from the identification of common errors to the study of effective approaches to combining the strengths of systems. We demonstrate these opportunities through a multi-system analysis on three BioNLP ST 2011 main tasks, focusing on events that none of the systems can successfully extract. We further argue for new perspectives to the performance evaluation of domain event extraction systems, considering a document-level, âoff-the-page â representation and evaluation to complement the mentionlevel evaluations pursued in most recent work.
The Impact of Annotation on the Performance of Protein Tagging in Biomedical Text
In this paper we discuss five different corpora annotated for protein names. We present several within- and cross-dataset protein tagging experiments showing that different annotation schemes severely affect the portability of statistical protein taggers. By means of a detailed error analysis we identify crucial annotation issues that future annotation projects should take into careful consideration
Neural Relation Extraction Within and Across Sentence Boundaries
Past work in relation extraction mostly focuses on binary relation between
entity pairs within single sentence. Recently, the NLP community has gained
interest in relation extraction in entity pairs spanning multiple sentences. In
this paper, we propose a novel architecture for this task: inter-sentential
dependency-based neural networks (iDepNN). iDepNN models the shortest and
augmented dependency paths via recurrent and recursive neural networks to
extract relationships within (intra-) and across (inter-) sentence boundaries.
Compared to SVM and neural network baselines, iDepNN is more robust to false
positives in relationships spanning sentences.
We evaluate our models on four datasets from newswire (MUC6) and medical
(BioNLP shared task) domains that achieve state-of-the-art performance and show
a better balance in precision and recall for inter-sentential relationships. We
perform better than 11 teams participating in the BioNLP shared task 2016 and
achieve a gain of 5.2% (0.587 vs 0.558) in F1 over the winning team. We also
release the crosssentence annotations for MUC6.Comment: AAAI201
Kernelized Hashcode Representations for Relation Extraction
Kernel methods have produced state-of-the-art results for a number of NLP
tasks such as relation extraction, but suffer from poor scalability due to the
high cost of computing kernel similarities between natural language structures.
A recently proposed technique, kernelized locality-sensitive hashing (KLSH),
can significantly reduce the computational cost, but is only applicable to
classifiers operating on kNN graphs. Here we propose to use random subspaces of
KLSH codes for efficiently constructing an explicit representation of NLP
structures suitable for general classification methods. Further, we propose an
approach for optimizing the KLSH model for classification problems by
maximizing an approximation of mutual information between the KLSH codes
(feature vectors) and the class labels. We evaluate the proposed approach on
biomedical relation extraction datasets, and observe significant and robust
improvements in accuracy w.r.t. state-of-the-art classifiers, along with
drastic (orders-of-magnitude) speedup compared to conventional kernel methods.Comment: To appear in the proceedings of conference, AAAI-1
Biomedical Event Trigger Identification Using Bidirectional Recurrent Neural Network Based Models
Biomedical events describe complex interactions between various biomedical
entities. Event trigger is a word or a phrase which typically signifies the
occurrence of an event. Event trigger identification is an important first step
in all event extraction methods. However many of the current approaches either
rely on complex hand-crafted features or consider features only within a
window. In this paper we propose a method that takes the advantage of recurrent
neural network (RNN) to extract higher level features present across the
sentence. Thus hidden state representation of RNN along with word and entity
type embedding as features avoid relying on the complex hand-crafted features
generated using various NLP toolkits. Our experiments have shown to achieve
state-of-art F1-score on Multi Level Event Extraction (MLEE) corpus. We have
also performed category-wise analysis of the result and discussed the
importance of various features in trigger identification task.Comment: The work has been accepted in BioNLP at ACL-201
Cell line name recognition in support of the identification of synthetic lethality in cancer from text
Motivation: The recognition and normalization of cell line names in text is an important task in biomedical text mining research, facilitating for instance the identification of synthetically lethal genes from the literature. While several tools have previously been developed to address cell line recognition, it is unclear whether available systems can perform sufficiently well in realistic and broad-coverage applications such as extracting synthetically lethal genes from the cancer literature. In this study, we revisit the cell line name recognition task, evaluating both available systems and newly introduced methods on various resources to obtain a reliable tagger not tied to any specific subdomain. In support of this task, we introduce two text collections manually annotated for cell line names: the broad-coverage corpus Gellus and CLL, a focused target domain corpus.
Results: We find that the best performance is achieved using NERsuite, a machine learning system based on Conditional Random Fields, trained on the Gellus corpus and supported with a dictionary of cell line names. The system achieves an F-score of 88.46% on the test set of Gellus and 85.98% on the independently annotated CLL corpus. It was further applied at large scale to 24 302 102 unannotated articles, resulting in the identification of 5 181 342 cell line mentions, normalized to 11 755 unique cell line database identifiers
- âŚ