179 research outputs found
Biomedical relation extraction:from binary to complex
Biomedical relation extraction aims to uncover high-quality relations from life science literature with high accuracy and efficiency. Early biomedical relation extraction tasks focused on capturing binary relations, such as protein-protein interactions, which are crucial for virtually every process in a living cell. Information about these interactions provides the foundations for new therapeutic approaches. In recent years, more interests have been shifted to the extraction of complex relations such as biomolecular events. While complex relations go beyond binary relations and involve more than two arguments, they might also take another relation as an argument. In the paper, we conduct a thorough survey on the research in biomedical relation extraction. We first present a general framework for biomedical relation extraction and then discuss the approaches proposed for binary and complex relation extraction with focus on the latter since it is a much more difficult task compared to binary relation extraction. Finally, we discuss challenges that we are facing with complex relation extraction and outline possible solutions and future directions
Biomolecular Event Extraction using Natural Language Processing
Biomedical research and discoveries are communicated through scholarly publications and this literature is voluminous, rich in scientific text and growing exponentially by the day. Biomedical journals publish nearly three thousand research articles daily, making literature search a challenging proposition for researchers. Biomolecular events involve genes, proteins, metabolites, and enzymes that provide invaluable insights into biological processes and explain the physiological functional mechanisms. Text mining (TM) or extraction of such events automatically from big data is the only quick and viable solution to gather any useful information. Such events extracted from biological literature have a broad range of applications like database curation, ontology construction, semantic web search and interactive systems. However, automatic extraction has its challenges on account of ambiguity and the diverse nature of natural language and associated linguistic occurrences like speculations, negations etc., which commonly exist in biomedical texts and lead to erroneous elucidation. In the last decade, many strategies have been proposed in this field, using different paradigms like Biomedical natural language processing (BioNLP), machine learning and deep learning. Also, new parallel computing architectures like graphical processing units (GPU) have emerged as possible candidates to accelerate the event extraction pipeline. This paper reviews and provides a summarization of the key approaches in complex biomolecular big data event extraction tasks and recommends a balanced architecture in terms of accuracy, speed, computational cost, and memory usage towards developing a robust GPU-accelerated BioNLP system
Biomedical Event Extraction with Machine Learning
Biomedical natural language processing (BioNLP) is a subfield of natural
language processing, an area of computational linguistics concerned
with developing programs that work with natural language: written texts and
speech. Biomedical relation extraction concerns the detection of
semantic relations such as protein--protein interactions (PPI) from scientific
texts. The aim is to enhance information retrieval by detecting relations
between concepts, not just individual concepts as with a keyword search.
In recent years, events have been proposed as a more detailed alternative for
simple pairwise PPI relations. Events provide a systematic, structural
representation for annotating the content of natural language texts. Events are
characterized by annotated trigger words, directed and typed arguments and the
ability to nest other events. For example, the sentence ``Protein A causes
protein B to bind protein C'' can be annotated with the nested event structure
CAUSE(A, BIND(B, C)). Converted to such formal representations, the
information of natural language texts can be used by computational
applications. Biomedical event annotations were introduced by the BioInfer and
GENIA corpora, and event extraction was popularized by the BioNLP'09 Shared Task
on Event Extraction.
In this thesis we present a method for automated event extraction, implemented
as the Turku Event Extraction System (TEES). A unified graph format is defined
for representing event annotations and the problem of extracting complex event
structures is decomposed into a number of independent classification tasks.
These classification tasks are solved using SVM and RLS classifiers, utilizing
rich feature representations built from full dependency parsing. Building on
earlier work on pairwise relation extraction and using a generalized graph
representation, the resulting TEES system is capable of detecting binary
relations as well as complex event structures.
We show that this event extraction system has good performance,
reaching the first place in the BioNLP'09 Shared Task on Event Extraction. Subsequently,
TEES has achieved several first ranks in the BioNLP'11 and BioNLP'13 Shared
Tasks, as well as shown competitive performance in the binary relation Drug-Drug
Interaction Extraction 2011 and 2013 shared tasks.
The Turku Event Extraction System is published as a freely available open-source
project, documenting the research in detail as well as making the method
available for practical applications. In particular, in this thesis we
describe the application of the event extraction method to PubMed-scale text
mining, showing how the developed approach not only shows good performance, but
is generalizable and applicable to large-scale real-world text mining projects.
Finally, we discuss related literature, summarize the contributions of the work
and present some thoughts on future directions for biomedical event extraction.
This thesis includes and builds on six original research publications. The first
of these introduces the analysis of dependency parses that leads to
development of TEES. The entries in the three BioNLP Shared Tasks, as well as
in the DDIExtraction 2011 task are covered in four publications, and the sixth
one demonstrates the application of the system to PubMed-scale text mining.</p
Semantically linking molecular entities in literature through entity relationships
Background Text mining tools have gained popularity to process the vast amount of available research articles in the biomedical literature. It is crucial that such tools extract information with a sufficient level of detail to be applicable in real life scenarios. Studies of mining non-causal molecular relations attribute to this goal by formally identifying the relations between genes, promoters, complexes and various other molecular entities found in text. More importantly, these studies help to enhance integration of text mining results with database facts. Results We describe, compare and evaluate two frameworks developed for the prediction of non-causal or 'entity' relations (REL) between gene symbols and domain terms. For the corresponding REL challenge of the BioNLP Shared Task of 2011, these systems ranked first (57.7% F-score) and second (41.6% F-score). In this paper, we investigate the performance discrepancy of 16 percentage points by benchmarking on a related and more extensive dataset, analysing the contribution of both the term detection and relation extraction modules. We further construct a hybrid system combining the two frameworks and experiment with intersection and union combinations, achieving respectively high-precision and high-recall results. Finally, we highlight extremely high-performance results (F-score > 90%) obtained for the specific subclass of embedded entity relations that are essential for integrating text mining predictions with database facts. Conclusions The results from this study will enable us in the near future to annotate semantic relations between molecular entities in the entire scientific literature available through PubMed. The recent release of the EVEX dataset, containing biomolecular event predictions for millions of PubMed articles, is an interesting and exciting opportunity to overlay these entity relations with event predictions on a literature-wide scale
An analysis of gene/protein associations at PubMed scale
<p>Abstract</p> <p>Background</p> <p>Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available.</p> <p>Results</p> <p>In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology.</p> <p>Conclusions</p> <p>We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage.</p
- …