187 research outputs found
Lexical Adaptation of Link Grammar to the Biomedical Sublanguage: a Comparative Evaluation of Three Approaches
We study the adaptation of Link Grammar Parser to the biomedical sublanguage
with a focus on domain terms not found in a general parser lexicon. Using two
biomedical corpora, we implement and evaluate three approaches to addressing
unknown words: automatic lexicon expansion, the use of morphological clues, and
disambiguation using a part-of-speech tagger. We evaluate each approach
separately for its effect on parsing performance and consider combinations of
these approaches. In addition to a 45% increase in parsing efficiency, we find
that the best approach, incorporating information from a domain part-of-speech
tagger, offers a statistically signicant 10% relative decrease in error. The
adapted parser is available under an open-source license at
http://www.it.utu.fi/biolg
Adapting a general parser to a sublanguage
In this paper, we propose a method to adapt a general parser (Link Parser) to
sublanguages, focusing on the parsing of texts in biology. Our main proposal is
the use of terminology (identication and analysis of terms) in order to reduce
the complexity of the text to be parsed. Several other strategies are explored
and finally combined among which text normalization, lexicon and
morpho-guessing module extensions and grammar rules adaptation. We compare the
parsing results before and after these adaptations
A Dependency Parsing Approach to Biomedical Text Mining
Biomedical research is currently facing a new type of challenge: an excess of information, both in terms of raw data from experiments and in the number of scientific publications describing their results. Mirroring the focus on data mining techniques to address the issues of structured data, there has recently been great interest in the development and application of text mining techniques to make more effective use of the knowledge contained in biomedical scientific publications, accessible only in the form of natural human language.
This thesis describes research done in the broader scope of projects aiming to develop methods, tools and techniques for text mining tasks in general and for the biomedical domain in particular. The work described here involves more specifically the goal of extracting information from statements concerning relations of biomedical entities, such as protein-protein interactions. The approach taken is one using full parsing—syntactic analysis of the entire structure of sentences—and machine learning, aiming to develop reliable methods that can further be generalized to apply also to other domains.
The five papers at the core of this thesis describe research on a number of distinct but related topics in text mining. In the first of these studies, we assessed the applicability of two popular general English parsers to biomedical text mining and, finding their performance limited, identified several specific challenges to accurate parsing of domain text. In a follow-up study focusing on parsing issues related to specialized domain terminology, we evaluated three lexical adaptation methods. We found that the accurate resolution of unknown words can considerably improve parsing performance and introduced a domain-adapted parser that reduced the error rate of theoriginal by 10% while also roughly halving parsing time.
To establish the relative merits of parsers that differ in the applied formalisms and the representation given to their syntactic analyses, we have also developed evaluation methodology, considering different approaches to establishing comparable dependency-based evaluation results. We introduced a methodology for creating highly accurate conversions between different parse representations, demonstrating the feasibility of unification of idiverse syntactic schemes under a shared, application-oriented representation. In addition to allowing formalism-neutral evaluation, we argue that such unification can also increase the value of parsers for domain text mining. As a further step in this direction, we analysed the characteristics of publicly available biomedical corpora annotated for protein-protein interactions and created tools for converting them into a shared form, thus contributing also to the unification of text mining resources. The introduced unified corpora allowed us to perform a task-oriented comparative evaluation of biomedical text mining corpora. This evaluation established clear limits on the comparability of results for text mining methods evaluated on different resources, prompting further efforts toward standardization.
To support this and other research, we have also designed and annotated BioInfer, the first domain corpus of its size combining annotation of syntax and biomedical entities with a detailed annotation of their relationships. The corpus represents a major design and development effort of the research group, with manual annotation that identifies over 6000 entities, 2500 relationships and 28,000 syntactic dependencies in 1100 sentences. In addition to combining these key annotations for a single set of sentences, BioInfer was also the first domain resource to introduce a representation of entity relations that is supported by ontologies and able to capture complex, structured relationships.
Part I of this thesis presents a summary of this research in the broader context of a text mining system, and Part II contains reprints of the five included publications.Siirretty Doriast
BioInfer: a corpus for information extraction in the biomedical domain
BACKGROUND: Lately, there has been a great interest in the application of information extraction methods to the biomedical domain, in particular, to the extraction of relationships of genes, proteins, and RNA from scientific publications. The development and evaluation of such methods requires annotated domain corpora. RESULTS: We present BioInfer (Bio Information Extraction Resource), a new public resource providing an annotated corpus of biomedical English. We describe an annotation scheme capturing named entities and their relationships along with a dependency analysis of sentence syntax. We further present ontologies defining the types of entities and relationships annotated in the corpus. Currently, the corpus contains 1100 sentences from abstracts of biomedical research articles annotated for relationships, named entities, as well as syntactic dependencies. Supporting software is provided with the corpus. The corpus is unique in the domain in combining these annotation types for a single set of sentences, and in the level of detail of the relationship annotation. CONCLUSION: We introduce a corpus targeted at protein, gene, and RNA relationships which serves as a resource for the development of information extraction systems and their components such as parsers and domain analyzers. The corpus will be maintained and further developed with a current version being available at
Performance of the Charniak-Lease parser on biological text using different training corpora
POS tagging is used as the first step in many NLP workflows, although the accuracy of tag assignment frequently goes unchecked. We hypothesize that changing the training corpora for a parser will affect its POS tagging of a target corpus. To this end we train the Charniak-Lease parser on the WSJ corpus and two biomedical corpora and evaluate its output to MedPost, a POS tagger with a reported 97% accuracy on biomedical text. Our findings indicate that using biomedical training corpora significantly improves performance, but that minor differences in the biomedical training corpora have a significant effect on the correctness of POS tagging. Specifically, the tagging of hyphenated words and verbs was affected. This work suggests that the choice of training corpora is crucial to domain targeted NLP analysis
An analysis of gene/protein associations at PubMed scale
<p>Abstract</p> <p>Background</p> <p>Event extraction following the GENIA Event corpus and BioNLP shared task models has been a considerable focus of recent work in biomedical information extraction. This work includes efforts applying event extraction methods to the entire PubMed literature database, far beyond the narrow subdomains of biomedicine for which annotated resources for extraction method development are available.</p> <p>Results</p> <p>In the present study, our aim is to estimate the coverage of all statements of gene/protein associations in PubMed that existing resources for event extraction can provide. We base our analysis on a recently released corpus automatically annotated for gene/protein entities and syntactic analyses covering the entire PubMed, and use named entity co-occurrence, shortest dependency paths and an unlexicalized classifier to identify likely statements of gene/protein associations. A set of high-frequency/high-likelihood association statements are then manually analyzed with reference to the GENIA ontology.</p> <p>Conclusions</p> <p>We present a first estimate of the overall coverage of gene/protein associations provided by existing resources for event extraction. Our results suggest that for event-type associations this coverage may be over 90%. We also identify several biologically significant associations of genes and proteins that are not addressed by these resources, suggesting directions for further extension of extraction coverage.</p
Biomedical Event Extraction with Machine Learning
Biomedical natural language processing (BioNLP) is a subfield of natural
language processing, an area of computational linguistics concerned with
developing programs that work with natural language: written texts and
speech. Biomedical relation extraction concerns the detection of semantic
relations such as protein-protein interactions (PPI) from scientific texts.
The aim is to enhance information retrieval by detecting relations between
concepts, not just individual concepts as with a keyword search.
In recent years, events have been proposed as a more detailed alternative
for simple pairwise PPI relations. Events provide a systematic, structural
representation for annotating the content of natural language texts. Events
are characterized by annotated trigger words, directed and typed arguments
and the ability to nest other events. For example, the sentence “Protein A
causes protein B to bind protein C” can be annotated with the nested event
structure CAUSE(A, BIND(B, C)). Converted to such formal representations,
the information of natural language texts can be used by computational
applications. Biomedical event annotations were introduced by the
BioInfer and GENIA corpora, and event extraction was popularized by the
BioNLP'09 Shared Task on Event Extraction.
In this thesis we present a method for automated event extraction, implemented
as the Turku Event Extraction System (TEES). A unified graph
format is defined for representing event annotations and the problem of
extracting complex event structures is decomposed into a number of independent
classification tasks. These classification tasks are solved using SVM
and RLS classifiers, utilizing rich feature representations built from full dependency
parsing. Building on earlier work on pairwise relation extraction
and using a generalized graph representation, the resulting TEES system is
capable of detecting binary relations as well as complex event structures.
We show that this event extraction system has good performance, reaching
the first place in the BioNLP'09 Shared Task on Event Extraction.
Subsequently, TEES has achieved several first ranks in the BioNLP'11 and
BioNLP'13 Shared Tasks, as well as shown competitive performance in the
binary relation Drug-Drug Interaction Extraction 2011 and 2013 shared
tasks.
The Turku Event Extraction System is published as a freely available
open-source project, documenting the research in detail as well as making
the method available for practical applications. In particular, in this thesis
we describe the application of the event extraction method to PubMed-scale
text mining, showing how the developed approach not only shows good
performance, but is generalizable and applicable to large-scale real-world
text mining projects.
Finally, we discuss related literature, summarize the contributions of the
work and present some thoughts on future directions for biomedical event
extraction. This thesis includes and builds on six original research publications.
The first of these introduces the analysis of dependency parses that
leads to development of TEES. The entries in the three BioNLP Shared
Tasks, as well as in the DDIExtraction 2011 task are covered in four publications,
and the sixth one demonstrates the application of the system to
PubMed-scale text mining.Siirretty Doriast
Biomedical Event Extraction with Machine Learning
Biomedical natural language processing (BioNLP) is a subfield of natural
language processing, an area of computational linguistics concerned
with developing programs that work with natural language: written texts and
speech. Biomedical relation extraction concerns the detection of
semantic relations such as protein--protein interactions (PPI) from scientific
texts. The aim is to enhance information retrieval by detecting relations
between concepts, not just individual concepts as with a keyword search.
In recent years, events have been proposed as a more detailed alternative for
simple pairwise PPI relations. Events provide a systematic, structural
representation for annotating the content of natural language texts. Events are
characterized by annotated trigger words, directed and typed arguments and the
ability to nest other events. For example, the sentence ``Protein A causes
protein B to bind protein C'' can be annotated with the nested event structure
CAUSE(A, BIND(B, C)). Converted to such formal representations, the
information of natural language texts can be used by computational
applications. Biomedical event annotations were introduced by the BioInfer and
GENIA corpora, and event extraction was popularized by the BioNLP'09 Shared Task
on Event Extraction.
In this thesis we present a method for automated event extraction, implemented
as the Turku Event Extraction System (TEES). A unified graph format is defined
for representing event annotations and the problem of extracting complex event
structures is decomposed into a number of independent classification tasks.
These classification tasks are solved using SVM and RLS classifiers, utilizing
rich feature representations built from full dependency parsing. Building on
earlier work on pairwise relation extraction and using a generalized graph
representation, the resulting TEES system is capable of detecting binary
relations as well as complex event structures.
We show that this event extraction system has good performance,
reaching the first place in the BioNLP'09 Shared Task on Event Extraction. Subsequently,
TEES has achieved several first ranks in the BioNLP'11 and BioNLP'13 Shared
Tasks, as well as shown competitive performance in the binary relation Drug-Drug
Interaction Extraction 2011 and 2013 shared tasks.
The Turku Event Extraction System is published as a freely available open-source
project, documenting the research in detail as well as making the method
available for practical applications. In particular, in this thesis we
describe the application of the event extraction method to PubMed-scale text
mining, showing how the developed approach not only shows good performance, but
is generalizable and applicable to large-scale real-world text mining projects.
Finally, we discuss related literature, summarize the contributions of the work
and present some thoughts on future directions for biomedical event extraction.
This thesis includes and builds on six original research publications. The first
of these introduces the analysis of dependency parses that leads to
development of TEES. The entries in the three BioNLP Shared Tasks, as well as
in the DDIExtraction 2011 task are covered in four publications, and the sixth
one demonstrates the application of the system to PubMed-scale text mining.</p
- …