8 research outputs found
Event extraction from biomedical texts using trimmed dependency graphs
This thesis explores the automatic extraction of information from biomedical publications. Such techniques are urgently needed because the biosciences are publishing continually increasing numbers of texts. The focus of this work is on events. Information about events is currently manually curated from the literature by biocurators. Biocuration, however, is time-consuming and costly so automatic methods are needed for information extraction from the literature. This thesis is dedicated to modeling, implementing and evaluating an advanced event extraction approach based on the analysis of syntactic dependency graphs. This work presents the event extraction approach proposed and its implementation, the JReX (Jena Relation eXtraction) system. This system was used by the University of Jena (JULIE Lab) team in the "BioNLP 2009 Shared Task on Event Extraction" competition and was ranked second among 24 competing teams. Thereafter JReX was the highest scorer on the worldwide shared U-Compare event extraction server, outperforming the competing systems from the challenge. This success was made possible, among other things, by extensive research on event extraction solutions carried out during this thesis, e.g., exploring the effects of syntactic and semantic processing procedures on solving the event extraction task. The evaluations executed on standard and community-wide accepted competition data were complemented by real-life evaluation of large-scale biomedical database reconstruction. This work showed that considerable parts of manually curated databases can be automatically re-created with the help of the event extraction approach developed. Successful re-creation was possible for parts of RegulonDB, the world's largest database for E. coli. In summary, the event extraction approach justified, developed and implemented in this thesis meets the needs of a large community of human curators and thus helps in the acquisition of new knowledge in the biosciences
Computer-assisted curation of a human regulatory core network from the biological literature
Motivation: A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs.
Results: We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45 000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained ∼900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship.
Conclusions: We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art.
Availability and implementation: Web-service is freely accessible athttp://fastforward.sys-bio.net/.FWN – Publicaties zonder aanstelling Universiteit Leide
Biomolecular Event Extraction using Natural Language Processing
Biomedical research and discoveries are communicated through scholarly publications and this literature is voluminous, rich in scientific text and growing exponentially by the day. Biomedical journals publish nearly three thousand research articles daily, making literature search a challenging proposition for researchers. Biomolecular events involve genes, proteins, metabolites, and enzymes that provide invaluable insights into biological processes and explain the physiological functional mechanisms. Text mining (TM) or extraction of such events automatically from big data is the only quick and viable solution to gather any useful information. Such events extracted from biological literature have a broad range of applications like database curation, ontology construction, semantic web search and interactive systems. However, automatic extraction has its challenges on account of ambiguity and the diverse nature of natural language and associated linguistic occurrences like speculations, negations etc., which commonly exist in biomedical texts and lead to erroneous elucidation. In the last decade, many strategies have been proposed in this field, using different paradigms like Biomedical natural language processing (BioNLP), machine learning and deep learning. Also, new parallel computing architectures like graphical processing units (GPU) have emerged as possible candidates to accelerate the event extraction pipeline. This paper reviews and provides a summarization of the key approaches in complex biomolecular big data event extraction tasks and recommends a balanced architecture in terms of accuracy, speed, computational cost, and memory usage towards developing a robust GPU-accelerated BioNLP system
Negated bio-events: Analysis and identification
Background: Negation occurs frequently in scientific literature, especially in biomedical literature. It has previously been reported that around 13% of sentences found in biomedical research articles contain negation. Historically, the main motivation for identifying negated events has been to ensure their exclusion from lists of extracted interactions. However, recently, there has been a growing interest in negative results, which has resulted in negation detection being identified as a key challenge in biomedical relation extraction. In this article, we focus on the problem of identifying negated bio-events, given gold standard event annotations.Results: We have conducted a detailed analysis of three open access bio-event corpora containing negation information (i.e., GENIA Event, BioInfer and BioNLP'09 ST), and have identified the main types of negated bio-events. We have analysed the key aspects of a machine learning solution to the problem of detecting negated events, including selection of negation cues, feature engineering and the choice of learning algorithm. Combining the best solutions for each aspect of the problem, we propose a novel framework for the identification of negated bio-events. We have evaluated our system on each of the three open access corpora mentioned above. The performance of the system significantly surpasses the best results previously reported on the BioNLP'09 ST corpus, and achieves even better results on the GENIA Event and BioInfer corpora, both of which contain more varied and complex events.Conclusions: Recently, in the field of biomedical text mining, the development and enhancement of event-based systems has received significant interest. The ability to identify negated events is a key performance element for these systems. We have conducted the first detailed study on the analysis and identification of negated bio-events. Our proposed framework can be integrated with state-of-the-art event extraction systems. The resulting systems will be able to extract bio-events with attached polarities from textual documents, which can serve as the foundation for more elaborate systems that are able to detect mutually contradicting bio-events. © 2013 Nawaz et al.; licensee BioMed Central Ltd
Developing Ontological Background Knowledge for Biomedicine
Biomedicine is an impressively fast developing, interdisciplinary field of
research. To control the growing volumes of biomedical data, ontologies are
increasingly used as common organization structures. Biomedical ontologies
describe domain knowledge in a formal, computationally accessible way. They
serve as controlled vocabularies and background knowledge in applications
dealing with the integration, analysis and retrieval of heterogeneous types
of data. The development of biomedical ontologies, however, is hampered by
specific challenges. They include the lack of quality standards, resulting
in very heterogeneous resources, and the decentralized development of
biomedical ontologies, causing the increasing fragmentation of domain
knowledge across them.
In the first part of this thesis, a life cycle model for biomedical
ontologies is developed, which is intended to cope with these challenges.
It comprises the stages "requirements analysis", "design and
implementation", "evaluation", "documentation and release" and
"maintenance". For each stage, associated subtasks and activities are
specified. To promote quality standards for biomedical ontology
development, an emphasis is set on the evaluation stage. As part of it,
comprehensive evaluation procedures are specified, which allow to assess
the quality of ontologies on various levels. To tackle the issue of
knowledge fragmentation, the life cycle model is extended to also cover
ontology alignments. Ontology alignments specify mappings between related
elements of different ontologies. By making potential overlaps and
similarities between ontologies explicit, they support the integration of
ontologies and help reduce the fragmentation of knowledge.
In the second part of this thesis, the life cycle model for biomedical
ontologies and alignments is validated by means of five case studies. As a
result, they confirm that the model is effective. Four of the case studies
demonstrate that it is able to support the development of useful new
ontologies and alignments. The latter facilitate novel natural language
processing and bioinformatics applications, and in one case constitute the
basis of a task of the "BioNLP shared task 2013", an international
challenge on biomedical information extraction. The fifth case study shows
that the presented evaluation procedures are an effective means to check
and improve the quality of ontology alignments. Hence, they support the
crucial task of quality assurance of alignments, which are themselves
increasingly used as reference standards in evaluations of automatic
ontology alignment systems. Both, the presented life cycle model and the
ontologies and alignments that have resulted from its validation improve
information and knowledge management in biomedicine and thus promote
biomedical research