8 research outputs found

    Event extraction from biomedical texts using trimmed dependency graphs

    Get PDF
    This thesis explores the automatic extraction of information from biomedical publications. Such techniques are urgently needed because the biosciences are publishing continually increasing numbers of texts. The focus of this work is on events. Information about events is currently manually curated from the literature by biocurators. Biocuration, however, is time-consuming and costly so automatic methods are needed for information extraction from the literature. This thesis is dedicated to modeling, implementing and evaluating an advanced event extraction approach based on the analysis of syntactic dependency graphs. This work presents the event extraction approach proposed and its implementation, the JReX (Jena Relation eXtraction) system. This system was used by the University of Jena (JULIE Lab) team in the "BioNLP 2009 Shared Task on Event Extraction" competition and was ranked second among 24 competing teams. Thereafter JReX was the highest scorer on the worldwide shared U-Compare event extraction server, outperforming the competing systems from the challenge. This success was made possible, among other things, by extensive research on event extraction solutions carried out during this thesis, e.g., exploring the effects of syntactic and semantic processing procedures on solving the event extraction task. The evaluations executed on standard and community-wide accepted competition data were complemented by real-life evaluation of large-scale biomedical database reconstruction. This work showed that considerable parts of manually curated databases can be automatically re-created with the help of the event extraction approach developed. Successful re-creation was possible for parts of RegulonDB, the world's largest database for E. coli. In summary, the event extraction approach justified, developed and implemented in this thesis meets the needs of a large community of human curators and thus helps in the acquisition of new knowledge in the biosciences

    Computer-assisted curation of a human regulatory core network from the biological literature

    Get PDF
    Motivation: A highly interlinked network of transcription factors (TFs) orchestrates the context-dependent expression of human genes. ChIP-chip experiments that interrogate the binding of particular TFs to genomic regions are used to reconstruct gene regulatory networks at genome-scale, but are plagued by high false-positive rates. Meanwhile, a large body of knowledge on high-quality regulatory interactions remains largely unexplored, as it is available only in natural language descriptions scattered over millions of scientific publications. Such data are hard to extract and regulatory data currently contain together only 503 regulatory relations between human TFs. Results: We developed a text-mining-assisted workflow to systematically extract knowledge about regulatory interactions between human TFs from the biological literature. We applied this workflow to the entire Medline, which helped us to identify more than 45 000 sentences potentially describing such relationships. We ranked these sentences by a machine-learning approach. The top-2500 sentences contained ∼900 sentences that encompass relations already known in databases. By manually curating the remaining 1625 top-ranking sentences, we obtained more than 300 validated regulatory relationships that were not present in a regulatory database before. Full-text curation allowed us to obtain detailed information on the strength of experimental evidences supporting a relationship. Conclusions: We were able to increase curated information about the human core transcriptional network by >60% compared with the current content of regulatory databases. We observed improved performance when using the network for disease gene prioritization compared with the state-of-the-art. Availability and implementation: Web-service is freely accessible athttp://fastforward.sys-bio.net/.FWN – Publicaties zonder aanstelling Universiteit Leide

    Biomolecular Event Extraction using Natural Language Processing

    Get PDF
    Biomedical research and discoveries are communicated through scholarly publications and this literature is voluminous, rich in scientific text and growing exponentially by the day. Biomedical journals publish nearly three thousand research articles daily, making literature search a challenging proposition for researchers. Biomolecular events involve genes, proteins, metabolites, and enzymes that provide invaluable insights into biological processes and explain the physiological functional mechanisms. Text mining (TM) or extraction of such events automatically from big data is the only quick and viable solution to gather any useful information. Such events extracted from biological literature have a broad range of applications like database curation, ontology construction, semantic web search and interactive systems. However, automatic extraction has its challenges on account of ambiguity and the diverse nature of natural language and associated linguistic occurrences like speculations, negations etc., which commonly exist in biomedical texts and lead to erroneous elucidation. In the last decade, many strategies have been proposed in this field, using different paradigms like Biomedical natural language processing (BioNLP), machine learning and deep learning. Also, new parallel computing architectures like graphical processing units (GPU) have emerged as possible candidates to accelerate the event extraction pipeline. This paper reviews and provides a summarization of the key approaches in complex biomolecular big data event extraction tasks and recommends a balanced architecture in terms of accuracy, speed, computational cost, and memory usage towards developing a robust GPU-accelerated BioNLP system

    Negated bio-events: Analysis and identification

    Get PDF
    Background: Negation occurs frequently in scientific literature, especially in biomedical literature. It has previously been reported that around 13% of sentences found in biomedical research articles contain negation. Historically, the main motivation for identifying negated events has been to ensure their exclusion from lists of extracted interactions. However, recently, there has been a growing interest in negative results, which has resulted in negation detection being identified as a key challenge in biomedical relation extraction. In this article, we focus on the problem of identifying negated bio-events, given gold standard event annotations.Results: We have conducted a detailed analysis of three open access bio-event corpora containing negation information (i.e., GENIA Event, BioInfer and BioNLP'09 ST), and have identified the main types of negated bio-events. We have analysed the key aspects of a machine learning solution to the problem of detecting negated events, including selection of negation cues, feature engineering and the choice of learning algorithm. Combining the best solutions for each aspect of the problem, we propose a novel framework for the identification of negated bio-events. We have evaluated our system on each of the three open access corpora mentioned above. The performance of the system significantly surpasses the best results previously reported on the BioNLP'09 ST corpus, and achieves even better results on the GENIA Event and BioInfer corpora, both of which contain more varied and complex events.Conclusions: Recently, in the field of biomedical text mining, the development and enhancement of event-based systems has received significant interest. The ability to identify negated events is a key performance element for these systems. We have conducted the first detailed study on the analysis and identification of negated bio-events. Our proposed framework can be integrated with state-of-the-art event extraction systems. The resulting systems will be able to extract bio-events with attached polarities from textual documents, which can serve as the foundation for more elaborate systems that are able to detect mutually contradicting bio-events. © 2013 Nawaz et al.; licensee BioMed Central Ltd

    Developing Ontological Background Knowledge for Biomedicine

    Full text link
    Biomedicine is an impressively fast developing, interdisciplinary field of research. To control the growing volumes of biomedical data, ontologies are increasingly used as common organization structures. Biomedical ontologies describe domain knowledge in a formal, computationally accessible way. They serve as controlled vocabularies and background knowledge in applications dealing with the integration, analysis and retrieval of heterogeneous types of data. The development of biomedical ontologies, however, is hampered by specific challenges. They include the lack of quality standards, resulting in very heterogeneous resources, and the decentralized development of biomedical ontologies, causing the increasing fragmentation of domain knowledge across them. In the first part of this thesis, a life cycle model for biomedical ontologies is developed, which is intended to cope with these challenges. It comprises the stages "requirements analysis", "design and implementation", "evaluation", "documentation and release" and "maintenance". For each stage, associated subtasks and activities are specified. To promote quality standards for biomedical ontology development, an emphasis is set on the evaluation stage. As part of it, comprehensive evaluation procedures are specified, which allow to assess the quality of ontologies on various levels. To tackle the issue of knowledge fragmentation, the life cycle model is extended to also cover ontology alignments. Ontology alignments specify mappings between related elements of different ontologies. By making potential overlaps and similarities between ontologies explicit, they support the integration of ontologies and help reduce the fragmentation of knowledge. In the second part of this thesis, the life cycle model for biomedical ontologies and alignments is validated by means of five case studies. As a result, they confirm that the model is effective. Four of the case studies demonstrate that it is able to support the development of useful new ontologies and alignments. The latter facilitate novel natural language processing and bioinformatics applications, and in one case constitute the basis of a task of the "BioNLP shared task 2013", an international challenge on biomedical information extraction. The fifth case study shows that the presented evaluation procedures are an effective means to check and improve the quality of ontology alignments. Hence, they support the crucial task of quality assurance of alignments, which are themselves increasingly used as reference standards in evaluations of automatic ontology alignment systems. Both, the presented life cycle model and the ontologies and alignments that have resulted from its validation improve information and knowledge management in biomedicine and thus promote biomedical research
    corecore