11 research outputs found

    Ontology-Based Interactive Information Extraction From Scientific Abstracts

    Get PDF
    Over recent years, there has been a growing interest in extracting information automatically or semi-automatically from the scientific literature. This paper describes a novel ontology-based interactive information extraction (OBIIE) framework and a specific OBIIE system. We describe how this system enables life scientists to make ad hoc queries similar to using a standard search engine, but where the results are obtained in a database format similar to a pre-programmed information extraction engine. We present a case study in which the system was evaluated for extracting co-factors from EMBASE and MEDLINE

    Text mining of full-text journal articles combined with gene expression analysis reveals a relationship between sphingosine-1-phosphate and invasiveness of a glioblastoma cell line

    Get PDF
    BACKGROUND: Sphingosine 1-phosphate (S1P), a lysophospholipid, is involved in various cellular processes such as migration, proliferation, and survival. To date, the impact of S1P on human glioblastoma is not fully understood. Particularly, the concerted role played by matrix metalloproteinases (MMP) and S1P in aggressive tumor behavior and angiogenesis remains to be elucidated. RESULTS: To gain new insights in the effect of S1P on angiogenesis and invasion of this type of malignant tumor, we used microarrays to investigate the gene expression in glioblastoma as a response to S1P administration in vitro. We compared the expression profiles for the same cell lines under the influence of epidermal growth factor (EGF), an important growth factor. We found a set of 72 genes that are significantly differentially expressed as a unique response to S1P. Based on the result of mining full-text articles from 20 scientific journals in the field of cancer research published over a period of five years, we inferred gene-gene interaction networks for these 72 differentially expressed genes. Among the generated networks, we identified a particularly interesting one. It describes a cascading event, triggered by S1P, leading to the transactivation of MMP-9 via neuregulin-1 (NRG-1), vascular endothelial growth factor (VEGF), and the urokinase-type plasminogen activator (uPA). This interaction network has the potential to shed new light on our understanding of the role played by MMP-9 in invasive glioblastomas. CONCLUSION: Automated extraction of information from biological literature promises to play an increasingly important role in biological knowledge discovery. This is particularly true for high-throughput approaches, such as microarrays, and for combining and integrating data from different sources. Text mining may hold the key to unraveling previously unknown relationships between biological entities and could develop into an indispensable instrument in the process of formulating novel and potentially promising hypotheses

    PASBio: predicate-argument structures for event extraction in molecular biology

    Get PDF
    Background: The exploitation of information extraction (IE), a technology aiming to provide instances of structured representations from free-form text, has been rapidly growing within the molecular biology (MB) research community to keep track of the latest results reported in literature. IE systems have traditionally used shallow syntactic patterns for matching facts in sentences but such approaches appear inadequate to achieve high accuracy in MB event extraction due to complex sentence structure. A consensus in the IE community is emerging on the necessity for exploiting deeper knowledge structures such as through the relations between a verb and its arguments shown by predicate-argument structure (PAS). PAS is of interest as structures typically correspond to events of interest and their participating entities. For this to be realized within IE a key knowledge component is the definition of PAS frames. PAS frames for non-technical domains such as newswire are already being constructed in several projects such as PropBank, VerbNet, and FrameNet. Knowledge from PAS should enable more accurate applications in several areas where sentence understanding is required like machine translation and text summarization. In this article, we explore the need to adapt PAS for the MB domain and specify PAS frames to support IE, as well as outlining the major issues that require consideration in their construction. Results: We introduce PASBio by extending a model based on PropBank to the MB domain. The hypothesis we explore is that PAS holds the key for understanding relationships describing the roles of genes and gene products in mediating their biological functions. We chose predicates describing gene expression, molecular interactions and signal transduction events with the aim of covering a number of research areas in MB. Analysis was performed on sentences containing a set of verbal predicates from MEDLINE and full text journals. Results confirm the necessity to analyze PAS specifically for MB domain. Conclusions: At present PASBio contains the analyzed PAS of over 30 verbs, publicly available on the Internet for use in advanced applications. In the future we aim to expand the knowledge base to cover more verbs and the nominal form of each predicate

    Nominalization and Alternations in Biomedical Language

    Get PDF
    Background: This paper presents data on alternations in the argument structure of common domain-specific verbs and their associated verbal nominalizations in the PennBioIE corpus. Alternation is the term in theoretical linguistics for variations in the surface syntactic form of verbs, e.g. the different forms of stimulate in FSH stimulates follicular development and follicular development is stimulated by FSH. The data is used to assess the implications of alternations for biomedical text mining systems and to test the fit of the sublanguage model to biomedical texts. Methodology/Principal Findings: We examined 1,872 tokens of the ten most common domain-specific verbs or their zerorelated nouns in the PennBioIE corpus and labelled them for the presence or absence of three alternations. We then annotated the arguments of 746 tokens of the nominalizations related to these verbs and counted alternations related to the presence or absence of arguments and to the syntactic position of non-absent arguments. We found that alternations are quite common both for verbs and for nominalizations. We also found a previously undescribed alternation involving an adjectival present participle. Conclusions/Significance: We found that even in this semantically restricted domain, alternations are quite common, and alternations involving nominalizations are exceptionally diverse. Nonetheless, the sublanguage model applies to biomedica

    Ontology Enrichment from Free-text Clinical Documents: A Comparison of Alternative Approaches

    Get PDF
    While the biomedical informatics community widely acknowledges the utility of domain ontologies, there remain many barriers to their effective use. One important requirement of domain ontologies is that they achieve a high degree of coverage of the domain concepts and concept relationships. However, the development of these ontologies is typically a manual, time-consuming, and often error-prone process. Limited resources result in missing concepts and relationships, as well as difficulty in updating the ontology as domain knowledge changes. Methodologies developed in the fields of Natural Language Processing (NLP), Information Extraction (IE), Information Retrieval (IR), and Machine Learning (ML) provide techniques for automating the enrichment of ontology from free-text documents. In this dissertation, I extended these methodologies into biomedical ontology development. First, I reviewed existing methodologies and systems developed in the fields of NLP, IR, and IE, and discussed how existing methods can benefit the development of biomedical ontologies. This previously unconducted review was published in the Journal of Biomedical Informatics. Second, I compared the effectiveness of three methods from two different approaches, the symbolic (the Hearst method) and the statistical (the Church and Lin methods), using clinical free-text documents. Third, I developed a methodological framework for Ontology Learning (OL) evaluation and comparison. This framework permits evaluation of the two types of OL approaches that include three OL methods. The significance of this work is as follows: 1) The results from the comparative study showed the potential of these methods for biomedical ontology enrichment. For the two targeted domains (NCIT and RadLex), the Hearst method revealed an average of 21% and 11% new concept acceptance rates, respectively. The Lin method produced a 74% acceptance rate for NCIT; the Church method, 53%. As a result of this study (published in the Journal of Methods of Information in Medicine), many suggested candidates have been incorporated into the NCIT; 2) The evaluation framework is flexible and general enough that it can analyze the performance of ontology enrichment methods for many domains, thus expediting the process of automation and minimizing the likelihood that key concepts and relationships would be missed as domain knowledge evolves

    Ermittlung von Zusammenhängen zwischen enzymatischer Aktivität und Krankheiten durch die automatische Analyse wissenschaftlicher Publikationen

    Get PDF
    Aufgrund des schnellen Wachstums biomedizinischer Daten sowie der assoziierten Literatur wird es auch für Experten zunehmend schwierig, den Überblick über den aktuellen Wissensstand zu behalten. Der Aufbau und die manuelle Erweiterung von Datenbanken ist teuer und zeitaufwändig, kann jedoch durch linguistische Methoden unterstützt werden, welche Erkenntnisse automatisch aus der wissenschaftlichen Literatur extrahieren. Die vorliegende Dissertation stellt eine solche Methode zur Annotation von Enzymklassen mit krankheitsrelevanten Informationen vor. Die Enzymnamen von 3901 Enzymklassen der BRENDA, einer Sammlung von qualitativen und quantitativen Enzymdaten, wurden in einem Textkorpus aus über 100000 Kurzzusammenfassungen der PubMed-Datenbank identifiziert. Phrasen der Kurzzusammenfassungen konnten durch das MetaMap-Programm den Konzepten des UMLS (Unified Medical Language Systems) zugewiesen werden, was eine Identifikation der krankheitsrelevanten Begriffe mittels ihrer semantischen Felder in der UMLS-Ontologie erlaubte. Eine Zuordnung von Enzymklassen zu Krankheitskonzepten erfolgte aufgrund der gemeinsamen Nennung innerhalb eines Satzes. Die Zahl falscher Zuordnung konnte durch den Einsatz verschiedener Filter verringert werden. Verwendet wurden unter anderem die Mindestzahl gemeinsamer Nennungen, die Entfernung von Sätzen mit einer Negation sowie die Klassifikation unbekannter Sätze durch eine Support Vector Machine. Eine Überprüfung der Zuordnungen anhand 1500 manuell annotierter Sätze ergab eine Präzision von 95%, was eine direkte Erweiterung der BRENDA-Datenbank mit den gefundenen Zuordnungen erlaubte

    Extracting molecular binding relationships from biomedical text

    No full text
    corecore