2,212 research outputs found

    Text Mining for Pathway Curation

    Get PDF
    Biolog:innen untersuchen häufig Pathways, Netzwerke von Interaktionen zwischen Proteinen und Genen mit einer spezifischen Funktion. Neue Erkenntnisse über Pathways werden in der Regel zunächst in Publikationen veröffentlicht und dann in strukturierter Form in Lehrbüchern, Datenbanken oder mathematischen Modellen weitergegeben. Deren Kuratierung kann jedoch aufgrund der hohen Anzahl von Publikationen sehr aufwendig sein. In dieser Arbeit untersuchen wir wie Text Mining Methoden die Kuratierung unterstützen können. Wir stellen PEDL vor, ein Machine-Learning-Modell zur Extraktion von Protein-Protein-Assoziationen (PPAs) aus biomedizinischen Texten. PEDL verwendet Distant Supervision und vortrainierte Sprachmodelle, um eine höhere Genauigkeit als vergleichbare Methoden zu erreichen. Eine Evaluation durch Expert:innen bestätigt die Nützlichkeit von PEDLs für Pathway-Kurator:innen. Außerdem stellen wir PEDL+ vor, ein Kommandozeilen-Tool, mit dem auch Nicht-Expert:innen PPAs effizient extrahieren können. Drei Kurator:innen bewerten 55,6 % bis 79,6 % der von PEDL+ gefundenen PPAs als nützlich für ihre Arbeit. Die große Anzahl von PPAs, die durch Text Mining identifiziert werden, kann für Forscher:innen überwältigend sein. Um hier Abhilfe zu schaffen, stellen wir PathComplete vor, ein Modell, das nützliche Erweiterungen eines Pathways vorschlägt. Es ist die erste Pathway-Extension-Methode, die auf überwachtem maschinellen Lernen basiert. Unsere Experimente zeigen, dass PathComplete wesentlich genauer ist als existierende Methoden. Schließlich schlagen wir eine Methode vor, um Pathways mit komplexen Ereignisstrukturen zu erweitern. Hier übertrifft unsere neue Methode zur konditionalen Graphenmodifikation die derzeit beste Methode um 13-24% Genauigkeit in drei Benchmarks. Insgesamt zeigen unsere Ergebnisse, dass Deep Learning basierte Informationsextraktion eine vielversprechende Grundlage für die Unterstützung von Pathway-Kurator:innen ist.Biological knowledge often involves understanding the interactions between molecules, such as proteins and genes, that form functional networks called pathways. New knowledge about pathways is typically communicated through publications and later condensed into structured formats such as textbooks, pathway databases or mathematical models. However, curating updated pathway models can be labour-intensive due to the growing volume of publications. This thesis investigates text mining methods to support pathway curation. We present PEDL (Protein-Protein-Association Extraction with Deep Language Models), a machine learning model designed to extract protein-protein associations (PPAs) from biomedical text. PEDL uses distant supervision and pre-trained language models to achieve higher accuracy than the state of the art. An expert evaluation confirms its usefulness for pathway curators. We also present PEDL+, a command-line tool that allows non-expert users to efficiently extract PPAs. When applied to pathway curation tasks, 55.6% to 79.6% of PEDL+ extractions were found useful by curators. The large number of PPAs identified by text mining can be overwhelming for researchers. To help, we present PathComplete, a model that suggests potential extensions to a pathway. It is the first method based on supervised machine learning for this task, using transfer learning from pathway databases. Our evaluations show that PathComplete significantly outperforms existing methods. Finally, we generalise pathway extension from PPAs to more realistic complex events. Here, our novel method for conditional graph modification outperforms the current best by 13-24% accuracy on three benchmarks. We also present a new dataset for event-based pathway extension. Overall, our results show that deep learning-based information extraction is a promising basis for supporting pathway curators

    Hey Article, What Are You About? Question Answering for Information Systems Articles through Transformer Models for Long Sequences

    Get PDF
    Question Answering (QA) systems can significantly reduce manual effort of searching for relevant information. However, challenges arise from a lack of domain-specificity and the fact that QA systems usually retrieve answers from short text passages instead of long scientific articles. We aim to address these challenges by (1) exploring the use of transformer models for long sequence processing, (2) performing domain adaptation for the Information Systems (IS) discipline and (3) developing novel techniques by performing domain adaptation in multiple training phases. Our models were pre-trained on a corpus of 2 million sentences retrieved from 3,463 articles from the Senior Scholars' Basket and fine-tuned on SQuAD and a manually created set of 500 QA pairs from the IS field. In six experiments, we tested two transfer learning techniques for fine-tuning (TANDA and FANDO). The results show that fine-tuning with task-specific domain knowledge considerably increases the models' F1- and Exact Match-scores

    Improving Clinical Document Understanding on COVID-19 Research with Spark NLP

    Get PDF
    Following the global COVID-19 pandemic, the number of scientific papers studying the virus has grown massively, leading to increased interest in automated literate review. We present a clinical text mining system that improves on previous efforts in three ways. First, it can recognize over 100 different entity types including social determinants of health, anatomy, risk factors, and adverse events in addition to other commonly used clinical and biomedical entities. Second, the text processing pipeline includes assertion status detection, to distinguish between clinical facts that are present, absent, conditional, or about someone other than the patient. Third, the deep learning models used are more accurate than previously available, leveraging an integrated pipeline of state-of-the-art pretrained named entity recognition models, and improving on the previous best performing benchmarks for assertion status detection. We illustrate extracting trends and insights, e.g. most frequent disorders and symptoms, and most common vital signs and EKG findings, from the COVID-19 Open Research Dataset (CORD-19). The system is built using the Spark NLP library which natively supports scaling to use distributed clusters, leveraging GPUs, configurable and reusable NLP pipelines, healthcare specific embeddings, and the ability to train models to support new entity types or human languages with no code changes.Comment: Accepted to SDU (Scientific Document Understanding) workshop at AAAI 202

    ChemNLP: A Natural Language Processing based Library for Materials Chemistry Text Data

    Full text link
    In this work, we present the ChemNLP library that can be used for 1) curating open access datasets for materials and chemistry literature, developing and comparing traditional machine learning, transformers and graph neural network models for 2) classifying and clustering texts, 3) named entity recognition for large-scale text-mining, 4) abstractive summarization for generating titles of articles from abstracts, 5) text generation for suggesting abstracts from titles, 6) integration with density functional theory dataset for identifying potential candidate materials such as superconductors, and 7) web-interface development for text and reference query. We primarily use the publicly available arXiv and Pubchem datasets but the tools can be used for other datasets as well. Moreover, as new models are developed, they can be easily integrated in the library. ChemNLP is available at the websites: https://github.com/usnistgov/chemnlp and https://jarvis.nist.gov/jarvischemnlp

    Born-reusable scientific knowledge: Concept, implementation, and applications

    Get PDF
    The exponentially increasing growth of scientific literature publication presents a significant challenge to effectively read, process, and fully comprehend the wealth of scientific knowledge. The Open Research Knowledge Graph (ORKG) aims to address this challenge by providing infrastructure that aligns with the FAIR principles, to support the creation, curation, and utilization of scientific knowledge. Nevertheless, the current dependence on crowdsourcing and natural language processing (NLP) for post-publication knowledge extraction restricts the scalability and quality of such knowledge bases. In response to these challenges, we present a novel ’born-reusable’ approach that seeks to create richly-detailed, machine-reusable descriptions of papers directly within the computing environment where the research was conducted, thus placing the onus on authors to ensure their research findings are FAIR prior to publication. With the help of the ORKG R package, salient scientific knowledge is captured from the paper’s associated R source code and serialized to a machine-reusable format (JSON-LD) for harvesting by the ORKG by DOI-lookup. By applying this approach to an unpublished soil science manuscript, we demonstrated how authors are best situated to describe their work in a richlydetailed machine-reusable format. Furthermore, by applying this approach to two published agroecology papers, we demonstrated its relevance to post-publication, thus suggesting that papers which share source code and data sets could be made machine-reusable retrospectively. Finally, a proof-of-concept meta-analysis was conducted to demonstrate how this approach can help facilitate research synthesis by providing FAIR scientific data. We concluded that the ’born-reusable’ approach has promising implications for the reusability of scientific knowledge. However, its broad adoption faces several challenges. Therefore, solutions were explored to improve the approach’s interoperability with knowledge graphs, assist authors with its implementation into their workflows, and strengthen cooperation with publishers to provide the necessary infrastructure

    A recurrent neural network architecture for biomedical event trigger classification

    Get PDF
    A “biomedical event” is a broad term used to describe the roles and interactions between entities (such as proteins, genes and cells) in a biological system. The task of biomedical event extraction aims at identifying and extracting these events from unstructured texts. An important component in the early stage of the task is biomedical trigger classification which involves identifying and classifying words/phrases that indicate an event. In this thesis, we present our work on biomedical trigger classification developed using the multi-level event extraction dataset. We restrict the scope of our classification to 19 biomedical event types grouped under four broad categories - Anatomical, Molecular, General and Planned. While most of the existing approaches are based on traditional machine learning algorithms which require extensive feature engineering, our model relies on neural networks to implicitly learn important features directly from the text. We use natural language processing techniques to transform the text into vectorized inputs that can be used in a neural network architecture. As per our knowledge, this is the first time neural attention strategies are being explored in the area of biomedical trigger classification. Our best results were obtained from an ensemble of 50 models which produced a micro F-score of 79.82%, an improvement of 1.3% over the previous best score
    corecore