1,745 research outputs found

    A study on developing novel methods for relation extraction

    Get PDF
    Relation Extraction (RE) is a task of Natural Language Processing (NLP) to detect and classify the relations between two entities. Relation extraction in the biomedical and scientific literature domain is challenging as text can contain multiple pairs of entities in the same instance. During the course of this research, we developed an RE framework (RelEx), which consists of five main RE paradigms: rule-based, machine learning-based, Convolutional Neural Network (CNN)-based, Bidirectional Encoder Representations from Transformers (BERT)-based, and Graph Convolutional Networks (GCNs)-based approaches. RelEx\u27s rule-based approach uses co-location information of the entities to determine whether a relation exists between a selected entity and the other entities. RelEx\u27s machine learning-based approach consists of traditional feature representations into traditional machine learning algorithms. RelEx\u27s CNN-based approach consists of three CNN architectures: Segment-CNN, single-label Sentence-CNN, and multi-label Sentence-CNN. RelEx\u27s BERT-based approach utilizes BERT\u27s contextualized word embeddings into a feed-forward neural network. Finally, RelEx\u27s GCN-based approach consists of two GCN-based architectures: GCN-Vanilla, GCN-BERT. We evaluated variations of these approaches in two different domains across four distinct relation types. Overall our findings showed that the rule-based approach is applicable for data with fewer instances in the training data. In contrast, the CNN-based, BERT-based, and GCN-based approaches perform better with labeled data with many training instances. These approaches automatically identify patterns in the data efficiently, whereas rule-based approaches require expert knowledge to generate rules. The CNN-based, BERT-based approaches capture the local contextual information within a sentence or document by embedding both semantic and syntactic information in a learned representation. However, their ability to capture the long-range dependency global information in a text is limited. GCN-based approaches capture the global association information by performing convolution operations on neighbor nodes in a graph and incorporating information from neighbors. Combining GCN with BERT integrates the local contextual and global association information of the words and generates better representations for the words

    BERT Based Clinical Knowledge Extraction for Biomedical Knowledge Graph Construction and Analysis

    Full text link
    Background : Knowledge is evolving over time, often as a result of new discoveries or changes in the adopted methods of reasoning. Also, new facts or evidence may become available, leading to new understandings of complex phenomena. This is particularly true in the biomedical field, where scientists and physicians are constantly striving to find new methods of diagnosis, treatment and eventually cure. Knowledge Graphs (KGs) offer a real way of organizing and retrieving the massive and growing amount of biomedical knowledge. Objective : We propose an end-to-end approach for knowledge extraction and analysis from biomedical clinical notes using the Bidirectional Encoder Representations from Transformers (BERT) model and Conditional Random Field (CRF) layer. Methods : The approach is based on knowledge graphs, which can effectively process abstract biomedical concepts such as relationships and interactions between medical entities. Besides offering an intuitive way to visualize these concepts, KGs can solve more complex knowledge retrieval problems by simplifying them into simpler representations or by transforming the problems into representations from different perspectives. We created a biomedical Knowledge Graph using using Natural Language Processing models for named entity recognition and relation extraction. The generated biomedical knowledge graphs (KGs) are then used for question answering. Results : The proposed framework can successfully extract relevant structured information with high accuracy (90.7% for Named-entity recognition (NER), 88% for relation extraction (RE)), according to experimental findings based on real-world 505 patient biomedical unstructured clinical notes. Conclusions : In this paper, we propose a novel end-to-end system for the construction of a biomedical knowledge graph from clinical textual using a variation of BERT models

    GNTeam at 2018 n2c2:Feature-augmented BiLSTM-CRF for drug-related entity recognition in hospital discharge summaries

    Get PDF
    Monitoring the administration of drugs and adverse drug reactions are key parts of pharmacovigilance. In this paper, we explore the extraction of drug mentions and drug-related information (reason for taking a drug, route, frequency, dosage, strength, form, duration, and adverse events) from hospital discharge summaries through deep learning that relies on various representations for clinical named entity recognition. This work was officially part of the 2018 n2c2 shared task, and we use the data supplied as part of the task. We developed two deep learning architecture based on recurrent neural networks and pre-trained language models. We also explore the effect of augmenting word representations with semantic features for clinical named entity recognition. Our feature-augmented BiLSTM-CRF model performed with F1-score of 92.67% and ranked 4th for entity extraction sub-task among submitted systems to n2c2 challenge. The recurrent neural networks that use the pre-trained domain-specific word embeddings and a CRF layer for label optimization perform drug, adverse event and related entities extraction with micro-averaged F1-score of over 91%. The augmentation of word vectors with semantic features extracted using available clinical NLP toolkits can further improve the performance. Word embeddings that are pre-trained on a large unannotated corpus of relevant documents and further fine-tuned to the task perform rather well. However, the augmentation of word embeddings with semantic features can help improve the performance (primarily by boosting precision) of drug-related named entity recognition from electronic health records

    Data-efficient methods for information extraction

    Get PDF
    Strukturierte Wissensrepräsentationssysteme wie Wissensdatenbanken oder Wissensgraphen bieten Einblicke in Entitäten und Beziehungen zwischen diesen Entitäten in der realen Welt. Solche Wissensrepräsentationssysteme können in verschiedenen Anwendungen der natürlichen Sprachverarbeitung eingesetzt werden, z. B. bei der semantischen Suche, der Beantwortung von Fragen und der Textzusammenfassung. Es ist nicht praktikabel und ineffizient, diese Wissensrepräsentationssysteme manuell zu befüllen. In dieser Arbeit entwickeln wir Methoden, um automatisch benannte Entitäten und Beziehungen zwischen den Entitäten aus Klartext zu extrahieren. Unsere Methoden können daher verwendet werden, um entweder die bestehenden unvollständigen Wissensrepräsentationssysteme zu vervollständigen oder ein neues strukturiertes Wissensrepräsentationssystem von Grund auf zu erstellen. Im Gegensatz zu den gängigen überwachten Methoden zur Informationsextraktion konzentrieren sich unsere Methoden auf das Szenario mit wenigen Daten und erfordern keine große Menge an kommentierten Daten. Im ersten Teil der Arbeit haben wir uns auf das Problem der Erkennung von benannten Entitäten konzentriert. Wir haben an der gemeinsamen Aufgabe von Bacteria Biotope 2019 teilgenommen. Die gemeinsame Aufgabe besteht darin, biomedizinische Entitätserwähnungen zu erkennen und zu normalisieren. Unser linguistically informed Named-Entity-Recognition-System besteht aus einem Deep-Learning-basierten Modell, das sowohl verschachtelte als auch flache Entitäten extrahieren kann; unser Modell verwendet mehrere linguistische Merkmale und zusätzliche Trainingsziele, um effizientes Lernen in datenarmen Szenarien zu ermöglichen. Unser System zur Entitätsnormalisierung verwendet String-Match, Fuzzy-Suche und semantische Suche, um die extrahierten benannten Entitäten mit den biomedizinischen Datenbanken zu verknüpfen. Unser System zur Erkennung von benannten Entitäten und zur Entitätsnormalisierung erreichte die niedrigste Slot-Fehlerrate von 0,715 und belegte den ersten Platz in der gemeinsamen Aufgabe. Wir haben auch an zwei gemeinsamen Aufgaben teilgenommen: Adverse Drug Effect Span Detection (Englisch) und Profession Span Detection (Spanisch); beide Aufgaben sammeln Daten von der Social Media Plattform Twitter. Wir haben ein Named-Entity-Recognition-Modell entwickelt, das die Eingabedarstellung des Modells durch das Stapeln heterogener Einbettungen aus verschiedenen Domänen verbessern kann; unsere empirischen Ergebnisse zeigen komplementäres Lernen aus diesen heterogenen Einbettungen. Unser Beitrag belegte den 3. Platz in den beiden gemeinsamen Aufgaben. Im zweiten Teil der Arbeit untersuchten wir Strategien zur Erweiterung synthetischer Daten, um ressourcenarme Informationsextraktion in spezialisierten Domänen zu ermöglichen. Insbesondere haben wir backtranslation an die Aufgabe der Erkennung von benannten Entitäten auf Token-Ebene und der Extraktion von Beziehungen auf Satzebene angepasst. Wir zeigen, dass die Rückübersetzung sprachlich vielfältige und grammatikalisch kohärente synthetische Sätze erzeugen kann und als wettbewerbsfähige Erweiterungsstrategie für die Aufgaben der Erkennung von benannten Entitäten und der Extraktion von Beziehungen dient. Bei den meisten realen Aufgaben zur Extraktion von Beziehungen stehen keine kommentierten Daten zur Verfügung, jedoch ist häufig ein großer unkommentierter Textkorpus vorhanden. Bootstrapping-Methoden zur Beziehungsextraktion können mit diesem großen Korpus arbeiten, da sie nur eine Handvoll Startinstanzen benötigen. Bootstrapping-Methoden neigen jedoch dazu, im Laufe der Zeit Rauschen zu akkumulieren (bekannt als semantische Drift), und dieses Phänomen hat einen drastischen negativen Einfluss auf die endgültige Genauigkeit der Extraktionen. Wir entwickeln zwei Methoden zur Einschränkung des Bootstrapping-Prozesses, um die semantische Drift bei der Extraktion von Beziehungen zu minimieren. Unsere Methoden nutzen die Graphentheorie und vortrainierte Sprachmodelle, um verrauschte Extraktionsmuster explizit zu identifizieren und zu entfernen. Wir berichten über die experimentellen Ergebnisse auf dem TACRED-Datensatz für vier Relationen. Im letzten Teil der Arbeit demonstrieren wir die Anwendung der Domänenanpassung auf die anspruchsvolle Aufgabe der mehrsprachigen Akronymextraktion. Unsere Experimente zeigen, dass die Domänenanpassung die Akronymextraktion in wissenschaftlichen und juristischen Bereichen in sechs Sprachen verbessern kann, darunter auch Sprachen mit geringen Ressourcen wie Persisch und Vietnamesisch.The structured knowledge representation systems such as knowledge base or knowledge graph can provide insights regarding entities and relationship(s) among these entities in the real-world, such knowledge representation systems can be employed in various natural language processing applications such as semantic search, question answering and text summarization. It is infeasible and inefficient to manually populate these knowledge representation systems. In this work, we develop methods to automatically extract named entities and relationships among the entities from plain text and hence our methods can be used to either complete the existing incomplete knowledge representation systems to create a new structured knowledge representation system from scratch. Unlike mainstream supervised methods for information extraction, our methods focus on the low-data scenario and do not require a large amount of annotated data. In the first part of the thesis, we focused on the problem of named entity recognition. We participated in the shared task of Bacteria Biotope 2019, the shared task consists of recognizing and normalizing the biomedical entity mentions. Our linguistically informed named entity recognition system consists of a deep learning based model which can extract both nested and flat entities; our model employed several linguistic features and auxiliary training objectives to enable efficient learning in data-scarce scenarios. Our entity normalization system employed string match, fuzzy search and semantic search to link the extracted named entities to the biomedical databases. Our named entity recognition and entity normalization system achieved the lowest slot error rate of 0.715 and ranked first in the shared task. We also participated in two shared tasks of Adverse Drug Effect Span detection (English) and Profession Span Detection (Spanish); both of these tasks collect data from the social media platform Twitter. We developed a named entity recognition model which can improve the input representation of the model by stacking heterogeneous embeddings from a diverse domain(s); our empirical results demonstrate complementary learning from these heterogeneous embeddings. Our submission ranked 3rd in both of the shared tasks. In the second part of the thesis, we explored synthetic data augmentation strategies to address low-resource information extraction in specialized domains. Specifically, we adapted backtranslation to the token-level task of named entity recognition and sentence-level task of relation extraction. We demonstrate that backtranslation can generate linguistically diverse and grammatically coherent synthetic sentences and serve as a competitive augmentation strategy for the task of named entity recognition and relation extraction. In most of the real-world relation extraction tasks, the annotated data is not available, however, quite often a large unannotated text corpus is available. Bootstrapping methods for relation extraction can operate on this large corpus as they only require a handful of seed instances. However, bootstrapping methods tend to accumulate noise over time (known as semantic drift) and this phenomenon has a drastic negative impact on the final precision of the extractions. We develop two methods to constrain the bootstrapping process to minimise semantic drift for relation extraction; our methods leverage graph theory and pre-trained language models to explicitly identify and remove noisy extraction patterns. We report the experimental results on the TACRED dataset for four relations. In the last part of the thesis, we demonstrate the application of domain adaptation to the challenging task of multi-lingual acronym extraction. Our experiments demonstrate that domain adaptation can improve acronym extraction within scientific and legal domains in 6 languages including low-resource languages such as Persian and Vietnamese

    Named Entity Recognition in Electronic Health Records: A Methodological Review

    Get PDF
    Objectives A substantial portion of the data contained in Electronic Health Records (EHR) is unstructured, often appearing as free text. This format restricts its potential utility in clinical decision-making. Named entity recognition (NER) methods address the challenge of extracting pertinent information from unstructured text. The aim of this study was to outline the current NER methods and trace their evolution from 2011 to 2022. Methods We conducted a methodological literature review of NER methods, with a focus on distinguishing the classification models, the types of tagging systems, and the languages employed in various corpora. Results Several methods have been documented for automatically extracting relevant information from EHRs using natural language processing techniques such as NER and relation extraction (RE). These methods can automatically extract concepts, events, attributes, and other data, as well as the relationships between them. Most NER studies conducted thus far have utilized corpora in English or Chinese. Additionally, the bidirectional encoder representation from transformers using the BIO tagging system architecture is the most frequently reported classification scheme. We discovered a limited number of papers on the implementation of NER or RE tasks in EHRs within a specific clinical domain. Conclusions EHRs play a pivotal role in gathering clinical information and could serve as the primary source for automated clinical decision support systems. However, the creation of new corpora from EHRs in specific clinical domains is essential to facilitate the swift development of NER and RE models applied to EHRs for use in clinical practice
    corecore