11 research outputs found

    Robust input representations for low-resource information extraction

    Get PDF
    Recent advances in the field of natural language processing were achieved with deep learning models. This led to a wide range of new research questions concerning the stability of such large-scale systems and their applicability beyond well-studied tasks and datasets, such as information extraction in non-standard domains and languages, in particular, in low-resource environments. In this work, we address these challenges and make important contributions across fields such as representation learning and transfer learning by proposing novel model architectures and training strategies to overcome existing limitations, including a lack of training resources, domain mismatches and language barriers. In particular, we propose solutions to close the domain gap between representation models by, e.g., domain-adaptive pre-training or our novel meta-embedding architecture for creating a joint representations of multiple embedding methods. Our broad set of experiments demonstrates state-of-the-art performance of our methods for various sequence tagging and classification tasks and highlight their robustness in challenging low-resource settings across languages and domains.Die jĂŒngsten Fortschritte auf dem Gebiet der Verarbeitung natĂŒrlicher Sprache wurden mit Deep-Learning-Modellen erzielt. Dies fĂŒhrte zu einer Vielzahl neuer Forschungsfragen bezĂŒglich der StabilitĂ€t solcher großen Systeme und ihrer Anwendbarkeit ĂŒber gut untersuchte Aufgaben und DatensĂ€tze hinaus, wie z. B. die Informationsextraktion fĂŒr Nicht-Standardsprachen, aber auch TextdomĂ€nen und Aufgaben, fĂŒr die selbst im Englischen nur wenige Trainingsdaten zur VerfĂŒgung stehen. In dieser Arbeit gehen wir auf diese Herausforderungen ein und leisten wichtige BeitrĂ€ge in Bereichen wie ReprĂ€sentationslernen und Transferlernen, indem wir neuartige Modellarchitekturen und Trainingsstrategien vorschlagen, um bestehende BeschrĂ€nkungen zu ĂŒberwinden, darunter fehlende Trainingsressourcen, ungesehene DomĂ€nen und Sprachbarrieren. Insbesondere schlagen wir Lösungen vor, um die DomĂ€nenlĂŒcke zwischen ReprĂ€sentationsmodellen zu schließen, z.B. durch domĂ€nenadaptives Vortrainieren oder unsere neuartige Meta-Embedding-Architektur zur Erstellung einer gemeinsamen ReprĂ€sentation mehrerer Embeddingmethoden. Unsere umfassende Evaluierung demonstriert die LeistungsfĂ€higkeit unserer Methoden fĂŒr verschiedene Klassifizierungsaufgaben auf Word und Satzebene und unterstreicht ihre Robustheit in anspruchsvollen, ressourcenarmen Umgebungen in verschiedenen Sprachen und DomĂ€nen

    Syntax-based Transfer Learning for the Task of Biomedical Relation Extraction

    Get PDF
    International audienceTransfer learning (TL) proposes to enhance machine learning performance on a problem, by reusing labeled data originally designed for a related problem. In particular, domain adaptation consists, for a specific task, in reusing training data developed for the same task but a distinct domain. This is particularly relevant to the applications of deep learning in Natural Language Processing, because those usually require large annotated corpora that may not exist for the targeted domain, but exist for side domains. In this paper, we experiment with TL for the task of Relation Extraction (RE) from biomedical texts, using the TreeLSTM model. We empirically show the impact of TreeLSTM alone and with domain adaptation by obtaining better performances than the state of the art on two biomedical RE tasks and equal performances for two others, for which few annotated data are available. Furthermore, we propose an analysis of the role that syntactic features may play in TL for RE

    Automated coding of under-studied medical concept domains: linking physical activity reports to the international classification of functioning, disability, and health

    Get PDF
    Linking clinical narratives to standardized vocabularies and coding systems is a key component of unlocking the information in medical text for analysis. However, many domains of medical concepts, such as functional outcomes and social determinants of health, lack well-developed terminologies that can support effective coding of medical text. We present a framework for developing natural language processing (NLP) technologies for automated coding of medical information in under-studied domains, and demonstrate its applicability through a case study on physical mobility function. Mobility function is a component of many health measures, from post-acute care and surgical outcomes to chronic frailty and disability, and is represented as one domain of human activity in the International Classification of Functioning, Disability, and Health (ICF). However, mobility and other types of functional activity remain under-studied in the medical informatics literature, and neither the ICF nor commonly-used medical terminologies capture functional status terminology in practice. We investigated two data-driven paradigms, classification and candidate selection, to link narrative observations of mobility status to standardized ICF codes, using a dataset of clinical narratives from physical therapy encounters. Recent advances in language modeling and word embedding were used as features for established machine learning models and a novel deep learning approach, achieving a macro-averaged F-1 score of 84% on linking mobility activity reports to ICF codes. Both classification and candidate selection approaches present distinct strengths for automated coding in under-studied domains, and we highlight that the combination of (i) a small annotated data set; (ii) expert definitions of codes of interest; and (iii) a representative text corpus is sufficient to produce high-performing automated coding systems. This research has implications for continued development of language technologies to analyze functional status information, and the ongoing growth of NLP tools for a variety of specialized applications in clinical care and research

    Exploration and adaptation of large language models for specialized domains

    Get PDF
    Large language models have transformed the field of natural language processing (NLP). Their improved performance on various NLP benchmarks makes them a promising tool—also for the application in specialized domains. Such domains are characterized by highly trained professionals with particular domain expertise. Since these experts are rare, improving the efficiency of their work with automated systems is especially desirable. However, domain-specific text resources hold various challenges for NLP systems. These challenges include distinct language, noisy and scarce data, and a high level of variation. Further, specialized domains present an increased need for transparent systems since they are often applied in high stakes settings. In this dissertation, we examine whether large language models (LLMs) can overcome some of these challenges and propose methods to effectively adapt them to domain-specific requirements. We first investigate the inner workings and abilities of LLMs and show how they can fill the gaps that are present in previous NLP algorithms for specialized domains. To this end, we explore the sources of errors produced by earlier systems to identify which of them can be addressed by using LLMs. Following this, we take a closer look at how information is processed within Transformer-based LLMs to better understand their capabilities. We find that their layers encode different dimensions of the input text. Here, the contextual vector representation, and the general language knowledge learned during pre-training are especially beneficial for solving complex and multi-step tasks common in specialized domains. Following this exploration, we propose solutions for further adapting LLMs to the requirements of domain-specific tasks. We focus on the clinical domain, which incorporates many typical challenges found in specialized domains. We show how to improve generalization by integrating different domain-specific resources into our models. We further analyze the behavior of the produced models and propose a behavioral testing framework that can serve as a tool for communication with domain experts. Finally, we present an approach for incorporating the benefits of LLMs while fulfilling requirements such as interpretability and modularity. The presented solutions show improvements in performance on benchmark datasets and in manually conducted analyses with medical professionals. Our work provides both new insights into the inner workings of pre-trained language models as well as multiple adaptation methods showing that LLMs can be an effective tool for NLP in specialized domains

    Innovations in Domain Knowledge Augmentation of Contextual Models

    Get PDF
    The digital transformation of our society is creating a tremendous amount of data at an unprecedented rate. A large part of this data is in unstructured text format. While enjoying the benefit of instantaneous data access, we are also burdened by information overload. In healthcare, clinicians have to spend a significant portion of their time reading, writing and synthesizing data in electronic patient record systems. Information overload is reported as one of the main factors contributing to physician burnout; however, information overload is not unique to healthcare. We need better practical tools to help us access the right information at the right time. This has led to a heightened interest in high-performing Natural Language Processing research and solutions. Natural Language Processing (NLP), or Computational Linguistics, is a sub-field of computer science that focuses on analyzing and representing human language. The most recent advancements in NLP are large pre-trained contextual language models (e.g., transformer based models), which are pre-trained on massive corpora, and their context-sensitive embeddings (i.e., learned representation of words) are used in downstream tasks. The introduction of these models has led to significant performance gains in various downstream tasks, including sentiment analysis, entity recognition, and question answering. Such models have the ability to change the embedding of a word based on its imputed meaning, which is derived from the surrounding context. Contextual models can only encode the knowledge available in raw text corpora. Injecting structured domain-specific knowledge into these contextual models could further improve their performance and efficiency. However, this is not a trivial task. It requires a deep understanding of the model’s architecture and the nature and structure of the domain knowledge incorporated into the model. Another challenge facing NLP is the “low-resource” problem, arising from a shortage of publicly available (domain-specific) large datasets for training purposes. The low-resource challenge is especially acute in the biomedical domain, where strict regulation for privacy protection prohibits many datasets from being publicly available to the NLP community. The severe shortage of clinical experts further exacerbates the lack of labeled training datasets for clinical NLP research. We approach these challenges from the knowledge augmentation angle. This thesis explores how knowledge found in structured knowledge bases, either in general-purpose lexical databases (e.g., WordNet) or domain-specific knowledge bases (e.g., the Unified Medical Language Systems or the International Classification of Diseases), can be used to address the low-resource problem. We show that by incorporating domain-specific prior knowledge into a deep learning NLP architecture, we can force an NLP model to learn the associations between distinctive terminologies that it otherwise may not have the opportunity to learn due to the scarcity of domain-specific datasets. Four distinct yet complementary strategies have been pursued. First, we investigate how contextual models can use structured knowledge contained in the lexical database WordNet to distinguish between semantically similar words. We update the input policy of a contextual model by introducing a new mix-up embedding strategy for the input embedding of the target word. We also introduce additional information, such as the degree of similarity between the definitions of the target and the candidate words. We demonstrate that this supplemental information has enabled the model to select candidate words that are semantically similar to the target word rather than those that are only appropriate for the sentence’s context. Having successfully proven that lexical knowledge can aid a contextual model in distinguishing between semantically similar words, we extend this approach to highly specialized vocabularies such as those found in medical text. We explore whether using domain-specific (medical) knowledge from a clinical Metathesaurus (UMLS Metathesaurus) in the architecture of a transformer-based encoder model can aid the model in building ‘semantically enriched’ contextual representations that will benefit from both the contextual learning and the domain knowledge. We also investigate whether incorporating structured medical knowledge into the pre-training phase of a transformer-based model can incentivize the model to learn more accurately the association between distinctive terminologies. This strategy is proven to be effective through a series of benchmark comparisons with other related models. After demonstrating the effect of structured domain (medical) knowledge on the performance of a transformer-based encoder model, we extend the medical features and illustrate that structured medical knowledge can also boost the performance of a (medical) summarization transformer-based sequence-to-sequence model. We introduce a guidance signal consisting of the medical terminologies in the input sequence. Moreover, the input policy is modified by utilizing the semantic types from UMLS, and we also propose a novel weighted loss function. Our study demonstrates the benefit of these strategies in providing a stronger incentive for the model to include relevant medical facts in the summarized output. We further examine whether an NLP model can take advantage of both the relational information between different labels and contextual embedding information by introducing a novel attention mechanism (instead of augmenting the architecture of contextual models with structured information as described in the previous paragraphs). We tackle the challenge of automatic ICD coding, which is the task of assigning codes of the International Classification of Diseases (ICD) system to medical notes. Through a novel attention mechanism, we integrate the information from a Graph Convolutional Network (GCN) that considers the relationship between various codes with the contextual sentence embeddings of the medical notes. Our experiments reveal that this enhancement effectively boosts the model’s performance in the automatic ICD coding task. The main contribution of this thesis is two-fold: (1) this thesis contributes to the computer science literature by demonstrating how domain-specific knowledge can be effectively incorporated into contextual models to improve model performance in NLP tasks that lack helpful training resources; and (2) the knowledge augmentation strategies and the contextual models developed in this research are shown to improve NLP performance in the biomedical field, where publicly available training datasets are scarce but domain-specific knowledge bases and data standards have achieved a wide adoption in electronic medical records systems

    Preface

    Get PDF

    IberSPEECH 2020: XI Jornadas en TecnologĂ­a del Habla and VII Iberian SLTech

    Get PDF
    IberSPEECH2020 is a two-day event, bringing together the best researchers and practitioners in speech and language technologies in Iberian languages to promote interaction and discussion. The organizing committee has planned a wide variety of scientific and social activities, including technical paper presentations, keynote lectures, presentation of projects, laboratories activities, recent PhD thesis, discussion panels, a round table, and awards to the best thesis and papers. The program of IberSPEECH2020 includes a total of 32 contributions that will be presented distributed among 5 oral sessions, a PhD session, and a projects session. To ensure the quality of all the contributions, each submitted paper was reviewed by three members of the scientific review committee. All the papers in the conference will be accessible through the International Speech Communication Association (ISCA) Online Archive. Paper selection was based on the scores and comments provided by the scientific review committee, which includes 73 researchers from different institutions (mainly from Spain and Portugal, but also from France, Germany, Brazil, Iran, Greece, Hungary, Czech Republic, Ucrania, Slovenia). Furthermore, it is confirmed to publish an extension of selected papers as a special issue of the Journal of Applied Sciences, “IberSPEECH 2020: Speech and Language Technologies for Iberian Languages”, published by MDPI with fully open access. In addition to regular paper sessions, the IberSPEECH2020 scientific program features the following activities: the ALBAYZIN evaluation challenge session.Red Española de TecnologĂ­as del Habla. Universidad de Valladoli

    Talking about personal recovery in bipolar disorder: Integrating health research, natural language processing, and corpus linguistics to analyse peer online support forum posts

    Get PDF
    Background: Personal recovery, ‘living a satisfying, hopeful and contributing lifeeven with the limitations caused by the illness’ (Anthony, 1993) is of particular value in bipolar disorder where symptoms often persist despite treatment. So far, personal recovery has only been studied in researcher-constructed environments (interviews, focus groups). Support forum posts can serve as a complementary naturalistic data source. Objective: The overarching aim of this thesis was to study personal recovery experiences that people living with bipolar disorder have shared in online support forums through integrating health research, NLP, and corpus linguistics in a mixed methods approach within a pragmatic research paradigm, while considering ethical issues and involving people with lived experience. Methods: This mixed-methods study analysed: 1) previous qualitative evidence on personal recovery in bipolar disorder from interviews and focus groups 2) who self-reports a bipolar disorder diagnosis on the online discussion platform Reddit 3) the relationship of mood and posting in mental health-specific Reddit forums (subreddits) 4) discussions of personal recovery in bipolar disorder subreddits. Results: A systematic review of qualitative evidence resulted in the first framework for personal recovery in bipolar disorder, POETIC (Purpose & meaning, Optimism & hope, Empowerment, Tensions, Identity, Connectedness). Mainly young or middle-aged US-based adults self-report a bipolar disorder diagnosis on Reddit. Of these, those experiencing more intense emotions appear to be more likely to post in mental health support subreddits. Their personal recovery-related discussions in bipolar disorder subreddits primarily focussed on three domains: Purpose & meaning (particularly reproductive decisions, work), Connectedness (romantic relationships, social support), Empowerment (self-management, personal responsibility). Support forum data highlighted personal recovery issues that exclusively or more frequently came up online compared to previous evidence from interviews and focus groups. Conclusion: This project is the first to analyse non-reactive data on personal recovery in bipolar disorder. Indicating the key areas that people focus on in personal recovery when posting freely and the language they use provides a helpful starting point for formal and informal carers to understand the concerns of people diagnosed with bipolar disorder and to consider how best to offer support
    corecore