563 research outputs found

    The impact of pretrained language models on negation and speculation detection in cross-lingual medical text: Comparative study

    Get PDF
    Background: Negation and speculation are critical elements in natural language processing (NLP)-related tasks, such as information extraction, as these phenomena change the truth value of a proposition. In the clinical narrative that is informal, these linguistic facts are used extensively with the objective of indicating hypotheses, impressions, or negative findings. Previous state-of-the-art approaches addressed negation and speculation detection tasks using rule-based methods, but in the last few years, models based on machine learning and deep learning exploiting morphological, syntactic, and semantic features represented as spare and dense vectors have emerged. However, although such methods of named entity recognition (NER) employ a broad set of features, they are limited to existing pretrained models for a specific domain or language. Objective: As a fundamental subsystem of any information extraction pipeline, a system for cross-lingual and domain-independent negation and speculation detection was introduced with special focus on the biomedical scientific literature and clinical narrative. In this work, detection of negation and speculation was considered as a sequence-labeling task where cues and the scopes of both phenomena are recognized as a sequence of nested labels recognized in a single step. Methods: We proposed the following two approaches for negation and speculation detection: (1) bidirectional long short-term memory (Bi-LSTM) and conditional random field using character, word, and sense embeddings to deal with the extraction of semantic, syntactic, and contextual patterns and (2) bidirectional encoder representations for transformers (BERT) with fine tuning for NER. Results: The approach was evaluated for English and Spanish languages on biomedical and review text, particularly with the BioScope corpus, IULA corpus, and SFU Spanish Review corpus, with F-measures of 86.6%, 85.0%, and 88.1%, respectively, for NeuroNER and 86.4%, 80.8%, and 91.7%, respectively, for BERT. Conclusions: These results show that these architectures perform considerably better than the previous rule-based and conventional machine learning-based systems. Moreover, our analysis results show that pretrained word embedding and particularly contextualized embedding for biomedical corpora help to understand complexities inherent to biomedical text.This work was supported by the Research Program of the Ministry of Economy and Competitiveness, Government of Spain (DeepEMR Project TIN2017-87548-C2-1-R)

    Negation and Speculation in NLP: A Survey, Corpora, Methods, and Applications

    Get PDF
    Negation and speculation are universal linguistic phenomena that affect the performance of Natural Language Processing (NLP) applications, such as those for opinion mining and information retrieval, especially in biomedical data. In this article, we review the corpora annotated with negation and speculation in various natural languages and domains. Furthermore, we discuss the ongoing research into recent rule-based, supervised, and transfer learning techniques for the detection of negating and speculative content. Many English corpora for various domains are now annotated with negation and speculation; moreover, the availability of annotated corpora in other languages has started to increase. However, this growth is insufficient to address these important phenomena in languages with limited resources. The use of cross-lingual models and translation of the well-known languages are acceptable alternatives. We also highlight the lack of consistent annotation guidelines and the shortcomings of the existing techniques, and suggest alternatives that may speed up progress in this research direction. Adding more syntactic features may alleviate the limitations of the existing techniques, such as cue ambiguity and detecting the discontinuous scopes. In some NLP applications, inclusion of a system that is negation- and speculation-aware improves performance, yet this aspect is still not addressed or considered an essential step

    Contributions to information extraction for spanish written biomedical text

    Get PDF
    285 p.Healthcare practice and clinical research produce vast amounts of digitised, unstructured data in multiple languages that are currently underexploited, despite their potential applications in improving healthcare experiences, supporting trainee education, or enabling biomedical research, for example. To automatically transform those contents into relevant, structured information, advanced Natural Language Processing (NLP) mechanisms are required. In NLP, this task is known as Information Extraction. Our work takes place within this growing field of clinical NLP for the Spanish language, as we tackle three distinct problems. First, we compare several supervised machine learning approaches to the problem of sensitive data detection and classification. Specifically, we study the different approaches and their transferability in two corpora, one synthetic and the other authentic. Second, we present and evaluate UMLSmapper, a knowledge-intensive system for biomedical term identification based on the UMLS Metathesaurus. This system recognises and codifies terms without relying on annotated data nor external Named Entity Recognition tools. Although technically naive, it performs on par with more evolved systems, and does not exhibit a considerable deviation from other approaches that rely on oracle terms. Finally, we present and exploit a new corpus of real health records manually annotated with negation and uncertainty information: NUBes. This corpus is the basis for two sets of experiments, one on cue andscope detection, and the other on assertion classification. Throughout the thesis, we apply and compare techniques of varying levels of sophistication and novelty, which reflects the rapid advancement of the field

    Adverse drug reaction extraction on electronic health records written in Spanish

    Get PDF
    148 p.This work focuses on the automatic extraction of Adverse Drug Reactions (ADRs) in Electronic HealthRecords (EHRs). That is, extracting a response to a medicine which is noxious and unintended and whichoccurs at doses normally used. From Natural Language Processing (NLP) perspective, this wasapproached as a relation extraction task in which the drug is the causative agent of a disease, sign orsymptom, that is, the adverse reaction.ADR extraction from EHRs involves major challenges. First, ADRs are rare events. That is, relationsbetween drugs and diseases found in an EHR are seldom ADRs (are often unrelated or, instead, related astreatment). This implies the inference from samples with skewed class distribution. Second, EHRs arewritten by experts often under time pressure, employing both rich medical jargon together with colloquialexpressions (not always grammatical) and it is not infrequent to find misspells and both standard andnon-standard abbreviations. All this leads to a high lexical variability.We explored several ADR detection algorithms and representations to characterize the ADR candidates.In addition, we have assessed the tolerance of the ADR detection model to external noise such as theincorrect detection of implied medical entities implied in the ADR extraction, i.e. drugs and diseases. Westtled the first steps on ADR extraction in Spanish using a corpus of real EHRs

    Extracting information from radiology reports by Natural Language Processing and Deep Learning

    Get PDF
    This work was supported by the NLP4RARE-CM-UC3M, which was developed under the Interdisciplinary Projects Program for Young Researchers at University Carlos III of Madrid. The work was also supported by the Multiannual Agreement with UC3M in the line of Excellence of University Professors (EPUC3M17), and in the context of the V PRICIT (Regional Programme of Research and Technological Innovation)

    Making decisions based on context: models and applications in cognitive sciences and natural language processing

    Full text link
    It is known that humans are capable of making decisions based on context and generalizing what they have learned. This dissertation considers two related problem areas and proposes different models that take context information into account. By including the context, the proposed models exhibit strong performance in each of the problem areas considered. The first problem area focuses on a context association task studied in cognitive science, which evaluates the ability of a learning agent to associate specific stimuli with an appropriate response in particular spatial contexts. Four neural circuit models are proposed to model how the stimulus and context information are processed to produce a response. The neural networks are trained by modifying the strength of neural connections (weights) using principles of Hebbian learning. Such learning is considered biologically plausible, in contrast to back propagation techniques that do not have a solid neurophysiological basis. A series of theoretical results for the neural circuit models are established, guaranteeing convergence to an optimal configuration when all the stimulus-context pairs are provided during training. Among all the models, a specific model based on ideas from recommender systems trained with a primal-dual update rule, achieves perfect performance in learning and generalizing the mapping from context-stimulus pairs to correct responses. The second problem area considered in the thesis focuses on clinical natural language processing (NLP). A particular application is the development of deep-learning models for analyzing radiology reports. Four NLP tasks are considered including anatomy named entity recognition, negation detection, incidental finding detection, and clinical concept extraction. A hierarchical Recurrent Neural Network (RNN) is proposed for anatomy named entity recognition, which is then used to produce a set of features for incidental finding detection of pulmonary nodules. A clinical context word embedding model is obtained, which is used with an RNN to model clinical concept extraction. Finally, feature-enriched RNN and transformer-based models with contextual word embedding are proposed for negation detection. All these models take the (clinical) context information into account. The models are evaluated on different datasets and are shown to achieve strong performance, largely outperforming the state-of-art

    Exploration and adaptation of large language models for specialized domains

    Get PDF
    Large language models have transformed the field of natural language processing (NLP). Their improved performance on various NLP benchmarks makes them a promising tool—also for the application in specialized domains. Such domains are characterized by highly trained professionals with particular domain expertise. Since these experts are rare, improving the efficiency of their work with automated systems is especially desirable. However, domain-specific text resources hold various challenges for NLP systems. These challenges include distinct language, noisy and scarce data, and a high level of variation. Further, specialized domains present an increased need for transparent systems since they are often applied in high stakes settings. In this dissertation, we examine whether large language models (LLMs) can overcome some of these challenges and propose methods to effectively adapt them to domain-specific requirements. We first investigate the inner workings and abilities of LLMs and show how they can fill the gaps that are present in previous NLP algorithms for specialized domains. To this end, we explore the sources of errors produced by earlier systems to identify which of them can be addressed by using LLMs. Following this, we take a closer look at how information is processed within Transformer-based LLMs to better understand their capabilities. We find that their layers encode different dimensions of the input text. Here, the contextual vector representation, and the general language knowledge learned during pre-training are especially beneficial for solving complex and multi-step tasks common in specialized domains. Following this exploration, we propose solutions for further adapting LLMs to the requirements of domain-specific tasks. We focus on the clinical domain, which incorporates many typical challenges found in specialized domains. We show how to improve generalization by integrating different domain-specific resources into our models. We further analyze the behavior of the produced models and propose a behavioral testing framework that can serve as a tool for communication with domain experts. Finally, we present an approach for incorporating the benefits of LLMs while fulfilling requirements such as interpretability and modularity. The presented solutions show improvements in performance on benchmark datasets and in manually conducted analyses with medical professionals. Our work provides both new insights into the inner workings of pre-trained language models as well as multiple adaptation methods showing that LLMs can be an effective tool for NLP in specialized domains

    The text classification pipeline: Starting shallow, going deeper

    Get PDF
    An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC.An increasingly relevant and crucial subfield of Natural Language Processing (NLP), tackled in this PhD thesis from a computer science and engineering perspective, is the Text Classification (TC). Also in this field, the exceptional success of deep learning has sparked a boom over the past ten years. Text retrieval and categorization, information extraction and summarization all rely heavily on TC. The literature has presented numerous datasets, models, and evaluation criteria. Even if languages as Arabic, Chinese, Hindi and others are employed in several works, from a computer science perspective the most used and referred language in the literature concerning TC is English. This is also the language mainly referenced in the rest of this PhD thesis. Even if numerous machine learning techniques have shown outstanding results, the classifier effectiveness depends on the capability to comprehend intricate relations and non-linear correlations in texts. In order to achieve this level of understanding, it is necessary to pay attention not only to the architecture of a model but also to other stages of the TC pipeline. In an NLP framework, a range of text representation techniques and model designs have emerged, including the large language models. These models are capable of turning massive amounts of text into useful vector representations that effectively capture semantically significant information. The fact that this field has been investigated by numerous communities, including data mining, linguistics, and information retrieval, is an aspect of crucial interest. These communities frequently have some overlap, but are mostly separate and do their research on their own. Bringing researchers from other groups together to improve the multidisciplinary comprehension of this field is one of the objectives of this dissertation. Additionally, this dissertation makes an effort to examine text mining from both a traditional and modern perspective. This thesis covers the whole TC pipeline in detail. However, the main contribution is to investigate the impact of every element in the TC pipeline to evaluate the impact on the final performance of a TC model. It is discussed the TC pipeline, including the traditional and the most recent deep learning-based models. This pipeline consists of State-Of-The-Art (SOTA) datasets used in the literature as benchmark, text preprocessing, text representation, machine learning models for TC, evaluation metrics and current SOTA results. In each chapter of this dissertation, I go over each of these steps, covering both the technical advancements and my most significant and recent findings while performing experiments and introducing novel models. The advantages and disadvantages of various options are also listed, along with a thorough comparison of the various approaches. At the end of each chapter, there are my contributions with experimental evaluations and discussions on the results that I have obtained during my three years PhD course. The experiments and the analysis related to each chapter (i.e., each element of the TC pipeline) are the main contributions that I provide, extending the basic knowledge of a regular survey on the matter of TC

    Extreme multi-label deep neural classification of Spanish health records according to the International Classification of Diseases

    Get PDF
    111 p.Este trabajo trata sobre la minería de textos clínicos, un campo del Procesamiento del Lenguaje Natural aplicado al dominio biomédico. El objetivo es automatizar la tarea de codificación médica. Los registros electrónicos de salud (EHR) son documentos que contienen información clínica sobre la salud de unpaciente. Los diagnósticos y procedimientos médicos plasmados en la Historia Clínica Electrónica están codificados con respecto a la Clasificación Internacional de Enfermedades (CIE). De hecho, la CIE es la base para identificar estadísticas de salud internacionales y el estándar para informar enfermedades y condiciones de salud. Desde la perspectiva del aprendizaje automático, el objetivo es resolver un problema extremo de clasificación de texto de múltiples etiquetas, ya que a cada registro de salud se le asignan múltiples códigos ICD de un conjunto de más de 70 000 términos de diagnóstico. Una cantidad importante de recursos se dedican a la codificación médica, una laboriosa tarea que actualmente se realiza de forma manual. Los EHR son narraciones extensas, y los codificadores médicos revisan los registros escritos por los médicos y asignan los códigos ICD correspondientes. Los textos son técnicos ya que los médicos emplean una jerga médica especializada, aunque rica en abreviaturas, acrónimos y errores ortográficos, ya que los médicos documentan los registros mientras realizan la práctica clínica real. Paraabordar la clasificación automática de registros de salud, investigamos y desarrollamos un conjunto de técnicas de clasificación de texto de aprendizaje profundo
    corecore