8,789 research outputs found

    Clinical text data in machine learning: Systematic review

    Get PDF
    Background: Clinical narratives represent the main form of communication within healthcare providing a personalized account of patient history and assessments, offering rich information for clinical decision making. Natural language processing (NLP) has repeatedly demonstrated its feasibility to unlock evidence buried in clinical narratives. Machine learning can facilitate rapid development of NLP tools by leveraging large amounts of text data. Objective: The main aim of this study is to provide systematic evidence on the properties of text data used to train machine learning approaches to clinical NLP. We also investigate the types of NLP tasks that have been supported by machine learning and how they can be applied in clinical practice. Methods: Our methodology was based on the guidelines for performing systematic reviews. In August 2018, we used PubMed, a multi-faceted interface, to perform a literature search against MEDLINE. We identified a total of 110 relevant studies and extracted information about the text data used to support machine learning, the NLP tasks supported and their clinical applications. The data properties considered included their size, provenance, collection methods, annotation and any relevant statistics. Results: The vast majority of datasets used to train machine learning models included only hundreds or thousands of documents. Only 10 studies used tens of thousands of documents with a handful of studies utilizing more. Relatively small datasets were utilized for training even when much larger datasets were available. The main reason for such poor data utilization is the annotation bottleneck faced by supervised machine learning algorithms. Active learning was explored to iteratively sample a subset of data for manual annotation as a strategy for minimizing the annotation effort while maximizing predictive performance of the model. Supervised learning was successfully used where clinical codes integrated with free text notes into electronic health records were utilized as class labels. Similarly, distant supervision was used to utilize an existing knowledge base to automatically annotate raw text. Where manual annotation was unavoidable, crowdsourcing was explored, but it remains unsuitable due to sensitive nature of data considered. Beside the small volume, training data were typically sourced from a small number of institutions, thus offering no hard evidence about the transferability of machine learning models. The vast majority of studies focused on the task of text classification. Most commonly, the classification results were used to support phenotyping, prognosis, care improvement, resource management and surveillance. Conclusions: We identified the data annotation bottleneck as one of the key obstacles to machine learning approaches in clinical NLP. Active learning and distant supervision were explored as a way of saving the annotation efforts. Future research in this field would benefit from alternatives such as data augmentation and transfer learning, or unsupervised learning, which does not require data annotation

    Text Classification

    Get PDF
    There is an abundance of text data in this world but most of it is raw. We need to extract information from this data to make use of it. One way to extract this information from raw text is to apply informative labels drawn from a pre-defined fixed set i.e. Text Classification. In this thesis, we focus on the general problem of text classification, and work towards solving challenges associated to binary/multi-class/multi-label classification. More specifically, we deal with the problem of (i) Zero-shot labels during testing; (ii) Active learning for text screening; (iii) Multi-label classification under low supervision; (iv) Structured label space; (v) Classifying pairs of words in raw text i.e. Relation Extraction. For (i), we use a zero-shot classification model that utilizes independently learned semantic embeddings. Regarding (ii), we propose a novel active learning algorithm that reduces problem of bias in naive active learning algorithms. For (iii), we propose neural candidate-selector architecture that starts from a set of high-recall candidate labels to obtain high-precision predictions. In the case of (iv), we proposed an attention based neural tree decoder that recursively decodes an abstract into the ontology tree. For (v), we propose using second-order relations that are derived by explicitly connecting pairs of words via context token(s) for improved relation extraction. We use a wide variety of both traditional and deep machine learning tools. More specifically, we used traditional machine learning models like multi-valued linear regression and logistic regression for (i, ii), deep convolutional neural networks for (iii), recurrent neural networks for (iv) and transformer networks for (v)

    Bigram feature extraction and conditional random fields model to improve text classification clinical trial document

    Get PDF
    In the field of health and medicine, there is a very important term known as clinical trials. Clinical trials are a type of activity that studies how the safest way to treat patients is. These clinical trials are usually written in unstructured free text which requires translation from a computer. The aim of this paper is to classify the texts of cancer clinical trial documents consisting of unstructured free texts taken from cancer clinical trial protocols. The proposed algorithm is conditional random Fields and bigram features. A new classification model from the cancer clinical trial document text is proposed to compete with other methods in terms of precision, recall, and f-1 score. The results of this study are better than the previous results, namely 88.07 precision, 88.05 recall and f-1 score 88.06

    Natural Language Processing – Finding the Missing Link for Oncologic Data, 2022

    Get PDF
    Oncology like most medical specialties, is undergoing a data revolution at the center of which lie vast and growing amounts of clinical data in unstructured, semi-structured and structed formats. Artificial intelligence approaches are widely employed in research endeavors in an attempt to harness electronic medical records data to advance patient outcomes. The use of clinical oncologic data, although collected on large scale, particularly with the increased implementation of electronic medical records, remains limited due to missing, incorrect or manually entered data in registries and the lack of resource allocation to data curation in real world settings. Natural Language Processing (NLP) may provide an avenue to extract data from electronic medical records and as a result has grown considerably in medicine to be employed for documentation, outcome analysis, phenotyping and clinical trial eligibility. Barriers to NLP persist with inability to aggregate findings across studies due to use of different methods and significant heterogeneity at all levels with important parameters such as patient comorbidities and performance status lacking implementation in AI approaches. The goal of this review is to provide an updated overview of natural language processing (NLP) and the current state of its application in oncology for clinicians and researchers that wish to implement NLP to augment registries and/or advance research projects

    Biomedical Information Extraction Pipelines for Public Health in the Age of Deep Learning

    Get PDF
    abstract: Unstructured texts containing biomedical information from sources such as electronic health records, scientific literature, discussion forums, and social media offer an opportunity to extract information for a wide range of applications in biomedical informatics. Building scalable and efficient pipelines for natural language processing and extraction of biomedical information plays an important role in the implementation and adoption of applications in areas such as public health. Advancements in machine learning and deep learning techniques have enabled rapid development of such pipelines. This dissertation presents entity extraction pipelines for two public health applications: virus phylogeography and pharmacovigilance. For virus phylogeography, geographical locations are extracted from biomedical scientific texts for metadata enrichment in the GenBank database containing 2.9 million virus nucleotide sequences. For pharmacovigilance, tools are developed to extract adverse drug reactions from social media posts to open avenues for post-market drug surveillance from non-traditional sources. Across these pipelines, high variance is observed in extraction performance among the entities of interest while using state-of-the-art neural network architectures. To explain the variation, linguistic measures are proposed to serve as indicators for entity extraction performance and to provide deeper insight into the domain complexity and the challenges associated with entity extraction. For both the phylogeography and pharmacovigilance pipelines presented in this work the annotated datasets and applications are open source and freely available to the public to foster further research in public health.Dissertation/ThesisDoctoral Dissertation Biomedical Informatics 201

    Knowledge Extraction and Prediction from Behavior Science Randomized Controlled Trials: A Case Study in Smoking Cessation

    Get PDF
    Due to the fast pace at which randomized controlled trials are published in the health domain, researchers, consultants and policymakers would benefit from more automatic ways to process them by both extracting relevant information and automating the meta-analysis processes. In this paper, we present a novel methodology based on natural language processing and reasoning models to 1) extract relevant information from RCTs and 2) predict potential outcome values on novel scenarios, given the extracted knowledge, in the domain of behavior change for smoking cessation
    • …
    corecore