23,467 research outputs found

    Automated Detection of Systematic Off-label Drug Use in Free Text of Electronic Medical Records.

    Get PDF
    Off-label use of a drug occurs when it is used in a manner that deviates from its FDA label. Studies estimate that 21% of prescriptions are off-label, with only 27% of those uses supported by evidence of safety and efficacy. We have developed methods to detect population level off-label usage using computationally efficient annotation of free text from clinical notes to generate features encoding empirical information about drug-disease mentions. By including additional features encoding prior knowledge about drugs, diseases, and known usage, we trained a highly accurate predictive model that was used to detect novel candidate off-label usages in a very large clinical corpus. We show that the candidate uses are plausible and can be prioritized for further analysis in terms of safety and efficacy

    Mining the Medical and Patent Literature to Support Healthcare and Pharmacovigilance

    Get PDF
    Recent advancements in healthcare practices and the increasing use of information technology in the medical domain has lead to the rapid generation of free-text data in forms of scientific articles, e-health records, patents, and document inventories. This has urged the development of sophisticated information retrieval and information extraction technologies. A fundamental requirement for the automatic processing of biomedical text is the identification of information carrying units such as the concepts or named entities. In this context, this work focuses on the identification of medical disorders (such as diseases and adverse effects) which denote an important category of concepts in the medical text. Two methodologies were investigated in this regard and they are dictionary-based and machine learning-based approaches. Futhermore, the capabilities of the concept recognition techniques were systematically exploited to build a semantic search platform for the retrieval of e-health records and patents. The system facilitates conventional text search as well as semantic and ontological searches. Performance of the adapted retrieval platform for e-health records and patents was evaluated within open assessment challenges (i.e. TRECMED and TRECCHEM respectively) wherein the system was best rated in comparison to several other competing information retrieval platforms. Finally, from the medico-pharma perspective, a strategy for the identification of adverse drug events from medical case reports was developed. Qualitative evaluation as well as an expert validation of the developed system's performance showed robust results. In conclusion, this thesis presents approaches for efficient information retrieval and information extraction from various biomedical literature sources in the support of healthcare and pharmacovigilance. The applied strategies have potential to enhance the literature-searches performed by biomedical, healthcare, and patent professionals. The applied strategies have potential to enhance the literature-searches performed by biomedical, healthcare, and patent professionals. This can promote the literature-based knowledge discovery, improve the safety and effectiveness of medical practices, and drive the research and development in medical and healthcare arena

    Doctor of Philosophy

    Get PDF
    dissertationManual annotation of clinical texts is often used as a method of generating reference standards that provide data for training and evaluation of Natural Language Processing (NLP) systems. Manually annotating clinical texts is time consuming, expensive, and requires considerable cognitive effort on the part of human reviewers. Furthermore, reference standards must be generated in ways that produce consistent and reliable data but must also be valid in order to adequately evaluate the performance of those systems. The amount of labeled data necessary varies depending on the level of analysis, the complexity of the clinical use case, and the methods that will be used to develop automated machine systems for information extraction and classification. Evaluating methods that potentially reduce cost, manual human workload, introduce task efficiencies, and reduce the amount of labeled data necessary to train NLP tools for specific clinical use cases are active areas of research inquiry in the clinical NLP domain. This dissertation integrates a mixed methods approach using methodologies from cognitive science and artificial intelligence with manual annotation of clinical texts. Aim 1 of this dissertation identifies factors that affect manual annotation of clinical texts. These factors are further explored by evaluating approaches that may introduce efficiencies into manual review tasks applied to two different NLP development areas - semantic annotation of clinical concepts and identification of information representing Protected Health Information (PHI) as defined by HIPAA. Both experiments integrate iv different priming mechanisms using noninteractive and machine-assisted methods. The main hypothesis for this research is that integrating pre-annotation or other machineassisted methods within manual annotation workflows will improve efficiency of manual annotation tasks without diminishing the quality of generated reference standards

    Spark NLP: Natural Language Understanding at Scale

    Get PDF
    Spark NLP is a Natural Language Processing (NLP) library built on top of Apache Spark ML. It provides simple, performant and accurate NLP annotations for machine learning pipelines that can scale easily in a distributed environment. Spark NLP comes with 1100 pre trained pipelines and models in more than 192 languages. It supports nearly all the NLP tasks and modules that can be used seamlessly in a cluster. Downloaded more than 2.7 million times and experiencing nine times growth since January 2020, Spark NLP is used by 54% of healthcare organizations as the worlds most widely used NLP library in the enterprise.Comment: =Accepted as a publication in Elsevier, Software Impacts Journal. arXiv admin note: substantial text overlap with arXiv:2012.0400

    Improving Clinical Document Understanding on COVID-19 Research with Spark NLP

    Get PDF
    Following the global COVID-19 pandemic, the number of scientific papers studying the virus has grown massively, leading to increased interest in automated literate review. We present a clinical text mining system that improves on previous efforts in three ways. First, it can recognize over 100 different entity types including social determinants of health, anatomy, risk factors, and adverse events in addition to other commonly used clinical and biomedical entities. Second, the text processing pipeline includes assertion status detection, to distinguish between clinical facts that are present, absent, conditional, or about someone other than the patient. Third, the deep learning models used are more accurate than previously available, leveraging an integrated pipeline of state-of-the-art pretrained named entity recognition models, and improving on the previous best performing benchmarks for assertion status detection. We illustrate extracting trends and insights, e.g. most frequent disorders and symptoms, and most common vital signs and EKG findings, from the COVID-19 Open Research Dataset (CORD-19). The system is built using the Spark NLP library which natively supports scaling to use distributed clusters, leveraging GPUs, configurable and reusable NLP pipelines, healthcare specific embeddings, and the ability to train models to support new entity types or human languages with no code changes.Comment: Accepted to SDU (Scientific Document Understanding) workshop at AAAI 202

    Contributions to information extraction for spanish written biomedical text

    Get PDF
    285 p.Healthcare practice and clinical research produce vast amounts of digitised, unstructured data in multiple languages that are currently underexploited, despite their potential applications in improving healthcare experiences, supporting trainee education, or enabling biomedical research, for example. To automatically transform those contents into relevant, structured information, advanced Natural Language Processing (NLP) mechanisms are required. In NLP, this task is known as Information Extraction. Our work takes place within this growing field of clinical NLP for the Spanish language, as we tackle three distinct problems. First, we compare several supervised machine learning approaches to the problem of sensitive data detection and classification. Specifically, we study the different approaches and their transferability in two corpora, one synthetic and the other authentic. Second, we present and evaluate UMLSmapper, a knowledge-intensive system for biomedical term identification based on the UMLS Metathesaurus. This system recognises and codifies terms without relying on annotated data nor external Named Entity Recognition tools. Although technically naive, it performs on par with more evolved systems, and does not exhibit a considerable deviation from other approaches that rely on oracle terms. Finally, we present and exploit a new corpus of real health records manually annotated with negation and uncertainty information: NUBes. This corpus is the basis for two sets of experiments, one on cue andscope detection, and the other on assertion classification. Throughout the thesis, we apply and compare techniques of varying levels of sophistication and novelty, which reflects the rapid advancement of the field

    The Benefits of Word Embeddings Features for Active Learning in Clinical Information Extraction

    Get PDF
    This study investigates the use of unsupervised word embeddings and sequence features for sample representation in an active learning framework built to extract clinical concepts from clinical free text. The objective is to further reduce the manual annotation effort while achieving higher effectiveness compared to a set of baseline features. Unsupervised features are derived from skip-gram word embeddings and a sequence representation approach. The comparative performance of unsupervised features and baseline hand-crafted features in an active learning framework are investigated using a wide range of selection criteria including least confidence, information diversity, information density and diversity, and domain knowledge informativeness. Two clinical datasets are used for evaluation: the i2b2/VA 2010 NLP challenge and the ShARe/CLEF 2013 eHealth Evaluation Lab. Our results demonstrate significant improvements in terms of effectiveness as well as annotation effort savings across both datasets. Using unsupervised features along with baseline features for sample representation lead to further savings of up to 9% and 10% of the token and concept annotation rates, respectively
    • …
    corecore