11 research outputs found

    "When they say weed causes depression, but it's your fav antidepressant": Knowledge-aware Attention Framework for Relationship Extraction

    Get PDF
    With the increasing legalization of medical and recreational use of cannabis, more research is needed to understand the association between depression and consumer behavior related to cannabis consumption. Big social media data has potential to provide deeper insights about these associations to public health analysts. In this interdisciplinary study, we demonstrate the value of incorporating domain-specific knowledge in the learning process to identify the relationships between cannabis use and depression. We develop an end-to-end knowledge infused deep learning framework (Gated-K-BERT) that leverages the pre-trained BERT language representation model and domain-specific declarative knowledge source (Drug Abuse Ontology (DAO)) to jointly extract entities and their relationship using gated fusion sharing mechanism. Our model is further tailored to provide more focus to the entities mention in the sentence through entity-position aware attention layer, where ontology is used to locate the target entities position. Experimental results show that inclusion of the knowledge-aware attentive representation in association with BERT can extract the cannabis-depression relationship with better coverage in comparison to the state-of-the-art relation extractor

    De-identifying Hospital Discharge Summaries: An End-to-End Framework using Ensemble of Deep Learning Models

    Full text link
    Electronic Medical Records (EMRs) contain clinical narrative text that is of great potential value to medical researchers. However, this information is mixed with Personally Identifiable Information (PII) that presents risks to patient and clinician confidentiality. This paper presents an end-to-end de-identification framework to automatically remove PII from hospital discharge summaries. Our corpus included 600 hospital discharge summaries which were extracted from the EMRs of two principal referral hospitals in Sydney, Australia. Our end-to-end de-identification framework consists of three components: 1) Annotation: labelling of PII in the 600 hospital discharge summaries using five pre-defined categories: person, address, date of birth, identification number, phone number; 2) Modelling: training six named entity recognition (NER) deep learning base-models on balanced and imbalanced datasets; and evaluating ensembles that combine all six base-models, the three base-models with the best F1 scores and the three base-models with the best recall scores respectively, using token-level majority voting and stacking methods; and 3) De-identification: removing PII from the hospital discharge summaries. Our results showed that the ensemble model combined using the stacking Support Vector Machine (SVM) method on the three base-models with the best F1 scores achieved excellent results with a F1 score of 99.16% on the test set of our corpus. We also evaluated the robustness of our modelling component on the 2014 i2b2 de-identification dataset. Our ensemble model, which uses the token-level majority voting method on all six base-models, achieved the highest F1 score of 96.24% at strict entity matching and the highest F1 score of 98.64% at binary token-level matching compared to two state-of-the-art methods. The framework provides a robust solution to de-identifying clinical narrative text safely

    Using machine learning for automated de-identification and clinical coding of free text data in electronic medical records

    Full text link
    The widespread adoption of Electronic Medical Records (EMRs) in hospitals continues to increase the amount of patient data that are digitally stored. Although the primary use of the EMR is to support patient care by making all relevant information accessible, governments and health organisations are looking for ways to unleash the potential of these data for secondary purposes, including clinical research, disease surveillance and automation of healthcare processes and workflows. EMRs include large quantities of free text documents that contain valuable information. The greatest challenges in using the free text data in EMRs include the removal of personally identifiable information and the extraction of relevant information for specific tasks such as clinical coding. Machine learning-based automated approaches can potentially address these challenges. This thesis aims to explore and improve the performance of machine learning models for automated de-identification and clinical coding of free text data in EMRs, as captured in hospital discharge summaries, and facilitate the applications of these approaches in real-world use cases. It does so by 1) implementing an end-to-end de-identification framework using an ensemble of deep learning models; 2) developing a web-based system for de-identification of free text (DEFT) with an interactive learning loop; 3) proposing and implementing a hierarchical label-wise attention transformer model (HiLAT) for explainable International Classification of Diseases (ICD) coding; and 4) investigating the use of extreme multi-label long text transformer-based models for automated ICD coding. The key findings include: 1) An end-to-end framework using an ensemble of deep learning base-models achieved excellent performance on the de-identification task. 2) A new web-based de-identification software system (DEFT) can be readily and easily adopted by data custodians and researchers to perform de-identification of free text in EMRs. 3) A novel domain-specific transformer-based model (HiLAT) achieved state-of-the-art (SOTA) results for predicting ICD codes on a Medical Information Mart for Intensive Care (MIMIC-III) dataset comprising the discharge summaries (n=12,808) that are coded with at least one of the most 50 frequent diagnosis and procedure codes. In addition, the label-wise attention scores for the tokens in the discharge summary presented a potential explainability tool for checking the face validity of ICD code predictions. 4) An optimised transformer-based model, PLM-ICD, achieved the latest SOTA results for ICD coding on all the discharge summaries of the MIMIC-III dataset (n=59,652). The segmentation method, which split the long text consecutively into multiple small chunks, addressed the problem of applying transformer-based models to long text datasets. However, using transformer-based models on extremely large label sets needs further research. These findings demonstrate that the de-identification and clinical coding tasks can benefit from the application of machine learning approaches, present practical tools for implementing these approaches, and highlight priorities for further research

    Synthetic Data Sharing and Estimation of Viable Dynamic Treatment Regimes with Observational Data

    Full text link
    Significant public demand arises for rapid data-driven scientific investigations using observational data, especially in personalized healthcare. This dissertation addresses three complementary challenges of analyzing complex observational data in biomedical research. The ethical challenge reflects regulatory policies and social norms regarding data privacy, which tend to emphasize data security at the expense of effective data sharing. This results in fragmentation and scarcity of available research data. In Chapter 2, we propose the DataSifter approach that mediates this challenge by facilitating the generation of realistic synthetic data from sensitive datasets containing static and time-varying variables. The DataSifter method relies on robust imputation methods, including missForest and an iterative imputation technique for time-varying variables using the Generalized Linear Mixed Model (GLMM) and the Random Effects-Expectation Maximization tree (RE-EM tree). Applications demonstrate that under a moderate level of obfuscation, the DataSifter guarantees sufficient per subject perturbations of time-invariant data and preserves the joint distribution and the energy of the entire data archive, which ensures high utility and analytical value of the time-varying information. This promotes accelerated innovation by enabling secure sharing among data governors and researchers. Once sensitive data can be securely shared, effective analytical tools are needed to provide viable individualized data-driven solutions. Observational data is an important data source for estimating dynamic treatment regimes (DTR) that guide personalized treatment decisions. The second natural challenge regards the viability of optimal DTR estimations, which may be affected by the observed treatment combinations that are not applicable for future patients due to clinical or economic reasons. In Chapter 3, we develop restricted Tree-based Reinforcement Learning to accommodate restrictions on feasible treatment combinations in observational studies by truncating possible treatment options based on patient history in a multi-stage multi-treatment setting. The proposed new method provides optimal treatment recommendations for patients only regarding viable treatment options and utilizes all valid observations in the dataset to avoid selection bias and improve efficiency. In addition to the structured data, unstructured data, such as free-text, or voice-note, have become an essential component in many biomedical studies based on clinical and health data rapidly, including electronic health records (EHR), providing extra patient information. The last two chapters in my dissertation (Chapter 4 and Chapter 5) expands the methods developed in the previous two projects by utilizing novel natural language processing (NLP) techniques to address the third challenge of handling unstructured data elements. In Chapter 4, we construct a text data anonymization tool, DataSifterText, which generates synthetic free-text data to protect sensitive unstructured data, such as personal health information. In Chapter 5, we propose to enhance the precision of optimal DTR estimation by acquiring additional information contained in clinical notes with information extraction (IE) techniques. Simulation studies and application on blood pressure management in intensive care units demonstrated that the IE techniques can provide extra patient information and more accurate counterfactual outcome modeling, because of the potentially enhanced sample size and a wider pool of candidate tailoring variables for optimal DTR estimation. The statistical methods presented in this thesis provides theoretical and practical solutions for privacy-aware utility-preserving large-scale data sharing and clinically meaningful optimal DTR estimation. The general theoretical formulation of the methods leads to the design of tools and direct applications that are expected to go beyond the biomedical and health analytics domains.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/166113/1/zhounina_1.pd
    corecore