11 research outputs found
"When they say weed causes depression, but it's your fav antidepressant": Knowledge-aware Attention Framework for Relationship Extraction
With the increasing legalization of medical and recreational use of cannabis,
more research is needed to understand the association between depression and
consumer behavior related to cannabis consumption. Big social media data has
potential to provide deeper insights about these associations to public health
analysts. In this interdisciplinary study, we demonstrate the value of
incorporating domain-specific knowledge in the learning process to identify the
relationships between cannabis use and depression. We develop an end-to-end
knowledge infused deep learning framework (Gated-K-BERT) that leverages the
pre-trained BERT language representation model and domain-specific declarative
knowledge source (Drug Abuse Ontology (DAO)) to jointly extract entities and
their relationship using gated fusion sharing mechanism. Our model is further
tailored to provide more focus to the entities mention in the sentence through
entity-position aware attention layer, where ontology is used to locate the
target entities position. Experimental results show that inclusion of the
knowledge-aware attentive representation in association with BERT can extract
the cannabis-depression relationship with better coverage in comparison to the
state-of-the-art relation extractor
De-identifying Hospital Discharge Summaries: An End-to-End Framework using Ensemble of Deep Learning Models
Electronic Medical Records (EMRs) contain clinical narrative text that is of
great potential value to medical researchers. However, this information is
mixed with Personally Identifiable Information (PII) that presents risks to
patient and clinician confidentiality. This paper presents an end-to-end
de-identification framework to automatically remove PII from hospital discharge
summaries. Our corpus included 600 hospital discharge summaries which were
extracted from the EMRs of two principal referral hospitals in Sydney,
Australia. Our end-to-end de-identification framework consists of three
components: 1) Annotation: labelling of PII in the 600 hospital discharge
summaries using five pre-defined categories: person, address, date of birth,
identification number, phone number; 2) Modelling: training six named entity
recognition (NER) deep learning base-models on balanced and imbalanced
datasets; and evaluating ensembles that combine all six base-models, the three
base-models with the best F1 scores and the three base-models with the best
recall scores respectively, using token-level majority voting and stacking
methods; and 3) De-identification: removing PII from the hospital discharge
summaries. Our results showed that the ensemble model combined using the
stacking Support Vector Machine (SVM) method on the three base-models with the
best F1 scores achieved excellent results with a F1 score of 99.16% on the test
set of our corpus. We also evaluated the robustness of our modelling component
on the 2014 i2b2 de-identification dataset. Our ensemble model, which uses the
token-level majority voting method on all six base-models, achieved the highest
F1 score of 96.24% at strict entity matching and the highest F1 score of 98.64%
at binary token-level matching compared to two state-of-the-art methods. The
framework provides a robust solution to de-identifying clinical narrative text
safely
Using machine learning for automated de-identification and clinical coding of free text data in electronic medical records
The widespread adoption of Electronic Medical Records (EMRs) in hospitals continues to increase the amount of patient data that are digitally stored. Although the primary use of the EMR is to support patient care by making all relevant information accessible, governments and health organisations are looking for ways to unleash the potential of these data for secondary purposes, including clinical research, disease surveillance and automation of healthcare processes and workflows.
EMRs include large quantities of free text documents that contain valuable information. The greatest challenges in using the free text data in EMRs include the removal of personally identifiable information and the extraction of relevant information for specific tasks such as clinical coding. Machine learning-based automated approaches can potentially address these challenges.
This thesis aims to explore and improve the performance of machine learning models for automated de-identification and clinical coding of free text data in EMRs, as captured in hospital discharge summaries, and facilitate the applications of these approaches in real-world use cases. It does so by 1) implementing an end-to-end de-identification framework using an ensemble of deep learning models; 2) developing a web-based system for de-identification of free text (DEFT) with an interactive learning loop; 3) proposing and implementing a hierarchical label-wise attention transformer model (HiLAT) for explainable International Classification of Diseases (ICD) coding; and 4) investigating the use of extreme multi-label long text transformer-based models for automated ICD coding.
The key findings include: 1) An end-to-end framework using an ensemble of deep learning base-models achieved excellent performance on the de-identification task. 2) A new web-based de-identification software system (DEFT) can be readily and easily adopted by data custodians and researchers to perform de-identification of free text in EMRs. 3) A novel domain-specific transformer-based model (HiLAT) achieved state-of-the-art (SOTA) results for predicting ICD codes on a Medical Information Mart for Intensive Care (MIMIC-III) dataset comprising the discharge summaries (n=12,808) that are coded with at least one of the most 50 frequent diagnosis and procedure codes. In addition, the label-wise attention scores for the tokens in the discharge summary presented a potential explainability tool for checking the face validity of ICD code predictions. 4) An optimised transformer-based model, PLM-ICD, achieved the latest SOTA results for ICD coding on all the discharge summaries of the MIMIC-III dataset (n=59,652). The segmentation method, which split the long text consecutively into multiple small chunks, addressed the problem of applying transformer-based models to long text datasets. However, using transformer-based models on extremely large label sets needs further research.
These findings demonstrate that the de-identification and clinical coding tasks can benefit from the application of machine learning approaches, present practical tools for implementing these approaches, and highlight priorities for further research
Synthetic Data Sharing and Estimation of Viable Dynamic Treatment Regimes with Observational Data
Significant public demand arises for rapid data-driven scientific investigations using observational data, especially in personalized healthcare. This dissertation addresses three complementary challenges of analyzing complex observational data in biomedical research.
The ethical challenge reflects regulatory policies and social norms regarding data privacy, which tend to emphasize data security at the expense of effective data sharing. This results in fragmentation and scarcity of available research data. In Chapter 2, we propose the DataSifter approach that mediates this challenge by facilitating the generation of realistic synthetic data from sensitive datasets containing static and time-varying variables. The DataSifter method relies on robust imputation methods, including missForest and an iterative imputation technique for time-varying variables using the Generalized Linear Mixed Model (GLMM) and the Random Effects-Expectation Maximization tree (RE-EM tree). Applications demonstrate that under a moderate level of obfuscation, the DataSifter guarantees sufficient per subject perturbations of time-invariant data and preserves the joint distribution and the energy of the entire data archive, which ensures high utility and analytical value of the time-varying information. This promotes accelerated innovation by enabling secure sharing among data governors and researchers.
Once sensitive data can be securely shared, effective analytical tools are needed to provide viable individualized data-driven solutions. Observational data is an important data source for estimating dynamic treatment regimes (DTR) that guide personalized treatment decisions. The second natural challenge regards the viability of optimal DTR estimations, which may be affected by the observed treatment combinations that are not applicable for future patients due to clinical or economic reasons. In Chapter 3, we develop restricted Tree-based Reinforcement Learning to accommodate restrictions on feasible treatment combinations in observational studies by truncating possible treatment options based on patient history in a multi-stage multi-treatment setting. The proposed new method provides optimal treatment recommendations for patients only regarding viable treatment options and utilizes all valid observations in the dataset to avoid selection bias and improve efficiency.
In addition to the structured data, unstructured data, such as free-text, or voice-note, have become an essential component in many biomedical studies based on clinical and health data rapidly, including electronic health records (EHR), providing extra patient information. The last two chapters in my dissertation (Chapter 4 and Chapter 5) expands the methods developed in the previous two projects by utilizing novel natural language processing (NLP) techniques to address the third challenge of handling unstructured data elements. In Chapter 4, we construct a text data anonymization tool, DataSifterText, which generates synthetic free-text data to protect sensitive unstructured data, such as personal health information. In Chapter 5, we propose to enhance the precision of optimal DTR estimation by acquiring additional information contained in clinical notes with information extraction (IE) techniques. Simulation studies and application on blood pressure management in intensive care units demonstrated that the IE techniques can provide extra patient information and more accurate counterfactual outcome modeling, because of the potentially enhanced sample size and a wider pool of candidate tailoring variables for optimal DTR estimation.
The statistical methods presented in this thesis provides theoretical and practical solutions for privacy-aware utility-preserving large-scale data sharing and clinically meaningful optimal DTR estimation. The general theoretical formulation of the methods leads to the design of tools and direct applications that are expected to go beyond the biomedical and health analytics domains.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/166113/1/zhounina_1.pd