743 research outputs found

    A scalable formulation of joint modelling for longitudinal and time to event data and its application on large electronic health record data of diabetes complications

    Get PDF
    INTRODUCTION: Clinical decision-making in the management of diabetes and other chronic diseases depends upon individualised risk predictions of progression of the disease or complica- tions of disease. With sequential measurements of biomarkers, it should be possible to make dynamic predictions that are updated as new data arrive. Since the 1990s, methods have been developed to jointly model longitudinal measurements of biomarkers and time-to-event data, aiming to facilitate predictions in various fields. These methods offer a comprehensive approach to analyse both the longitudinal changes in biomarkers, and the occurrence of events, allowing for a more integrated understanding of the underlying processes and improved predictive capabilities. The aim of this thesis is to investigate whether established methods for joint modelling are able to scale to large-scale electronic health record datasets with multiple biomarkers measured asynchronously, and evaluates the performance of a novel approach that overcomes the limitations of existing methods. METHODS: The epidemiological study design utilised in this research is a retrospective observa- tional study. The data used for these analyses were obtained from a registry encompassing all individuals with type 1 diabetes in Scotland, which is delivered by the Scottish Care Information - Diabetes Collaboration platform. The two outcomes studied were time to cardiovascular disease (CVD) and time to end-stage renal disease (ESRD) from T1D diag- nosis. The longitudinal biomarkers examined in the study were glycosylated haemoglobin (HbA1c) and estimated glomerular filtration rate (eGFR). These biomarkers and endpoints were selected based on their prevalence in the T1D population and the established association between these biomarkers and the outcomes. As a state-of-the-art method for joint modelling, Brilleman’s stan_jm() function was evaluated. This is an implementation of a shared parameter joint model for longitudinal and time-to- event data in Stan contributed to the rstanarm package. This was compared with a novel approach based on sequential Bayesian updating of a continuous-time state-space model for the biomarkers, with predictions generated by a Kalman filter algorithm using the ctsem package fed into a Poisson time-splitting regression model for the events. In contrast to the standard joint modelling approach that can only fit a linear mixed model to the biomarkers, the ctsem package is able to fit a broader family of models that include terms for autoregressive drift and diffusion. As a baseline for comparison, a last-observation-carried-forward model was evaluated to predict time-to-event. RESULTS: The analyses were conducted using renal replacement therapy outcome data regarding 29764 individuals and cardiovascular disease outcome data on 29479 individuals in Scotland (as per the 2019 national registry extract). The CVD dataset was reduced to 24779 individuals with both HbA1c and eGFR data measured on the same date; a limitation of the modelling function itself. The datasets include 799 events of renal replacement therapy (RRT) or death due to renal failure (6.71 years average follow-up) and 2274 CVD events (7.54 years average follow-up) respectively. The standard approach to joint modelling using quadrature to integrate over the trajectories of the latent biomarker states, implemented in rstanarm, was found to be too slow to use even with moderate-sized datasets, e.g. 17.5 hours for a subset of 2633 subjects, 35.9 hours for 5265 subjects, and more than 68 hours for 10532 subjects. The sequential Bayesian updating approach was much faster, as it was able to analyse a dataset of 29121 individuals over 225598.3 person-years in 19 hours. Comparison of the fit of different longitudinal biomarker submodels showed that the fit of models that also included a drift and diffusion term was much better (AIC 51139 deviance units lower) than models that included only a linear mixed model slope term. Despite this, the improvement in predictive performance was slight for CVD (C-statistic 0.680 to 0.696 for 2112 individuals) and only moderate for end-stage renal disease (C-statistic 0.88 to 0.91 for 2000 individuals) by adding terms for diffusion and drift. The predictive performance of joint modelling in these datasets was only slightly better than using last-observation-carried-forward in the Poisson regression model (C-statistic 0.819 over 8625 person-years). CONCLUSIONS: I have demonstrated that unlike the standard approach to joint modelling, implemented in rstanarm, the time-splitting joint modelling approach based on sequential Bayesian updating can scale to a large dataset and allows biomarker trajectories to be modelled with a wider family of models that have better fit than simple linear mixed models. However, in this application, where the only biomarkers were HbA1c and eGFR, and the outcomes were time-to-CVD and end-stage renal disease, the increment in the predictive performance of joint modelling compared with last-observation-carried forward was slight. For other outcomes, where the ability to predict time-to-event depends upon modelling latent biomarker trajectories rather than just using the last-observation-carried-forward, the advantages of joint modelling may be greater. This thesis proceeds as follows. The first two chapters serve as an introduction to the joint modelling of longitudinal and time-to-event data and its relation to other methods for clinical risk prediction. Briefly, this part explores the rationale for utilising such an approach to manage chronic diseases, such as T1D, better. The methodological chapters of this thesis describe the mathematical formulation of a multivariate shared-parameter joint model and introduce its application and performance on a subset of individuals with T1D and data pertaining to CVD and ESRD outcomes. Additionally, the mathematical formulation of an alternative time-splitting approach is demonstrated and compared to a conventional method for estimating longitudinal trajectories of clinical biomarkers used in risk prediction. Also, the key features of the pipeline required to implement this approach are outlined. The final chapters of the thesis present an applied example that demonstrates the estimation and evaluation of the alternative modelling approach and explores the types of inferences that can be obtained for a subset of individuals with T1D that might progress to ESRD. Finally, this thesis highlights the strengths and weaknesses of applying and scaling up more complex modelling approaches to facilitate dynamic risk prediction for precision medicine

    Enhance Representation Learning of Clinical Narrative with Neural Networks for Clinical Predictive Modeling

    Get PDF
    Medicine is undergoing a technological revolution. Understanding human health from clinical data has major challenges from technical and practical perspectives, thus prompting methods that understand large, complex, and noisy data. These methods are particularly necessary for natural language data from clinical narratives/notes, which contain some of the richest information on a patient. Meanwhile, deep neural networks have achieved superior performance in a wide variety of natural language processing (NLP) tasks because of their capacity to encode meaningful but abstract representations and learn the entire task end-to-end. In this thesis, I investigate representation learning of clinical narratives with deep neural networks through a number of tasks ranging from clinical concept extraction, clinical note modeling, and patient-level language representation. I present methods utilizing representation learning with neural networks to support understanding of clinical text documents. I first introduce the notion of representation learning from natural language processing and patient data modeling. Then, I investigate word-level representation learning to improve clinical concept extraction from clinical notes. I present two works on learning word representations and evaluate them to extract important concepts from clinical notes. The first study focuses on cancer-related information, and the second study evaluates shared-task data. The aims of these two studies are to automatically extract important entities from clinical notes. Next, I present a series of deep neural networks to encode hierarchical, longitudinal, and contextual information for modeling a series of clinical notes. I also evaluate the models by predicting clinical outcomes of interest, including mortality, length of stay, and phenotype predictions. Finally, I propose a novel representation learning architecture to develop a generalized and transferable language representation at the patient level. I also identify pre-training tasks appropriate for constructing a generalizable language representation. The main focus is to improve predictive performance of phenotypes with limited data, a challenging task due to a lack of data. Overall, this dissertation addresses issues in natural language processing for medicine, including clinical text classification and modeling. These studies show major barriers to understanding large-scale clinical notes. It is believed that developing deep representation learning methods for distilling enormous amounts of heterogeneous data into patient-level language representations will improve evidence-based clinical understanding. The approach to solving these issues by learning representations could be used across clinical applications despite noisy data. I conclude that considering different linguistic components in natural language and sequential information between clinical events is important. Such results have implications beyond the immediate context of predictions and further suggest future directions for clinical machine learning research to improve clinical outcomes. This could be a starting point for future phenotyping methods based on natural language processing that construct patient-level language representations to improve clinical predictions. While significant progress has been made, many open questions remain, so I will highlight a few works to demonstrate promising directions

    Probabilistic Models for Exploring, Predicting, and Influencing Health Trajectories

    Get PDF
    Over the past decade, healthcare systems around the world have transitioned from paper to electronic health records. The majority of healthcare systems today now host large, on-premise clusters that support an institution-wide network of computers deployed at the point of care. A stream of transactions pass through this network each minute, recording information about what medications a patient is receiving, what procedures they have had, and the results of hundreds of physical examinations and laboratory tests. There is increasing pressure to leverage these repositories of data as a means to improve patient outcomes, drive down costs, or both. To date, however, there is no clear answer on how to best do this. In this thesis, we study two important problems that can help to accomplish these goals: disease subtyping and disease trajectory prediction. In disease subtyping, the goal is to better understand complex, heterogeneous diseases by discovering patient populations with similar symptoms and disease expression. As we discover and refine subtypes, we can integrate them into clinical practice to improve management and can use them to motivate new hypothesis-driven research into the genetic and molecular underpinnings of the disease. In disease trajectory prediction, our goal is to forecast how severe a patient's disease will become in the future. Tools to make accurate forecasts have clear implications for clinical decision support, but they can also improve our process for validating new therapies through trial enrichment. We identify several characteristics of EHR data that make it to difficult to do subtyping and disease trajectory prediction. The key contribution of this thesis is a collection of novel probabilistic models that address these challenges and make it possible to successfully solve the subtyping and disease trajectory prediction problems using EHR data

    Machine Learning Framework for Real-World Electronic Health Records Regarding Missingness, Interpretability, and Fairness

    Get PDF
    Machine learning (ML) and deep learning (DL) techniques have shown promising results in healthcare applications using Electronic Health Records (EHRs) data. However, their adoption in real-world healthcare settings is hindered by three major challenges. Firstly, real-world EHR data typically contains numerous missing values. Secondly, traditional ML/DL models are typically considered black-boxes, whereas interpretability is required for real-world healthcare applications. Finally, differences in data distributions may lead to unfairness and performance disparities, particularly in subpopulations. This dissertation proposes methods to address missing data, interpretability, and fairness issues. The first work proposes an ensemble prediction framework for EHR data with large missing rates using multiple subsets with lower missing rates. The second method introduces the integration of medical knowledge graphs and double attention mechanism with the long short-term memory (LSTM) model to enhance interpretability by providing knowledge-based model interpretation. The third method develops an LSTM variant that integrates medical knowledge graphs and additional time-aware gates to handle multi-variable temporal missing issues and interpretability concerns. Finally, a transformer-based model is proposed to learn unbiased and fair representations of diverse subpopulations using domain classifiers and three attention mechanisms

    Impact of Terminology Mapping on Population Health Cohorts IMPaCt

    Get PDF
    Background and Objectives: The population health care delivery model uses phenotype algorithms in the electronic health record (EHR) system to identify patient cohorts targeted for clinical interventions such as laboratory tests, and procedures. The standard terminology used to identify disease cohorts may contribute to significant variation in error rates for patient inclusion or exclusion. The United States requires EHR systems to support two diagnosis terminologies, the International Classification of Disease (ICD) and the Systematized Nomenclature of Medicine (SNOMED). Terminology mapping enables the retrieval of diagnosis data using either terminology. There are no standards of practice by which to evaluate and report the operational characteristics of ICD and SNOMED value sets used to select patient groups for population health interventions. Establishing a best practice for terminology selection is a step forward in ensuring that the right patients receive the right intervention at the right time. The research question is, “How does the diagnosis retrieval terminology (ICD vs SNOMED) and terminology map maintenance impact population health cohorts?” Aim 1 and 2 explore this question, and Aim 3 informs practice and policy for population health programs. Methods Aim 1: Quantify impact of terminology choice (ICD vs SNOMED) ICD and SNOMED phenotype algorithms for diabetes, chronic kidney disease (CKD), and heart failure were developed using matched sets of codes from the Value Set Authority Center. The performance of the diagnosis-only phenotypes was compared to published reference standard that included diagnosis codes, laboratory results, procedures, and medications. Aim 2: Measure terminology maintenance impact on SNOMED cohorts For each disease state, the performance of a single SNOMED algorithm before and after terminology updates was evaluated in comparison to a reference standard to identify and quantify cohort changes introduced by terminology maintenance. Aim 3: Recommend methods for improving population health interventions The socio-technical model for studying health information technology was used to inform best practice for the use of population health interventions. Results Aim 1: ICD-10 value sets had better sensitivity than SNOMED for diabetes (.829, .662) and CKD (.242, .225) (N=201,713, p Aim 2: Following terminology maintenance the SNOMED algorithm for diabetes increased in sensitivity from (.662 to .683 (p Aim 3: Based on observed social and technical challenges to population health programs, including and in addition to the development and measurement of phenotypes, a practical method was proposed for population health intervention development and reporting

    Review of methods for detecting glycemic disorders

    Get PDF
    Prediabetes (intermediate hyperglycemia) consists of two abnormalities, impaired fasting glucose (IFG) and impaired glucose tolerance (IGT) detected by a standardized 75-gram oral glucose tolerance test (OGTT). Individuals with isolated IGT or combined IFG and IGT have increased risk for developing type 2 diabetes (T2D) and cardiovascular disease (CVD). Diagnosing prediabetes early and accurately is critical in order to refer high-risk individuals for intensive lifestyle modification. However, there is currently no international consensus for diagnosing prediabetes with HbA1c or glucose measurements based upon American Diabetes Association (ADA) and the World Health Organization (WHO) criteria that identify different populations at risk for progressing to diabetes. Various caveats affecting the accuracy of interpreting the HbA1c including genetics complicate this further. This review describes established methods for detecting glucose disorders based upon glucose and HbA1c parameters as well as novel approaches including the 1-hour plasma glucose (1-h PG), glucose challenge test (GCT), shape of the glucose curve, genetics, continuous glucose monitoring (CGM), measures of insulin secretion and sensitivity, metabolomics, and ancillary tools such as fructosamine, glycated albumin (GA), 1,5- anhydroglucitol (1,5-AG). Of the approaches considered, the 1-h PG has considerable potential as a biomarker for detecting glucose disorders if confirmed by additional data including health economic analysis. Whether the 1-h OGTT is superior to genetics and omics in providing greater precision for individualized treatment requires further investigation. These methods will need to demonstrate substantially superiority to simpler tools for detecting glucose disorders to justify their cost and complexity
    corecore