7 research outputs found

    Machine Learning for Diabetes and Mortality Risk Prediction From Electronic Health Records

    Get PDF
    Data science can provide invaluable tools to better exploit healthcare data to improve patient outcomes and increase cost-effectiveness. Today, electronic health records (EHR) systems provide a fascinating array of data that data science applications can use to revolutionise the healthcare industry. Utilising EHR data to improve the early diagnosis of a variety of medical conditions/events is a rapidly developing area that, if successful, can help to improve healthcare services across the board. Specifically, as Type-2 Diabetes Mellitus (T2DM) represents one of the most serious threats to health across the globe, analysing the huge volumes of data provided by EHR systems to investigate approaches for early accurately predicting the onset of T2DM, and medical events such as in-hospital mortality, are two of the most important challenges data science currently faces. The present thesis addresses these challenges by examining the research gaps in the existing literature, pinpointing the un-investigated areas, and proposing a novel machine learning modelling given the difficulties inherent in EHR data. To achieve these aims, the present thesis firstly introduces a unique and large EHR dataset collected from Saudi Arabia. Then we investigate the use of a state-of-the-art machine learning predictive models that exploits this dataset for diabetes diagnosis and the early identification of patients with pre-diabetes by predicting the blood levels of one of the main indicators of diabetes and pre-diabetes: elevated Glycated Haemoglobin (HbA1c) levels. A novel collaborative denoising autoencoder (Col-DAE) framework is adopted to predict the diabetes (high) HbA1c levels. We also employ several machine learning approaches (random forest, logistic regression, support vector machine, and multilayer perceptron) for the identification of patients with pre-diabetes (elevated HbA1c levels). The models employed demonstrate that a patient's risk of diabetes/pre-diabetes can be reliably predicted from EHR records. We then extend this work to include pioneering adoption of recent technologies to investigate the outcomes of the predictive models employed by using recent explainable methods. This work also investigates the effect of using longitudinal data and more of the features available in the EHR systems on the performance and features ranking of the employed machine learning models for predicting elevated HbA1c levels in non-diabetic patients. This work demonstrates that longitudinal data and available EHR features can improve the performance of the machine learning models and can affect the relative order of importance of the features. Secondly, we develop a machine learning model for the early and accurate prediction all in-hospital mortality events for such patients utilising EHR data. This work investigates a novel application of the Stacked Denoising Autoencoder (SDA) to predict in-hospital patient mortality risk. In doing so, we demonstrate how our approach uniquely overcomes the issues associated with imbalanced datasets to which existing solutions are subject. The proposed model –– using clinical patient data on a variety of health conditions and without intensive feature engineering –– is demonstrated to achieve robust and promising results using EHR patient data recorded during the first 24 hours after admission

    Predicting Current Glycated Hemoglobin Levels in Adults From Electronic Health Records: Validation of Multiple Logistic Regression Algorithm

    Get PDF
    Background: Electronic health record (EHR) systems generate large datasets that can significantly enrich the development of medical predictive models. Several attempts have been made to investigate the effect of glycated hemoglobin (HbA1c) elevation on the prediction of diabetes onset. However, there is still a need for validation of these models using EHR data collected from different populations. Objective: The aim of this study is to perform a replication study to validate, evaluate, and identify the strengths and weaknesses of replicating a predictive model that employed multiple logistic regression with EHR data to forecast the levels of HbA1c. The original study used data from a population in the United States and this differentiated replication used a population in Saudi Arabia. Methods: A total of 3 models were developed and compared with the model created in the original study. The models were trained and tested using a larger dataset from Saudi Arabia with 36,378 records. The 10-fold cross-validation approach was used for measuring the performance of the models. Results: Applying the method employed in the original study achieved an accuracy of 74% to 75% when using the dataset collected from Saudi Arabia, compared with 77% obtained from using the population from the United States. The results also show a different ranking of importance for the predictors between the original study and the replication. The order of importance for the predictors with our population, from the most to the least importance, is age, random blood sugar, estimated glomerular filtration rate, total cholesterol, non–high-density lipoprotein, and body mass index. Conclusions: This replication study shows that direct use of the models (calculators) created using multiple logistic regression to predict the level of HbA1c may not be appropriate for all populations. This study reveals that the weighting of the predictors needs to be calibrated to the population used. However, the study does confirm that replicating the original study using a different population can help with predicting the levels of HbA1c by using the predictors that are routinely collected and stored in hospital EHR systems

    Improving Current Glycated Hemoglobin Prediction in Adults: Use of Machine Learning Algorithms with Electronic Health Records

    Get PDF
    Background: Predicting the risk of glycated hemoglobin (HbA1c) elevation can help identify patients with the potential for developing serious chronic health problems such as diabetes. Early preventive interventions based upon advanced predictive models using electronic health records (EHR) data for identifying such patients can ultimately help provide better health outcomes. Objective: Our study investigates the performance of predictive models to forecast HbA1c elevation levels by employing several machine learning models. We also investigate utilizing the patient's EHR longitudinal data in the performance of the predictive models. Explainable methods have been employed to interpret the decisions made by the blackbox models. Methods: This study employed Multiple Logistic Regression, Random Forest, Support Vector Machine and Logistic Regression models, as well as a deep learning model (Multi-layer perceptron) to classify patients with normal (<5.7%) and elevated (≥5.7%) levels of HbA1c. We also integrated current visit data with historical (longitudinal) data from previous visits. Explainable machine learning methods were used to interrogate the models and provide an understanding of the reasons behind the decisions made by the models. All models were trained and tested using a large dataset from Saudi Arabia with 18,844 unique patient records. Results: The machine learning models achieved promising results for predicting current HbA1c elevation risk. When employed with longitudinal data, the machine learning models outperformed the Multiple Logistic Regression model employed in the comparative study. The multi-layer perceptron model achieved an accuracy of 83.22% for the AUC-ROC when used with historical data. All models showed close level of agreement on the contribution of random blood sugar and age variables with and without longitudinal data. Conclusions: This study shows that machine learning models can provide promising results for the task of predicting current HbA1c levels (≥5.7% or less). Utilizing the patient's longitudinal data improved the performance and affected the relative importance for the predictors used. The models showed results that are consistent with comparable studies

    Type-2 diabetes mellitus diagnosis from time series clinical data using deep learning models.

    Get PDF
    Clinical data is usually observed and recorded at irregular intervals and includes: evaluations, treatments, vital sign and lab test results. These provide an invaluable source of information to help diagnose and understand medical conditions. In this work, we introduce the largest patient records dataset in diabetes research: King Abdullah International Research Centre Diabetes (KAIMRCD) which includes over 14k patient data. KAIMRCD contains detailed information about the patient’s visit and have been labelled against T2DM by clinicians. The data is processed as time series and then investigated using temporal predictive Deep Learning models with the goal of diagnosing Type 2 Diabetes Mellitus (T2DM). Long Short-Term Memory (LSTM) and Gated-Recurrent Unit (GRU) are trained on KAIMRCD and are demonstrated here to outperform classical machine learning approaches in the literature with over 97% accuracy

    Stacked Denoising Autoencoders for Mortality Risk Prediction Using Imbalanced Clinical Data

    Get PDF
    Clinical data, such as evaluations, treatments, vital sign and lab test results, are usually observed and recorded in hospital systems. Making use of such data to help physicians to evaluate the mortality risk of in-hospital patients provides an invaluable source of information that can ultimately help with improving healthcare services. In particular, quick and accurate predictions of mortality can be valuable for physicians who are making decisions about interventions. In this work we introduce the use of a predictive Deep Learning model to help evaluate the mortality risk for in-hospital patients. Stacked Denoising Autoencoder (SDA) has been trained using a unique time-stamped dataset (King Abdullah International Research Center - KAIMRC) which is naturally imbalanced. The results are compared to those from common deep learning approaches, using different methods for data balancing. The proposed model demonstrated here aims to overcome the problem of imbalanced data, and outperforms common deep learning approaches with an accuracy of 77.13% for the Recall macro

    Collaborative Denoising Autoencoder for High Glycated Haemoglobin Prediction

    No full text
    A pioneering study is presented demonstrating that the presence of high glycated haemoglobin (HbA1c) levels in a patient’s blood can be reliably predicted from routinely collected clinical data. This paves the way for performing early detection of Type-2 Diabetes Mellitus (T2DM). This will save healthcare providers a major cost associated with the administration and assessment of clinical tests for HbA1c. A novel collaborative denoising autoencoder framework is used to address this challenge. The framework builds an independent denoising autoencoder model for the high and low HbA1c level, which extracts feature representations in the latent space. A baseline model using just three features: patient age together with triglycerides and glucose level achieves 76% F1-score with an SVM classifier. The collaborative denoising autoencoder uses 78 features and can predict HbA1c level with 81% F1-score

    Type-2 Diabetes Mellitus Diagnosis from Time Series Clinical Data using Deep Learning Models

    No full text
    Clinical data is usually observed and recorded at irregular intervals and includes: evaluations, treatments, vital sign and lab test results. These provide an invaluable source of information to help diagnose and understand medical conditions. In this work, we introduce the largest patient records dataset in diabetes research: King Abdullah International Research Centre Diabetes (KAIMRCD) which includes over 14k patient data. KAIMRCD contains detailed information about the patient’s visit and have been labelled against T2DM by clinicians. The data is processed as time series and then investigated using temporal predictive Deep Learning models with the goal of diagnosing Type 2 Diabetes Mellitus (T2DM). Long Short-Term Memory (LSTM) and Gated-Recurrent Unit (GRU) are trained on KAIMRCD and are demonstrated here to outperform classical machine learning approaches in the literature with over 97% accuracy
    corecore