334 research outputs found

    Predicting disease risks from highly imbalanced data using random forest

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>We present a method utilizing Healthcare Cost and Utilization Project (HCUP) dataset for predicting disease risk of individuals based on their medical diagnosis history. The presented methodology may be incorporated in a variety of applications such as risk management, tailored health communication and decision support systems in healthcare.</p> <p>Methods</p> <p>We employed the National Inpatient Sample (NIS) data, which is publicly available through Healthcare Cost and Utilization Project (HCUP), to train random forest classifiers for disease prediction. Since the HCUP data is highly imbalanced, we employed an ensemble learning approach based on repeated random sub-sampling. This technique divides the training data into multiple sub-samples, while ensuring that each sub-sample is fully balanced. We compared the performance of support vector machine (SVM), bagging, boosting and RF to predict the risk of eight chronic diseases.</p> <p>Results</p> <p>We predicted eight disease categories. Overall, the RF ensemble learning method outperformed SVM, bagging and boosting in terms of the area under the receiver operating characteristic (ROC) curve (AUC). In addition, RF has the advantage of computing the importance of each variable in the classification process.</p> <p>Conclusions</p> <p>In combining repeated random sub-sampling with RF, we were able to overcome the class imbalance problem and achieve promising results. Using the national HCUP data set, we predicted eight disease categories with an average AUC of 88.79%.</p

    Interpretable Machine Learning Model for Clinical Decision Making

    Get PDF
    Despite machine learning models being increasingly used in medical decision-making and meeting classification predictive accuracy standards, they remain untrusted black-boxes due to decision-makers\u27 lack of insight into their complex logic. Therefore, it is necessary to develop interpretable machine learning models that will engender trust in the knowledge they generate and contribute to clinical decision-makers intention to adopt them in the field. The goal of this dissertation was to systematically investigate the applicability of interpretable model-agnostic methods to explain predictions of black-box machine learning models for medical decision-making. As proof of concept, this study addressed the problem of predicting the risk of emergency readmissions within 30 days of being discharged for heart failure patients. Using a benchmark data set, supervised classification models of differing complexity were trained to perform the prediction task. More specifically, Logistic Regression (LR), Random Forests (RF), Decision Trees (DT), and Gradient Boosting Machines (GBM) models were constructed using the Healthcare Cost and Utilization Project (HCUP) Nationwide Readmissions Database (NRD). The precision, recall, area under the ROC curve for each model were used to measure predictive accuracy. Local Interpretable Model-Agnostic Explanations (LIME) was used to generate explanations from the underlying trained models. LIME explanations were empirically evaluated using explanation stability and local fit (R2). The results demonstrated that local explanations generated by LIME created better estimates for Decision Trees (DT) classifiers

    Relational data clustering algorithms with biomedical applications

    Get PDF

    Prediction of Concurrent Hypertensive Disorders in Pregnancy and Gestational Diabetes Mellitus Using Machine Learning Techniques

    Get PDF
    Gestational diabetes mellitus and hypertensive disorders in pregnancy are serious maternal health conditions with immediate and lifelong mother-child health consequences. These obstetric pathologies have been widely investigated, but mostly in silos, while studies focusing on their simultaneous occurrence rarely exist. This is especially the case in the machine learning domain. This retrospective study sought to investigate, construct, evaluate, compare, and isolate a supervised machine learning predictive model for the binary classification of co-occurring gestational diabetes mellitus and hypertensive disorders in pregnancy in a cohort of otherwise healthy pregnant women. To accomplish the stated aims, this study analyzed an extract (n=4624, n_features=38) of a labelled maternal perinatal dataset (n=9967, n_fields=79) collected by the PeriData.Net® database from a participating community hospital in Southeast Wisconsin between 2013 and 2018. The datasets were named, “WiseSample” and “WiseSubset” respectively in this study. Thirty-three models were constructed with the six supervised machine learning algorithms explored on the extracted dataset: logistic regression, random forest, decision tree, support vector machine, StackingClassifier, and KerasClassifier, which is a deep learning classification algorithm; all were evaluated using the StratifiedKfold cross-validation (k=10) method. The Synthetic Minority Oversampling Technique was applied to the training data to resolve the class imbalance that was noted in the sub-sample at the preprocessing phase. A wide range of evidence-based feature selection techniques were used to identify the best predictors of the comorbidity under investigation. Multiple model performance evaluation metrics that were employed to quantitatively evaluate and compare model performance quality include accuracy, F1, precision, recall, and the area under the receiver operating characteristic curve. Support Vector Machine objectively emerged as the most generalizable model for identifying the gravidae in WiseSubset who may develop concurrent gestational diabetes mellitus and hypertensive disorders in pregnancy, scoring 100.00% (mean) in recall. The model consisted of 9 predictors extracted by the recursive feature elimination with cross-validation with random forest. Finding from this study show that appropriate machine learning methods can reliably predict comorbid gestational diabetes and hypertensive disorders in pregnancy, using readily available routine prenatal attributes. Six of the nine most predictive factors of the comorbidity were also in the top 6 selections of at least one other feature selection method examined. The six predictors are healthy weight prepregnancy BMI, mother’s educational status, husband’s educational status, husband’s occupation in one year before the current pregnancy, mother’s blood group, and mother’s age range between 34 and 44 years. Insight from this analysis would support clinical decision making of obstetric experts when they are caring for 1.) nulliparous women, since they would have no obstetric history that could prompt their care providers for feto-maternal medical surveillance; and 2.) the experienced mothers with no obstetric history suggestive of any of the disease(s) under this study. Hence, among other benefits, the artificial-intelligence-backed tool designed in this research would likely improve maternal and child care quality outcomes

    Blood Pressure Estimation from Speech Recordings: Exploring the Role of Voice-over Artists

    Get PDF
    Hypertension, a prevalent global health concern, is associated with cardiovascular diseases and significant morbidity and mortality. Accurate and prompt Blood Pressure monitoring is crucial for early detection and successful management. Traditional cuff-based methods can be inconvenient, leading to the exploration of non-invasive and continuous estimation methods. This research aims to bridge the gap between speech processing and health monitoring by investigating the relationship between speech recordings and Blood Pressure estimation. Speech recordings offer promise for non-invasive Blood Pressure estimation due to the potential link between vocal characteristics and physiological responses. In this study, we focus on the role of Voice-over Artists, known for their ability to convey emotions through voice. By exploring the expertise of Voice-over Artists in controlling speech and expressing emotions, we seek valuable insights into the potential correlation between speech characteristics and Blood Pressure. This research sheds light on presenting an innovative and convenient approach to health assessment. By unraveling the specific role of Voice-over Artists in this process, the study lays the foundation for future advancements in healthcare and human-robot interactions. Through the exploration of speech characteristics and emotional expression, this investigation offers valuable insights into the correlation between vocal features and Blood Pressure levels. By leveraging the expertise of Voice-over Artists in conveying emotions through voice, this study enriches our understanding of the intricate relationship between speech recordings and physiological responses, opening new avenues for the integration of voice-related factors in healthcare technologies

    Development and Evaluation of an Interdisciplinary Periodontal Risk Prediction Tool Using a Machine Learning Approach

    Get PDF
    Periodontitis (PD) is a major public health concern which profoundly affects oral health and concomitantly, general health of the population worldwide. Evidence-based research continues to support association between PD and systemic diseases such as diabetes and hypertension, among others. Notably PD also represents a modifiable risk factor that may reduce the onset and progression of some systemic diseases, including diabetes. Due to lack of oral screening in medical settings, this population does not get flagged with the risk of developing PD. This study sought to develop a PD risk assessment model applicable at clinical point-of-care (POC) by comparing performance of five supervised machine learning (ML) algorithms: Naïve Bayes, Logistic Regression, Support Vector Machine, Artificial Neural Network and Decision Tree, for modeling risk by retrospectively interrogating clinical data collected across seven different models of care (MOC) within the interdisciplinary settings. Risk assessment modeling was accomplished using Waikato Environment for Knowledge Analysis (WEKA) open-sourced tool, which supported comparative assessment of the relative performance of the five ML algorithms when applied to risk prediction. To align with current conventions for clinical classification of disease severity, predicting PD risk was treated as a ‘classification problem’, where patients were sorted into two categories based on disease severity and ‘low risk PD’ was defined as no or mild gum disease (‘controls’) or ‘high risk PD’ defined as moderate to severe disease (‘cases’). To assess the predictive performance of models, the study compared performance of ML algorithms applying analysis of recall, specificity, area under the curve, precision, F-measure and Matthew’s correlation coefficient (MCC) and receiver operating characteristic (ROC) curve. A tenfold-cross validation was performed. External validation of the resultant models was achieved by creating validation data subsets applying random selection of approximately 10% of each class of data proportionately. Findings from this study have prognostic implications for assessing PD risk. Models evolved in the present study have translational value in that they can be incorporated into the Electronic Health Record (EHR) to support POC screening. Additionally, the study has defined relative performance of PD risk prediction models across various MOC environments. Moreover, these findings have established the power ML application can serve to create a decision support tool for dental providers in assessing PD status, severity and inform treatment decisions. Further, such risk scores could also inform medical providers regarding the need for patient referrals and management of comorbid conditions impacted by presence of oral disease such as PD. Finally, this study illustrates the benefit of the integrated medical and dental care delivery environment for detecting risk of periodontitis at a stage when implementation of proven interventions could delay and even prevent disease progression. Keywords: Periodontitis, Risk Assessment, Interprofessional Relations, Machine learning, Electronic Health Records, Decision Support System

    Development of Machine Learning Techniques for Diabetic Retinopathy Risk Estimation

    Get PDF
    La retinopatia diabètica (DR) és una malaltia crònica. És una de les principals complicacions de diabetis i una causa essencial de pèrdua de visió entre les persones que pateixen diabetis. Els pacients diabètics han de ser analitzats periòdicament per tal de detectar signes de desenvolupament de la retinopatia en una fase inicial. El cribratge precoç i freqüent disminueix el risc de pèrdua de visió i minimitza la càrrega als centres assistencials. El nombre dels pacients diabètics està en augment i creixements ràpids, de manera que el fa difícil que consumeix recursos per realitzar un cribatge anual a tots ells. L’objectiu principal d’aquest doctorat. la tesi consisteix en construir un sistema de suport de decisions clíniques (CDSS) basat en dades de registre de salut electrònic (EHR). S'utilitzarà aquest CDSS per estimar el risc de desenvolupar RD. En aquesta tesi doctoral s'estudien mètodes d'aprenentatge automàtic per constuir un CDSS basat en regles lingüístiques difuses. El coneixement expressat en aquest tipus de regles facilita que el metge sàpiga quines combindacions de les condicions són les poden provocar el risc de desenvolupar RD. En aquest treball, proposo un mètode per reduir la incertesa en la classificació dels pacients que utilitzen arbres de decisió difusos (FDT). A continuació es combinen diferents arbres, usant la tècnica de Fuzzy Random Forest per millorar la qualitat de la predicció. A continuació es proposen diverses tècniques d'agregació que millorin la fusió dels resultats que ens dóna cadascun dels arbres FDT. Per millorar la decisió final dels nostres models, proposo tres mesures difuses que s'utilitzen amb integrals de Choquet i Sugeno. La definició d’aquestes mesures difuses es basa en els valors de confiança de les regles. En particular, una d'elles és una mesura difusa que es troba en la qual l'estructura jeràrquica de la FDT és explotada per trobar els valors de la mesura difusa. El resultat final de la recerca feta ha donat lloc a un programari que es pot instal·lar en centres d’assistència primària i hospitals, i pot ser usat pels metges de capçalera per fer l'avaluació preventiva i el cribatge de la Retinopatia Diabètica.La retinopatía diabética (RD) es una enfermedad crónica. Es una de las principales complicaciones de diabetes y una causa esencial de pérdida de visión entre las personas que padecen diabetes. Los pacientes diabéticos deben ser examinados periódicamente para detectar signos de diabetes. desarrollo de retinopatía en una etapa temprana. La detección temprana y frecuente disminuye el riesgo de pérdida de visión y minimiza la carga en los centros de salud. El número de pacientes diabéticos es enorme y está aumentando rápidamente, lo que lo hace difícil y Consume recursos para realizar una evaluación anual para todos ellos. El objetivo principal de esta tesis es construir un sistema de apoyo a la decisión clínica (CDSS) basado en datos de registros de salud electrónicos (EHR). Este CDSS será utilizado para estimar el riesgo de desarrollar RD. En este tesis doctoral se estudian métodos de aprendizaje automático para construir un CDSS basado en reglas lingüísticas difusas. El conocimiento expresado en este tipo de reglas facilita que el médico pueda saber que combinaciones de las condiciones son las que pueden provocar el riesgo de desarrollar RD. En este trabajo propongo un método para reducir la incertidumbre en la clasificación de los pacientes que usan árboles de decisión difusos (FDT). A continuación se combinan diferentes árboles usando la técnica de Fuzzy Random Forest para mejorar la calidad de la predicción. Se proponen también varias políticas para fusionar los resultados de que nos da cada uno de los árboles (FDT). Para mejorar la decisión final propongo tres medidas difusas que se usan con las integrales Choquet y Sugeno. La definición de estas medidas difusas se basa en los valores de confianza de las reglas. En particular, uno de ellos es una medida difusa descomponible en la que se usa la estructura jerárquica del FDT para encontrar los valores de la medida difusa. Como resultado final de la investigación se ha construido un software que puede instalarse en centros de atención médica y hospitales, i que puede ser usado por los médicos de cabecera para hacer la evaluación preventiva y el cribado de la Retinopatía Diabética.Diabetic retinopathy (DR) is a chronic illness. It is one of the main complications of diabetes, and an essential cause of vision loss among people suffering from diabetes. Diabetic patients must be periodically screened in order to detect signs of diabetic retinopathy development in an early stage. Early and frequent screening decreases the risk of vision loss and minimizes the load on the health care centres. The number of the diabetic patients is huge and rapidly increasing so that makes it hard and resource-consuming to perform a yearly screening to all of them. The main goal of this Ph.D. thesis is to build a clinical decision support system (CDSS) based on electronic health record (EHR) data. This CDSS will be utilised to estimate the risk of developing RD. In this Ph.D. thesis, I focus on developing novel interpretable machine learning systems. Fuzzy based systems with linguistic terms are going to be proposed. The output of such systems makes the physician know what combinations of the features that can cause the risk of developing DR. In this work, I propose a method to reduce the uncertainty in classifying diabetic patients using fuzzy decision trees. A Fuzzy Random forest (FRF) approach is proposed as well to estimate the risk for developing DR. Several policies are going to be proposed to merge the classification results achieved by different Fuzzy Decision Trees (FDT) models to improve the quality of the final decision of our models, I propose three fuzzy measures that are used with Choquet and Sugeno integrals. The definition of these fuzzy measures is based on the confidence values of the rules. In particular, one of them is a decomposable fuzzy measure in which the hierarchical structure of the FDT is exploited to find the values of the fuzzy measure. Out of this Ph.D. work, we have built a CDSS software that may be installed in the health care centres and hospitals in order to evaluate and detect Diabetic Retinopathy at early stages
    corecore