2 research outputs found

    Machine Learning in Population Health: Frequent Emergency Department Utilization Pattern Identification and Prediction

    Get PDF
    Emergency Department (ED) overcrowding is an emerging risk to patient safety and may significantly affect chronically ill people. For instance, overcrowding in an ED may cause delays in patient transportation or revenue loss for hospitals due to hospital diversion. Frequent users with avoidable visits play a significant role in imposing such challenges to ED settings. Non-urgent or "avoidable" ED use induces overcrowding and cost increases due to unnecessary tests and treatment. It is, therefore, valuable to understand the pattern of the ED visits among a population and prospectively identify ED frequent users, to provide stratified care management and resource allocation. Although most current models use classical methods like descriptive analysis or regression modelling, more sophisticated techniques may be needed to increase the accuracy of outcomes where big data is in use. This study focuses on the Machine Learning (ML) techniques to identify the ED usage pattern among frequent users and to evaluate the predicting ability of the models. I performed an extensive literature review to generate a list of potential predictors of ED frequent use. For this thesis, I used Korean Health Panel data from 2008 to 2015. Individuals with at least one ED visit were included, among whom those with four or more visits per year were considered frequent ED users. Demographic and clinical data was collected. The relationship between predictors and ED frequent use was examined through multivariable analysis. A K-modes clustering algorithm was applied to identify ED utilization patterns among frequent users. Finally, the performance of four machine learning classification algorithms was assessed and compared to logistic regression. The classification algorithms used in my thesis were Random Forest, Support Vector Machine (SVM), Bagging, and Voting. The models' performance was evaluated based on Positive Predictive Value (PPV), sensitivity, Area Under Curve (AUC), and classification error. A total of 9,348 individuals with 15,627 ED visits were eligible for this study. Frequent ED users accounted for 2.4% of all ED visits. Frequent ED users tended to be older, male, and more likely to be using ambulance as a mode of transport than non‐frequent ED users. In the cluster analysis, we identified three subgroups among frequent ED users: (i) older patients with respiratory system complaints, the highest discharged rates who were more likely to visit in Spring and Winter, (ii) older patients with the highest rate of hospitalization, who are also more likely to have used ambulance, and visited ED due to circulatory system complaints, (iii) younger patients, mostly female, with the highest rate of ED visits in summer, and lowest rate of using an ambulance, who visited ED mostly due to damages such as injuries, poisoning, etc. The ML classification algorithms predicted frequent ED users with high precision (90% - 98%) and sensitivity (87% - 91%), while showed high AUC scores from 89% for SVM to 96% for Random Forest, as well. The classification error varied among algorithms; logistic regression had the highest classification error (34.9%) while Random Forest had the least (3.8%). According to the Random Forest Importance Score, the top 5 factors predicting frequent users were disease category, age, day of the week, season, and sex. In this thesis, I showed how ML methods applies to ED users in population health. The study results show that ML classification algorithms are robust techniques with predictive power for future ED visit identification and prediction. As more data are collected and the amount of data availability increases, machine learning approaches is a promising tool for advancing the understanding of such ‘Big’ data

    On Parallelization of Categorical Data Clustering

    Get PDF
    We study parallelization of categorical data clustering algorithms in an MPI platform. Clustering such data has been a daunting task even for sequential algorithms, mainly due to the challenges in finding suitable similarity/distance measures. We propose a parallel version of the k-modes algorithm, called PV3, which maintains the same clustering quality as produced by the sequential approach while achieving reasonable speed-ups. PV3 is programmed to ensure deterministic processing in a parallel environment. To produce better clustering results, we then develop an initialization method called Revised Density Method (RDM) based on the notion of density. Additionally, we develop variants of the RDM method to further enhance its performance. we then study effective ways to parallelize RDM and its variants. To further exploit parallelism opportunities, we develop an Ensemble Parallelizing Process (EPP) framework. This framework can be used with any desired initialization/clustering algorithms with different levels of parallelism. Using our different RDM initialization techniques along with the PV3 algorithm in the EPP framework, we then build an RDM realization of EPP, called RDM EPP. The result of our numerous experiments using benchmark categorical datasets indicate the quality metric of RDM EPP to be among the top three sequential k-modes based clustering algorithms. In terms of speed up, the results indicate to be 7 times faster for some datasets, though much larger datasets are required for a more comprehensive scalability study of RDM EPP
    corecore