5,353 research outputs found

    Multilevel Weighted Support Vector Machine for Classification on Healthcare Data with Missing Values

    Full text link
    This work is motivated by the needs of predictive analytics on healthcare data as represented by Electronic Medical Records. Such data is invariably problematic: noisy, with missing entries, with imbalance in classes of interests, leading to serious bias in predictive modeling. Since standard data mining methods often produce poor performance measures, we argue for development of specialized techniques of data-preprocessing and classification. In this paper, we propose a new method to simultaneously classify large datasets and reduce the effects of missing values. It is based on a multilevel framework of the cost-sensitive SVM and the expected maximization imputation method for missing values, which relies on iterated regression analyses. We compare classification results of multilevel SVM-based algorithms on public benchmark datasets with imbalanced classes and missing values as well as real data in health applications, and show that our multilevel SVM-based method produces fast, and more accurate and robust classification results.Comment: arXiv admin note: substantial text overlap with arXiv:1503.0625

    Early hospital mortality prediction using vital signals

    Full text link
    Early hospital mortality prediction is critical as intensivists strive to make efficient medical decisions about the severely ill patients staying in intensive care units. As a result, various methods have been developed to address this problem based on clinical records. However, some of the laboratory test results are time-consuming and need to be processed. In this paper, we propose a novel method to predict mortality using features extracted from the heart signals of patients within the first hour of ICU admission. In order to predict the risk, quantitative features have been computed based on the heart rate signals of ICU patients. Each signal is described in terms of 12 statistical and signal-based features. The extracted features are fed into eight classifiers: decision tree, linear discriminant, logistic regression, support vector machine (SVM), random forest, boosted trees, Gaussian SVM, and K-nearest neighborhood (K-NN). To derive insight into the performance of the proposed method, several experiments have been conducted using the well-known clinical dataset named Medical Information Mart for Intensive Care III (MIMIC-III). The experimental results demonstrate the capability of the proposed method in terms of precision, recall, F1-score, and area under the receiver operating characteristic curve (AUC). The decision tree classifier satisfies both accuracy and interpretability better than the other classifiers, producing an F1-score and AUC equal to 0.91 and 0.93, respectively. It indicates that heart rate signals can be used for predicting mortality in patients in the ICU, achieving a comparable performance with existing predictions that rely on high dimensional features from clinical records which need to be processed and may contain missing information.Comment: 11 pages, 5 figures, preprint of accepted paper in IEEE&ACM CHASE 2018 and published in Smart Health journa

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    A theoretical framework for research on readmission risk prediction

    Get PDF
    On the one hand, predictive analytics is an important field of research in Information Systems (IS); however, research on predictive analytics in healthcare is still scarce in IS literature. One area where predictive analytics can be of great benefit is with regard to unplanned readmissions. While a number of studies on readmission prediction already exists in related research areas, there are few guidelines to date on how to conduct such analytics projects. To address this gap the paper presents the general process to develop empirical models by Shmueli and Koppius (2011) and extends this to the specific requirements of readmission risk prediction. Based on a systematic literature review, the resulting process defines important aspects of readmission prediction. It also structures relevant questions and tasks that need to be taken care of in this context. This extension of the guidelines by Shmueli and Koppius (2011) provides a best practice as well as a template that can be used in future studies on readmission risk prediction, thus allowing for more comparable results across various research fields

    Development and Evaluation of an Interdisciplinary Periodontal Risk Prediction Tool Using a Machine Learning Approach

    Get PDF
    Periodontitis (PD) is a major public health concern which profoundly affects oral health and concomitantly, general health of the population worldwide. Evidence-based research continues to support association between PD and systemic diseases such as diabetes and hypertension, among others. Notably PD also represents a modifiable risk factor that may reduce the onset and progression of some systemic diseases, including diabetes. Due to lack of oral screening in medical settings, this population does not get flagged with the risk of developing PD. This study sought to develop a PD risk assessment model applicable at clinical point-of-care (POC) by comparing performance of five supervised machine learning (ML) algorithms: Naïve Bayes, Logistic Regression, Support Vector Machine, Artificial Neural Network and Decision Tree, for modeling risk by retrospectively interrogating clinical data collected across seven different models of care (MOC) within the interdisciplinary settings. Risk assessment modeling was accomplished using Waikato Environment for Knowledge Analysis (WEKA) open-sourced tool, which supported comparative assessment of the relative performance of the five ML algorithms when applied to risk prediction. To align with current conventions for clinical classification of disease severity, predicting PD risk was treated as a ‘classification problem’, where patients were sorted into two categories based on disease severity and ‘low risk PD’ was defined as no or mild gum disease (‘controls’) or ‘high risk PD’ defined as moderate to severe disease (‘cases’). To assess the predictive performance of models, the study compared performance of ML algorithms applying analysis of recall, specificity, area under the curve, precision, F-measure and Matthew’s correlation coefficient (MCC) and receiver operating characteristic (ROC) curve. A tenfold-cross validation was performed. External validation of the resultant models was achieved by creating validation data subsets applying random selection of approximately 10% of each class of data proportionately. Findings from this study have prognostic implications for assessing PD risk. Models evolved in the present study have translational value in that they can be incorporated into the Electronic Health Record (EHR) to support POC screening. Additionally, the study has defined relative performance of PD risk prediction models across various MOC environments. Moreover, these findings have established the power ML application can serve to create a decision support tool for dental providers in assessing PD status, severity and inform treatment decisions. Further, such risk scores could also inform medical providers regarding the need for patient referrals and management of comorbid conditions impacted by presence of oral disease such as PD. Finally, this study illustrates the benefit of the integrated medical and dental care delivery environment for detecting risk of periodontitis at a stage when implementation of proven interventions could delay and even prevent disease progression. Keywords: Periodontitis, Risk Assessment, Interprofessional Relations, Machine learning, Electronic Health Records, Decision Support System

    Evaluating Sampling Techniques for Healthcare Insurance Fraud Detection in Imbalanced Dataset

    Get PDF
    Detecting fraud in the healthcare insurance dataset is challenging due to severe class imbalance, where fraud cases are rare compared to non-fraud cases. Various techniques have been applied to address this problem, such as oversampling and undersampling methods. However, there is a lack of comparison and evaluation of these sampling methods. Therefore, the research contribution of this study is to conduct a comprehensive evaluation of the different sampling methods in different class distributions, utilizing multiple evaluation metrics, including , , , Precision, and Recall. In addition, a model evaluation approach be proposed to address the issue of inconsistent scores in different metrics. This study employs a real-world dataset with the XGBoost algorithm utilized alongside widely used data sampling techniques such as Random Oversampling and Undersampling, SMOTE, and Instance Hardness Threshold. Results indicate that Random Oversampling and Undersampling perform well in the 50% distribution, while SMOTE and Instance Hardness Threshold methods are more effective in the 70% distribution. Instance Hardness Threshold performs best in the 90% distribution. The 70% distribution is more robust with the SMOTE and Instance Hardness Threshold, particularly in the consistent score in different metrics, although they have longer computation times. These models consistently performed well across all evaluation metrics, indicating their ability to generalize to new unseen data in both the minority and majority classes. The study also identifies key features such as costs, diagnosis codes, type of healthcare service, gender, and severity level of diseases, which are important for accurate healthcare insurance fraud detection. These findings could be valuable for healthcare providers to make informed decisions with lower risks. A well-performing fraud detection model ensures the accurate classification of fraud and non-fraud cases. The findings also can be used by healthcare insurance providers to develop more effective fraud detection and prevention strategies

    Imbalance Learning and Its Application on Medical Datasets

    Get PDF
    To gain more valuable information from the increasing large amount of data, data mining has been a hot topic that attracts growing attention in this two decades. One of the challenges in data mining is imbalance learning, which refers to leaning from imbalanced datasets. The imbalanced datasets is dominated by some classes (majority) and other under-represented classes (minority). The imbalanced datasets degrade the learning ability of traditional methods, which are designed on the assumption that all classes are balanced and have equal misclassification costs, leading to the poor performance on the minority classes. This phenomenon is usually called the class imbalance problem. However, it is usually the minority classes of more interest and importance, such as sick cases in the medical dataset. Additionally, traditional methods are optimized to achieve maximum accuracy, which is not suitable for evaluating the performance on imbalanced datasets. From the view of data space, class imbalance could be classified as extrinsic imbalance and intrinsic imbalance. Extrinsic imbalance is caused by external factors, such as data transmission or data storage, while intrinsic imbalance means the dataset is inherently imbalanced due to its nature.  As extrinsic imbalance could be fixed by collecting more samples, this thesis mainly focus on on two scenarios of the intrinsic imbalance,  machine learning for imbalanced structured datasets and deep learning for imbalanced image datasets.  Normally, the solutions for the class imbalance problem are named as imbalance learning methods, which could be grouped into data-level methods (re-sampling), algorithm-level (re-weighting) methods and hybrid methods. Data-level methods modify the class distribution of the training dataset to create balanced training sets, and typical examples are over-sampling and under-sampling. Instead of modifying the data distribution, algorithm-level methods adjust the misclassification cost to alleviate the class imbalance problem, and one typical example is cost sensitive methods. Hybrid methods usually combine data-level methods and algorithm-level methods. However, existing imbalance learning methods encounter different kinds of problems. Over-sampling methods increase the minority samples to create balanced training sets, which might lead the trained model overfit to the minority class. Under-sampling methods create balanced training sets by discarding majority samples, which lead to the information loss and poor performance of the trained model. Cost-sensitive methods usually need assistance from domain expert to define the misclassification costs which are task specified. Thus, the generalization ability of cost-sensitive methods is poor. Especially, when it comes to the deep learning methods under class imbalance, re-sampling methods may introduce large computation cost and existing re-weighting methods could lead to poor performance. The object of this dissertation is to understand features difference under class imbalance, to improve the classification performance on structured datasets or image datasets. This thesis proposes two machine learning methods for imbalanced structured datasets and one deep learning method for imbalance image datasets. The proposed methods are evaluated on several medical datasets, which are intrinsically imbalanced.  Firstly, we study the feature difference between the majority class and the minority class of an imbalanced medical dataset, which is collected from a Chinese hospital. After data cleaning and structuring, we get 3292 kidney stone cases treated by Percutaneous Nephrolithonomy from 2012 to 2019. There are 651 (19.78% ) cases who have postoperative complications, which makes the complication prediction an imbalanced classification task. We propose a sampling-based method SMOTE-XGBoost and implement it to build a postoperative complication prediction model. Experimental results show that the proposed method outperforms classic machine learning methods. Furthermore, traditional prediction models of Percutaneous Nephrolithonomy are designed to predict the kidney stone status and overlook complication related features, which could degrade their prediction performance on complication prediction tasks. To this end, we merge more features into the proposed sampling-based method and further improve the classification performance. Overall, SMOTE-XGBoost achieves an AUC of 0.7077 which is 41.54% higher than that of S.T.O.N.E. nephrolithometry, a traditional prediction model of Percutaneous Nephrolithonomy. After reviewing the existing machine learning methods under class imbalance, we propose a novel ensemble learning approach called Multiple bAlance Subset Stacking (MASS). MASS first cuts the majority class into multiple subsets by the size of the minority set, and combines each majority subset with the minority set as one balanced subsets. In this way, MASS could overcome the problem of information loss because it does not discard any majority sample. Each balanced subset is used to train one base classifier. Then, the original dataset is feed to all the trained base classifiers, whose output are used to generate the stacking dataset. One stack model is trained by the staking dataset to get the optimal weights for the base classifiers. As the stacking dataset keeps the same labels as the original dataset, which could avoid the overfitting problem. Finally, we can get an ensembled strong model based on the trained base classifiers and the staking model. Extensive experimental results on three medical datasets show that MASS outperforms baseline methods.  The robustness of MASS is proved over implementing different base classifiers. We design a parallel version MASS to reduce the training time cost. The speedup analysis proves that Parallel MASS could reduce training time cost greatly when applied on large datasets. Specially, Parallel MASS reduces 101.8% training time compared with MASS at most in our experiments.  When it comes to the class imbalance problem of image datasets, existing imbalance learning methods suffer from the problem of large training cost and poor performance.  After introducing the problem of implementing resampling methods on image classification tasks, we demonstrate issues of re-weighting strategy using class frequencies through the experimental result on one medical image dataset.  We propose a novel re-weighting method Hardness Aware Dynamic loss to solve the class imbalance problem of image datasets. After each training epoch of deep neural networks, we compute the classification hardness of each class. We will assign higher class weights to the classes have large classification hardness values and vice versa in the next epoch. In this way, HAD could tune the weight of each sample in the loss function dynamically during the training process. The experimental results prove that HAD significantly outperforms the state-of-the-art methods. Moreover, HAD greatly improves the classification accuracies of minority classes while only making a small compromise of majority class accuracies. Especially, HAD loss improves 10.04% average precision compared with the best baseline, Focal loss, on the HAM10000 dataset. At last, I conclude this dissertation with our contributions to the imbalance learning, and provide an overview of potential directions for future research, which include extensions of the three proposed methods, development of task-specified algorithms, and fixing the challenges of within-class imbalance.2021-06-0
    • …
    corecore