471 research outputs found

    Stacked Generalizations in Imbalanced Fraud Data Sets using Resampling Methods

    Full text link
    This study uses stacked generalization, which is a two-step process of combining machine learning methods, called meta or super learners, for improving the performance of algorithms in step one (by minimizing the error rate of each individual algorithm to reduce its bias in the learning set) and then in step two inputting the results into the meta learner with its stacked blended output (demonstrating improved performance with the weakest algorithms learning better). The method is essentially an enhanced cross-validation strategy. Although the process uses great computational resources, the resulting performance metrics on resampled fraud data show that increased system cost can be justified. A fundamental key to fraud data is that it is inherently not systematic and, as of yet, the optimal resampling methodology has not been identified. Building a test harness that accounts for all permutations of algorithm sample set pairs demonstrates that the complex, intrinsic data structures are all thoroughly tested. Using a comparative analysis on fraud data that applies stacked generalizations provides useful insight needed to find the optimal mathematical formula to be used for imbalanced fraud data sets.Comment: 19 pages, 3 figures, 8 table

    Automated Intelligent Cueing Device to Improve Ambient Gait Behaviors for Patients with Parkinson\u27s Disease

    Get PDF
    Freezing of gait (FoG) is a common motor dysfunction in individuals with Parkinson’s disease (PD). FoG impairs walking and is associated with increased fall risk. Although pharmacological treatments have shown promise during ON-medication periods, FoG remains difficult to treat during medication OFF state and in advanced stages of the disease. External cueing therapy in the forms of visual, auditory, and vibrotactile, has been effective in treating gait deviations. Intelligent (or on-demand) cueing devices are novel systems that analyze gait patterns in real-time and activate cues only at moments when specific gait alterations are detected. In this study we developed methods to analyze gait signals collected through wearable sensors and accurately identify FoG episodes. We also investigated the potential of predicting the symptoms before their actual occurrence. We collected data from seven participants with PD using two Inertial Measurement Units (IMUs) on ankles. In our first study, we extracted engineered features from the signals and used machine learning (ML) methods to identify FoG episodes. We tested the performance of models using patient-dependent and patient-independent paradigms. The former models achieved 92.5% and 89.0% for average sensitivity and specificity, respectively. However, the conventional binary classification methods fail to accurately classify data if only data from normal gait periods are available. In order to identify FoG episodes in participants who did not freeze during data collection sessions, we developed a Deep Gait Anomaly Detector (DGAD) to identify anomalies (i.e., FoG) in the signals. DGAD was formed of convolutional layers and trained to automatically learn features from signals. The convolutional layers are followed by fully connected layers to reduce the dimensions of the features. A k-nearest neighbors (kNN) classifier is then used to classify the data as normal or FoG. The models identified 87.4% of FoG onsets, with 21.9% being predicted on average for each participant. This study demonstrates our algorithm\u27s potential for delivery of preventive cues. The DGAD algorithm was then implemented in an Android application to monitor gait patterns of PD patients in ambient environments. The phone triggered vibrotactile and auditory cues on a connected smartwatch if an FoG episode was identified. A 6-week in-home study showed the potentials for effective treatment of FoG severity in ambient environments using intelligent cueing devices

    Predictive Modelling Approach to Data-Driven Computational Preventive Medicine

    Get PDF
    This thesis contributes novel predictive modelling approaches to data-driven computational preventive medicine and offers an alternative framework to statistical analysis in preventive medicine research. In the early parts of this research, this thesis presents research by proposing a synergy of machine learning methods for detecting patterns and developing inexpensive predictive models from healthcare data to classify the potential occurrence of adverse health events. In particular, the data-driven methodology is founded upon a heuristic-systematic assessment of several machine-learning methods, data preprocessing techniques, models’ training estimation and optimisation, and performance evaluation, yielding a novel computational data-driven framework, Octopus. Midway through this research, this thesis advances research in preventive medicine and data mining by proposing several new extensions in data preparation and preprocessing. It offers new recommendations for data quality assessment checks, a novel multimethod imputation (MMI) process for missing data mitigation, a novel imbalanced resampling approach, and minority pattern reconstruction (MPR) led by information theory. This thesis also extends the area of model performance evaluation with a novel classification performance ranking metric called XDistance. In particular, the experimental results show that building predictive models with the methods guided by our new framework (Octopus) yields domain experts' approval of the new reliable models’ performance. Also, performing the data quality checks and applying the MMI process led healthcare practitioners to outweigh predictive reliability over interpretability. The application of MPR and its hybrid resampling strategies led to better performances in line with experts' success criteria than the traditional imbalanced data resampling techniques. Finally, the use of the XDistance performance ranking metric was found to be more effective in ranking several classifiers' performances while offering an indication of class bias, unlike existing performance metrics The overall contributions of this thesis can be summarised as follow. First, several data mining techniques were thoroughly assessed to formulate the new Octopus framework to produce new reliable classifiers. In addition, we offer a further understanding of the impact of newly engineered features, the physical activity index (PAI) and biological effective dose (BED). Second, the newly developed methods within the new framework. Finally, the newly accepted developed predictive models help detect adverse health events, namely, visceral fat-associated diseases and advanced breast cancer radiotherapy toxicity side effects. These contributions could be used to guide future theories, experiments and healthcare interventions in preventive medicine and data mining

    A Back Propagation Neural Network Model with the Synthetic Minority Over-Sampling Technique for Construction Company Bankruptcy Prediction

    Get PDF
    Improving model accuracy is one of the most frequently addressed issues in bankruptcy prediction. Several previous studies employed artificial neural networks (ANNs) to improve the accuracy at which construction company bankruptcy can be predicted. However, most of these studies use the sample-matching technique and all of the available company quarters or company years in the dataset, resulting in sample selection biases and between-class imbalances. This study integrates a back propagation neural network (BPNN) with the synthetic minority over-sampling technique (SMOTE) and the use of all of the available company-year samples during the sample period to improve the accuracy at which bankruptcy in construction companies can be predicted. In addition to eliminating sample selection biases during the sample matching and between-class imbalance, these methods also achieve the high accuracy rates. Furthermore, the approach used in this study shows optimal over-sampling times, neurons of the hidden layer, and learning rate, all of which are major parameters in the BPNN and SMOTE-BPNN models. The traditional BPNN model is provided as a benchmark for evaluating the predictive abilities of the SMOTE-BPNN model. The empirical results of this paper show that the SMOTE-BPNN model outperforms the traditional BPNN

    Quantifying soybean phenotypes using UAV imagery and machine learning, deep learning methods

    Get PDF
    Crop breeding programs aim to introduce new cultivars to the world with improved traits to solve the food crisis. Food production should need to be twice of current growth rate to feed the increasing number of people by 2050. Soybean is one the major grain in the world and only US contributes around 35 percent of world soybean production. To increase soybean production, breeders still rely on conventional breeding strategy, which is mainly a 'trial and error' process. These constraints limit the expected progress of the crop breeding program. The goal was to quantify the soybean phenotypes of plant lodging and pubescence color using UAV-based imagery and advanced machine learning. Plant lodging and soybean pubescence color are two of the most important phenotypes for soybean breeding programs. Soybean lodging and pubescence color is conventionally evaluated visually by breeders, which is time-consuming and subjective to human errors. The goal of this study was to investigate the potential of unmanned aerial vehicle (UAV)-based imagery and machine learning in the assessment of lodging conditions and deep learning in the assessment pubescence color of soybean breeding lines. A UAV imaging system equipped with an RGB (red-green-blue) camera was used to collect the imagery data of 1,266 four-row plots in a soybean breeding field at the reproductive stage. Soybean lodging scores and pubescence scores were visually assessed by experienced breeders. Lodging scores were grouped into four classes, i.e., non-lodging, moderate lodging, high lodging, and severe lodging. In contrast, pubescence color scores were grouped into three classes, i.e., gray, tawny, and segregation. UAV images were stitched to build orthomosaics, and soybean plots were segmented using a grid method. Twelve image features were extracted from the collected images to assess the lodging scores of each breeding line. Four models, i.e., extreme gradient boosting (XGBoost), random forest (RF), K-nearest neighbor (KNN), and artificial neural network (ANN), were evaluated to classify soybean lodging classes. Five data pre-processing methods were used to treat the imbalanced dataset to improve the classification accuracy. Results indicate that the pre-processing method SMOTE-ENN consistently performs well for all four (XGBoost, RF, KNN, and ANN) classifiers, achieving the highest overall accuracy (OA), lowest misclassification, higher F1-score, and higher Kappa coefficient. This suggests that Synthetic Minority Over-sampling-Edited Nearest Neighbor (SMOTE-ENN) may be an excellent pre-processing method for using unbalanced datasets and classification tasks. Furthermore, an overall accuracy of 96 percent was obtained using the SMOTE-ENN dataset and ANN classifier. On the other hand, to classify the soybean pubescence color, seven pre-trained deep learning models, i.e., DenseNet121, DenseNet169, DenseNet201, ResNet50, InceptionResNet-V2, Inception-V3, and EfficientNet were used, and images of each plot were fed into the model. Data was enhanced using two rotational and two scaling factors to increase the datasets. Among the seven pre-trained deep learning models, ResNet50 and DenseNet121 classifiers showed a higher overall accuracy of 88 percent, along with higher precision, recall, and F1-score for all three classes of pubescence color. In conclusion, the developed UAV-based high-throughput phenotyping system can gather image features to estimate soybean crucial phenotypes and classify the phenotypes, which will help the breeders in phenotypic variations in breeding trials. Also, the RGB imagery-based classification could be a cost-effective choice for breeders and associated researchers for plant breeding programs in identifying superior genotypes.Includes bibliographical references

    Automatic Multi-Label ECG Classification with Category Imbalance and Cost-Sensitive Thresholding

    Get PDF
    From MDPI via Jisc Publications RouterHistory: accepted 2021-11-12, pub-electronic 2021-11-14Publication status: PublishedFunder: Collaborative Innovation Center for Prevention and Treatment of Cardiovascular Disease of Si-chuan Province (CICPTCDSP); Grant(s): xtcx2019-01Automatic electrocardiogram (ECG) classification is a promising technology for the early screening and follow-up management of cardiovascular diseases. It is, by nature, a multi-label classification task owing to the coexistence of different kinds of diseases, and is challenging due to the large number of possible label combinations and the imbalance among categories. Furthermore, the task of multi-label ECG classification is cost-sensitive, a fact that has usually been ignored in previous studies on the development of the model. To address these problems, in this work, we propose a novel deep learning model–based learning framework and a thresholding method, namely category imbalance and cost-sensitive thresholding (CICST), to incorporate prior knowledge about classification costs and the characteristic of category imbalance in designing a multi-label ECG classifier. The learning framework combines a residual convolutional network with a class-wise attention mechanism. We evaluate our method with a cost-sensitive metric on multiple realistic datasets. The results show that CICST achieved a cost-sensitive metric score of 0.641 ± 0.009 in a 5-fold cross-validation, outperforming other commonly used thresholding methods, including rank-based thresholding, proportion-based thresholding, and fixed thresholding. This demonstrates that, by taking into account the category imbalance and predefined cost information, our approach is effective in improving the performance and practicability of multi-label ECG classification models

    Learning from class-imbalanced data: overlap-driven resampling for imbalanced data classification.

    Get PDF
    Classification of imbalanced datasets has attracted substantial research interest over the past years. This is because imbalanced datasets are common in several domains such as health, finance and security, but learning algorithms are generally not designed to handle them. Many existing solutions focus mainly on the class distribution problem. However, a number of reports showed that class overlap had a higher negative impact on the learning process than class imbalance. This thesis thoroughly explores the impact of class overlap on the learning algorithm and demonstrates how elimination of class overlap can effectively improve the classification of imbalanced datasets. Novel undersampling approaches were developed with the main objective of enhancing the presence of minority class instances in the overlapping region. This is achieved by identifying and removing majority class instances potentially residing in such a region. Seven methods under the two different approaches were designed for the task. Extensive experiments were carried out to evaluate the methods on simulated and well-known real-world datasets. Results showed that substantial improvement in the classification accuracy of the minority class was obtained with favourable trade-offs with the majority class accuracy. Moreover, successful application of the methods in predictive diagnostics of diseases with imbalanced records is presented. These novel overlap-based approaches have several advantages over other common resampling methods. First, the undersampling amount is independent of class imbalance and proportional to the degree of overlap. This could effectively address the problem of class overlap while reducing the effect of class imbalance. Second, information loss is minimised as instance elimination is contained within the problematic region. Third, adaptive parameters enable the methods to be generalised across different problems. It is also worth pointing out that these methods provide different trade-offs, which offer more alternatives to real-world users in selecting the best fit solution to the problem

    Developing and deploying data mining techniques in healthcare

    Get PDF
    Improving healthcare is a top priority for all nations. US healthcare expenditure was $3 trillion in 2014. In the same year, the share of GDP assigned to healthcare expenditure was 17.5%. These statistics shows the importance of making improvement in healthcare delivery system. In this research, we developed several data mining methods and algorithms to address healthcare problems. These methods can also be applied to the problems in other domains.The first part of this dissertation is about rare item problem in association analysis. This problem deals with the discovering rare rules, which include rare items. In this study, we introduced a novel assessment metric, called adjusted support to address this problem. By applying this metric, we can retrieve rare rules without over-generating association rules. We applied this method to perform association analysis on complications of diabetes.The second part of this dissertation is developing a clinical decision support system for predicting retinopathy. Retinopathy is the leading cause of vision loss among American adults. In this research, we analyzed data from more than 1.4 million diabetic patients and developed four sets of predictive models: basic, comorbid, over-sampled, and ensemble models. The results show that incorporating comorbidity data and oversampling improved the accuracy of prediction. In addition, we developed a novel "confidence margin" ensemble approach that outperformed the existing ensemble models. In ensemble models, we also addressed the issue of tie in voting-based ensemble models by comparing the confidence margins of the base predictors.The third part of this dissertation addresses the problem of imbalanced data learning, which is a major challenge in machine learning. While a standard machine learning technique could have a good performance on balanced datasets, when applied to imbalanced datasets its performance deteriorates dramatically. This poor performance is rather troublesome especially in detecting the minority class that usually is the class of interest. In this study, we proposed a synthetic informative minority over-sampling (SIMO) algorithm embedded into support vector machine. We applied SIMO to 15 publicly available benchmark datasets and assessed its performance in comparison with seven existing approaches. The results showed that SIMO outperformed all existing approaches
    • …
    corecore