12,437 research outputs found

    Deep Over-sampling Framework for Classifying Imbalanced Data

    Full text link
    Class imbalance is a challenging issue in practical classification problems for deep learning models as well as traditional models. Traditionally successful countermeasures such as synthetic over-sampling have had limited success with complex, structured data handled by deep learning models. In this paper, we propose Deep Over-sampling (DOS), a framework for extending the synthetic over-sampling method to exploit the deep feature space acquired by a convolutional neural network (CNN). Its key feature is an explicit, supervised representation learning, for which the training data presents each raw input sample with a synthetic embedding target in the deep feature space, which is sampled from the linear subspace of in-class neighbors. We implement an iterative process of training the CNN and updating the targets, which induces smaller in-class variance among the embeddings, to increase the discriminative power of the deep representation. We present an empirical study using public benchmarks, which shows that the DOS framework not only counteracts class imbalance better than the existing method, but also improves the performance of the CNN in the standard, balanced settings

    CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification

    Full text link
    Class imbalance classification is a challenging research problem in data mining and machine learning, as most of the real-life datasets are often imbalanced in nature. Existing learning algorithms maximise the classification accuracy by correctly classifying the majority class, but misclassify the minority class. However, the minority class instances are representing the concept with greater interest than the majority class instances in real-life applications. Recently, several techniques based on sampling methods (under-sampling of the majority class and over-sampling the minority class), cost-sensitive learning methods, and ensemble learning have been used in the literature for classifying imbalanced datasets. In this paper, we introduce a new clustering-based under-sampling approach with boosting (AdaBoost) algorithm, called CUSBoost, for effective imbalanced classification. The proposed algorithm provides an alternative to RUSBoost (random under-sampling with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost) algorithms. We evaluated the performance of CUSBoost algorithm with the state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost, SMOTEBoost on 13 imbalance binary and multi-class datasets with various imbalance ratios. The experimental results show that the CUSBoost is a promising and effective approach for dealing with highly imbalanced datasets.Comment: CSITSS-201

    Autoencoders and Generative Adversarial Networks for Imbalanced Sequence Classification

    Full text link
    Generative Adversarial Networks (GANs) have been used in many different applications to generate realistic synthetic data. We introduce a novel GAN with Autoencoder (GAN-AE) architecture to generate synthetic samples for variable length, multi-feature sequence datasets. In this model, we develop a GAN architecture with an additional autoencoder component, where recurrent neural networks (RNNs) are used for each component of the model in order to generate synthetic data to improve classification accuracy for a highly imbalanced medical device dataset. In addition to the medical device dataset, we also evaluate the GAN-AE performance on two additional datasets and demonstrate the application of GAN-AE to a sequence-to-sequence task where both synthetic sequence inputs and sequence outputs must be generated. To evaluate the quality of the synthetic data, we train encoder-decoder models both with and without the synthetic data and compare the classification model performance. We show that a model trained with GAN-AE generated synthetic data outperforms models trained with synthetic data generated both with standard oversampling techniques such as SMOTE and Autoencoders as well as with state of the art GAN-based models

    Sampling Strategies for Tackling Imbalanced Data in Human Activity Recognition

    Get PDF
    Human activity recognition (HAR) using wearable sensors is a topic that is being actively researched in machine learning. Smart, sensor-embedded devices, such as smartphones, fitness trackers, or smart watches that collect detailed data on movement, are widely available now. HAR may be applied in areas such as healthcare, physiotherapy, and fitness to assist users of these smart devices in their daily lives. However, one of the main challenges facing HAR, particularly when it is used in supervised learning, is how balanced data may be obtained for algorithm optimisation and testing. Because users engage in some activities more than others, e.g. walking more than running, HAR datasets are typically imbalanced. The lack of dataset representation from minority classes, therefore, hinders the ability of HAR classifiers to sufficiently capture new instances of those activities. Inspired by the concept of data fusion, this thesis will introduce three new hybrid sampling methods. Thus, the diversity of the synthesised samples will be enhanced by combining output from separate sampling methods into three hybrid approaches. The advantage of the hybrid method is that it provides diverse synthetic data that can increase the size of the training data from different sampling approaches. This leads to improvements in the generalisation of a learning activity recognition model. The first strategy, known as the (DBM), combines synthetic minority oversampling techniques (SMOTE) with Random_SMOTE, both of which are built around the k-nearest neighbours algorithm. The second technique, called the noise detection-based method (NDBM), combines Tomek links (SMOTE_Tomeklinks) and the modified synthetic minority oversampling technique (MSMOTE). The third approach, titled the cluster-based method (CBM), combines cluster-based synthetic oversampling (CBSO) and the proximity weighted synthetic oversampling technique (ProWSyn). The performance of the proposed hybrid methods is compared with existing methods using accelerometer data from three commonly used benchmark datasets. The results show that the DBM, NDBM and CBM can significantly reduce the impact of class imbalance and enhance F1 scores of the multilayer perceptron (MLP) by as much as 9 % to 20 % compared with their constituent sampling methods. Also, the Friedman statistical significance test was conducted to compare the effect of the different sampling methods. The test results confirm that the CBM is more effective than the other sampling approaches. This thesis also introduces a method based on the Wasserstein generative adversarial network (WGAN) for generating different types of data on human activity. The WGAN is more stable to train than a generative adversarial network (GAN) and this is due to the use of a stable metric, namely Wasserstein distance, to compare the similarity between the real data distribution with the generated data distribution. WGAN is a deep learning approach, and in contrast to the six existing sampling methods referred to previously, it can operate on raw sensor data as convolutional and recurrent layers can act as feature extractors. WGAN is used to generate raw sensor data to overcome the limitations of the traditional machine learning-based sampling methods that can only operate on extracted features. The synthetic data that is produced by WGAN is then used to oversample the imbalanced training data. This thesis demonstrates that this approach significantly enhances the learning ability of the convolutional neural network(CNN) by as much as 5 % to 6 % from imbalanced human activity datasets. This thesis concludes that the proposed sampling methods based on traditional machine learning are efficient when human activity training data is imbalanced and small. These methods are less complex to implement, require less human activity training data to produce synthetic data and fewer computational resources than the WGAN approach. The proposed WGAN method is effective at producing raw sensor data when a large quantity of human activity training data is available. Additionally, it is time-consuming to optimise the hyperparameters related to the WGAN architecture, which significantly impacts the performance of the method
    • …
    corecore