12,437 research outputs found
Deep Over-sampling Framework for Classifying Imbalanced Data
Class imbalance is a challenging issue in practical classification problems
for deep learning models as well as traditional models. Traditionally
successful countermeasures such as synthetic over-sampling have had limited
success with complex, structured data handled by deep learning models. In this
paper, we propose Deep Over-sampling (DOS), a framework for extending the
synthetic over-sampling method to exploit the deep feature space acquired by a
convolutional neural network (CNN). Its key feature is an explicit, supervised
representation learning, for which the training data presents each raw input
sample with a synthetic embedding target in the deep feature space, which is
sampled from the linear subspace of in-class neighbors. We implement an
iterative process of training the CNN and updating the targets, which induces
smaller in-class variance among the embeddings, to increase the discriminative
power of the deep representation. We present an empirical study using public
benchmarks, which shows that the DOS framework not only counteracts class
imbalance better than the existing method, but also improves the performance of
the CNN in the standard, balanced settings
CUSBoost: Cluster-based Under-sampling with Boosting for Imbalanced Classification
Class imbalance classification is a challenging research problem in data
mining and machine learning, as most of the real-life datasets are often
imbalanced in nature. Existing learning algorithms maximise the classification
accuracy by correctly classifying the majority class, but misclassify the
minority class. However, the minority class instances are representing the
concept with greater interest than the majority class instances in real-life
applications. Recently, several techniques based on sampling methods
(under-sampling of the majority class and over-sampling the minority class),
cost-sensitive learning methods, and ensemble learning have been used in the
literature for classifying imbalanced datasets. In this paper, we introduce a
new clustering-based under-sampling approach with boosting (AdaBoost)
algorithm, called CUSBoost, for effective imbalanced classification. The
proposed algorithm provides an alternative to RUSBoost (random under-sampling
with AdaBoost) and SMOTEBoost (synthetic minority over-sampling with AdaBoost)
algorithms. We evaluated the performance of CUSBoost algorithm with the
state-of-the-art methods based on ensemble learning like AdaBoost, RUSBoost,
SMOTEBoost on 13 imbalance binary and multi-class datasets with various
imbalance ratios. The experimental results show that the CUSBoost is a
promising and effective approach for dealing with highly imbalanced datasets.Comment: CSITSS-201
Autoencoders and Generative Adversarial Networks for Imbalanced Sequence Classification
Generative Adversarial Networks (GANs) have been used in many different
applications to generate realistic synthetic data. We introduce a novel GAN
with Autoencoder (GAN-AE) architecture to generate synthetic samples for
variable length, multi-feature sequence datasets. In this model, we develop a
GAN architecture with an additional autoencoder component, where recurrent
neural networks (RNNs) are used for each component of the model in order to
generate synthetic data to improve classification accuracy for a highly
imbalanced medical device dataset. In addition to the medical device dataset,
we also evaluate the GAN-AE performance on two additional datasets and
demonstrate the application of GAN-AE to a sequence-to-sequence task where both
synthetic sequence inputs and sequence outputs must be generated. To evaluate
the quality of the synthetic data, we train encoder-decoder models both with
and without the synthetic data and compare the classification model
performance. We show that a model trained with GAN-AE generated synthetic data
outperforms models trained with synthetic data generated both with standard
oversampling techniques such as SMOTE and Autoencoders as well as with state of
the art GAN-based models
Sampling Strategies for Tackling Imbalanced Data in Human Activity Recognition
Human activity recognition (HAR) using wearable sensors is a topic that is being actively researched in machine learning. Smart, sensor-embedded devices, such as smartphones, fitness trackers, or smart watches that collect detailed data on movement, are widely available now. HAR may be applied in areas such as healthcare, physiotherapy, and fitness to assist users of these smart devices in their daily lives. However, one of the main challenges facing HAR, particularly when it is used in supervised learning, is how balanced data may be obtained for algorithm optimisation and testing. Because users engage in some activities more than others, e.g. walking more than running, HAR datasets are typically imbalanced. The lack of dataset representation from minority classes, therefore, hinders the ability of HAR classifiers to sufficiently capture new instances of those activities. Inspired by the concept of data fusion, this thesis will introduce three new hybrid sampling methods. Thus, the diversity of the synthesised samples will be enhanced by combining output from separate sampling methods into three hybrid approaches. The advantage of the hybrid method is that it provides diverse synthetic data that can increase the size of the training data from different sampling approaches. This leads to improvements in the generalisation of a learning activity recognition model. The first strategy, known as the (DBM), combines synthetic minority oversampling techniques (SMOTE) with Random_SMOTE, both of which are built around the k-nearest neighbours algorithm. The second technique, called the noise detection-based method (NDBM), combines Tomek links (SMOTE_Tomeklinks) and the modified synthetic minority oversampling technique (MSMOTE). The third approach, titled the cluster-based method (CBM), combines cluster-based synthetic oversampling (CBSO) and the proximity weighted synthetic oversampling technique (ProWSyn). The performance of the proposed hybrid methods is compared with existing methods using accelerometer data from three commonly used benchmark datasets. The results show that the DBM, NDBM and CBM can significantly reduce the impact of class imbalance and enhance F1 scores of the multilayer perceptron (MLP) by as much as 9 % to 20 % compared with their constituent sampling methods. Also, the Friedman statistical significance test was conducted to compare the effect of the different sampling methods. The test results confirm that the CBM is more effective than the other sampling approaches. This thesis also introduces a method based on the Wasserstein generative adversarial network (WGAN) for generating different types of data on human activity. The WGAN is more stable to train than a generative adversarial network (GAN) and this is due to the use of a stable metric, namely Wasserstein distance, to compare the similarity between the real data distribution with the generated data distribution. WGAN is a deep learning approach, and in contrast to the six existing sampling methods referred to previously, it can operate on raw sensor data as convolutional and recurrent layers can act as feature extractors. WGAN is used to generate raw sensor data to overcome the limitations of the traditional machine learning-based sampling methods that can only operate on extracted features. The synthetic data that is produced by WGAN is then used to oversample the imbalanced training data. This thesis demonstrates that this approach significantly enhances the learning ability of the convolutional neural network(CNN) by as much as 5 % to 6 % from imbalanced human activity datasets. This thesis concludes that the proposed sampling methods based on traditional machine learning are efficient when human activity training data is imbalanced and small. These methods are less complex to implement, require less human activity training data to produce synthetic data and fewer computational resources than the WGAN approach. The proposed WGAN method is effective at producing raw sensor data when a large quantity of human activity training data is available. Additionally, it is time-consuming to optimise the hyperparameters related to the WGAN architecture, which significantly impacts the performance of the method
- …