3,176 research outputs found

    Coupling different methods for overcoming the class imbalance problem

    Get PDF
    Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

    Autoencoders and Generative Adversarial Networks for Imbalanced Sequence Classification

    Full text link
    Generative Adversarial Networks (GANs) have been used in many different applications to generate realistic synthetic data. We introduce a novel GAN with Autoencoder (GAN-AE) architecture to generate synthetic samples for variable length, multi-feature sequence datasets. In this model, we develop a GAN architecture with an additional autoencoder component, where recurrent neural networks (RNNs) are used for each component of the model in order to generate synthetic data to improve classification accuracy for a highly imbalanced medical device dataset. In addition to the medical device dataset, we also evaluate the GAN-AE performance on two additional datasets and demonstrate the application of GAN-AE to a sequence-to-sequence task where both synthetic sequence inputs and sequence outputs must be generated. To evaluate the quality of the synthetic data, we train encoder-decoder models both with and without the synthetic data and compare the classification model performance. We show that a model trained with GAN-AE generated synthetic data outperforms models trained with synthetic data generated both with standard oversampling techniques such as SMOTE and Autoencoders as well as with state of the art GAN-based models

    An empirical evaluation of imbalanced data strategies from a practitioner's point of view

    Full text link
    This research tested the following well known strategies to deal with binary imbalanced data on 82 different real life data sets (sampled to imbalance rates of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline (just the base classifier). As base classifiers we used SVM with RBF kernel, random forests, and gradient boosting machines and we measured the quality of the resulting classifier using 6 different metrics (Area under the curve, Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced accuracy). The best strategy strongly depends on the metric used to measure the quality of the classifier. For AUC and accuracy class weight and the baseline perform better; for F-measure and MCC, SMOTE performs better; and for G-mean and balanced accuracy, underbagging
    • …
    corecore