Search CORE

882 research outputs found

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

Author: Bacao Fernando
Douzas Georgios
Last Felix
Publication venue: 'Elsevier BV'
Publication date: 12/12/2017
Field of study

Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.Comment: 19 pages, 8 figure

arXiv.org e-Print Archive

Repositório da Universidade Nova de Lisboa

Coupling different methods for overcoming the class imbalance problem

Author: Fantozzi Carlo
N. Lazzarini
Nanni Loris
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

Newcastle University E-Prints

Archivio istituzionale della ricerca - Università di Padova

Machine Learning Shrewd Approach For An Imbalanced Dataset Conversion Samples

Author: Ahmed T.
Ashraf S.
Publication venue: UTeM Press Website
Publication date: 19/06/2020
Field of study

The imbalance data applies to at least one of the classes, which are typically exceeded by the other ones. The Machine Learning Algorithm (Classifier) trained with an imbalance dataset predicts the majority class (frequently occurring) ‎more than the other minority classes (rarely occurring). Training with an imbalance dataset poses challenges for classifiers; ‎however, applying suitable techniques for reducing class imbalance issues can enhance the classifier’s performance. We take an ‎imbalanced dataset from an educational context. Initially, all shortcomings regarding classification of imbalanced dataset have ‎been examined. After that, we apply data-level algorithms for class balancing and compare the performance of classifiers. The ‎performance of the classifier is measured using the underlying information in their confusion matrices such as accuracy, ‎precision, recall, and f-measure. It shows that classification with an imbalance dataset may produce higher accuracy but low ‎precision and recall for the minority class. The analysis confirms that both undersampling and oversampling are effective for ‎balancing datasets, however, oversampling dominates.

Universiti Teknikal Malaysia Melaka: UTeM Open Journal System

Advanced Data Analytics for Systematic Review Creation and Update

Author: Timsina Prem
Publication venue: Beadle Scholar
Publication date: 01/03/2016
Field of study

Beadle Scholar at Dakota State University

An Empirical Study on the Joint Impact of Feature Selection and Data Re-sampling on Imbalance Classification

Author: García López Salvador
Zhang Chongsheng
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 13/09/2021
Field of study

In predictive tasks, real-world datasets often present di erent degrees of imbalanced (i.e., long-tailed or skewed) distributions. While the majority (the head or the most frequent) classes have su cient samples, the minority (the tail or the less frequent or rare) classes can be under-represented by a rather limited number of samples. Data pre-processing has been shown to be very e ective in dealing with such problems. On one hand, data re-sampling is a common approach to tackling class imbalance. On the other hand, dimension reduction, which reduces the feature space, is a conventional technique for reducing noise and inconsistencies in a dataset. However, the possible synergy between feature selection and data re-sampling for high-performance imbalance classification has rarely been investigated before. To address this issue, we carry out a comprehensive empirical study on the joint influence of feature selection and re-sampling on two-class imbalance classification. Specifically, we study the performance of two opposite pipelines for imbalance classification by applying feature selection before or after data re-sampling. We conduct a large number of experiments, with a total of 9225 tests, on 52 publicly available datasets, using 9 feature selection methods, 6 resampling approaches for class imbalance learning, and 3 well-known classification algorithms. Experimental results show that there is no constant winner between the two pipelines; thus both of them should be considered to derive the best performing model for imbalance classification. We find that the performance of an imbalance classification model not only depends on the classifier adopted and the ratio between the number of majority and minority samples, but also depends on the ratio between the number of samples and features. Overall, this study should provide new reference value for researchers and practitioners in imbalance learning.TIN2017-89517-

Repositorio Institucional Universidad de Granada

An empirical evaluation of imbalanced data strategies from a practitioner's point of view

Author: Franceschinell Rodrigo A.
Wainer Jacques
Publication venue
Publication date: 16/10/2018
Field of study

This research tested the following well known strategies to deal with binary imbalanced data on 82 different real life data sets (sampled to imbalance rates of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline (just the base classifier). As base classifiers we used SVM with RBF kernel, random forests, and gradient boosting machines and we measured the quality of the resulting classifier using 6 different metrics (Area under the curve, Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced accuracy). The best strategy strongly depends on the metric used to measure the quality of the classifier. For AUC and accuracy class weight and the baseline perform better; for F-measure and MCC, SMOTE performs better; and for G-mean and balanced accuracy, underbagging

arXiv.org e-Print Archive