Search CORE

1,159 research outputs found

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

Author: Bacao Fernando
Douzas Georgios
Last Felix
Publication venue: 'Elsevier BV'
Publication date: 12/12/2017
Field of study

Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.Comment: 19 pages, 8 figure

arXiv.org e-Print Archive

Repositório da Universidade Nova de Lisboa

An empirical evaluation of imbalanced data strategies from a practitioner's point of view

Author: Franceschinell Rodrigo A.
Wainer Jacques
Publication venue
Publication date: 16/10/2018
Field of study

This research tested the following well known strategies to deal with binary imbalanced data on 82 different real life data sets (sampled to imbalance rates of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline (just the base classifier). As base classifiers we used SVM with RBF kernel, random forests, and gradient boosting machines and we measured the quality of the resulting classifier using 6 different metrics (Area under the curve, Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced accuracy). The best strategy strongly depends on the metric used to measure the quality of the classifier. For AUC and accuracy class weight and the baseline perform better; for F-measure and MCC, SMOTE performs better; and for G-mean and balanced accuracy, underbagging

arXiv.org e-Print Archive

Techniques to deal with imbalanced data in multi-class problems: A review of existing methods

Author: Vitor Miguel Saraiva Esteves
Publication venue
Publication date: 11/02/2020
Field of study

Repositório Aberto da Universidade do Porto

Generation of Controlled Synthetic Samples and Impact of Hyper-Tuning Parameters to Effectively Classify the Complex Structure of Overlapping Region

Author: Aslam Muhammad
Badshah Afzal
Butt Naveed Anwer
Jilani Syeda Fizzah
Mahmood Zafar
Rehman Ghani Ur
Zubair Muhammad
Publication venue
Publication date: 22/08/2022
Field of study

Aberystwyth Research Portal

Research Repository and Portal - University of the West of Scotland

Comparing the performance of oversampling techniques in combination with a clustering algorithm for imbalanced learning

Author: Pereira Mariana Matoso
Publication venue
Publication date: 01/03/2019
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceImbalanced datasets in supervised learning are considered an ongoing challenging task for standard algorithms, seeing as they are designed to handle balanced class distributions and perform poorly when applied to problems of the imbalanced nature. Many methods have been developed to address this specific problem but the more general approach to achieve a balanced class distribution is data level modification, instead of algorithm modifications. Although class imbalances are responsible for significant losses of performance in standard classifiers in many different types of problems, another aspect that is important to consider is the small disjuncts problem. Therefore, it is important to consider and understand solutions that not only take into the account the between-class imbalance (the imbalance occurring between the two classes) but also the within-class imbalance (the imbalance occurring between the sub-clusters of each class) and to oversample the dataset by rectifying these two types of imbalances simultaneously. It has been shown that cluster-based oversampling is a robust solution that takes into consideration these two problems. This work sets out to study the effect and impact combining different existing oversampling methods with a clustering-based approach. Empirical results of extensive experiments show that the combinations of different oversampling techniques with the clustering algorithm k-means – K-Means Oversampling - improves upon classification results resulting solely from the oversampling techniques with no prior clustering step

Repositório da Universidade Nova de Lisboa

Oversampling for imbalanced learning based on k-means and smote

Author: Last Felix
Publication venue
Publication date: 05/02/2018
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsLearning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language

Repositório da Universidade Nova de Lisboa

SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

Author: Chawla Nitesh V.
Fernández Hilario Alberto Luis
García López Salvador
Herrera Triguero Francisco
Publication venue: 'AI Access Foundation'
Publication date: 01/01/2018
Field of study

The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositorio Institucional Universidad de Granada