Search CORE

638 research outputs found

An empirical evaluation of imbalanced data strategies from a practitioner's point of view

Author: Franceschinell Rodrigo A.
Wainer Jacques
Publication venue
Publication date: 16/10/2018
Field of study

This research tested the following well known strategies to deal with binary imbalanced data on 82 different real life data sets (sampled to imbalance rates of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline (just the base classifier). As base classifiers we used SVM with RBF kernel, random forests, and gradient boosting machines and we measured the quality of the resulting classifier using 6 different metrics (Area under the curve, Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced accuracy). The best strategy strongly depends on the metric used to measure the quality of the classifier. For AUC and accuracy class weight and the baseline perform better; for F-measure and MCC, SMOTE performs better; and for G-mean and balanced accuracy, underbagging

arXiv.org e-Print Archive

EPRENNID: An evolutionary prototype reduction based ensemble for nearest neighbor classification of imbalanced data

Author: Alcalá-Fdez
Alpaydin
Barua
Batista
Blaszczynski
Breiman
Cano
Castro
Chawla
Chris Cornelis
Cover
Das
Datta
Demšar
Díez-Pastor
Fawcett
Friedman
Galar
García
García
García
García
García-Pedrajas
Hand
He
Hido
Isaac Triguero
Khoshgoftaar
Kononenko
Krawczyk
Krawczyk
Kuncheva
Lee
Lin
López
López
Neri
Pawlak
Ramentol
Sarah Vluymans
Schapire
Seiffert
Storn
Ting
Triguero
Triguero
Triguero
Triguero
Wang
Wilson
Wilson
Yijing
Yu
Yule
Yvan Saeys
Publication venue: 'Elsevier BV'
Publication date: 01/01/2016
Field of study

Classification problems with an imbalanced class distribution have received an increased amount of attention within the machine learning community over the last decade. They are encountered in a growing number of real-world situations and pose a challenge to standard machine learning techniques. We propose a new hybrid method specifically tailored to handle class imbalance, called EPRENNID. It performs an evolutionary prototype reduction focused on providing diverse solutions to prevent the method from overfitting the training set. It also allows us to explicitly reduce the underrepresented class, which the most common preprocessing solutions handling class imbalance usually protect. As part of the experimental study, we show that the proposed prototype reduction method outperforms state-of-the-art preprocessing techniques. The preprocessing step yields multiple prototype sets that are later used in an ensemble, performing a weighted voting scheme with the nearest neighbor classifier. EPRENNID is experimentally shown to significantly outperform previous proposals

Nottingham ePrints

Nottingham eTheses

Crossref

Repository@Nottingham

Ghent University Academic Bibliography

Ensemble learning for software fault prediction problem with imbalanced data

Author: Khuat Thanh Tung
Le My Hanh
Publication venue: 'Institute of Advanced Engineering and Science'
Publication date: 01/08/2019
Field of study

Fault prediction problem has a crucial role in the software development process because it contributes to reducing defects and assisting the testing process towards fault-free software components. Therefore, there are a lot of efforts aiming to address this type of issues, in which static code characteristics are usually adopted to construct fault classification models. One of the challenging problems influencing the performance of predictive classifiers is the high imbalance among patterns belonging to different classes. This paper aims to integrate the sampling techniques and common classification techniques to form a useful ensemble model for the software defect prediction problem. The empirical results conducted on the benchmark datasets of software projects have shown the promising performance of our proposal in comparison with individual classifiers

Crossref

ZENODO

NEUROSURGERY ENTHUSIASTIC WOMEN SOCIETY

Institute of Advanced Engineering and Science

Generative Adversarial Networks Selection Approach for Extremely Imbalanced Fault Diagnosis of Reciprocating Machinery

Author: Cabrera Diego
Cerrada Mariela
Li Chuan
Long Jianyu
Sancho Caparrini Fernando
Sánchez René-Vinicio
Zhang Shaohui
Publication venue: IEEE Computer Society
Publication date: 01/01/2019
Field of study

At present, countless approaches to fault diagnosis in reciprocating machines have been proposed, all considering that the available machinery dataset is in equal proportions for all conditions. However, when the application is closer to reality, the problem of data imbalance is increasingly evident. In this paper, we propose a method for the creation of diagnoses that consider an extreme imbalance in the available data. Our approach first processes the vibration signals of the machine using a wavelet packet transform-based feature-extraction stage. Then, improved generative models are obtained with a dissimilarity-based model selection to artificially balance the dataset. Finally, a Random Forest classifier is created to address the diagnostic task. This methodology provides a considerable improvement with 99% of data imbalance over other approaches reported in the literature, showing performance similar to that obtained with a balanced set of data.National Natural Science Foundation of China, under Grant 51605406National Natural Science Foundation of China under Grant 7180104

idUS. Depósito de Investigación Universidad de Sevilla

Minority Class Oversampling for Tabular Data with Deep Generative Models

Author: Camino Ramiro
Hammerschmidt Christian
State Radu
Publication venue
Publication date: 07/05/2020
Field of study

In practice, machine learning experts are often confronted with imbalanced data. Without accounting for the imbalance, common classifiers perform poorly and standard evaluation metrics mislead the practitioners on the model's performance. A common method to treat imbalanced datasets is under- and oversampling. In this process, samples are either removed from the majority class or synthetic samples are added to the minority class. In this paper, we follow up on recent developments in deep learning. We take proposals of deep generative models, including our own, and study the ability of these approaches to provide realistic samples that improve performance on imbalanced classification tasks via oversampling. Across 160K+ experiments, we show that all of the new methods tend to perform better than simple baseline methods such as SMOTE, but require different under- and oversampling ratios to do so. Our experiments show that the way the method of sampling does not affect quality, but runtime varies widely. We also observe that the improvements in terms of performance metric, while shown to be significant when ranking the methods, often are minor in absolute terms, especially compared to the required effort. Furthermore, we notice that a large part of the improvement is due to undersampling, not oversampling. We make our code and testing framework available

arXiv.org e-Print Archive

Open Repository and Bibliography - Luxembourg

Evolutionary undersampling for extremely imbalanced big data classification under apache spark

Author: Bustince H.
Galar M.
Herrera Francisco
Maillo Jesus
Merino D.
Triguero Isaac
Publication venue
Publication date: 01/01/2016
Field of study

The classification of datasets with a skewed class distribution is an important problem in data mining. Evolutionary undersampling of the majority class has proved to be a successful approach to tackle this issue. Such a challenging task may become even more difficult when the number of the majority class examples is very big. In this scenario, the use of the evolutionary model becomes unpractical due to the memory and time constrictions. Divide-and-conquer approaches based on the MapReduce paradigm have already been proposed to handle this type of problems by dividing data into multiple subsets. However, in extremely imbalanced cases, these models may suffer from a lack of density from the minority class in the subsets considered. Aiming at addressing this problem, in this contribution we provide a new big data scheme based on the new emerging technology Apache Spark to tackle highly imbalanced datasets. We take advantage of its in-memory operations to diminish the effect of the small sample size. The key point of this proposal lies in the independent management of majority and minority class examples, allowing us to keep a higher number of minority class examples in each subset. In our experiments, we analyze the proposed model with several data sets with up to 17 million instances. The results show the goodness of this evolutionary undersampling model for extremely imbalanced big data classification

Nottingham ePrints

Nottingham eTheses

Crossref

Repositorio Institucional Universidad de Granada