Search CORE

2,914 research outputs found

Exploring the performance of resampling strategies for the class imbalance problem

Author: D. Wilson
G. Batista
G. Cohen
H. Han
H. He
I. Tomek
I. Witten
M. Kubat
N. Chawla
N. Japkowicz
R. Barandela
R. Barandela
S. Daskalaki
S. García
S. Tan
T. Fawcett
V. García
Y.M. Huang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2010
Field of study

The present paper studies the influence of two distinct factors on the performance of some resampling strategies for handling imbalanced data sets. In particular, we focus on the nature of the classifier used, along with the ratio between minority and majority classes. Experiments using eight different classifiers show that the most significant differences are for data sets with low or moderate imbalance: over-sampling clearly appears as better than under-sampling for local classifiers, whereas some under-sampling strategies outperform over-sampling when employing classifiers with global learning

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositori Institucional de la Universitat Jaume I

An In-Depth Comparative Analysis of Machine Learning Techniques for Addressing Class Imbalance in Mental Health Prediction.

Author: Bokaba Tebogo
Mokheleli Tsholofelo
Museba Tinofirei
Publication venue: AIS Electronic Library (AISeL)
Publication date: 02/12/2023
Field of study

The application of machine learning (ML) in predicting mental healthcare faces a challenge due to imbalanced datasets. ML techniques analyse extensive datasets to make predictions; however, the unequal distribution of samples, with the majority belonging to diagnosed mental disorders, can lead to biased model training and limited generalisation. To mitigate the issue of class imbalance in mental health datasets, this study employed diverse ML techniques, namely, resampling, ensemble, and algorithm-specific approaches and metrics such as accuracy, precision, recall and F1 score. The dataset used was collected from the Open Sourcing Mental Illness website, spanning 2016 to 2021. The findings indicate that ensemble techniques, particularly Random Forest, excelled in managing class imbalance compared to other methods. Beyond conventional performance metrics, the study introduced Kappa, balanced accuracy, and geometric mean to evaluate model effectiveness. These findings provide valuable insights for improving mental health predictions, enabling early diagnosis and personalised treatment strategies

AIS Electronic Library (AISeL)

Gait-based Gender Classification Considering Resampling and Feature Selection

Author: García Jiménez Vicente
Martín Félez Raúl
Sánchez Garreta Josep Salvador
Publication venue: 'Engineering and Technology Publishing'
Publication date: 01/01/2013
Field of study

Two intrinsic data characteristics that arise in many domains are the class imbalance and the high dimensionality, which pose new challenges that should be addressed. When using gait for gender classification, benchmarking public databases and renowned gait representations lead to these two problems, but they have not been jointly studied in depth. This paper is a preliminary study that pursues to investigate the benefits of using several techniques to tackle the aforementioned problems either singly or in combination, and also to evaluate the order of application that leads to the best classification performance. Experimental results show the importance of jointly managing both problems for gait-based gender classification. In particular, it seems that the best strategy consists of applying resampling followed by feature selection

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Crossref

Repositori Institucional de la Universitat Jaume I

An empirical evaluation of imbalanced data strategies from a practitioner's point of view

Author: Franceschinell Rodrigo A.
Wainer Jacques
Publication venue
Publication date: 16/10/2018
Field of study

This research tested the following well known strategies to deal with binary imbalanced data on 82 different real life data sets (sampled to imbalance rates of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline (just the base classifier). As base classifiers we used SVM with RBF kernel, random forests, and gradient boosting machines and we measured the quality of the resulting classifier using 6 different metrics (Area under the curve, Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced accuracy). The best strategy strongly depends on the metric used to measure the quality of the classifier. For AUC and accuracy class weight and the baseline perform better; for F-measure and MCC, SMOTE performs better; and for G-mean and balanced accuracy, underbagging

arXiv.org e-Print Archive

Fraud Guard: A Comprehensive Comparative Analysis of Machine Learning Approaches to Enhance Credit Card Fraud Detection

Author: Abdulateef Omar Gheni
Publication venue: The International Institute for Science, Technology and Education (IISTE)
Publication date: 05/04/2024
Field of study

The COVID-19 pandemic has constrained people's mobility, prompting a surge in reliance on online services due to challenges in offline purchasing. Machine learning (ML) methods have played a crucial role in advancing classification and prediction techniques across various domains. In the realm of Credit Card Fraud Detection, the significance of ML is particularly pronounced. These methods harness the power of data-driven algorithms to distinguish between legitimate and fraudulent transactions, contributing significantly to the enhancement of security measures in financial transactions. The dynamic and adaptive nature of ML allows for the continuous evolution of fraud detection systems, ensuring a proactive approach to safeguarding against emerging threats in the credit card landscape. With this shift, credit card fraud has become a significant concern within the domain of internet-based transactions. Hence, there is a pressing demand to devise an optimal machine learning method for preventing fraudulent credit card transactions. The study employed four resampling techniques (CNN, AllKNN, SMOTE, and SVMSM ) and three machine learning approaches (XGBoost , CatBoost, and RF) for analysing credit card fraud datasets with the aim of detection. These findings demonstrated that integrating AllKNN as an undersampling technique and CatBoost as a classifier are achieving superior results across the evaluated methods. The accuracy, precision, recall, and f1-score were 99.9%, 95.9%, 80%, and 87.4%, respectively. Keywords: Unbalanced data, machine learning techniques, fraud detection, and credit card fraud. DOI: 10.7176/JIEA/14-2-02 Publication date:March 31st 2024

International Institute for Science, Technology and Education (IISTE): E-Journals

Coupling different methods for overcoming the class imbalance problem

Author: Fantozzi Carlo
N. Lazzarini
Nanni Loris
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical. Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches. To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature. Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357

Crossref

Newcastle University E-Prints

Archivio istituzionale della ricerca - Università di Padova

Recommended from our members

IMPROVING CREDIT CARD FRAUD DETECTION USING TRANSFER LEARNING AND DATA RESAMPLING TECHNIQUES

Author: Vinarta Charmaine Eunice Mena
Publication venue: CSUSB ScholarWorks
Publication date: 01/12/2023
Field of study

This Culminating Experience Project explores the use of machine learning algorithms to detect credit card fraud. The research questions are: Q1. What cross-domain techniques developed in other domains can be effectively adapted and applied to mitigate or eliminate credit card fraud, and how do these techniques compare in terms of fraud detection accuracy and efficiency? Q2. To what extent do synthetic data generation methods effectively mitigate the challenges posed by imbalanced datasets in credit card fraud detection, and how do these methods impact classification performance? Q3. To what extent can the combination of transfer learning and innovative data resampling techniques improve the accuracy and efficiency of credit card fraud detection systems when dealing with imbalanced datasets, and what novel strategies can be developed to address this common challenge? The main findings are: Q1. Unconventional cross-domain methods improved fraud detection, holding promise for enhanced security. Q2. The problems caused by unbalanced datasets in credit card fraud detection were effectively addressed by the synthetic data generation techniques SMOTE and ADASYN, resulting in a more balanced dataset suitable for fraud classification. Q3. The combination of neural networks and data resampling techniques, such as SMOTE and ADASYN, significantly improved credit card fraud detection accuracy. The main conclusions are: Q1. Cross-domain methods are useful for credit card fraud detection, especially when it comes to online transactions. Q2. When used with various classifiers, neural networks show remarkable accuracy rates: 97% for unbalanced data, 99.47% for SMOTE, and 99.11% for ADASYN Q3. A fraud recall of 0.99 is obtained by the model evaluation on imbalanced data, with 12,155 right predictions out of 12,336 and 181 incorrect ones. The identified areas for further study encompass the testing of our model on larger datasets and the optimization of hyperparameters for further enhancement

CSUSB ScholarWorks

HoloDetect: Few-Shot Learning for Error Detection

Author: Bengio Yoshua
Elmagarmid Ahmed K.
Globerson Amir
Goodfellow Ian
Guo Chuan
Hinton G. E.
Rahm Erhard
Ratcliff John W.
Zhang Yu
Zhu Xiaojin
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 03/04/2019
Field of study

We introduce a few-shot learning framework for error detection. We show that data augmentation (a form of weak supervision) is key to training high-quality, ML-based error detection models that require minimal human involvement. Our framework consists of two parts: (1) an expressive model to learn rich representations that capture the inherent syntactic and semantic heterogeneity of errors; and (2) a data augmentation model that, given a small seed of clean records, uses dataset-specific transformations to automatically generate additional training data. Our key insight is to learn data augmentation policies from the noisy input dataset in a weakly supervised manner. We show that our framework detects errors with an average precision of ~94% and an average recall of ~93% across a diverse array of datasets that exhibit different types and amounts of errors. We compare our approach to a comprehensive collection of error detection methods, ranging from traditional rule-based methods to ensemble-based and active learning approaches. We show that data augmentation yields an average improvement of 20 F1 points while it requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages

arXiv.org e-Print Archive

Crossref

Exploring synergetic effects of dimensionality reduction and resampling tools on hyperspectral imagery data classification

Author: A. Martínez-Usó
D.A. Landgrebe
D.P. Williams
F. Melgani
H. He
I.T. Jolliffe
J.A. Richards
J.C. Platt
J.R. Quinlan
J.R. Quinlan
L. Breiman
L. Bruzzone
L.O. Jiménez
M. Hall
M. Kubat
M. Trebar
M. Wasikowski
N. Japkowicz
N.V. Chawla
P.H. Hsu
R. Blagus
S. García
T. Fawcett
V. Kecman
V.N. Vapnik
X. Chen
Z.H. Zhou
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2011
Field of study

The present paper addresses the problem of the classification of hyperspectral images with multiple imbalanced classes and very high dimensionality. Class imbalance is handled by resampling the data set, whereas PCA and a supervised filter are applied to reduce the number of spectral bands. This is a preliminary study that pursues to investigate the benefits of combining several techniques to tackle the imbalance and the high dimensionality problems, and also to evaluate the order of application that leads to the best classification performance. Experimental results demonstrate the significance of using together these two preprocessing tools to improve the performance of hyperspectral imagery classification. Although it seems that the most effective order corresponds to first a resampling strategy and then a feature (or extraction) selection algorithm, this is a question that still needs a much more thorough investigation in the futureThis work has partially been supported by the Spanish Ministry of Education and Science under grants CSD2007–00018, AYA2008–05965–0596 and TIN2009–14205, the Fundació Caixa Castelló–Bancaixa under grant P1–1B2009–04, and the Generalitat Valenciana under grant PROMETEO/2010/02

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositori Institucional de la Universitat Jaume I