2,914 research outputs found
Exploring the performance of resampling strategies for the class imbalance problem
The present paper studies the influence of two distinct factors on the performance of some resampling strategies for handling imbalanced data sets. In particular, we focus on the nature of the classifier used, along with the ratio between minority and majority classes. Experiments using eight different classifiers show that the most significant differences are for data sets with low or moderate imbalance: over-sampling clearly appears as better than under-sampling for local classifiers, whereas some under-sampling strategies outperform over-sampling when employing classifiers with global learning
An In-Depth Comparative Analysis of Machine Learning Techniques for Addressing Class Imbalance in Mental Health Prediction.
The application of machine learning (ML) in predicting mental healthcare faces a challenge due to imbalanced datasets. ML techniques analyse extensive datasets to make predictions; however, the unequal distribution of samples, with the majority belonging to diagnosed mental disorders, can lead to biased model training and limited generalisation. To mitigate the issue of class imbalance in mental health datasets, this study employed diverse ML techniques, namely, resampling, ensemble, and algorithm-specific approaches and metrics such as accuracy, precision, recall and F1 score. The dataset used was collected from the Open Sourcing Mental Illness website, spanning 2016 to 2021. The findings indicate that ensemble techniques, particularly Random Forest, excelled in managing class imbalance compared to other methods. Beyond conventional performance metrics, the study introduced Kappa, balanced accuracy, and geometric mean to evaluate model effectiveness. These findings provide valuable insights for improving mental health predictions, enabling early diagnosis and personalised treatment strategies
Gait-based Gender Classification Considering Resampling and Feature Selection
Two intrinsic data characteristics that arise in many domains are the class imbalance and the high dimensionality, which pose new challenges that should be addressed. When using gait for gender classification, benchmarking public databases and renowned gait representations lead to these two problems, but they have not been jointly studied in depth. This paper is a preliminary study that pursues to investigate the benefits of using several techniques to tackle the aforementioned problems either singly or in combination, and also to evaluate the order of application that leads to the best classification performance. Experimental results show the importance of jointly managing both problems for gait-based gender classification. In particular, it seems that the best strategy consists of applying resampling followed by feature selection
An empirical evaluation of imbalanced data strategies from a practitioner's point of view
This research tested the following well known strategies to deal with binary
imbalanced data on 82 different real life data sets (sampled to imbalance rates
of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline
(just the base classifier). As base classifiers we used SVM with RBF kernel,
random forests, and gradient boosting machines and we measured the quality of
the resulting classifier using 6 different metrics (Area under the curve,
Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced
accuracy). The best strategy strongly depends on the metric used to measure the
quality of the classifier. For AUC and accuracy class weight and the baseline
perform better; for F-measure and MCC, SMOTE performs better; and for G-mean
and balanced accuracy, underbagging
Fraud Guard: A Comprehensive Comparative Analysis of Machine Learning Approaches to Enhance Credit Card Fraud Detection
The COVID-19 pandemic has constrained people's mobility, prompting a surge in reliance on online services due to challenges in offline purchasing. Machine learning (ML) methods have played a crucial role in advancing classification and prediction techniques across various domains. In the realm of Credit Card Fraud Detection, the significance of ML is particularly pronounced. These methods harness the power of data-driven algorithms to distinguish between legitimate and fraudulent transactions, contributing significantly to the enhancement of security measures in financial transactions. The dynamic and adaptive nature of ML allows for the continuous evolution of fraud detection systems, ensuring a proactive approach to safeguarding against emerging threats in the credit card landscape. With this shift, credit card fraud has become a significant concern within the domain of internet-based transactions. Hence, there is a pressing demand to devise an optimal machine learning method for preventing fraudulent credit card transactions. The study employed four resampling techniques (CNN, AllKNN, SMOTE, and SVMSM ) and three machine learning approaches (XGBoost , CatBoost, and RF) for analysing credit card fraud datasets with the aim of detection. These findings demonstrated that integrating AllKNN as an undersampling technique and CatBoost as a classifier are achieving superior results across the evaluated methods. The accuracy, precision, recall, and f1-score were 99.9%, 95.9%, 80%, and 87.4%, respectively. Keywords: Unbalanced data, machine learning techniques, fraud detection, and credit card fraud. DOI: 10.7176/JIEA/14-2-02 Publication date:March 31st 2024
Coupling different methods for overcoming the class imbalance problem
Many classification problems must deal with imbalanced datasets where one class \u2013 the majority class \u2013 outnumbers the other classes. Standard classification methods do not provide accurate predictions in this setting since classification is generally biased towards the majority class. The minority classes are oftentimes the ones of interest (e.g., when they are associated with pathological conditions in patients), so methods for handling imbalanced datasets are critical.
Using several different datasets, this paper evaluates the performance of state-of-the-art classification methods for handling the imbalance problem in both binary and multi-class datasets. Different strategies are considered, including the one-class and dimension reduction approaches, as well as their fusions. Moreover, some ensembles of classifiers are tested, in addition to stand-alone classifiers, to assess the effectiveness of ensembles in the presence of imbalance. Finally, a novel ensemble of ensembles is designed specifically to tackle the problem of class imbalance: the proposed ensemble does not need to be tuned separately for each dataset and outperforms all the other tested approaches.
To validate our classifiers we resort to the KEEL-dataset repository, whose data partitions (training/test) are publicly available and have already been used in the open literature: as a consequence, it is possible to report a fair comparison among different approaches in the literature.
Our best approach (MATLAB code and datasets not easily accessible elsewhere) will be available at https://www.dei.unipd.it/node/2357
Recommended from our members
IMPROVING CREDIT CARD FRAUD DETECTION USING TRANSFER LEARNING AND DATA RESAMPLING TECHNIQUES
This Culminating Experience Project explores the use of machine learning algorithms to detect credit card fraud. The research questions are: Q1. What cross-domain techniques developed in other domains can be effectively adapted and applied to mitigate or eliminate credit card fraud, and how do these techniques compare in terms of fraud detection accuracy and efficiency? Q2. To what extent do synthetic data generation methods effectively mitigate the challenges posed by imbalanced datasets in credit card fraud detection, and how do these methods impact classification performance? Q3. To what extent can the combination of transfer learning and innovative data resampling techniques improve the accuracy and efficiency of credit card fraud detection systems when dealing with imbalanced datasets, and what novel strategies can be developed to address this common challenge?
The main findings are: Q1. Unconventional cross-domain methods improved fraud detection, holding promise for enhanced security. Q2. The problems caused by unbalanced datasets in credit card fraud detection were effectively addressed by the synthetic data generation techniques SMOTE and ADASYN, resulting in a more balanced dataset suitable for fraud classification. Q3. The combination of neural networks and data resampling techniques, such as SMOTE and ADASYN, significantly improved credit card fraud detection accuracy.
The main conclusions are: Q1. Cross-domain methods are useful for credit card fraud detection, especially when it comes to online transactions. Q2. When used with various classifiers, neural networks show remarkable accuracy rates: 97% for unbalanced data, 99.47% for SMOTE, and 99.11% for ADASYN Q3. A fraud recall of 0.99 is obtained by the model evaluation on imbalanced data, with 12,155 right predictions out of 12,336 and 181 incorrect ones. The identified areas for further study encompass the testing of our model on larger datasets and the optimization of hyperparameters for further enhancement
HoloDetect: Few-Shot Learning for Error Detection
We introduce a few-shot learning framework for error detection. We show that
data augmentation (a form of weak supervision) is key to training high-quality,
ML-based error detection models that require minimal human involvement. Our
framework consists of two parts: (1) an expressive model to learn rich
representations that capture the inherent syntactic and semantic heterogeneity
of errors; and (2) a data augmentation model that, given a small seed of clean
records, uses dataset-specific transformations to automatically generate
additional training data. Our key insight is to learn data augmentation
policies from the noisy input dataset in a weakly supervised manner. We show
that our framework detects errors with an average precision of ~94% and an
average recall of ~93% across a diverse array of datasets that exhibit
different types and amounts of errors. We compare our approach to a
comprehensive collection of error detection methods, ranging from traditional
rule-based methods to ensemble-based and active learning approaches. We show
that data augmentation yields an average improvement of 20 F1 points while it
requires access to 3x fewer labeled examples compared to other ML approaches.Comment: 18 pages
Exploring synergetic effects of dimensionality reduction and resampling tools on hyperspectral imagery data classification
The present paper addresses the problem of the classification of hyperspectral images with multiple imbalanced classes and very high dimensionality. Class imbalance is handled by resampling the data set, whereas PCA and a supervised filter are applied to reduce the number of spectral bands. This is a preliminary study that pursues to investigate the benefits of combining several techniques to tackle the imbalance and the high dimensionality problems, and also to evaluate the order of application that leads to the best classification performance. Experimental results demonstrate the significance of using together these two preprocessing tools to improve the performance of hyperspectral imagery classification. Although it seems that the most effective order corresponds to first a resampling strategy and then a feature (or extraction) selection algorithm, this is a question that still needs a much more thorough investigation in the futureThis work has partially been supported by the Spanish Ministry of Education and Science under grants CSD2007–00018, AYA2008–05965–0596 and TIN2009–14205, the Fundació Caixa Castelló–Bancaixa under grant P1–1B2009–04, and the Generalitat Valenciana under grant PROMETEO/2010/02
- …