164 research outputs found
Oversampling for Imbalanced Learning Based on K-Means and SMOTE
Learning from class-imbalanced data continues to be a common and challenging
problem in supervised learning as standard classification algorithms are
designed to handle balanced class distributions. While different strategies
exist to tackle this problem, methods which generate artificial data to achieve
a balanced class distribution are more versatile than modifications to the
classification algorithm. Such techniques, called oversamplers, modify the
training data, allowing any classifier to be used with class-imbalanced
datasets. Many algorithms have been proposed for this task, but most are
complex and tend to generate unnecessary noise. This work presents a simple and
effective oversampling method based on k-means clustering and SMOTE
oversampling, which avoids the generation of noise and effectively overcomes
imbalances between and within classes. Empirical results of extensive
experiments with 71 datasets show that training data oversampled with the
proposed method improves classification results. Moreover, k-means SMOTE
consistently outperforms other popular oversampling methods. An implementation
is made available in the python programming language.Comment: 19 pages, 8 figure
An empirical evaluation of imbalanced data strategies from a practitioner's point of view
This research tested the following well known strategies to deal with binary
imbalanced data on 82 different real life data sets (sampled to imbalance rates
of 5%, 3%, 1%, and 0.1%): class weight, SMOTE, Underbagging, and a baseline
(just the base classifier). As base classifiers we used SVM with RBF kernel,
random forests, and gradient boosting machines and we measured the quality of
the resulting classifier using 6 different metrics (Area under the curve,
Accuracy, F-measure, G-mean, Matthew's correlation coefficient and Balanced
accuracy). The best strategy strongly depends on the metric used to measure the
quality of the classifier. For AUC and accuracy class weight and the baseline
perform better; for F-measure and MCC, SMOTE performs better; and for G-mean
and balanced accuracy, underbagging
An under-Sampled Approach for Handling Skewed Data Distribution using Cluster Disjuncts
In Data mining and Knowledge Discovery hidden and valuable knowledge from the data sources is discovered. The traditional algorithms used for knowledge discovery are bottle necked due to wide range of data sources availability. Class imbalance is a one of the problem arises due to data source which provide unequal class i.e. examples of one class in a training data set vastly outnumber examples of the other class(es). Researchers have rigorously studied several techniques to alleviate the problem of class imbalance, including resampling algorithms, and feature selection approaches to this problem. In this paper, we present a new hybrid frame work dubbed as Majority Under-sampling based on Cluster Disjunct (MAJOR_CD) for learning from skewed training data. This algorithm provides a simpler and faster alternative by using cluster disjunct concept. We conduct experiments using twelve UCI data sets from various application domains using five algorithms for comparison on six evaluation metrics. The empirical study suggests that MAJOR_CD have been believed to be effective in addressing the class imbalance problem
Comparing the performance of oversampling techniques in combination with a clustering algorithm for imbalanced learning
Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceImbalanced datasets in supervised learning are considered an ongoing challenging task for standard
algorithms, seeing as they are designed to handle balanced class distributions and perform poorly
when applied to problems of the imbalanced nature. Many methods have been developed to address
this specific problem but the more general approach to achieve a balanced class distribution is data
level modification, instead of algorithm modifications. Although class imbalances are responsible for
significant losses of performance in standard classifiers in many different types of problems, another
aspect that is important to consider is the small disjuncts problem. Therefore, it is important to
consider and understand solutions that not only take into the account the between-class imbalance
(the imbalance occurring between the two classes) but also the within-class imbalance (the imbalance
occurring between the sub-clusters of each class) and to oversample the dataset by rectifying these
two types of imbalances simultaneously. It has been shown that cluster-based oversampling is a robust
solution that takes into consideration these two problems. This work sets out to study the effect and
impact combining different existing oversampling methods with a clustering-based approach.
Empirical results of extensive experiments show that the combinations of different oversampling
techniques with the clustering algorithm k-means – K-Means Oversampling - improves upon
classification results resulting solely from the oversampling techniques with no prior clustering step
Oversampling for imbalanced learning based on k-means and smote
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsLearning from class-imbalanced data continues to be a common and challenging problem in
supervised learning as standard classification algorithms are designed to handle balanced class
distributions. While different strategies exist to tackle this problem, methods which generate
artificial data to achieve a balanced class distribution are more versatile than modifications to the
classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any
classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this
task, but most are complex and tend to generate unnecessary noise. This work presents a simple and
effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids
the generation of noise and effectively overcomes imbalances between and within classes. Empirical
results of extensive experiments with 71 datasets show that training data oversampled with the
proposed method improves classification results. Moreover, k-means SMOTE consistently
outperforms other popular oversampling methods. An implementation is made available in the
python programming language
Borderline Over-sampling for Imbalanced Data Classification
Traditional classification algorithms, in many times, perform poorly on imbalanced data sets in which some classes are heavily outnumbered by the remaining classes. For this kind of data, minority class instances, which are usually much more of interest, are often misclassified. The paper proposes a method to deal with them by changing class distribution through over-sampling at the borderline between the minority class and the majority class of the data set. A Support Vector Machines (SVMs) classifier then is trained to predict new unknown instances. Compared to other over-sampling methods, the proposed method focuses only on the minority class instances lying around the borderline due to the fact that this area is most crucial for establishing the decision boundary. Furthermore, new instances will be generated in such a manner that minority class area will be expanded further toward the side of the majority class at the places where there appear few majority class instances. Experimental results show that the proposed method can achieve better performance than some other over-sampling methods, especially with data sets having low degree of overlap due to its ability of expanding minority class area in such cases
- …