8,902 research outputs found
Oversampling for Imbalanced Learning Based on K-Means and SMOTE
Learning from class-imbalanced data continues to be a common and challenging
problem in supervised learning as standard classification algorithms are
designed to handle balanced class distributions. While different strategies
exist to tackle this problem, methods which generate artificial data to achieve
a balanced class distribution are more versatile than modifications to the
classification algorithm. Such techniques, called oversamplers, modify the
training data, allowing any classifier to be used with class-imbalanced
datasets. Many algorithms have been proposed for this task, but most are
complex and tend to generate unnecessary noise. This work presents a simple and
effective oversampling method based on k-means clustering and SMOTE
oversampling, which avoids the generation of noise and effectively overcomes
imbalances between and within classes. Empirical results of extensive
experiments with 71 datasets show that training data oversampled with the
proposed method improves classification results. Moreover, k-means SMOTE
consistently outperforms other popular oversampling methods. An implementation
is made available in the python programming language.Comment: 19 pages, 8 figure
Learning from Imbalanced Multi-label Data Sets by Using Ensemble Strategies
Multi-label classification is an extension of conventional classification in which a single instance can be associated with multiple labels. Problems of this type are ubiquitous in everyday life. Such as, a movie can be categorized as action, crime, and thriller. Most algorithms on multi-label classification learning are designed for balanced data and don’t work well on imbalanced data. On the other hand, in real applications, most datasets are imbalanced. Therefore, we focused to improve multi-label classification performance on imbalanced datasets. In this paper, a state-of-the-art multi-label classification algorithm, which called IBLR_ML, is employed. This algorithm is produced from combination of k-nearest neighbor and logistic regression algorithms. Logistic regression part of this algorithm is combined with two ensemble learning algorithms, Bagging and Boosting. My approach is called IB-ELR. In this paper, for the first time, the ensemble bagging method whit stable learning as the base learner and imbalanced data sets as the training data is examined. Finally, to evaluate the proposed methods; they are implemented in JAVA language. Experimental results show the effectiveness of proposed methods.
Keywords: Multi-label classification, Imbalanced data set, Ensemble learning, Stable algorithm, Logistic regression, Bagging, Boostin
Imbalanced Ensemble Classifier for learning from imbalanced business school data set
Private business schools in India face a common problem of selecting quality
students for their MBA programs to achieve the desired placement percentage.
Generally, such data sets are biased towards one class, i.e., imbalanced in
nature. And learning from the imbalanced dataset is a difficult proposition.
This paper proposes an imbalanced ensemble classifier which can handle the
imbalanced nature of the dataset and achieves higher accuracy in case of the
feature selection (selection of important characteristics of students) cum
classification problem (prediction of placements based on the students'
characteristics) for Indian business school dataset. The optimal value of an
important model parameter is found. Numerical evidence is also provided using
Indian business school dataset to assess the outstanding performance of the
proposed classifier
Oversampling for imbalanced learning based on k-means and smote
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsLearning from class-imbalanced data continues to be a common and challenging problem in
supervised learning as standard classification algorithms are designed to handle balanced class
distributions. While different strategies exist to tackle this problem, methods which generate
artificial data to achieve a balanced class distribution are more versatile than modifications to the
classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any
classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this
task, but most are complex and tend to generate unnecessary noise. This work presents a simple and
effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids
the generation of noise and effectively overcomes imbalances between and within classes. Empirical
results of extensive experiments with 71 datasets show that training data oversampled with the
proposed method improves classification results. Moreover, k-means SMOTE consistently
outperforms other popular oversampling methods. An implementation is made available in the
python programming language
- …