Search CORE

46 research outputs found

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

Author: Bacao Fernando
Douzas Georgios
Last Felix
Publication venue: 'Elsevier BV'
Publication date: 12/12/2017
Field of study

Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.Comment: 19 pages, 8 figure

arXiv.org e-Print Archive

Repositório da Universidade Nova de Lisboa

Toward optimal allocation of human resources for active learning with application to safe advertising

Author: Attenberg Josh
Provost Foster
Publication venue
Publication date: 22/12/2009
Field of study

New York University Faculty Digital Archive

A performance comparison of oversampling methods for data generation in imbalanced learning tasks

Author: Dattagupta Samrat Jayanta
Publication venue
Publication date: 02/02/2018
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Statistics and Information Management, specialization in Marketing Research e CRMClass Imbalance problem is one of the most fundamental challenges faced by the machine learning community. The imbalance refers to number of instances in the class of interest being relatively low, as compared to the rest of the data. Sampling is a common technique for dealing with this problem. A number of over - sampling approaches have been applied in an attempt to balance the classes. This study provides an overview of the issue of class imbalance and attempts to examine some common oversampling approaches for dealing with this problem. In order to illustrate the differences, an experiment is conducted using multiple simulated data sets for comparing the performance of these oversampling methods on different classifiers based on various evaluation criteria. In addition, the effect of different parameters, such as number of features and imbalance ratio, on the classifier performance is also evaluated

Repositório da Universidade Nova de Lisboa

Comparing the performance of oversampling techniques in combination with a clustering algorithm for imbalanced learning

Author: Pereira Mariana Matoso
Publication venue
Publication date: 01/03/2019
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceImbalanced datasets in supervised learning are considered an ongoing challenging task for standard algorithms, seeing as they are designed to handle balanced class distributions and perform poorly when applied to problems of the imbalanced nature. Many methods have been developed to address this specific problem but the more general approach to achieve a balanced class distribution is data level modification, instead of algorithm modifications. Although class imbalances are responsible for significant losses of performance in standard classifiers in many different types of problems, another aspect that is important to consider is the small disjuncts problem. Therefore, it is important to consider and understand solutions that not only take into the account the between-class imbalance (the imbalance occurring between the two classes) but also the within-class imbalance (the imbalance occurring between the sub-clusters of each class) and to oversample the dataset by rectifying these two types of imbalances simultaneously. It has been shown that cluster-based oversampling is a robust solution that takes into consideration these two problems. This work sets out to study the effect and impact combining different existing oversampling methods with a clustering-based approach. Empirical results of extensive experiments show that the combinations of different oversampling techniques with the clustering algorithm k-means – K-Means Oversampling - improves upon classification results resulting solely from the oversampling techniques with no prior clustering step

Repositório da Universidade Nova de Lisboa

Associative learning on imbalanced environments: An empirical study

Author: Cleofás Sánchez Laura
García V.
Sánchez Garreta Josep Salvador
Valdovinos Rosas Rosa María
Publication venue: 'Elsevier BV'
Publication date: 01/01/2015
Field of study

Associative memories have emerged as a powerful computational neural network model for several pattern classification problems. Like most traditional classifiers, these models assume that the classes share similar prior probabilities. However, in many real-life applications the ratios of prior probabilities between classes are extremely skewed. Although the literature has provided numerous studies that examine the performance degradation of renowned classifiers on different imbalanced scenarios, so far this effect has not been supported by a thorough empirical study in the context of associative memories. In this paper, we fix our attention on the applicability of the associative neural networks to the classification of imbalanced data. The key questions here addressed are whether these models perform better, the same or worse than other popular classifiers, how the level of imbalance affects their performance, and whether distinct resampling strategies produce a different impact on the associative memories. In order to answer these questions and gain further insight into the feasibility and efficiency of the associative memories, a large-scale experimental evaluation with 31 databases, seven classification models and four resampling algorithms is carried out here, along with a non-parametric statistical test to discover any significant differences between each pair of classifiers.This work has partially been supported by the Mexican Science and Technology Council (CONACYT-Mexico) through the Postdoctoral Fellowship Program (232167), the Mexican PRODEP(DSA/103.5/15/7004), the Spanish Ministry of Economy(TIN2013-46522-P) and the Generalitat Valenciana (PROMETEOII/2014/062)

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositori Institucional de la Universitat Jaume I

Oversampling for imbalanced learning based on k-means and smote

Author: Last Felix
Publication venue
Publication date: 05/02/2018
Field of study

Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsLearning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language

Repositório da Universidade Nova de Lisboa

Use of ensemble based on GA for imbalance problem

Author: G.E. Batista
K. Woods
N.V. Chawla
R. Barandela
R. Barandela
R. Jacobs
R.C. Prati
R.C. Prati
S. Daskalaki
S. Tan
T. Fawcett
T.G. Dietterich
V. Dasarathy
Y. Huang
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2009
Field of study

In real-world applications, it has been observed that class imbalance (significant differences in class prior probabilities) may produce an important deterioration of the classifier performance, in particular with patterns belonging to the less represented classes. One method to tackle this problem consists to resample the original training set, either by over-sampling the minority class and/or under-sampling the majority class. In this paper, we propose two ensemble models (using a modular neural network and the nearest neighbor rule) trained on datasets under-sampled with genetic algorithms. Experiments with real datasets demonstrate the effectiveness of the methodology here propose

Crossref

LAReferencia - Red Federada de Repositorios Institucionales de Publicaciones Científicas Latinoamericanas

Repositori Institucional de la Universitat Jaume I

MOOC next week dropout prediction: weekly assessing time and learning patterns

Author: Alamri Ahmed
Cristea Alexandra I.
Pereira Filipe Dawn
Steward Craig
Sun Zhongtian
Publication venue: Springer Verlag
Publication date: 01/01/2021
Field of study

Although Massive Open Online Course (MOOC) systems have become more prevalent in recent years, associated student attrition rates are still a major drawback. In the past decade, many researchers have sought to explore the reasons behind learner attrition or lack of interest. A growing body of literature recognises the importance of the early prediction of student attrition from MOOCs, since it can lead to timely interventions. Among them, most are concerned with identifying the best features for the entire course dropout prediction. This study focuses on innovations in predicting student dropout rates by examining their next-week-based learning activities and behaviours. The study is based on multiple MOOC platforms including 251,662 students from 7 courses with 29 runs spanning in 2013 to 2018. This study aims to build a generalised early predictive model for the weekly prediction of student completion using machine learning algorithms. In addition, this study is the first to use a ‘learner’s jumping behaviour’ as a feature, to obtain a high dropout prediction accuracy

Durham Research Online