Search CORE

106,169 research outputs found

Oversampling for Imbalanced Learning Based on K-Means and SMOTE

Author: Bacao Fernando
Douzas Georgios
Last Felix
Publication venue: 'Elsevier BV'
Publication date: 12/12/2017
Field of study

Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.Comment: 19 pages, 8 figure

arXiv.org e-Print Archive

Repositório da Universidade Nova de Lisboa

Recommended from our members

The evaluation of effectiveness of school-based small group mentoring system : Discovery Scholars Program at UT Austin

Author: Oh Haein
Publication venue
Publication date: 06/11/2017
Field of study

Previous studies have found that students’ college experiences differ vastly depending on their socioeconomic status (SES), not only in their academic achievement (Chen & Carroll, 2005; Harackiewicz et al., 2014), but also in the wholesome process of the college experience, including college preparation and socio-cultural practices while at college (Engle & Tinto, 2008; Merritt, 2008). The present study examines the effectiveness of a small-group mentoring program, Discovery Scholars Program, targeted for students identified as at-risk due to SES and lower entrance scores at the University of Texas at Austin. Through exploring the survey data collected as part of the program, the current study explored the following research questions: 1) Did the students engage in and benefit from DSP events and feel supported by the DSP groups? 2) Did the DSP program help students develop academic skills necessary for college life? 3) Did the DSP program help students feel more comfortable and confident as a part of the UT community? The results showed that, while more long-term goals of the program were difficult to measure, the program succeeded in helping students develop social support groups and academic skills that aided in their adjustment process to college.Educational Psycholog

Texas ScholarWorks

An AUC-based Permutation Variable Importance Measure for Random Forests

Author: A Estabrooks
AL Boulesteix
AL Boulesteix
Anne-Laure Boulesteix
C Chen
C Liu
C Strobl
Carolin Strobl
F Briggs
G Batista
J Chang
J Van Hulse
J Van Hulse
K Nicodemus
KK Nicodemus
KK Nicodemus
KK Nicodemus
L Breiman
M Calle
M Cummings
M Khalilia
M Kubat
M Pepe
N Japkowicz
R Blagus
Silke Janitza
T Fawcett
T Hothorn
T Hothorn
T Khoshgoftaar
WJ Lin
Y Huang
Y Sun
Y Xie
Publication venue
Publication date: 01/11/2012
Field of study

The random forest (RF) method is a commonly used tool for classification with high dimensional data as well as for ranking candidate predictors based on the so-called random forest variable importance measures (VIMs). However the classification performance of RF is known to be suboptimal in case of strongly unbalanced data, i.e. data where response class sizes differ considerably. Suggestions were made to obtain better classification performance based either on sampling procedures or on cost sensitivity analyses. However to our knowledge the performance of the VIMs has not yet been examined in the case of unbalanced response classes. In this paper we explore the performance of the permutation VIM for unbalanced data settings and introduce an alternative permutation VIM based on the area under the curve (AUC) that is expected to be more robust towards class imbalance. We investigated the performance of the standard permutation VIM and of our novel AUC-based permutation VIM for different class imbalance levels using simulated data and real data. The results suggest that the standard permutation VIM loses its ability to discriminate between associated predictors and predictors not associated with the response for increasing class imbalance. It is outperformed by our new AUC-based permutation VIM for unbalanced data settings, while the performance of both VIMs is very similar in the case of balanced classes. The new AUC-based VIM is implemented in the R package party for the unbiased RF variant based on conditional inference trees. The codes implementing our study are available from the companion website: http://www.ibe.med.uni-muenchen.de/organisation/mitarbeiter/070_drittmittel/janitza/index.html

CiteSeerX

Crossref

Springer - Publisher Connector

Open Access LMU

PubMed Central

ZORA

On the role of pre and post-processing in environmental data mining

Author: Athanasiadis Ioannis
Comas Joaquim
Gibert Karina
Holmes Geoffrey
Izquierdo Joaquin
Sanchez-Marre Miquel
Publication venue: International Environmental Modelling and Software Society
Publication date: 01/01/2008
Field of study

The quality of discovered knowledge is highly depending on data quality. Unfortunately real data use to contain noise, uncertainty, errors, redundancies or even irrelevant information. The more complex is the reality to be analyzed, the higher the risk of getting low quality data. Knowledge Discovery from Databases (KDD) offers a global framework to prepare data in the right form to perform correct analyses. On the other hand, the quality of decisions taken upon KDD results, depend not only on the quality of the results themselves, but on the capacity of the system to communicate those results in an understandable form. Environmental systems are particularly complex and environmental users particularly require clarity in their results. In this paper some details about how this can be achieved are provided. The role of the pre and post processing in the whole process of Knowledge Discovery in environmental systems is discussed

Research Commons@Waikato

Improving Underrepresented Minority Student Persistence in STEM.

Author: Burnett Myra
Campbell Andrew G
Campbell Patricia B
Denetclaw Wilfred F
Estrada Mica
Gutiérrez Carlos G
Hurtado Sylvia
John Gilbert H
Matsui John
McGee Richard
Okpodu Camellia Moses
Robinson T Joan
Summers Michael F
Werner-Washburne Maggie
Zavala MariaElena
Publication venue: eScholarship, University of California
Publication date: 01/01/2016
Field of study

Members of the Joint Working Group on Improving Underrepresented Minorities (URMs) Persistence in Science, Technology, Engineering, and Mathematics (STEM)-convened by the National Institute of General Medical Sciences and the Howard Hughes Medical Institute-review current data and propose deliberation about why the academic "pathways" leak more for URM than white or Asian STEM students. They suggest expanding to include a stronger focus on the institutional barriers that need to be removed and the types of interventions that "lift" students' interests, commitment, and ability to persist in STEM fields. Using Kurt Lewin's planned approach to change, the committee describes five recommendations to increase URM persistence in STEM at the undergraduate level. These recommendations capitalize on known successes, recognize the need for accountability, and are framed to facilitate greater progress in the future. The impact of these recommendations rests upon enacting the first recommendation: to track successes and failures at the institutional level and collect data that help explain the existing trends

PubMed Central

eScholarship - University of California