448 research outputs found
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is
considered \de facto" standard in the framework of learning from imbalanced data. This
is due to its simplicity in the design of the procedure, as well as its robustness when applied
to di erent type of problems. Since its publication in 2002, SMOTE has proven
successful in a variety of applications from several di erent domains. SMOTE has also inspired
several approaches to counter the issue of class imbalance, and has also signi cantly
contributed to new supervised learning paradigms, including multilabel classi cation, incremental
learning, semi-supervised learning, multi-instance learning, among others. It is
standard benchmark for learning from imbalanced data. It is also featured in a number of
di erent software packages | from open source to commercial. In this paper, marking the
fteen year anniversary of SMOTE, we re
ect on the SMOTE journey, discuss the current
state of a airs with SMOTE, its applications, and also identify the next set of challenges
to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology
under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project
887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016;
and the National Science Foundation (NSF) Grant IIS-1447795
Automating the decision making process of Todd’s age estimation method from the pubic symphysis with explainable machine learning
Age estimation is a fundamental task in forensic anthropology for both the living and the
dead. The procedure consists of analyzing properties such as appearance, ossification patterns,
and morphology in different skeletonized remains. The pubic symphysis is extensively
used to assess adults’ age-at-death due to its reliability. Nevertheless, most
methods currently used for skeleton-based age estimation are carried out manually, even
though their automation has the potential to lead to a considerable improvement in terms
of economic resources, effectiveness, and execution time. In particular, explainable
machine learning emerges as a promising means of addressing this challenge by engaging
forensic experts to refine and audit the extracted knowledge and discover unknown patterns
hidden in the complex and uncertain available data. In this contribution we address
the automation of the decision making process of Todd’s pioneering age assessment
method to assist the forensic practitioner in its application. To do so, we make use of the
pubic bone data base available at the Physical Anthropology lab of the University of
Granada. The machine learning task is significantly complex as it becomes an imbalanced
ordinal classification problem with a small sample size and a high dimension. We tackle it
with the combination of an ordinal classification method and oversampling techniques
through an extensive experimental setup. Two forensic anthropologists refine and validate
the derived rule base according to their own expertise and the knowledge available in the
area. The resulting automatic system, finally composed of 34 interpretable rules, outperforms
the state-of-the-art accuracy. In addition, and more importantly, it allows the forensic
experts to uncover novel and interesting insights about how Todd’s method works, in
particular, and the guidelines to estimate age-at-death from pubic symphysis characteristics,
generally.Ministry of Science and Innovation, Spain (MICINN)
Spanish GovernmentAgencia Estatal de Investigacion (AEI) PID2021-122916NB-I00
Spanish Government PGC2018-101216-B-I00Junta de AndaluciaUniversity of Granada P18 -FR -4262
B-TIC-456-UGR20European CommissionUniversidad de Granada/CBU
Small data oversampling: improving small data prediction accuracy using the geometric SMOTE algorithm
Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsIn the age of Big Data, many machine learning tasks in numerous industries are still restricted due to the use of
small datasets. The limited availability of data often results in unsatisfactory prediction performance of
supervised learning algorithms and, consequently, poor decision making. The current research work aims to
mitigate the small dataset problem by artificial data generation in the pre-processing phase of the data analysis
process. The oversampling technique Geometric SMOTE is applied to generate new training instances and
enhance crisp data structures. Experimental results show a significant improvement on the prediction accuracy
when compared with the use of original, small datasets and over other oversampling techniques such as Random
Oversampling, SMOTE and Borderline SMOTE. These findings show that artificial data creation is a promising
approach to overcome the problem of small data in classification tasks
Handling Imbalance Data in Classification Model with Nominal Predictors
Decision tree, one of classification method, can be done to find out the factors that predict something with interpretable result. However, a small and unbalanced percentage will make the classification only lead to the majority class. Therefore, handling imbalance class needs to be done. One method that often used in nominal predictor data is SMOTE-N. For accuracy improving, a hybrid SMOTE-N and ADASYN-N was developed. SMOTE-N-ENN and ADASYN-N were developed for accuracy improvement. In this study, SMOTE-N, SMOTE-N-ENN and ADASYN-N will be compared in handling imbalance class in the classification of premarital sex among adolescent using base class CART. The conclusion obtained regarding the best method for handling class imbalance is ADASYN-N because it provides the highest AUC compared to SMOTE-N and SMOTE-N-ENN. The best decision tree provides information that factors that can predict adolescents having premarital sexual relations are dating style, knowledge of the fertile period, knowledge of the risk of young marriage, gender, recent education, and area of residence
Prediction of Fatigue on Rotating-Shift Workers
Rotating shifts have become prevalent in many industries, leading to a growing concern about the impact of fatigue on workers performance and safety. Thus, it is useful to develop a method to predict the fatigue of workers with rotating shifts. This thesis aims at contributing to the development of such method by building data-driven models to predict level of fatigue.
We use random forest classifier and random forest regressor to build two fatigue prediction models. A third model is built by a combination of random forest classifier and regressor. Two imbalanced datasets from different groups of workers in the same industry are used. We explore two strategies to deal with imbalanced datasets: random over-sampling and class weights.
We select features with feature importance of random forest and discover that a set of 19 features, selected from 38 original features, gives best performance.
We obtain good prediction accuracy on both datasets. The combined model reaches mean absolute error of 0.93 and 0.83 on two datasets, on a 9-level scale of fatigue. In the area of high level of fatigue, which in real work is of particular interest, our model can predict with average 85\% confidence that the true level falls into +-1 range of prediction.
We conclude that fatigue can be predicted with high confidence, based on a dataset of sleep patterns, work schedules and demographic data. Future work will focus on model generalization to datasets from different industries or geographical areas; and the discovery of other sets of features that give better prediction
Estudio de métodos de construcción de ensembles de clasificadores y aplicaciones
La inteligencia artificial se dedica a la creación de sistemas informáticos con un comportamiento inteligente. Dentro de este área el aprendizaje computacional estudia la creación de sistemas que aprenden por sà mismos.
Un tipo de aprendizaje computacional es el aprendizaje supervisado, en el cual, se le proporcionan al sistema tanto las entradas como la salida esperada y el sistema aprende a partir de estos datos. Un sistema de este tipo se denomina clasificador.
En ocasiones ocurre, que en el conjunto de ejemplos que utiliza el sistema para aprender, el nĂşmero de ejemplos de un tipo es mucho mayor que el nĂşmero de ejemplos de otro tipo. Cuando esto ocurre se habla de conjuntos desequilibrados.
La combinaciĂłn de varios clasificadores es lo que se denomina "ensemble", y a menudo ofrece mejores resultados que cualquiera de los miembros que lo forman. Una de las claves para el buen funcionamiento de los ensembles es la diversidad.
Esta tesis, se centra en el desarrollo de nuevos algoritmos de construcciĂłn de ensembles, centrados en tĂ©cnicas de incremento de la diversidad y en los problemas desequilibrados. Adicionalmente, se aplican estas tĂ©cnicas a la soluciĂłn de varias problemas industriales.Ministerio de EconomĂa y Competitividad, proyecto TIN-2011-2404
Geometric SMOTE for imbalanced datasets with nominal and continuous features
Fonseca, J., & Bacao, F. (2023). Geometric SMOTE for imbalanced datasets with nominal and continuous features. Expert Systems with Applications, 234(December), 1-9. [121053]. https://doi.org/10.1016/j.eswa.2023.121053 --- This research was supported by research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), references SFRH/BD/151473/2021, DSAIPA/DS/0116/2019, and by project UIDB/04152/2020 — Centro de Investigação em Gestão de Informação (MagIC) .Imbalanced learning can be addressed in 3 different ways: Resampling, algorithmic modifications and cost-sensitive solutions. Resampling, and specifically oversampling, are more general approaches when opposed to algorithmic and cost-sensitive methods. Since the proposal of the Synthetic Minority Oversampling TEchnique (SMOTE), various SMOTE variants and neural network-based oversampling methods have been developed. However, the options to oversample datasets with nominal and continuous features are limited. We propose Geometric SMOTE for Nominal and Continuous features (G-SMOTENC), based on a combination of G-SMOTE and SMOTENC. Our method modifies SMOTENC’s encoding and generation mechanism for nominal features while using G-SMOTE’s data selection mechanism to determine the center observation and k-nearest neighbors and generation mechanism for continuous features. G-SMOTENC’s performance is compared against SMOTENC’s along with two other baseline methods, a State-of-the-art oversampling method and no oversampling. The experiment was performed over 20 datasets with varying imbalance ratios, number of metric and non-metric features and target classes. We found a significant improvement in classification performance when using G-SMOTENC as the oversampling method. An open-source implementation of G-SMOTENC is made available in the Python programming language.publishersversionpublishe
- …