448 research outputs found

    SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary

    Get PDF
    The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is considered \de facto" standard in the framework of learning from imbalanced data. This is due to its simplicity in the design of the procedure, as well as its robustness when applied to di erent type of problems. Since its publication in 2002, SMOTE has proven successful in a variety of applications from several di erent domains. SMOTE has also inspired several approaches to counter the issue of class imbalance, and has also signi cantly contributed to new supervised learning paradigms, including multilabel classi cation, incremental learning, semi-supervised learning, multi-instance learning, among others. It is standard benchmark for learning from imbalanced data. It is also featured in a number of di erent software packages | from open source to commercial. In this paper, marking the fteen year anniversary of SMOTE, we re ect on the SMOTE journey, discuss the current state of a airs with SMOTE, its applications, and also identify the next set of challenges to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project 887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016; and the National Science Foundation (NSF) Grant IIS-1447795

    Automating the decision making process of Todd’s age estimation method from the pubic symphysis with explainable machine learning

    Get PDF
    Age estimation is a fundamental task in forensic anthropology for both the living and the dead. The procedure consists of analyzing properties such as appearance, ossification patterns, and morphology in different skeletonized remains. The pubic symphysis is extensively used to assess adults’ age-at-death due to its reliability. Nevertheless, most methods currently used for skeleton-based age estimation are carried out manually, even though their automation has the potential to lead to a considerable improvement in terms of economic resources, effectiveness, and execution time. In particular, explainable machine learning emerges as a promising means of addressing this challenge by engaging forensic experts to refine and audit the extracted knowledge and discover unknown patterns hidden in the complex and uncertain available data. In this contribution we address the automation of the decision making process of Todd’s pioneering age assessment method to assist the forensic practitioner in its application. To do so, we make use of the pubic bone data base available at the Physical Anthropology lab of the University of Granada. The machine learning task is significantly complex as it becomes an imbalanced ordinal classification problem with a small sample size and a high dimension. We tackle it with the combination of an ordinal classification method and oversampling techniques through an extensive experimental setup. Two forensic anthropologists refine and validate the derived rule base according to their own expertise and the knowledge available in the area. The resulting automatic system, finally composed of 34 interpretable rules, outperforms the state-of-the-art accuracy. In addition, and more importantly, it allows the forensic experts to uncover novel and interesting insights about how Todd’s method works, in particular, and the guidelines to estimate age-at-death from pubic symphysis characteristics, generally.Ministry of Science and Innovation, Spain (MICINN) Spanish GovernmentAgencia Estatal de Investigacion (AEI) PID2021-122916NB-I00 Spanish Government PGC2018-101216-B-I00Junta de AndaluciaUniversity of Granada P18 -FR -4262 B-TIC-456-UGR20European CommissionUniversidad de Granada/CBU

    Small data oversampling: improving small data prediction accuracy using the geometric SMOTE algorithm

    Get PDF
    Dissertation presented as the partial requirement for obtaining a Master's degree in Data Science and Advanced AnalyticsIn the age of Big Data, many machine learning tasks in numerous industries are still restricted due to the use of small datasets. The limited availability of data often results in unsatisfactory prediction performance of supervised learning algorithms and, consequently, poor decision making. The current research work aims to mitigate the small dataset problem by artificial data generation in the pre-processing phase of the data analysis process. The oversampling technique Geometric SMOTE is applied to generate new training instances and enhance crisp data structures. Experimental results show a significant improvement on the prediction accuracy when compared with the use of original, small datasets and over other oversampling techniques such as Random Oversampling, SMOTE and Borderline SMOTE. These findings show that artificial data creation is a promising approach to overcome the problem of small data in classification tasks

    Handling Imbalance Data in Classification Model with Nominal Predictors

    Get PDF
    Decision tree, one of classification method, can be done to find out the factors that predict something with interpretable result. However, a small and unbalanced percentage will make the classification only lead to the majority class. Therefore, handling imbalance class needs to be done. One method that often used in nominal predictor data is SMOTE-N. For accuracy improving, a hybrid SMOTE-N and ADASYN-N was developed. SMOTE-N-ENN and ADASYN-N were developed for accuracy improvement. In this study, SMOTE-N, SMOTE-N-ENN and ADASYN-N will be compared in handling imbalance class in the classification of premarital sex among adolescent using base class CART. The conclusion obtained regarding the best method for handling class imbalance is ADASYN-N because it provides the highest AUC compared to SMOTE-N and SMOTE-N-ENN. The best decision tree provides information that factors that can predict adolescents having premarital sexual relations are dating style, knowledge of the fertile period, knowledge of the risk of young marriage, gender, recent education, and area of residence

    Prediction of Fatigue on Rotating-Shift Workers

    Get PDF
    Rotating shifts have become prevalent in many industries, leading to a growing concern about the impact of fatigue on workers performance and safety. Thus, it is useful to develop a method to predict the fatigue of workers with rotating shifts. This thesis aims at contributing to the development of such method by building data-driven models to predict level of fatigue. We use random forest classifier and random forest regressor to build two fatigue prediction models. A third model is built by a combination of random forest classifier and regressor. Two imbalanced datasets from different groups of workers in the same industry are used. We explore two strategies to deal with imbalanced datasets: random over-sampling and class weights. We select features with feature importance of random forest and discover that a set of 19 features, selected from 38 original features, gives best performance. We obtain good prediction accuracy on both datasets. The combined model reaches mean absolute error of 0.93 and 0.83 on two datasets, on a 9-level scale of fatigue. In the area of high level of fatigue, which in real work is of particular interest, our model can predict with average 85\% confidence that the true level falls into +-1 range of prediction. We conclude that fatigue can be predicted with high confidence, based on a dataset of sleep patterns, work schedules and demographic data. Future work will focus on model generalization to datasets from different industries or geographical areas; and the discovery of other sets of features that give better prediction

    Estudio de métodos de construcción de ensembles de clasificadores y aplicaciones

    Get PDF
    La inteligencia artificial se dedica a la creación de sistemas informáticos con un comportamiento inteligente. Dentro de este área el aprendizaje computacional estudia la creación de sistemas que aprenden por sí mismos. Un tipo de aprendizaje computacional es el aprendizaje supervisado, en el cual, se le proporcionan al sistema tanto las entradas como la salida esperada y el sistema aprende a partir de estos datos. Un sistema de este tipo se denomina clasificador. En ocasiones ocurre, que en el conjunto de ejemplos que utiliza el sistema para aprender, el número de ejemplos de un tipo es mucho mayor que el número de ejemplos de otro tipo. Cuando esto ocurre se habla de conjuntos desequilibrados. La combinación de varios clasificadores es lo que se denomina "ensemble", y a menudo ofrece mejores resultados que cualquiera de los miembros que lo forman. Una de las claves para el buen funcionamiento de los ensembles es la diversidad. Esta tesis, se centra en el desarrollo de nuevos algoritmos de construcción de ensembles, centrados en técnicas de incremento de la diversidad y en los problemas desequilibrados. Adicionalmente, se aplican estas técnicas a la solución de varias problemas industriales.Ministerio de Economía y Competitividad, proyecto TIN-2011-2404

    Geometric SMOTE for imbalanced datasets with nominal and continuous features

    Get PDF
    Fonseca, J., & Bacao, F. (2023). Geometric SMOTE for imbalanced datasets with nominal and continuous features. Expert Systems with Applications, 234(December), 1-9. [121053]. https://doi.org/10.1016/j.eswa.2023.121053 --- This research was supported by research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), references SFRH/BD/151473/2021, DSAIPA/DS/0116/2019, and by project UIDB/04152/2020 — Centro de Investigação em Gestão de Informação (MagIC) .Imbalanced learning can be addressed in 3 different ways: Resampling, algorithmic modifications and cost-sensitive solutions. Resampling, and specifically oversampling, are more general approaches when opposed to algorithmic and cost-sensitive methods. Since the proposal of the Synthetic Minority Oversampling TEchnique (SMOTE), various SMOTE variants and neural network-based oversampling methods have been developed. However, the options to oversample datasets with nominal and continuous features are limited. We propose Geometric SMOTE for Nominal and Continuous features (G-SMOTENC), based on a combination of G-SMOTE and SMOTENC. Our method modifies SMOTENC’s encoding and generation mechanism for nominal features while using G-SMOTE’s data selection mechanism to determine the center observation and k-nearest neighbors and generation mechanism for continuous features. G-SMOTENC’s performance is compared against SMOTENC’s along with two other baseline methods, a State-of-the-art oversampling method and no oversampling. The experiment was performed over 20 datasets with varying imbalance ratios, number of metric and non-metric features and target classes. We found a significant improvement in classification performance when using G-SMOTENC as the oversampling method. An open-source implementation of G-SMOTENC is made available in the Python programming language.publishersversionpublishe
    • …
    corecore