44 research outputs found

    Oversampling for Imbalanced Learning Based on K-Means and SMOTE

    Full text link
    Learning from class-imbalanced data continues to be a common and challenging problem in supervised learning as standard classification algorithms are designed to handle balanced class distributions. While different strategies exist to tackle this problem, methods which generate artificial data to achieve a balanced class distribution are more versatile than modifications to the classification algorithm. Such techniques, called oversamplers, modify the training data, allowing any classifier to be used with class-imbalanced datasets. Many algorithms have been proposed for this task, but most are complex and tend to generate unnecessary noise. This work presents a simple and effective oversampling method based on k-means clustering and SMOTE oversampling, which avoids the generation of noise and effectively overcomes imbalances between and within classes. Empirical results of extensive experiments with 71 datasets show that training data oversampled with the proposed method improves classification results. Moreover, k-means SMOTE consistently outperforms other popular oversampling methods. An implementation is made available in the python programming language.Comment: 19 pages, 8 figure

    Correction to: WSMOTER: a novel approach for imbalanced regression

    Get PDF
    Camacho, L., & Bacao, F. (2024). Correction to: WSMOTER: a novel approach for imbalanced regression. Applied Intelligence, 54, 11160. https://doi.org/10.1007/s10489-024-05704-7The chapter WSMOTER: a novel approach for imbalanced regression, written by Luís Camacho and Fernando Bacao, was originally published without open access. Following the author’s/authors’ decision to opt for open access, the copyright of the chapter changed on July 29, 2024 to © The Author(s) 2024 and the chapter is now distributed under the terms of the Creative Commons Attribution Open Access This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http://creativecommons.org/licenses/by/4.0/..Open Access funded by: Open access funding provided by FCT|FCCN (b-on). This work was supported by national funds through FCT (Fundação para a Ciência e a Tecnologia), under the project UIDB/04152/2020 (DOI: 10.54499/UIDB/04152/2020) - Centro de Investigação em Gestão de Informação (MagIC)/NOVA IMS.publishersversionepub_ahead_of_prin

    a literature review

    Get PDF
    Fonseca, J., & Bacao, F. (2023). Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10, 1-37. [115]. https://doi.org/10.1186/s40537-023-00792-7 --- This research was supported by two research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), references SFRH/BD/151473/2021 and DSAIPA/DS/0116/2019, and by project UIDB/04152/2020 - Centro de Investigação em Gestão de Informação (MagIC).The generation of synthetic data can be used for anonymization, regularization, oversampling, semi-supervised learning, self-supervised learning, and several other tasks. Such broad potential motivated the development of new algorithms, specialized in data generation for specific data formats and Machine Learning (ML) tasks. However, one of the most common data formats used in industrial applications, tabular data, is generally overlooked; Literature analyses are scarce, state-of-the-art methods are spread across domains or ML tasks and there is little to no distinction among the main types of mechanism underlying synthetic data generation algorithms. In this paper, we analyze tabular and latent space synthetic data generation algorithms. Specifically, we propose a unified taxonomy as an extension and generalization of previous taxonomies, review 70 generation algorithms across six ML problems, distinguish the main generation mechanisms identified into six categories, describe each type of generation mechanism, discuss metrics to evaluate the quality of synthetic data and provide recommendations for future research. We expect this study to assist researchers and practitioners identify relevant gaps in the literature and design better and more informed practices with synthetic data.publishersversionpublishe

    Geometric SMOTE for imbalanced datasets with nominal and continuous features

    Get PDF
    Fonseca, J., & Bacao, F. (2023). Geometric SMOTE for imbalanced datasets with nominal and continuous features. Expert Systems with Applications, 234(December), 1-9. [121053]. https://doi.org/10.1016/j.eswa.2023.121053 --- This research was supported by research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), references SFRH/BD/151473/2021, DSAIPA/DS/0116/2019, and by project UIDB/04152/2020 — Centro de Investigação em Gestão de Informação (MagIC) .Imbalanced learning can be addressed in 3 different ways: Resampling, algorithmic modifications and cost-sensitive solutions. Resampling, and specifically oversampling, are more general approaches when opposed to algorithmic and cost-sensitive methods. Since the proposal of the Synthetic Minority Oversampling TEchnique (SMOTE), various SMOTE variants and neural network-based oversampling methods have been developed. However, the options to oversample datasets with nominal and continuous features are limited. We propose Geometric SMOTE for Nominal and Continuous features (G-SMOTENC), based on a combination of G-SMOTE and SMOTENC. Our method modifies SMOTENC’s encoding and generation mechanism for nominal features while using G-SMOTE’s data selection mechanism to determine the center observation and k-nearest neighbors and generation mechanism for continuous features. G-SMOTENC’s performance is compared against SMOTENC’s along with two other baseline methods, a State-of-the-art oversampling method and no oversampling. The experiment was performed over 20 datasets with varying imbalance ratios, number of metric and non-metric features and target classes. We found a significant improvement in classification performance when using G-SMOTENC as the oversampling method. An open-source implementation of G-SMOTENC is made available in the Python programming language.publishersversionpublishe

    An investigation on users’ perspective under the COVID-19 pandemic

    Get PDF
    Zhao, Y., & Bacao, F. (2021). How does the pandemic facilitate mobile payment? : An investigation on users’ perspective under the COVID-19 pandemic. International Journal of Environmental Research and Public Health, 18(3), 1-22. [1016]. https://doi.org/10.3390/ijerph18031016Owing to the convenience, reliability and contact-free feature of Mobile payment (M-payment), it has been diffusely adopted in China during the COVID-19 pandemic to reduce the direct and indirect contacts in transactions, allowing social distancing to be maintained and facilitating stabilization of the social economy. This paper aims to comprehensively investigate the technological and mental factors affecting users’ adoption intentions of M-payment under the COVID-19 pandemic, to expand the domain of technology adoption under the emergency situation. This study integrated Unified Theory of Acceptance and Use of Technology (UTAUT) with perceived benefits from Mental Accounting Theory (MAT), and two additional variables (perceived security and trust) to investigate 739 smartphone users’ adoption intentions of M-payment during the COVID-19 pandemic in China. The empirical results showed that users’ technological and mental perceptions conjointly influence their adoption intentions of M-payment during the COVID-19 pandemic, wherein perceived benefits are significantly determined by social influence and trust, corresponding with the situation of pandemic. This study initially integrated UTAUT with MAT to develop the theoretical framework for investigating users’ adoption intentions. Meanwhile, this study originally investigated the antecedents of M-payment adoption under the pandemic situation and indicated that users’ perceptions will be positively influenced when technology’s specific characteristics can benefit a particular situation.publishersversionpublishe

    How does gender moderate customer intention of shopping via live-streaming apps during the COVID-19 pandemic lockdown period?

    Get PDF
    Zhao, Y., & Bacao, F. (2021). How does gender moderate customer intention of shopping via live-streaming apps during the COVID-19 pandemic lockdown period? International Journal of Environmental Research and Public Health, 18(24), 1-24. [13004]. https://doi.org/10.3390/ijerph182413004Shopping through Live-Streaming Shopping Apps (LSSAs) as an emerging consumption phenomenon has increased dramatically in recent years, especially during the COVID-19 lockdown period. However, insufficient studies have focused on the psychological processes undergone in different customer demographics while shopping via LSSAs under pandemic conditions. This study integrated the Unified Theory of Acceptance and Use of Technology 2 with Flow Theory into a Stimulus-Organism-Response framework to investigate the psychological processes of different customer demographics during the COVID-19 lockdown period. A total of 374 validated data were analyzed by covariance-based structural equation modelling. The statistical results demonstrated by the proposed model showed a significant discrepancy between different gender groups, in which Flow, as a mediator, representing users’ engagement and immersion in shopping via LSSAs, was significantly moderated by gender where connection between stimulus components, hedonic moti-vation, trust and social influence and response component perceived value are concerned. This study contributed a theoretical development and a practical framework to the explanation of the mental processes of different customer demographics when using an innovative e-commerce tech-nology. Furthermore, the results can support the relevant stakeholders in e-commerce in their com-prehensive understanding of customers’ behavior, allowing better strategical and managerial de-velopment.publishersversionpublishe

    Advanced Genetic Programming vs. State-of-the-Art AutoML in Imbalanced Binary Classification

    Get PDF
    The objective of this article is to provide a comparative analysis of two novel genetic programming (GP) techniques, differentiable Cartesian genetic programming for artificial neural networks (DCGPANN) and geometric semantic genetic programming (GSGP), with state-of-the-art automated machine learning (AutoML) tools, namely Auto-Keras, Auto-PyTorch and Auto-Sklearn. While all these techniques are compared to several baseline algorithms upon their introduction, research still lacks direct comparisons between them, especially of the GP approaches with state-of-the-art AutoML. This study intends to fill this gap in order to analyze the true potential of GP for AutoML. The performances of the different tools are assessed by applying them to 20 benchmark datasets of the imbalanced binary classification field, thus an area that is a frequent and challenging problem. The tools are compared across the four categories average performance, maximum performance, standard deviation within performance, and generalization ability, whereby the metrics F1-score, G-mean, and AUC are used for evaluation. The analysis finds that the GP techniques, while unable to completely outperform state-of-the-art AutoML, are indeed already a very competitive alternative. Therefore, these advanced GP tools prove that they are able to provide a new and promising approach for practitioners developing machine learning (ML) models. Doi: 10.28991/ESJ-2023-07-04-021 Full Text: PD

    Extending the Flow Theory with Variables from the UTAUT2 Model

    Get PDF
    Zhao, Y., & Bacao, F. (2020). Theoretical Development: Extending the Flow Theory with Variables from the UTAUT2 Model. In 2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020 (pp. 2427-2431). [9345049] (2020 IEEE 6th International Conference on Computer and Communications, ICCC 2020). Institute of Electrical and Electronics Engineers Inc.. https://doi.org/10.1109/ICCC51575.2020.9345049According to the dramatic development of innovative information technology in worldwide ranges, business climate has changed from traditional commerce to virtual commerce in recent two decades. It is important to synthetically understand customers' adoption intention of new technology for better business management and strategy involved with information technology. Thus, this study extends the Flow theory by integrating variables from the revised Unified Theory of Acceptance and Use of Technology 2 (UTAUT2) model and satisfaction to propose a theoretical development for investigating the factors determining customers' behavioral intention on adopting new information technology. In addition, the proposed theoretical development contributes the relevant researches on systematical understanding customers' adoption intention determined from technological perceptions to mental cognition. Moreover, the proposed framework and measurement method can be applied as reference for relevant researchers and stakeholders to investigate customers' behaviors for further research and future business management and strategy.authorsversionpublishe

    A numeric-based machine learning design for detecting organized retail fraud in digital marketplaces

    Get PDF
    Mutemi, A., & Bacao, F. (2023). A numeric-based machine learning design for detecting organized retail fraud in digital marketplaces. Scientific Reports, 13(1), 1-16. [12499]. https://doi.org/10.1038/s41598-023-38304-5Organized retail crime (ORC) is a significant issue for retailers, marketplace platforms, and consumers. Its prevalence and influence have increased fast in lockstep with the expansion of online commerce, digital devices, and communication platforms. Today, it is a costly affair, wreaking havoc on enterprises’ overall revenues and continually jeopardizing community security. These negative consequences are set to rocket to unprecedented heights as more people and devices connect to the Internet. Detecting and responding to these terrible acts as early as possible is critical for protecting consumers and businesses while also keeping an eye on rising patterns and fraud. The issue of detecting fraud in general has been studied widely, especially in financial services, but studies focusing on organized retail crimes are extremely rare in literature. To contribute to the knowledge base in this area, we present a scalable machine learning strategy for detecting and isolating ORC listings on a prominent marketplace platform by merchants committing organized retail crimes or fraud. We employ a supervised learning approach to classify postings as fraudulent or real based on past data from buyer and seller behaviors and transactions on the platform. The proposed framework combines bespoke data preprocessing procedures, feature selection methods, and state-of-the-art class asymmetry resolution techniques to search for aligned classification algorithms capable of discriminating between fraudulent and legitimate listings in this context. Our best detection model obtains a recall score of 0.97 on the holdout set and 0.94 on the out-of-sample testing data set. We achieve these results based on a select set of 45 features out of 58.publishersversionpublishe
    corecore