1,748 research outputs found
a literature review
Fonseca, J., & Bacao, F. (2023). Tabular and latent space synthetic data generation: a literature review. Journal of Big Data, 10, 1-37. [115]. https://doi.org/10.1186/s40537-023-00792-7 --- This research was supported by two research grants of the Portuguese Foundation for Science and Technology (“Fundação para a Ciência e a Tecnologia”), references SFRH/BD/151473/2021 and DSAIPA/DS/0116/2019, and by project UIDB/04152/2020 - Centro de Investigação em Gestão de Informação (MagIC).The generation of synthetic data can be used for anonymization, regularization, oversampling, semi-supervised learning, self-supervised learning, and several other tasks. Such broad potential motivated the development of new algorithms, specialized in data generation for specific data formats and Machine Learning (ML) tasks. However, one of the most common data formats used in industrial applications, tabular data, is generally overlooked; Literature analyses are scarce, state-of-the-art methods are spread across domains or ML tasks and there is little to no distinction among the main types of mechanism underlying synthetic data generation algorithms. In this paper, we analyze tabular and latent space synthetic data generation algorithms. Specifically, we propose a unified taxonomy as an extension and generalization of previous taxonomies, review 70 generation algorithms across six ML problems, distinguish the main generation mechanisms identified into six categories, describe each type of generation mechanism, discuss metrics to evaluate the quality of synthetic data and provide recommendations for future research. We expect this study to assist researchers and practitioners identify relevant gaps in the literature and design better and more informed practices with synthetic data.publishersversionpublishe
DOPING: Generative Data Augmentation for Unsupervised Anomaly Detection with GAN
Recently, the introduction of the generative adversarial network (GAN) and
its variants has enabled the generation of realistic synthetic samples, which
has been used for enlarging training sets. Previous work primarily focused on
data augmentation for semi-supervised and supervised tasks. In this paper, we
instead focus on unsupervised anomaly detection and propose a novel generative
data augmentation framework optimized for this task. In particular, we propose
to oversample infrequent normal samples - normal samples that occur with small
probability, e.g., rare normal events. We show that these samples are
responsible for false positives in anomaly detection. However, oversampling of
infrequent normal samples is challenging for real-world high-dimensional data
with multimodal distributions. To address this challenge, we propose to use a
GAN variant known as the adversarial autoencoder (AAE) to transform the
high-dimensional multimodal data distributions into low-dimensional unimodal
latent distributions with well-defined tail probability. Then, we
systematically oversample at the `edge' of the latent distributions to increase
the density of infrequent normal samples. We show that our oversampling
pipeline is a unified one: it is generally applicable to datasets with
different complex data distributions. To the best of our knowledge, our method
is the first data augmentation technique focused on improving performance in
unsupervised anomaly detection. We validate our method by demonstrating
consistent improvements across several real-world datasets.Comment: Published as a conference paper at ICDM 2018 (IEEE International
Conference on Data Mining
SMOTE for Learning from Imbalanced Data: Progress and Challenges, Marking the 15-year Anniversary
The Synthetic Minority Oversampling Technique (SMOTE) preprocessing algorithm is
considered \de facto" standard in the framework of learning from imbalanced data. This
is due to its simplicity in the design of the procedure, as well as its robustness when applied
to di erent type of problems. Since its publication in 2002, SMOTE has proven
successful in a variety of applications from several di erent domains. SMOTE has also inspired
several approaches to counter the issue of class imbalance, and has also signi cantly
contributed to new supervised learning paradigms, including multilabel classi cation, incremental
learning, semi-supervised learning, multi-instance learning, among others. It is
standard benchmark for learning from imbalanced data. It is also featured in a number of
di erent software packages | from open source to commercial. In this paper, marking the
fteen year anniversary of SMOTE, we re
ect on the SMOTE journey, discuss the current
state of a airs with SMOTE, its applications, and also identify the next set of challenges
to extend SMOTE for Big Data problems.This work have been partially supported by the Spanish Ministry of Science and Technology
under projects TIN2014-57251-P, TIN2015-68454-R and TIN2017-89517-P; the Project
887 BigDaP-TOOLS - Ayudas Fundaci on BBVA a Equipos de Investigaci on Cient ca 2016;
and the National Science Foundation (NSF) Grant IIS-1447795
Using LUCAS survey and Recurrent Neural Networks to produce LCLU classification based on a Satellite Image time series of Sentinel-2
Dissertation presented as the partial requirement for obtaining a Master's degree in Information Management, specialization in Knowledge Management and Business IntelligenceThe need of timely and accurate information for the territory has increased over the years, making
Land Cover Land Use (LCLU) mapping one of the most common application of remote sensing.
Recently, the advances in satellite technology and the open access policies for remote sensing data
increased the interest in exploring satellite image time series. In addition, the attention of
researchers has shifted from standard machine learning algorithms (e.g., Support Vector Machines
and Random Forest) to Recurrent Neural Networks due to their ability of exploiting sequential
information. However, acquiring reference data to train these algorithms is still a hurdle. This study
aims to evaluate the capability of a Gated Recurrent Unit in performing pixel-level LCLU classification
of a satellite image time series, using Sentinel-2 imagery and having the LUCAS survey as reference
data. To assess the performance of our model we compared it to state-of-the-art classifiers (SVM and
RF). Due to the unbalance nature of the LUCAS survey, we applied oversampling to this dataset to
increase the performance of our models, testing three different oversampling techniques. The results
attained showed that Recurrent Neural Networks did not outperform the other state-of-the-art
algorithms, when trained with a limited number of sampling units, and that oversampling the LUCAS
survey increased the performance of all the classifiers. Finally, we were able to demonstrate that it is
possible to produce LCLU classification of satellite image time series using only open-source data by
using Sentinel-2 imagery and the LUCAS survey as refence data
The Role of Synthetic Data in Improving Supervised Learning Methods: The Case of Land Use/Land Cover Classification
A thesis submitted in partial fulfillment of the requirements for the degree of Doctor in Information ManagementIn remote sensing, Land Use/Land Cover (LULC) maps constitute important assets for
various applications, promoting environmental sustainability and good resource management.
Although, their production continues to be a challenging task. There are various factors
that contribute towards the difficulty of generating accurate, timely updated LULC maps,
both via automatic or photo-interpreted LULC mapping. Data preprocessing, being a
crucial step for any Machine Learning task, is particularly important in the remote sensing
domain due to the overwhelming amount of raw, unlabeled data continuously gathered
from multiple remote sensing missions. However a significant part of the state-of-the-art
focuses on scenarios with full access to labeled training data with relatively balanced class
distributions. This thesis focuses on the challenges found in automatic LULC classification
tasks, specifically in data preprocessing tasks. We focus on the development of novel
Active Learning (AL) and imbalanced learning techniques, to improve ML performance in
situations with limited training data and/or the existence of rare classes. We also show
that much of the contributions presented are not only successful in remote sensing problems,
but also in various other multidisciplinary classification problems. The work presented
in this thesis used open access datasets to test the contributions made in imbalanced
learning and AL. All the data pulling, preprocessing and experiments are made available at
https://github.com/joaopfonseca/publications. The algorithmic implementations are made
available in the Python package ml-research at https://github.com/joaopfonseca/ml-research
- …