3,009 research outputs found
Diffusion models for missing value imputation in tabular data
Missing value imputation in machine learning is the task of estimating the
missing values in the dataset accurately using available information. In this
task, several deep generative modeling methods have been proposed and
demonstrated their usefulness, e.g., generative adversarial imputation
networks. Recently, diffusion models have gained popularity because of their
effectiveness in the generative modeling task in images, texts, audio, etc. To
our knowledge, less attention has been paid to the investigation of the
effectiveness of diffusion models for missing value imputation in tabular data.
Based on recent development of diffusion models for time-series data
imputation, we propose a diffusion model approach called "Conditional
Score-based Diffusion Models for Tabular data" (TabCSDI). To effectively handle
categorical variables and numerical variables simultaneously, we investigate
three techniques: one-hot encoding, analog bits encoding, and feature
tokenization. Experimental results on benchmark datasets demonstrated the
effectiveness of TabCSDI compared with well-known existing methods, and also
emphasized the importance of the categorical embedding techniques.Comment: Accepted to Table Representation Learning Workshop at NeurIPS 2022.
Renamed proposed method name to TabCSD
Improving Missing Data Imputation with Deep Generative Models
Datasets with missing values are very common on industry applications, and
they can have a negative impact on machine learning models. Recent studies
introduced solutions to the problem of imputing missing values based on deep
generative models. Previous experiments with Generative Adversarial Networks
and Variational Autoencoders showed interesting results in this domain, but it
is not clear which method is preferable for different use cases. The goal of
this work is twofold: we present a comparison between missing data imputation
solutions based on deep generative models, and we propose improvements over
those methodologies. We run our experiments using known real life datasets with
different characteristics, removing values at random and reconstructing them
with several imputation techniques. Our results show that the presence or
absence of categorical variables can alter the selection of the best model, and
that some models are more stable than others after similar runs with different
random number generator seeds
- …