4 research outputs found
Airline Passenger Name Record Generation using Generative Adversarial Networks
Passenger Name Records (PNRs) are at the heart of the travel industry.
Created when an itinerary is booked, they contain travel and passenger
information. It is usual for airlines and other actors in the industry to
inter-exchange and access each other's PNR, creating the challenge of using
them without infringing data ownership laws. To address this difficulty, we
propose a method to generate realistic synthetic PNRs using Generative
Adversarial Networks (GANs). Unlike other GAN applications, PNRs consist of
categorical and numerical features with missing/NaN values, which makes the use
of GANs challenging. We propose a solution based on Cram\'{e}r GANs,
categorical feature embedding and a Cross-Net architecture. The method was
tested on a real PNR dataset, and evaluated in terms of distribution matching,
memorization, and performance of predictive models for two real business
problems: client segmentation and passenger nationality prediction. Results
show that the generated data matches well with the real PNRs without memorizing
them, and that it can be used to train models for real business applications.Comment: ICML 2018 - workshop on Theoretical Foundations and Applications of
Deep Generative Model
Multiple Imputation for Biomedical Data using Monte Carlo Dropout Autoencoders
Due to complex experimental settings, missing values are common in biomedical
data. To handle this issue, many methods have been proposed, from ignoring
incomplete instances to various data imputation approaches. With the recent
rise of deep neural networks, the field of missing data imputation has oriented
towards modelling of the data distribution. This paper presents an approach
based on Monte Carlo dropout within (Variational) Autoencoders which offers not
only very good adaptation to the distribution of the data but also allows
generation of new data, adapted to each specific instance. The evaluation shows
that the imputation error and predictive similarity can be improved with the
proposed approach
PC-GAIN: Pseudo-label Conditional Generative Adversarial Imputation Networks for Incomplete Data
Datasets with missing values are very common in real world applications.
GAIN, a recently proposed deep generative model for missing data imputation,
has been proved to outperform many state-of-the-art methods. But GAIN only uses
a reconstruction loss in the generator to minimize the imputation error of the
non-missing part, ignoring the potential category information which can reflect
the relationship between samples. In this paper, we propose a novel
unsupervised missing data imputation method named PC-GAIN, which utilizes
potential category information to further enhance the imputation power.
Specifically, we first propose a pre-training procedure to learn potential
category information contained in a subset of low-missing-rate data. Then an
auxiliary classifier is determined using the synthetic pseudo-labels. Further,
this classifier is incorporated into the generative adversarial framework to
help the generator to yield higher quality imputation results. The proposed
method can improve the imputation quality of GAIN significantly. Experimental
results on various benchmark datasets show that our method is also superior to
other baseline approaches. Our code is available at
\url{https://github.com/WYu-Feng/pc-gain}.Comment: 18page
Conditional Wasserstein GAN-based Oversampling of Tabular Data for Imbalanced Learning
Class imbalance is a common problem in supervised learning and impedes the
predictive performance of classification models. Popular countermeasures
include oversampling the minority class. Standard methods like SMOTE rely on
finding nearest neighbours and linear interpolations which are problematic in
case of high-dimensional, complex data distributions. Generative Adversarial
Networks (GANs) have been proposed as an alternative method for generating
artificial minority examples as they can model complex distributions. However,
prior research on GAN-based oversampling does not incorporate recent
advancements from the literature on generating realistic tabular data with
GANs. Previous studies also focus on numerical variables whereas categorical
features are common in many business applications of classification methods
such as credit scoring. The paper propoes an oversampling method based on a
conditional Wasserstein GAN that can effectively model tabular datasets with
numerical and categorical variables and pays special attention to the
down-stream classification task through an auxiliary classifier loss. We
benchmark our method against standard oversampling methods and the imbalanced
baseline on seven real-world datasets. Empirical results evidence the
competitiveness of GAN-based oversampling