669 research outputs found
Wasserstein Divergence for GANs
In many domains of computer vision, generative adversarial networks (GANs)
have achieved great success, among which the family of Wasserstein GANs (WGANs)
is considered to be state-of-the-art due to the theoretical contributions and
competitive qualitative performance. However, it is very challenging to
approximate the -Lipschitz constraint required by the Wasserstein-1
metric~(W-met). In this paper, we propose a novel Wasserstein
divergence~(W-div), which is a relaxed version of W-met and does not require
the -Lipschitz constraint. As a concrete application, we introduce a
Wasserstein divergence objective for GANs~(WGAN-div), which can faithfully
approximate W-div through optimization. Under various settings, including
progressive growing training, we demonstrate the stability of the proposed
WGAN-div owing to its theoretical and practical advantages over WGANs. Also, we
study the quantitative and visual performance of WGAN-div on standard image
synthesis benchmarks of computer vision, showing the superior performance of
WGAN-div compared to the state-of-the-art methods.Comment: accepted by eccv_2018, correct minor error
Sliced Wasserstein Generative Models
In generative modeling, the Wasserstein distance (WD) has emerged as a useful
metric to measure the discrepancy between generated and real data
distributions. Unfortunately, it is challenging to approximate the WD of
high-dimensional distributions. In contrast, the sliced Wasserstein distance
(SWD) factorizes high-dimensional distributions into their multiple
one-dimensional marginal distributions and is thus easier to approximate. In
this paper, we introduce novel approximations of the primal and dual SWD.
Instead of using a large number of random projections, as it is done by
conventional SWD approximation methods, we propose to approximate SWDs with a
small number of parameterized orthogonal projections in an end-to-end deep
learning fashion. As concrete applications of our SWD approximations, we design
two types of differentiable SWD blocks to equip modern generative
frameworks---Auto-Encoders (AE) and Generative Adversarial Networks (GAN). In
the experiments, we not only show the superiority of the proposed generative
models on standard image synthesis benchmarks, but also demonstrate the
state-of-the-art performance on challenging high resolution image and video
generation in an unsupervised manner.Comment: This paper is accepted by CVPR 2019, accidentally uploaded as a new
submission (arXiv:1904.05408, which has been withdrawn). The code is
available at this https URL https:// github.com/musikisomorphie/swd.gi
A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data
Crash data is often greatly imbalanced, with the majority of crashes being
non-fatal crashes, and only a small number being fatal crashes due to their
rarity. Such data imbalance issue poses a challenge for crash severity modeling
since it struggles to fit and interpret fatal crash outcomes with very limited
samples. Usually, such data imbalance issues are addressed by data resampling
methods, such as under-sampling and over-sampling techniques. However, most
traditional and deep learning-based data resampling methods, such as synthetic
minority oversampling technique (SMOTE) and generative Adversarial Networks
(GAN) are designed dedicated to processing continuous variables. Though some
resampling methods have improved to handle both continuous and discrete
variables, they may have difficulties in dealing with the collapse issue
associated with sparse discrete risk factors. Moreover, there is a lack of
comprehensive studies that compare the performance of various resampling
methods in crash severity modeling. To address the aforementioned issues, the
current study proposes a crash data generation method based on the Conditional
Tabular GAN. After data balancing, a crash severity model is employed to
estimate the performance of classification and interpretation. A comparative
study is conducted to assess classification accuracy and distribution
consistency of the proposed generation method using a 4-year imbalanced crash
dataset collected in Washington State, U.S. Additionally, Monte Carlo
simulation is employed to estimate the performance of parameter and probability
estimation in both two- and three-class imbalance scenarios. The results
indicate that using synthetic data generated by CTGAN-RU for crash severity
modeling outperforms using original data or synthetic data generated by other
resampling methods
IMPROVING STROKE PREDICTION ON IMBALANCED CLINICAL DATA USING CTGAN AND TVAE: A SYNTHETIC DATA APPROACH
Synthetic data (SD) have been evaluated and adopted in different domains and areas, especially in health. To conduct this study, we chose tabular data on stroke prediction, available in [3]. The dataset contains 11 clinical features, including thelast column, positive = 1 and negative = 0 for stroke. We chose this dataset because of its imbalance, which will be a perfect fit for implementing the generating techniques to know how well the real data resemble these. For generating SD, we use two techniques known as conditional tabular GAN (CTGAN) and tabular variational autoencoder (TVAE), which have different numbers of epochs and batch sizes. We further evaluated the results with three Machine Learning (ML) models as a benchmark with real data. The results highlight data generated with CTGAN (epochs=1500, batch size=500) performbetter with an accuracy score of 0.995 on random forest (RF) and Support Vector Machine (SVM)
PrivateCTGAN: adapting GAN for privacy-aware tabular data sharing
This research addresses the challenge of generating synthetic data that resembles real-world data while preserving privacy. With privacy laws protecting sensitive information such as healthcare data, accessing sufficient training data becomes difficult, resulting in an increased difficulty in training Machine Learning models and in overall worst models. Recently, there has been an increased interest in the usage of Generative Adversarial Networks (GAN) to generate synthetic data since they enable researchers to generate more data to train their models. GANs, however, may not be suitable for privacy-sensitive data since they have no concern for the privacy of the generated data. We propose modifying the known Conditional Tabular GAN (CTGAN) model by incorporating a privacy-aware loss function, thus resulting in the Private CTGAN (PCTGAN) method. Several experiments were carried out using 10 public domain classification datasets and comparing PCTGAN with CTGAN and the state-of-the-art privacy-preserving model, the Differential Privacy CTGAN (DP-CTGAN). The results demonstrated that PCTGAN enables users to fine-tune the privacy fidelity trade-off by leveraging parameters, as well as that if desired, a higher level of privacy.This work was partially funded by projects AISym4Med (101095387) supported by Horizon Europe Cluster 1: Health, ConnectedHealth (n.ō 46858), supported by Competitiveness and Internationalisation Operational Programme (POCI) and Lisbon Regional Operational Programme (LISBOA 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF) and NextGenAI - Center for Responsible AI (2022-C05i0102-02), supported by IAPMEI, and also by FCT plurianual funding for 2020-2023 of LIACC (UIDB/00027/2020_UIDP/00027/2020)
Adversarial Machine Learning-Enabled Anonymization of OpenWiFi Data
Data privacy and protection through anonymization is a critical issue for
network operators or data owners before it is forwarded for other possible use
of data. With the adoption of Artificial Intelligence (AI), data anonymization
augments the likelihood of covering up necessary sensitive information;
preventing data leakage and information loss. OpenWiFi networks are vulnerable
to any adversary who is trying to gain access or knowledge on traffic
regardless of the knowledge possessed by data owners. The odds for discovery of
actual traffic information is addressed by applied conditional tabular
generative adversarial network (CTGAN). CTGAN yields synthetic data; which
disguises as actual data but fostering hidden acute information of actual data.
In this paper, the similarity assessment of synthetic with actual data is
showcased in terms of clustering algorithms followed by a comparison of
performance for unsupervised cluster validation metrics. A well-known
algorithm, K-means outperforms other algorithms in terms of similarity
assessment of synthetic data over real data while achieving nearest scores
0.634, 23714.57, and 0.598 as Silhouette, Calinski and Harabasz and Davies
Bouldin metric respectively. On exploiting a comparative analysis in validation
scores among several algorithms, K-means forms the epitome of unsupervised
clustering algorithms ensuring explicit usage of synthetic data at the same
time a replacement for real data. Hence, the experimental results aim to show
the viability of using CTGAN-generated synthetic data in lieu of publishing
anonymized data to be utilized in various applications.Comment: 8 pages, 4 Figures, "Wireless World Research and Trends" Magazine.
Initial version was presented in 47th Wireless World Research Foru
Robustness Analysis of Deep Learning Models for Population Synthesis
Deep generative models have become useful for synthetic data generation,
particularly population synthesis. The models implicitly learn the probability
distribution of a dataset and can draw samples from a distribution. Several
models have been proposed, but their performance is only tested on a single
cross-sectional sample. The implementation of population synthesis on single
datasets is seen as a drawback that needs further studies to explore the
robustness of the models on multiple datasets. While comparing with the real
data can increase trust and interpretability of the models, techniques to
evaluate deep generative models' robustness for population synthesis remain
underexplored. In this study, we present bootstrap confidence interval for the
deep generative models, an approach that computes efficient confidence
intervals for mean errors predictions to evaluate the robustness of the models
to multiple datasets. Specifically, we adopt the tabular-based Composite Travel
Generative Adversarial Network (CTGAN) and Variational Autoencoder (VAE), to
estimate the distribution of the population, by generating agents that have
tabular data using several samples over time from the same study area. The
models are implemented on multiple travel diaries of Montreal Origin-
Destination Survey of 2008, 2013, and 2018 and compare the predictive
performance under varying sample sizes from multiple surveys. Results show that
the predictive errors of CTGAN have narrower confidence intervals indicating
its robustness to multiple datasets of the varying sample sizes when compared
to VAE. Again, the evaluation of model robustness against varying sample size
shows a minimal decrease in model performance with decrease in sample size.
This study directly supports agent-based modelling by enabling finer synthetic
generation of populations in a reliable environment.Comment: arXiv admin note: text overlap with arXiv:2203.03489,
arXiv:1909.07689 by other author
Addressing the data bottleneck in medical deep learning models using a human-in-the-loop machine learning approach
[Abstract]: Any machine learning (ML) model is highly dependent on the data it uses for learning, and this is even more important in the case of deep learning models. The problem is a data bottleneck, i.e. the difficulty in obtaining an adequate number of cases and quality data. Another issue is improving the learning process, which can be done by actively introducing experts into the learning loop, in what is known as human-in-the-loop (HITL) ML. We describe an ML model based on a neural network in which HITL techniques were used to resolve the data bottleneck problem for the treatment of pancreatic cancer. We first augmented the dataset using synthetic cases created by a generative adversarial network. We then launched an active learning (AL) process involving human experts as oracles to label both new cases and cases by the network found to be suspect. This AL process was carried out simultaneously with an interactive ML process in which feedback was obtained from humans in order to develop better synthetic cases for each iteration of training. We discuss the challenges involved in including humans in the learning process, especially in relation to human–computer interaction, which is acquiring great importance in building ML models and can condition the success of a HITL approach. This paper also discusses the methodological approach adopted to address these challenges.This work has been supported by the State Research Agency of the Spanish Government (Grant PID2019-107194GB-I00/AEI/10.13039/501100011033) and by the Xunta de Galicia (Grant ED431C 2022/44), supported in turn by the EU European Regional Development Fund. We wish to acknowledge support received from the Centro de Investigación de Galicia CITIC, funded by the Xunta de Galicia and the European Regional Development Fund (Galicia 2014–2020 Program; Grant ED431G 2019/01).Xunta de Galicia; ED431C 2022/44Xunta de Galicia; ED431G 2019/0
Federated Learning for Private Synthetic Data Generation
Die digitale Transformation des Gesundheitswesens hat in den letzten Jahren an Dynamik gewonnen, wie die Einführung von Electronic Health Record (EHR)-Systemen und digitalen Infrastrukturen zum Datenaustausch zwischen allen Akteuren im Gesundheitssektor zeigt. In Deutschland werden Versicherte demnächst die Möglichkeit haben, die in ihrer elektronischen Patientenakte gespeicherten Daten freiwillig für medizinische Forschungszwecke zu spenden. Die Sekundärnutzung medizinischer Real-World-Daten birgt zwar ein großes Potenzial, etwa bei der Überwachung von Langzeitergebnissen im Zusammenhang mit bestimmten Behandlungen, wirft aber auch erhebliche Bedenken hinsichtlich des Schutzes der Privatsphäre auf, da Gesundheitsdaten aufgrund des Risikos von Stigmatisierung oder Diskriminierung infolge einer missbräuchlichen Nutzung besonders schützenswert sind.
Aus diesem Grund wurden in der Literatur verschiedene Privacy-Enhancing Technologies (PETs) vorgestellt. So ermöglicht beispielsweise Differential Privacy (DP), die Auswirkungen von Datenanalysen auf die Privatsphäre durch Einfügen von sorgfältig kalibriertem Rauschen zu begrenzen. Mit den jüngsten Fortschritten im Bereich des maschinellen Lernens hat die Generierung synthetischer Daten (SDG) mithilfe von Generative Adversarial Networks (GANs) als Verfahren zum Schutz der Privatsphäre an Aufmerksamkeit gewonnen. Des Weiteren erlaubt Federated Learning (FL) das dezentrale Training von Machine-Learning-Modellen. Durch die Kombination von DP, SDG und FL können synthetische Daten kollaborativ erzeugt werden, die sowohl starke Datenschutzgarantien als auch einen Mehrwert für die Forschung bieten, während gleichzeitig die Trainingsdaten nicht mit einer zentralen Instanz geteilt werden müssen.
In dieser Masterarbeit wird ein neuartiger Ansatz namens DP-Fed-CTGAN zur Erzeugung synthetischer tabellarischer Daten vorgestellt, der auf FL beruht und strikte DP-Garantien erfüllt. Verglichen mit bestehenden Ansätzen zielt DP-Fed-CTGAN darauf ab, die Menge an Informationen zu minimieren, die Clients während des FL-Verfahrens über ihre lokalen Trainingsdatensätze preisgeben müssen. Die Performanz der Open-Source-Implementierung von DP-Fed-CTGAN wird anhand gängiger Metriken evaluiert, wobei sowohl medizinische als auch häufig verwendete Machine-Learning-Datensätze betrachtet werden. Die Ergebnisse zeigen, dass DP-Fed-CTGAN nicht nur einen vergleichbaren Nutzen und eine verbesserte Realitätsnähe im Vergleich zum zentralen Ansatz von DP-CTGAN erreicht, sondern auch dazu beitragen kann, die Akzeptanz der Patienten für eine Datenspende zu erhöhen und die Einhaltung der Datenschutzgesetze zu erleichtern
An improved CTGAN for data processing method of imbalanced disk failure
To address the problem of insufficient failure data generated by disks and
the imbalance between the number of normal and failure data. The existing
Conditional Tabular Generative Adversarial Networks (CTGAN) deep learning
methods have been proven to be effective in solving imbalance disk failure
data. But CTGAN cannot learn the internal information of disk failure data very
well. In this paper, a fault diagnosis method based on improved CTGAN, a
classifier for specific category discrimination is added and a discriminator
generate adversarial network based on residual network is proposed. We named it
Residual Conditional Tabular Generative Adversarial Networks (RCTGAN). Firstly,
to enhance the stability of system a residual network is utilized. RCTGAN uses
a small amount of real failure data to synthesize fake fault data; Then, the
synthesized data is mixed with the real data to balance the amount of normal
and failure data; Finally, four classifier (multilayer perceptron, support
vector machine, decision tree, random forest) models are trained using the
balanced data set, and the performance of the models is evaluated using G-mean.
The experimental results show that the data synthesized by the RCTGAN can
further improve the fault diagnosis accuracy of the classifier
- …
