Search CORE

669 research outputs found

Wasserstein Divergence for GANs

Author: Acharya Dinesh
Huang Zhiwu
Thoma Janine
Van Gool Luc
Wu Jiqing
Publication venue
Publication date: 05/09/2018
Field of study

In many domains of computer vision, generative adversarial networks (GANs) have achieved great success, among which the family of Wasserstein GANs (WGANs) is considered to be state-of-the-art due to the theoretical contributions and competitive qualitative performance. However, it is very challenging to approximate the

k

-Lipschitz constraint required by the Wasserstein-1 metric~(W-met). In this paper, we propose a novel Wasserstein divergence~(W-div), which is a relaxed version of W-met and does not require the

k

-Lipschitz constraint. As a concrete application, we introduce a Wasserstein divergence objective for GANs~(WGAN-div), which can faithfully approximate W-div through optimization. Under various settings, including progressive growing training, we demonstrate the stability of the proposed WGAN-div owing to its theoretical and practical advantages over WGANs. Also, we study the quantitative and visual performance of WGAN-div on standard image synthesis benchmarks of computer vision, showing the superior performance of WGAN-div compared to the state-of-the-art methods.Comment: accepted by eccv_2018, correct minor error

arXiv.org e-Print Archive

Southampton (e-Prints Soton)

Crossref

Sliced Wasserstein Generative Models

Author: Acharya Dinesh
Huang Zhiwu
Li Wen
Paudel Danda Pani
Thoma Janine
Van Gool Luc
Wu Jiqing
Publication venue
Publication date: 13/04/2019
Field of study

In generative modeling, the Wasserstein distance (WD) has emerged as a useful metric to measure the discrepancy between generated and real data distributions. Unfortunately, it is challenging to approximate the WD of high-dimensional distributions. In contrast, the sliced Wasserstein distance (SWD) factorizes high-dimensional distributions into their multiple one-dimensional marginal distributions and is thus easier to approximate. In this paper, we introduce novel approximations of the primal and dual SWD. Instead of using a large number of random projections, as it is done by conventional SWD approximation methods, we propose to approximate SWDs with a small number of parameterized orthogonal projections in an end-to-end deep learning fashion. As concrete applications of our SWD approximations, we design two types of differentiable SWD blocks to equip modern generative frameworks---Auto-Encoders (AE) and Generative Adversarial Networks (GAN). In the experiments, we not only show the superiority of the proposed generative models on standard image synthesis benchmarks, but also demonstrate the state-of-the-art performance on challenging high resolution image and video generation in an unsupervised manner.Comment: This paper is accepted by CVPR 2019, accidentally uploaded as a new submission (arXiv:1904.05408, which has been withdrawn). The code is available at this https URL https:// github.com/musikisomorphie/swd.gi

arXiv.org e-Print Archive

Southampton (e-Prints Soton)

Crossref

Institutional Knowledge at Singapore Management University

A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data

Author: Chen Junlan
Ding Hongliang
Guo Xiucheng
Pu Ziyuan
Wen Xiao
Zheng Nan
Publication venue
Publication date: 02/04/2024
Field of study

Crash data is often greatly imbalanced, with the majority of crashes being non-fatal crashes, and only a small number being fatal crashes due to their rarity. Such data imbalance issue poses a challenge for crash severity modeling since it struggles to fit and interpret fatal crash outcomes with very limited samples. Usually, such data imbalance issues are addressed by data resampling methods, such as under-sampling and over-sampling techniques. However, most traditional and deep learning-based data resampling methods, such as synthetic minority oversampling technique (SMOTE) and generative Adversarial Networks (GAN) are designed dedicated to processing continuous variables. Though some resampling methods have improved to handle both continuous and discrete variables, they may have difficulties in dealing with the collapse issue associated with sparse discrete risk factors. Moreover, there is a lack of comprehensive studies that compare the performance of various resampling methods in crash severity modeling. To address the aforementioned issues, the current study proposes a crash data generation method based on the Conditional Tabular GAN. After data balancing, a crash severity model is employed to estimate the performance of classification and interpretation. A comparative study is conducted to assess classification accuracy and distribution consistency of the proposed generation method using a 4-year imbalanced crash dataset collected in Washington State, U.S. Additionally, Monte Carlo simulation is employed to estimate the performance of parameter and probability estimation in both two- and three-class imbalance scenarios. The results indicate that using synthetic data generated by CTGAN-RU for crash severity modeling outperforms using original data or synthetic data generated by other resampling methods

arXiv.org e-Print Archive

IMPROVING STROKE PREDICTION ON IMBALANCED CLINICAL DATA USING CTGAN AND TVAE: A SYNTHETIC DATA APPROACH

Author: Cleland I
Khan Naveed
Liaquat Majid
Nugent CD
Publication venue
Publication date: 01/05/2025
Field of study

Synthetic data (SD) have been evaluated and adopted in different domains and areas, especially in health. To conduct this study, we chose tabular data on stroke prediction, available in [3]. The dataset contains 11 clinical features, including thelast column, positive = 1 and negative = 0 for stroke. We chose this dataset because of its imbalance, which will be a perfect fit for implementing the generating techniques to know how well the real data resemble these. For generating SD, we use two techniques known as conditional tabular GAN (CTGAN) and tabular variational autoencoder (TVAE), which have different numbers of epochs and batch sizes. We further evaluated the results with three Machine Learning (ML) models as a benchmark with real data. The results highlight data generated with CTGAN (epochs=1500, batch size=500) performbetter with an accuracy score of 0.995 on random forest (RF) and Support Vector Machine (SVM)

Ulster University`s Research Portal

PrivateCTGAN: adapting GAN for privacy-aware tabular data sharing

Author: Cortez Paulo
Lopes Frederico
Soares Carlos
Publication venue: Springer, Cham
Publication date: 01/01/2025
Field of study

This research addresses the challenge of generating synthetic data that resembles real-world data while preserving privacy. With privacy laws protecting sensitive information such as healthcare data, accessing sufficient training data becomes difficult, resulting in an increased difficulty in training Machine Learning models and in overall worst models. Recently, there has been an increased interest in the usage of Generative Adversarial Networks (GAN) to generate synthetic data since they enable researchers to generate more data to train their models. GANs, however, may not be suitable for privacy-sensitive data since they have no concern for the privacy of the generated data. We propose modifying the known Conditional Tabular GAN (CTGAN) model by incorporating a privacy-aware loss function, thus resulting in the Private CTGAN (PCTGAN) method. Several experiments were carried out using 10 public domain classification datasets and comparing PCTGAN with CTGAN and the state-of-the-art privacy-preserving model, the Differential Privacy CTGAN (DP-CTGAN). The results demonstrated that PCTGAN enables users to fine-tune the privacy fidelity trade-off by leveraging parameters, as well as that if desired, a higher level of privacy.This work was partially funded by projects AISym4Med (101095387) supported by Horizon Europe Cluster 1: Health, ConnectedHealth (n.ō 46858), supported by Competitiveness and Internationalisation Operational Programme (POCI) and Lisbon Regional Operational Programme (LISBOA 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF) and NextGenAI - Center for Responsible AI (2022-C05i0102-02), supported by IAPMEI, and also by FCT plurianual funding for 2020-2023 of LIACC (UIDB/00027/2020_UIDP/00027/2020)

Universidade do Minho: RepositoriUM

Adversarial Machine Learning-Enabled Anonymization of OpenWiFi Data

Author: Chenier Marcel
Dabbour Kareem
Erol-Kantarci Melike
Hasan Irtiza
Herscovich Andrea
Kantarci Burak
Kuili Samhita
Publication venue
Publication date: 02/01/2024
Field of study

Data privacy and protection through anonymization is a critical issue for network operators or data owners before it is forwarded for other possible use of data. With the adoption of Artificial Intelligence (AI), data anonymization augments the likelihood of covering up necessary sensitive information; preventing data leakage and information loss. OpenWiFi networks are vulnerable to any adversary who is trying to gain access or knowledge on traffic regardless of the knowledge possessed by data owners. The odds for discovery of actual traffic information is addressed by applied conditional tabular generative adversarial network (CTGAN). CTGAN yields synthetic data; which disguises as actual data but fostering hidden acute information of actual data. In this paper, the similarity assessment of synthetic with actual data is showcased in terms of clustering algorithms followed by a comparison of performance for unsupervised cluster validation metrics. A well-known algorithm, K-means outperforms other algorithms in terms of similarity assessment of synthetic data over real data while achieving nearest scores 0.634, 23714.57, and 0.598 as Silhouette, Calinski and Harabasz and Davies Bouldin metric respectively. On exploiting a comparative analysis in validation scores among several algorithms, K-means forms the epitome of unsupervised clustering algorithms ensuring explicit usage of synthetic data at the same time a replacement for real data. Hence, the experimental results aim to show the viability of using CTGAN-generated synthetic data in lieu of publishing anonymized data to be utilized in various applications.Comment: 8 pages, 4 Figures, "Wireless World Research and Trends" Magazine. Initial version was presented in 47th Wireless World Research Foru

arXiv.org e-Print Archive

Robustness Analysis of Deep Learning Models for Population Synthesis

Author: Badu-Marfo Godwin
Farooq Bilal
Mensah Daniel Opoku
Publication venue
Publication date: 23/11/2022
Field of study

Deep generative models have become useful for synthetic data generation, particularly population synthesis. The models implicitly learn the probability distribution of a dataset and can draw samples from a distribution. Several models have been proposed, but their performance is only tested on a single cross-sectional sample. The implementation of population synthesis on single datasets is seen as a drawback that needs further studies to explore the robustness of the models on multiple datasets. While comparing with the real data can increase trust and interpretability of the models, techniques to evaluate deep generative models' robustness for population synthesis remain underexplored. In this study, we present bootstrap confidence interval for the deep generative models, an approach that computes efficient confidence intervals for mean errors predictions to evaluate the robustness of the models to multiple datasets. Specifically, we adopt the tabular-based Composite Travel Generative Adversarial Network (CTGAN) and Variational Autoencoder (VAE), to estimate the distribution of the population, by generating agents that have tabular data using several samples over time from the same study area. The models are implemented on multiple travel diaries of Montreal Origin- Destination Survey of 2008, 2013, and 2018 and compare the predictive performance under varying sample sizes from multiple surveys. Results show that the predictive errors of CTGAN have narrower confidence intervals indicating its robustness to multiple datasets of the varying sample sizes when compared to VAE. Again, the evaluation of model robustness against varying sample size shows a minimal decrease in model performance with decrease in sample size. This study directly supports agent-based modelling by enabling finer synthetic generation of populations in a reliable environment.Comment: arXiv admin note: text overlap with arXiv:2203.03489, arXiv:1909.07689 by other author

arXiv.org e-Print Archive

Addressing the data bottleneck in medical deep learning models using a human-in-the-loop machine learning approach

Author: Alonso Ríos David
Bobes-Bascarán José
Fernández-Leal Ángel
Hernández-Pereira Elena
Moret-Bonillo Vicente
Mosqueira-Rey E.
Pérez-Sánchez Alberto
Vidal-Ínsua Yolanda
Vázquez-Rivera Francisca
Publication venue: Springer Nature
Publication date: 01/11/2023
Field of study

[Abstract]: Any machine learning (ML) model is highly dependent on the data it uses for learning, and this is even more important in the case of deep learning models. The problem is a data bottleneck, i.e. the difficulty in obtaining an adequate number of cases and quality data. Another issue is improving the learning process, which can be done by actively introducing experts into the learning loop, in what is known as human-in-the-loop (HITL) ML. We describe an ML model based on a neural network in which HITL techniques were used to resolve the data bottleneck problem for the treatment of pancreatic cancer. We first augmented the dataset using synthetic cases created by a generative adversarial network. We then launched an active learning (AL) process involving human experts as oracles to label both new cases and cases by the network found to be suspect. This AL process was carried out simultaneously with an interactive ML process in which feedback was obtained from humans in order to develop better synthetic cases for each iteration of training. We discuss the challenges involved in including humans in the learning process, especially in relation to human–computer interaction, which is acquiring great importance in building ML models and can condition the success of a HITL approach. This paper also discusses the methodological approach adopted to address these challenges.This work has been supported by the State Research Agency of the Spanish Government (Grant PID2019-107194GB-I00/AEI/10.13039/501100011033) and by the Xunta de Galicia (Grant ED431C 2022/44), supported in turn by the EU European Regional Development Fund. We wish to acknowledge support received from the Centro de Investigación de Galicia CITIC, funded by the Xunta de Galicia and the European Regional Development Fund (Galicia 2014–2020 Program; Grant ED431G 2019/01).Xunta de Galicia; ED431C 2022/44Xunta de Galicia; ED431G 2019/0

Repositorio da Universidade da Coruña

Federated Learning for Private Synthetic Data Generation

Author: Leitner Moritz
Publication venue: Karlsruher Institut für Technologie
Publication date: 01/07/2024
Field of study

Die digitale Transformation des Gesundheitswesens hat in den letzten Jahren an Dynamik gewonnen, wie die Einführung von Electronic Health Record (EHR)-Systemen und digitalen Infrastrukturen zum Datenaustausch zwischen allen Akteuren im Gesundheitssektor zeigt. In Deutschland werden Versicherte demnächst die Möglichkeit haben, die in ihrer elektronischen Patientenakte gespeicherten Daten freiwillig für medizinische Forschungszwecke zu spenden. Die Sekundärnutzung medizinischer Real-World-Daten birgt zwar ein großes Potenzial, etwa bei der Überwachung von Langzeitergebnissen im Zusammenhang mit bestimmten Behandlungen, wirft aber auch erhebliche Bedenken hinsichtlich des Schutzes der Privatsphäre auf, da Gesundheitsdaten aufgrund des Risikos von Stigmatisierung oder Diskriminierung infolge einer missbräuchlichen Nutzung besonders schützenswert sind. Aus diesem Grund wurden in der Literatur verschiedene Privacy-Enhancing Technologies (PETs) vorgestellt. So ermöglicht beispielsweise Differential Privacy (DP), die Auswirkungen von Datenanalysen auf die Privatsphäre durch Einfügen von sorgfältig kalibriertem Rauschen zu begrenzen. Mit den jüngsten Fortschritten im Bereich des maschinellen Lernens hat die Generierung synthetischer Daten (SDG) mithilfe von Generative Adversarial Networks (GANs) als Verfahren zum Schutz der Privatsphäre an Aufmerksamkeit gewonnen. Des Weiteren erlaubt Federated Learning (FL) das dezentrale Training von Machine-Learning-Modellen. Durch die Kombination von DP, SDG und FL können synthetische Daten kollaborativ erzeugt werden, die sowohl starke Datenschutzgarantien als auch einen Mehrwert für die Forschung bieten, während gleichzeitig die Trainingsdaten nicht mit einer zentralen Instanz geteilt werden müssen. In dieser Masterarbeit wird ein neuartiger Ansatz namens DP-Fed-CTGAN zur Erzeugung synthetischer tabellarischer Daten vorgestellt, der auf FL beruht und strikte DP-Garantien erfüllt. Verglichen mit bestehenden Ansätzen zielt DP-Fed-CTGAN darauf ab, die Menge an Informationen zu minimieren, die Clients während des FL-Verfahrens über ihre lokalen Trainingsdatensätze preisgeben müssen. Die Performanz der Open-Source-Implementierung von DP-Fed-CTGAN wird anhand gängiger Metriken evaluiert, wobei sowohl medizinische als auch häufig verwendete Machine-Learning-Datensätze betrachtet werden. Die Ergebnisse zeigen, dass DP-Fed-CTGAN nicht nur einen vergleichbaren Nutzen und eine verbesserte Realitätsnähe im Vergleich zum zentralen Ansatz von DP-CTGAN erreicht, sondern auch dazu beitragen kann, die Akzeptanz der Patienten für eine Datenspende zu erhöhen und die Einhaltung der Datenschutzgesetze zu erleichtern

KITopen

Repository KITopen

An improved CTGAN for data processing method of imbalanced disk failure

Author: Dawood Hussain
Jia Jingbo
Wu Peng
Publication venue
Publication date: 10/10/2023
Field of study

To address the problem of insufficient failure data generated by disks and the imbalance between the number of normal and failure data. The existing Conditional Tabular Generative Adversarial Networks (CTGAN) deep learning methods have been proven to be effective in solving imbalance disk failure data. But CTGAN cannot learn the internal information of disk failure data very well. In this paper, a fault diagnosis method based on improved CTGAN, a classifier for specific category discrimination is added and a discriminator generate adversarial network based on residual network is proposed. We named it Residual Conditional Tabular Generative Adversarial Networks (RCTGAN). Firstly, to enhance the stability of system a residual network is utilized. RCTGAN uses a small amount of real failure data to synthesize fake fault data; Then, the synthesized data is mixed with the real data to balance the amount of normal and failure data; Finally, four classifier (multilayer perceptron, support vector machine, decision tree, random forest) models are trained using the balanced data set, and the performance of the models is evaluated using G-mean. The experimental results show that the data synthesized by the RCTGAN can further improve the fault diagnosis accuracy of the classifier

arXiv.org e-Print Archive