669 research outputs found

    Wasserstein Divergence for GANs

    Full text link
    In many domains of computer vision, generative adversarial networks (GANs) have achieved great success, among which the family of Wasserstein GANs (WGANs) is considered to be state-of-the-art due to the theoretical contributions and competitive qualitative performance. However, it is very challenging to approximate the kk-Lipschitz constraint required by the Wasserstein-1 metric~(W-met). In this paper, we propose a novel Wasserstein divergence~(W-div), which is a relaxed version of W-met and does not require the kk-Lipschitz constraint. As a concrete application, we introduce a Wasserstein divergence objective for GANs~(WGAN-div), which can faithfully approximate W-div through optimization. Under various settings, including progressive growing training, we demonstrate the stability of the proposed WGAN-div owing to its theoretical and practical advantages over WGANs. Also, we study the quantitative and visual performance of WGAN-div on standard image synthesis benchmarks of computer vision, showing the superior performance of WGAN-div compared to the state-of-the-art methods.Comment: accepted by eccv_2018, correct minor error

    Sliced Wasserstein Generative Models

    Full text link
    In generative modeling, the Wasserstein distance (WD) has emerged as a useful metric to measure the discrepancy between generated and real data distributions. Unfortunately, it is challenging to approximate the WD of high-dimensional distributions. In contrast, the sliced Wasserstein distance (SWD) factorizes high-dimensional distributions into their multiple one-dimensional marginal distributions and is thus easier to approximate. In this paper, we introduce novel approximations of the primal and dual SWD. Instead of using a large number of random projections, as it is done by conventional SWD approximation methods, we propose to approximate SWDs with a small number of parameterized orthogonal projections in an end-to-end deep learning fashion. As concrete applications of our SWD approximations, we design two types of differentiable SWD blocks to equip modern generative frameworks---Auto-Encoders (AE) and Generative Adversarial Networks (GAN). In the experiments, we not only show the superiority of the proposed generative models on standard image synthesis benchmarks, but also demonstrate the state-of-the-art performance on challenging high resolution image and video generation in an unsupervised manner.Comment: This paper is accepted by CVPR 2019, accidentally uploaded as a new submission (arXiv:1904.05408, which has been withdrawn). The code is available at this https URL https:// github.com/musikisomorphie/swd.gi

    A Generative Deep Learning Approach for Crash Severity Modeling with Imbalanced Data

    Full text link
    Crash data is often greatly imbalanced, with the majority of crashes being non-fatal crashes, and only a small number being fatal crashes due to their rarity. Such data imbalance issue poses a challenge for crash severity modeling since it struggles to fit and interpret fatal crash outcomes with very limited samples. Usually, such data imbalance issues are addressed by data resampling methods, such as under-sampling and over-sampling techniques. However, most traditional and deep learning-based data resampling methods, such as synthetic minority oversampling technique (SMOTE) and generative Adversarial Networks (GAN) are designed dedicated to processing continuous variables. Though some resampling methods have improved to handle both continuous and discrete variables, they may have difficulties in dealing with the collapse issue associated with sparse discrete risk factors. Moreover, there is a lack of comprehensive studies that compare the performance of various resampling methods in crash severity modeling. To address the aforementioned issues, the current study proposes a crash data generation method based on the Conditional Tabular GAN. After data balancing, a crash severity model is employed to estimate the performance of classification and interpretation. A comparative study is conducted to assess classification accuracy and distribution consistency of the proposed generation method using a 4-year imbalanced crash dataset collected in Washington State, U.S. Additionally, Monte Carlo simulation is employed to estimate the performance of parameter and probability estimation in both two- and three-class imbalance scenarios. The results indicate that using synthetic data generated by CTGAN-RU for crash severity modeling outperforms using original data or synthetic data generated by other resampling methods

    IMPROVING STROKE PREDICTION ON IMBALANCED CLINICAL DATA USING CTGAN AND TVAE: A SYNTHETIC DATA APPROACH

    Get PDF
    Synthetic data (SD) have been evaluated and adopted in different domains and areas, especially in health. To conduct this study, we chose tabular data on stroke prediction, available in [3]. The dataset contains 11 clinical features, including thelast column, positive = 1 and negative = 0 for stroke. We chose this dataset because of its imbalance, which will be a perfect fit for implementing the generating techniques to know how well the real data resemble these. For generating SD, we use two techniques known as conditional tabular GAN (CTGAN) and tabular variational autoencoder (TVAE), which have different numbers of epochs and batch sizes. We further evaluated the results with three Machine Learning (ML) models as a benchmark with real data. The results highlight data generated with CTGAN (epochs=1500, batch size=500) performbetter with an accuracy score of 0.995 on random forest (RF) and Support Vector Machine (SVM)

    PrivateCTGAN: adapting GAN for privacy-aware tabular data sharing

    Get PDF
    This research addresses the challenge of generating synthetic data that resembles real-world data while preserving privacy. With privacy laws protecting sensitive information such as healthcare data, accessing sufficient training data becomes difficult, resulting in an increased difficulty in training Machine Learning models and in overall worst models. Recently, there has been an increased interest in the usage of Generative Adversarial Networks (GAN) to generate synthetic data since they enable researchers to generate more data to train their models. GANs, however, may not be suitable for privacy-sensitive data since they have no concern for the privacy of the generated data. We propose modifying the known Conditional Tabular GAN (CTGAN) model by incorporating a privacy-aware loss function, thus resulting in the Private CTGAN (PCTGAN) method. Several experiments were carried out using 10 public domain classification datasets and comparing PCTGAN with CTGAN and the state-of-the-art privacy-preserving model, the Differential Privacy CTGAN (DP-CTGAN). The results demonstrated that PCTGAN enables users to fine-tune the privacy fidelity trade-off by leveraging parameters, as well as that if desired, a higher level of privacy.This work was partially funded by projects AISym4Med (101095387) supported by Horizon Europe Cluster 1: Health, ConnectedHealth (n.ō 46858), supported by Competitiveness and Internationalisation Operational Programme (POCI) and Lisbon Regional Operational Programme (LISBOA 2020), under the PORTUGAL 2020 Partnership Agreement, through the European Regional Development Fund (ERDF) and NextGenAI - Center for Responsible AI (2022-C05i0102-02), supported by IAPMEI, and also by FCT plurianual funding for 2020-2023 of LIACC (UIDB/00027/2020_UIDP/00027/2020)

    Adversarial Machine Learning-Enabled Anonymization of OpenWiFi Data

    Full text link
    Data privacy and protection through anonymization is a critical issue for network operators or data owners before it is forwarded for other possible use of data. With the adoption of Artificial Intelligence (AI), data anonymization augments the likelihood of covering up necessary sensitive information; preventing data leakage and information loss. OpenWiFi networks are vulnerable to any adversary who is trying to gain access or knowledge on traffic regardless of the knowledge possessed by data owners. The odds for discovery of actual traffic information is addressed by applied conditional tabular generative adversarial network (CTGAN). CTGAN yields synthetic data; which disguises as actual data but fostering hidden acute information of actual data. In this paper, the similarity assessment of synthetic with actual data is showcased in terms of clustering algorithms followed by a comparison of performance for unsupervised cluster validation metrics. A well-known algorithm, K-means outperforms other algorithms in terms of similarity assessment of synthetic data over real data while achieving nearest scores 0.634, 23714.57, and 0.598 as Silhouette, Calinski and Harabasz and Davies Bouldin metric respectively. On exploiting a comparative analysis in validation scores among several algorithms, K-means forms the epitome of unsupervised clustering algorithms ensuring explicit usage of synthetic data at the same time a replacement for real data. Hence, the experimental results aim to show the viability of using CTGAN-generated synthetic data in lieu of publishing anonymized data to be utilized in various applications.Comment: 8 pages, 4 Figures, "Wireless World Research and Trends" Magazine. Initial version was presented in 47th Wireless World Research Foru

    Robustness Analysis of Deep Learning Models for Population Synthesis

    Full text link
    Deep generative models have become useful for synthetic data generation, particularly population synthesis. The models implicitly learn the probability distribution of a dataset and can draw samples from a distribution. Several models have been proposed, but their performance is only tested on a single cross-sectional sample. The implementation of population synthesis on single datasets is seen as a drawback that needs further studies to explore the robustness of the models on multiple datasets. While comparing with the real data can increase trust and interpretability of the models, techniques to evaluate deep generative models' robustness for population synthesis remain underexplored. In this study, we present bootstrap confidence interval for the deep generative models, an approach that computes efficient confidence intervals for mean errors predictions to evaluate the robustness of the models to multiple datasets. Specifically, we adopt the tabular-based Composite Travel Generative Adversarial Network (CTGAN) and Variational Autoencoder (VAE), to estimate the distribution of the population, by generating agents that have tabular data using several samples over time from the same study area. The models are implemented on multiple travel diaries of Montreal Origin- Destination Survey of 2008, 2013, and 2018 and compare the predictive performance under varying sample sizes from multiple surveys. Results show that the predictive errors of CTGAN have narrower confidence intervals indicating its robustness to multiple datasets of the varying sample sizes when compared to VAE. Again, the evaluation of model robustness against varying sample size shows a minimal decrease in model performance with decrease in sample size. This study directly supports agent-based modelling by enabling finer synthetic generation of populations in a reliable environment.Comment: arXiv admin note: text overlap with arXiv:2203.03489, arXiv:1909.07689 by other author

    Addressing the data bottleneck in medical deep learning models using a human-in-the-loop machine learning approach

    Get PDF
    [Abstract]: Any machine learning (ML) model is highly dependent on the data it uses for learning, and this is even more important in the case of deep learning models. The problem is a data bottleneck, i.e. the difficulty in obtaining an adequate number of cases and quality data. Another issue is improving the learning process, which can be done by actively introducing experts into the learning loop, in what is known as human-in-the-loop (HITL) ML. We describe an ML model based on a neural network in which HITL techniques were used to resolve the data bottleneck problem for the treatment of pancreatic cancer. We first augmented the dataset using synthetic cases created by a generative adversarial network. We then launched an active learning (AL) process involving human experts as oracles to label both new cases and cases by the network found to be suspect. This AL process was carried out simultaneously with an interactive ML process in which feedback was obtained from humans in order to develop better synthetic cases for each iteration of training. We discuss the challenges involved in including humans in the learning process, especially in relation to human–computer interaction, which is acquiring great importance in building ML models and can condition the success of a HITL approach. This paper also discusses the methodological approach adopted to address these challenges.This work has been supported by the State Research Agency of the Spanish Government (Grant PID2019-107194GB-I00/AEI/10.13039/501100011033) and by the Xunta de Galicia (Grant ED431C 2022/44), supported in turn by the EU European Regional Development Fund. We wish to acknowledge support received from the Centro de Investigación de Galicia CITIC, funded by the Xunta de Galicia and the European Regional Development Fund (Galicia 2014–2020 Program; Grant ED431G 2019/01).Xunta de Galicia; ED431C 2022/44Xunta de Galicia; ED431G 2019/0

    Federated Learning for Private Synthetic Data Generation

    Get PDF
    Die digitale Transformation des Gesundheitswesens hat in den letzten Jahren an Dynamik gewonnen, wie die Einführung von Electronic Health Record (EHR)-Systemen und digitalen Infrastrukturen zum Datenaustausch zwischen allen Akteuren im Gesundheitssektor zeigt. In Deutschland werden Versicherte demnächst die Möglichkeit haben, die in ihrer elektronischen Patientenakte gespeicherten Daten freiwillig für medizinische Forschungszwecke zu spenden. Die Sekundärnutzung medizinischer Real-World-Daten birgt zwar ein großes Potenzial, etwa bei der Überwachung von Langzeitergebnissen im Zusammenhang mit bestimmten Behandlungen, wirft aber auch erhebliche Bedenken hinsichtlich des Schutzes der Privatsphäre auf, da Gesundheitsdaten aufgrund des Risikos von Stigmatisierung oder Diskriminierung infolge einer missbräuchlichen Nutzung besonders schützenswert sind. Aus diesem Grund wurden in der Literatur verschiedene Privacy-Enhancing Technologies (PETs) vorgestellt. So ermöglicht beispielsweise Differential Privacy (DP), die Auswirkungen von Datenanalysen auf die Privatsphäre durch Einfügen von sorgfältig kalibriertem Rauschen zu begrenzen. Mit den jüngsten Fortschritten im Bereich des maschinellen Lernens hat die Generierung synthetischer Daten (SDG) mithilfe von Generative Adversarial Networks (GANs) als Verfahren zum Schutz der Privatsphäre an Aufmerksamkeit gewonnen. Des Weiteren erlaubt Federated Learning (FL) das dezentrale Training von Machine-Learning-Modellen. Durch die Kombination von DP, SDG und FL können synthetische Daten kollaborativ erzeugt werden, die sowohl starke Datenschutzgarantien als auch einen Mehrwert für die Forschung bieten, während gleichzeitig die Trainingsdaten nicht mit einer zentralen Instanz geteilt werden müssen. In dieser Masterarbeit wird ein neuartiger Ansatz namens DP-Fed-CTGAN zur Erzeugung synthetischer tabellarischer Daten vorgestellt, der auf FL beruht und strikte DP-Garantien erfüllt. Verglichen mit bestehenden Ansätzen zielt DP-Fed-CTGAN darauf ab, die Menge an Informationen zu minimieren, die Clients während des FL-Verfahrens über ihre lokalen Trainingsdatensätze preisgeben müssen. Die Performanz der Open-Source-Implementierung von DP-Fed-CTGAN wird anhand gängiger Metriken evaluiert, wobei sowohl medizinische als auch häufig verwendete Machine-Learning-Datensätze betrachtet werden. Die Ergebnisse zeigen, dass DP-Fed-CTGAN nicht nur einen vergleichbaren Nutzen und eine verbesserte Realitätsnähe im Vergleich zum zentralen Ansatz von DP-CTGAN erreicht, sondern auch dazu beitragen kann, die Akzeptanz der Patienten für eine Datenspende zu erhöhen und die Einhaltung der Datenschutzgesetze zu erleichtern

    An improved CTGAN for data processing method of imbalanced disk failure

    Full text link
    To address the problem of insufficient failure data generated by disks and the imbalance between the number of normal and failure data. The existing Conditional Tabular Generative Adversarial Networks (CTGAN) deep learning methods have been proven to be effective in solving imbalance disk failure data. But CTGAN cannot learn the internal information of disk failure data very well. In this paper, a fault diagnosis method based on improved CTGAN, a classifier for specific category discrimination is added and a discriminator generate adversarial network based on residual network is proposed. We named it Residual Conditional Tabular Generative Adversarial Networks (RCTGAN). Firstly, to enhance the stability of system a residual network is utilized. RCTGAN uses a small amount of real failure data to synthesize fake fault data; Then, the synthesized data is mixed with the real data to balance the amount of normal and failure data; Finally, four classifier (multilayer perceptron, support vector machine, decision tree, random forest) models are trained using the balanced data set, and the performance of the models is evaluated using G-mean. The experimental results show that the data synthesized by the RCTGAN can further improve the fault diagnosis accuracy of the classifier
    corecore