Search CORE

132 research outputs found

Generating Artificial Data for Private Deep Learning

Author: Faltings Boi
Triastcyn Aleksei
Publication venue
Publication date: 28/04/2019
Field of study

In this paper, we propose generating artificial data that retain statistical properties of real data as the means of providing privacy with respect to the original dataset. We use generative adversarial network to draw privacy-preserving artificial data samples and derive an empirical method to assess the risk of information disclosure in a differential-privacy-like way. Our experiments show that we are able to generate artificial data of high quality and successfully train and validate machine learning models on this data while limiting potential privacy loss.Comment: Privacy-Enhancing Artificial Intelligence and Language Technologies, AAAI Spring Symposium Series, 201

arXiv.org e-Print Archive

Infoscience - École polytechnique fédérale de Lausanne

Robin Hood and Matthew Effects: Differential Privacy Has Disparate Impact on Synthetic Data

Author: Cristofaro Emiliano De
Ganev Georgi
Oprisanu Bristena
Publication venue
Publication date: 01/01/2022
Field of study

Generative models trained with Differential Privacy (DP) can be used to generate synthetic data while minimizing privacy risks. We analyze the impact of DP on these models vis-a-vis underrepresented classes/subgroups of data, specifically, studying: 1) the size of classes/subgroups in the synthetic data, and 2) the accuracy of classification tasks run on them. We also evaluate the effect of various levels of imbalance and privacy budgets. Our analysis uses three state-of-the-art DP models (PrivBayes, DP-WGAN, and PATE-GAN) and shows that DP yields opposite size distributions in the generated synthetic data. It affects the gap between the majority and minority classes/subgroups; in some cases by reducing it (a "Robin Hood" effect) and, in others, by increasing it (a "Matthew" effect). Either way, this leads to (similar) disparate impacts on the accuracy of classification tasks on the synthetic data, affecting disproportionately more the underrepresented subparts of the data. Consequently, when training models on synthetic data, one might incur the risk of treating different subpopulations unevenly, leading to unreliable or unfair conclusions

UCL Discovery