3,475 research outputs found
Differentially Private Data Generative Models
Deep neural networks (DNNs) have recently been widely adopted in various
applications, and such success is largely due to a combination of algorithmic
breakthroughs, computation resource improvements, and access to a large amount
of data. However, the large-scale data collections required for deep learning
often contain sensitive information, therefore raising many privacy concerns.
Prior research has shown several successful attacks in inferring sensitive
training data information, such as model inversion, membership inference, and
generative adversarial networks (GAN) based leakage attacks against
collaborative deep learning. In this paper, to enable learning efficiency as
well as to generate data with privacy guarantees and high utility, we propose a
differentially private autoencoder-based generative model (DP-AuGM) and a
differentially private variational autoencoder-based generative model
(DP-VaeGM). We evaluate the robustness of two proposed models. We show that
DP-AuGM can effectively defend against the model inversion, membership
inference, and GAN-based attacks. We also show that DP-VaeGM is robust against
the membership inference attack. We conjecture that the key to defend against
the model inversion and GAN-based attacks is not due to differential privacy
but the perturbation of training data. Finally, we demonstrate that both
DP-AuGM and DP-VaeGM can be easily integrated with real-world machine learning
applications, such as machine learning as a service and federated learning,
which are otherwise threatened by the membership inference attack and the
GAN-based attack, respectively
Differentially Private Continual Learning
Catastrophic forgetting can be a significant problem for institutions that
must delete historic data for privacy reasons. For example, hospitals might not
be able to retain patient data permanently. But neural networks trained on
recent data alone will tend to forget lessons learned on old data. We present a
differentially private continual learning framework based on variational
inference. We estimate the likelihood of past data given the current model
using differentially private generative models of old datasets.Comment: Presented at the Privacy in Machine Learning and AI workshop at ICML
201
Differentially Private Synthetic Data: Applied Evaluations and Enhancements
Machine learning practitioners frequently seek to leverage the most
informative available data, without violating the data owner's privacy, when
building predictive models. Differentially private data synthesis protects
personal details from exposure, and allows for the training of differentially
private machine learning models on privately generated datasets. But how can we
effectively assess the efficacy of differentially private synthetic data? In
this paper, we survey four differentially private generative adversarial
networks for data synthesis. We evaluate each of them at scale on five standard
tabular datasets, and in two applied industry scenarios. We benchmark with
novel metrics from recent literature and other standard machine learning tools.
Our results suggest some synthesizers are more applicable for different privacy
budgets, and we further demonstrate complicating domain-based tradeoffs in
selecting an approach. We offer experimental learning on applied machine
learning scenarios with private internal data to researchers and practioners
alike. In addition, we propose QUAIL, an ensemble-based modeling approach to
generating synthetic data. We examine QUAIL's tradeoffs, and note circumstances
in which it outperforms baseline differentially private supervised learning
models under the same budget constraint.Comment: Under Revie
Differentially Private Generative Adversarial Network
Generative Adversarial Network (GAN) and its variants have recently attracted
intensive research interests due to their elegant theoretical foundation and
excellent empirical performance as generative models. These tools provide a
promising direction in the studies where data availability is limited. One
common issue in GANs is that the density of the learned generative distribution
could concentrate on the training data points, meaning that they can easily
remember training samples due to the high model complexity of deep networks.
This becomes a major concern when GANs are applied to private or sensitive data
such as patient medical records, and the concentration of distribution may
divulge critical patient information. To address this issue, in this paper we
propose a differentially private GAN (DPGAN) model, in which we achieve
differential privacy in GANs by adding carefully designed noise to gradients
during the learning procedure. We provide rigorous proof for the privacy
guarantee, as well as comprehensive empirical evidence to support our analysis,
where we demonstrate that our method can generate high quality data points at a
reasonable privacy level
RON-Gauss: Enhancing Utility in Non-Interactive Private Data Release
A key challenge facing the design of differential privacy in the
non-interactive setting is to maintain the utility of the released data. To
overcome this challenge, we utilize the Diaconis-Freedman-Meckes (DFM) effect,
which states that most projections of high-dimensional data are nearly
Gaussian. Hence, we propose the RON-Gauss model that leverages the novel
combination of dimensionality reduction via random orthonormal (RON) projection
and the Gaussian generative model for synthesizing differentially-private data.
We analyze how RON-Gauss benefits from the DFM effect, and present multiple
algorithms for a range of machine learning applications, including both
unsupervised and supervised learning. Furthermore, we rigorously prove that (a)
our algorithms satisfy the strong -differential privacy guarantee,
and (b) RON projection can lower the level of perturbation required for
differential privacy. Finally, we illustrate the effectiveness of RON-Gauss
under three common machine learning applications -- clustering, classification,
and regression -- on three large real-world datasets. Our empirical results
show that (a) RON-Gauss outperforms previous approaches by up to an order of
magnitude, and (b) loss in utility compared to the non-private real data is
small. Thus, RON-Gauss can serve as a key enabler for real-world deployment of
privacy-preserving data release.Comment: Appears in PoPETS 2019.
pMSE Mechanism: Differentially Private Synthetic Data with Maximal Distributional Similarity
We propose a method for the release of differentially private synthetic
datasets. In many contexts, data contain sensitive values which cannot be
released in their original form in order to protect individuals' privacy.
Synthetic data is a protection method that releases alternative values in place
of the original ones, and differential privacy (DP) is a formal guarantee for
quantifying the privacy loss. We propose a method that maximizes the
distributional similarity of the synthetic data relative to the original data
using a measure known as the pMSE, while guaranteeing epsilon-differential
privacy. Additionally, we relax common DP assumptions concerning the
distribution and boundedness of the original data. We prove theoretical results
for the privacy guarantee and provide simulations for the empirical failure
rate of the theoretical results under typical computational limitations. We
also give simulations for the accuracy of linear regression coefficients
generated from the synthetic data compared with the accuracy of
non-differentially private synthetic data and other differentially private
methods. Additionally, our theoretical results extend a prior result for the
sensitivity of the Gini Index to include continuous predictors.Comment: 16 pages, 4 figure
Synthetic Data Generators: Sequential and Private
We study the sample complexity of private synthetic data generation over an
unbounded sized class of statistical queries, and show that any class that is
privately proper PAC learnable admits a private synthetic data generator
(perhaps non-efficient). Previous work on synthetic data generators focused on
the case that the query class is finite and obtained sample
complexity bounds that scale logarithmically with the size .
Here we construct a private synthetic data generator whose sample complexity is
independent of the domain size, and we replace finiteness with the assumption
that is privately PAC learnable (a formally weaker task, hence we
obtain equivalence between the two tasks)
Generative Models for Effective ML on Private, Decentralized Datasets
To improve real-world applications of machine learning, experienced modelers
develop intuition about their datasets, their models, and how the two interact.
Manual inspection of raw data - of representative samples, of outliers, of
misclassifications - is an essential tool in a) identifying and fixing problems
in the data, b) generating new modeling hypotheses, and c) assigning or
refining human-provided labels. However, manual data inspection is problematic
for privacy sensitive datasets, such as those representing the behavior of
real-world individuals. Furthermore, manual data inspection is impossible in
the increasingly important setting of federated learning, where raw examples
are stored at the edge and the modeler may only access aggregated outputs such
as metrics or model parameters. This paper demonstrates that generative models
- trained using federated methods and with formal differential privacy
guarantees - can be used effectively to debug many commonly occurring data
issues even when the data cannot be directly inspected. We explore these
methods in applications to text with differentially private federated RNNs and
to images using a novel algorithm for differentially private federated GANs.Comment: 26 pages, 8 figures. Camera-ready ICLR 2020 versio
The Secret Sharer: Evaluating and Testing Unintended Memorization in Neural Networks
This paper describes a testing methodology for quantitatively assessing the
risk that rare or unique training-data sequences are unintentionally memorized
by generative sequence models---a common type of machine-learning model.
Because such models are sometimes trained on sensitive data (e.g., the text of
users' private messages), this methodology can benefit privacy by allowing
deep-learning practitioners to select means of training that minimize such
memorization.
In experiments, we show that unintended memorization is a persistent,
hard-to-avoid issue that can have serious consequences. Specifically, for
models trained without consideration of memorization, we describe new,
efficient procedures that can extract unique, secret sequences, such as credit
card numbers. We show that our testing strategy is a practical and easy-to-use
first line of defense, e.g., by describing its application to quantitatively
limit data exposure in Google's Smart Compose, a commercial text-completion
neural network trained on millions of users' email messages
Security and Privacy Issues in Deep Learning
With the development of machine learning (ML), expectations for artificial
intelligence (AI) technology have been increasing daily. In particular, deep
neural networks have shown outstanding performance results in many fields. Many
applications are deeply involved in our daily life, such as making significant
decisions in application areas based on predictions or classifications, in
which a DL model could be relevant. Hence, if a DL model causes mispredictions
or misclassifications due to malicious external influences, then it can cause
very large difficulties in real life. Moreover, training DL models involve an
enormous amount of data and the training data often include sensitive
information. Therefore, DL models should not expose the privacy of such data.
In this paper, we review the vulnerabilities and the developed defense methods
on the security of the models and data privacy under the notion of secure and
private AI (SPAI). We also discuss current challenges and open issues
- …