14,711 research outputs found
Differentially Private Multivariate Statistics with an Application to Contingency Table Analysis
Differential privacy (DP) has become a rigorous central concept in privacy
protection for the past decade. Among various notions of DP, -DP is an
easily interpretable and informative concept that tightly captures privacy
level by comparing trade-off functions obtained from the hypothetical test of
how well the mechanism recognizes individual information in the dataset. We
adopt the Gaussian differential privacy (GDP), a canonical parametric family of
-DP. The Gaussian mechanism is a natural and fundamental mechanism that
tightly achieves GDP. However, the ordinary multivariate Gaussian mechanism is
not optimal with respect to statistical utility. To improve the utility, we
develop the rank-deficient and James-Stein Gaussian mechanisms for releasing
private multivariate statistics based on the geometry of multivariate Gaussian
distribution. We show that our proposals satisfy GDP and dominate the ordinary
Gaussian mechanism with respect to -cost. We also show that the Laplace
mechanism, a prime mechanism in -DP framework, is sub-optimal than
Gaussian-type mechanisms under the framework of GDP. For a fair comparison, we
calibrate the Laplace mechanism to the global sensitivity of the statistic with
the exact approach to the trade-off function. We also develop the optimal
parameter for the Laplace mechanism when applied to contingency tables. Indeed,
we show that the Gaussian-type mechanisms dominate the Laplace mechanism in
contingency table analysis. In addition, we apply our findings to propose
differentially private -tests on contingency tables. Numerical results
demonstrate that differentially private parametric bootstrap tests control the
type I error rates and show higher power than other natural competitors
DPpack: An R Package for Differentially Private Statistical Analysis and Machine Learning
Differential privacy (DP) is the state-of-the-art framework for guaranteeing
privacy for individuals when releasing aggregated statistics or building
statistical/machine learning models from data. We develop the open-source R
package DPpack that provides a large toolkit of differentially private
analysis. The current version of DPpack implements three popular mechanisms for
ensuring DP: Laplace, Gaussian, and exponential. Beyond that, DPpack provides a
large toolkit of easily accessible privacy-preserving descriptive statistics
functions. These include mean, variance, covariance, and quantiles, as well as
histograms and contingency tables. Finally, DPpack provides user-friendly
implementation of privacy-preserving versions of logistic regression, SVM, and
linear regression, as well as differentially private hyperparameter tuning for
each of these models. This extensive collection of implemented differentially
private statistics and models permits hassle-free utilization of differential
privacy principles in commonly performed statistical analysis. We plan to
continue developing DPpack and make it more comprehensive by including more
differentially private machine learning techniques, statistical modeling and
inference in the future
Differentially Private Publication of Sparse Data
The problem of privately releasing data is to provide a version of a dataset
without revealing sensitive information about the individuals who contribute to
the data. The model of differential privacy allows such private release while
providing strong guarantees on the output. A basic mechanism achieves
differential privacy by adding noise to the frequency counts in the contingency
tables (or, a subset of the count data cube) derived from the dataset. However,
when the dataset is sparse in its underlying space, as is the case for most
multi-attribute relations, then the effect of adding noise is to vastly
increase the size of the published data: it implicitly creates a huge number of
dummy data points to mask the true data, making it almost impossible to work
with.
We present techniques to overcome this roadblock and allow efficient private
release of sparse data, while maintaining the guarantees of differential
privacy. Our approach is to release a compact summary of the noisy data.
Generating the noisy data and then summarizing it would still be very costly,
so we show how to shortcut this step, and instead directly generate the summary
from the input data, without materializing the vast intermediate noisy data. We
instantiate this outline for a variety of sampling and filtering methods, and
show how to use the resulting summary for approximate, private, query
answering. Our experimental study shows that this is an effective, practical
solution, with comparable and occasionally improved utility over the costly
materialization approach
Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies
The protection of privacy of individual-level information in genome-wide
association study (GWAS) databases has been a major concern of researchers
following the publication of "an attack" on GWAS data by Homer et al. (2008)
Traditional statistical methods for confidentiality and privacy protection of
statistical databases do not scale well to deal with GWAS data, especially in
terms of guarantees regarding protection from linkage to external information.
The more recent concept of differential privacy, introduced by the
cryptographic community, is an approach that provides a rigorous definition of
privacy with meaningful privacy guarantees in the presence of arbitrary
external information, although the guarantees may come at a serious price in
terms of data utility. Building on such notions, Uhler et al. (2013) proposed
new methods to release aggregate GWAS data without compromising an individual's
privacy. We extend the methods developed in Uhler et al. (2013) for releasing
differentially-private -statistics by allowing for arbitrary number of
cases and controls, and for releasing differentially-private allelic test
statistics. We also provide a new interpretation by assuming the controls' data
are known, which is a realistic assumption because some GWAS use publicly
available data as controls. We assess the performance of the proposed methods
through a risk-utility analysis on a real data set consisting of DNA samples
collected by the Wellcome Trust Case Control Consortium and compare the methods
with the differentially-private release mechanism proposed by Johnson and
Shmatikov (2013).Comment: 28 pages, 2 figures, source code available upon reques
Accurate and Efficient Private Release of Datacubes and Contingency Tables
A central problem in releasing aggregate information about sensitive data is
to do so accurately while providing a privacy guarantee on the output. Recent
work focuses on the class of linear queries, which include basic counting
queries, data cubes, and contingency tables. The goal is to maximize the
utility of their output, while giving a rigorous privacy guarantee. Most
results follow a common template: pick a "strategy" set of linear queries to
apply to the data, then use the noisy answers to these queries to reconstruct
the queries of interest. This entails either picking a strategy set that is
hoped to be good for the queries, or performing a costly search over the space
of all possible strategies.
In this paper, we propose a new approach that balances accuracy and
efficiency: we show how to improve the accuracy of a given query set by
answering some strategy queries more accurately than others. This leads to an
efficient optimal noise allocation for many popular strategies, including
wavelets, hierarchies, Fourier coefficients and more. For the important case of
marginal queries we show that this strictly improves on previous methods, both
analytically and empirically. Our results also extend to ensuring that the
returned query answers are consistent with an (unknown) data set at minimal
extra cost in terms of time and noise
On the Differential Privacy of Bayesian Inference
We study how to communicate findings of Bayesian inference to third parties,
while preserving the strong guarantee of differential privacy. Our main
contributions are four different algorithms for private Bayesian inference on
proba-bilistic graphical models. These include two mechanisms for adding noise
to the Bayesian updates, either directly to the posterior parameters, or to
their Fourier transform so as to preserve update consistency. We also utilise a
recently introduced posterior sampling mechanism, for which we prove bounds for
the specific but general case of discrete Bayesian networks; and we introduce a
maximum-a-posteriori private mechanism. Our analysis includes utility and
privacy bounds, with a novel focus on the influence of graph structure on
privacy. Worked examples and experiments with Bayesian na{\"i}ve Bayes and
Bayesian linear regression illustrate the application of our mechanisms.Comment: AAAI 2016, Feb 2016, Phoenix, Arizona, United State
Differentially Private Exponential Random Graphs
We propose methods to release and analyze synthetic graphs in order to
protect privacy of individual relationships captured by the social network.
Proposed techniques aim at fitting and estimating a wide class of exponential
random graph models (ERGMs) in a differentially private manner, and thus offer
rigorous privacy guarantees. More specifically, we use the randomized response
mechanism to release networks under -edge differential privacy. To
maintain utility for statistical inference, treating the original graph as
missing, we propose a way to use likelihood based inference and Markov chain
Monte Carlo (MCMC) techniques to fit ERGMs to the produced synthetic networks.
We demonstrate the usefulness of the proposed techniques on a real data
example.Comment: minor edit
- …