22 research outputs found
Tight Lower Bounds for Differentially Private Selection
A pervasive task in the differential privacy literature is to select the
items of "highest quality" out of a set of items, where the quality of each
item depends on a sensitive dataset that must be protected. Variants of this
task arise naturally in fundamental problems like feature selection and
hypothesis testing, and also as subroutines for many sophisticated
differentially private algorithms.
The standard approaches to these tasks---repeated use of the exponential
mechanism or the sparse vector technique---approximately solve this problem
given a dataset of samples. We provide a tight lower
bound for some very simple variants of the private selection problem. Our lower
bound shows that a sample of size is required
even to achieve a very minimal accuracy guarantee.
Our results are based on an extension of the fingerprinting method to sparse
selection problems. Previously, the fingerprinting method has been used to
provide tight lower bounds for answering an entire set of queries, but
often only some much smaller set of queries are relevant. Our extension
allows us to prove lower bounds that depend on both the number of relevant
queries and the total number of queries
Statistical Theory of Differentially Private Marginal-based Data Synthesis Algorithms
Marginal-based methods achieve promising performance in the synthetic data
competition hosted by the National Institute of Standards and Technology
(NIST). To deal with high-dimensional data, the distribution of synthetic data
is represented by a probabilistic graphical model (e.g., a Bayesian network),
while the raw data distribution is approximated by a collection of
low-dimensional marginals. Differential privacy (DP) is guaranteed by
introducing random noise to each low-dimensional marginal distribution. Despite
its promising performance in practice, the statistical properties of
marginal-based methods are rarely studied in the literature. In this paper, we
study DP data synthesis algorithms based on Bayesian networks (BN) from a
statistical perspective. We establish a rigorous accuracy guarantee for
BN-based algorithms, where the errors are measured by the total variation (TV)
distance or the distance. Related to downstream machine learning tasks,
an upper bound for the utility error of the DP synthetic data is also derived.
To complete the picture, we establish a lower bound for TV accuracy that holds
for every -DP synthetic data generator
Privacy Amplification via Importance Sampling
We examine the privacy-enhancing properties of subsampling a data set via
importance sampling as a pre-processing step for differentially private
mechanisms. This extends the established privacy amplification by subsampling
result to importance sampling where each data point is weighted by the
reciprocal of its selection probability. The implications for privacy of
weighting each point are not obvious. On the one hand, a lower selection
probability leads to a stronger privacy amplification. On the other hand, the
higher the weight, the stronger the influence of the point on the output of the
mechanism in the event that the point does get selected. We provide a general
result that quantifies the trade-off between these two effects. We show that
heterogeneous sampling probabilities can lead to both stronger privacy and
better utility than uniform subsampling while retaining the subsample size. In
particular, we formulate and solve the problem of privacy-optimal sampling,
that is, finding the importance weights that minimize the expected subset size
subject to a given privacy budget. Empirically, we evaluate the privacy,
efficiency, and accuracy of importance sampling-based privacy amplification on
the example of k-means clustering.Comment: Under review for NeurIPS 202
CoinPress: Practical Private Mean and Covariance Estimation
We present simple differentially private estimators for the mean and
covariance of multivariate sub-Gaussian data that are accurate at small sample
sizes. We demonstrate the effectiveness of our algorithms both theoretically
and empirically using synthetic and real-world datasets---showing that their
asymptotic error rates match the state-of-the-art theoretical bounds, and that
they concretely outperform all previous methods. Specifically, previous
estimators either have weak empirical accuracy at small sample sizes, perform
poorly for multivariate data, or require the user to provide strong a priori
estimates for the parameters.Comment: Code is available at https://github.com/twistedcubic/coin-pres
Smooth Lower Bounds for Differentially Private Algorithms via Padding-and-Permuting Fingerprinting Codes
Fingerprinting arguments, first introduced by Bun, Ullman, and Vadhan (STOC
2014), are the most widely used method for establishing lower bounds on the
sample complexity or error of approximately differentially private (DP)
algorithms. Still, there are many problems in differential privacy for which we
don't know suitable lower bounds, and even for problems that we do, the lower
bounds are not smooth, and usually become vacuous when the error is larger than
some threshold.
In this work, we present a simple method to generate hard instances by
applying a padding-and-permuting transformation to a fingerprinting code. We
illustrate the applicability of this method by providing new lower bounds in
various settings:
1. A tight lower bound for DP averaging in the low-accuracy regime, which in
particular implies a new lower bound for the private 1-cluster problem
introduced by Nissim, Stemmer, and Vadhan (PODS 2016).
2. A lower bound on the additive error of DP algorithms for approximate
k-means clustering, as a function of the multiplicative error, which is tight
for a constant multiplication error.
3. A lower bound for estimating the top singular vector of a matrix under DP
in low-accuracy regimes, which is a special case of DP subspace estimation
studied by Singhal and Steinke (NeurIPS 2021).
Our main technique is to apply a padding-and-permuting transformation to a
fingerprinting code. However, rather than proving our results using a black-box
access to an existing fingerprinting code (e.g., Tardos' code), we develop a
new fingerprinting lemma that is stronger than those of Dwork et al. (FOCS
2015) and Bun et al. (SODA 2017), and prove our lower bounds directly from the
lemma. Our lemma, in particular, gives a simpler fingerprinting code
construction with optimal rate (up to polylogarithmic factors) that is of
independent interest