14 research outputs found
Mutual Information Optimally Local Private Discrete Distribution Estimation
Consider statistical learning (e.g. discrete distribution estimation) with
local -differential privacy, which preserves each data provider's
privacy locally, we aim to optimize statistical data utility under the privacy
constraints. Specifically, we study maximizing mutual information between a
provider's data and its private view, and give the exact mutual information
bound along with an attainable mechanism: -subset mechanism as results. The
mutual information optimal mechanism randomly outputs a size subset of the
original data domain with delicate probability assignment, where varies
with the privacy level and the data domain size . After analysing
the limitations of existing local private mechanisms from mutual information
perspective, we propose an efficient implementation of the -subset mechanism
for discrete distribution estimation, and show its optimality guarantees over
existing approaches.Comment: submitted to NIPS201
Compressive Privatization: Sparse Distribution Estimation under Locally Differentially Privacy
We consider the problem of discrete distribution estimation under locally
differential privacy. Distribution estimation is one of the most fundamental
estimation problems, which is widely studied in both non-private and private
settings. In the local model, private mechanisms with provably optimal sample
complexity are known. However, they are optimal only in the worst-case sense;
their sample complexity is proportional to the size of the entire universe,
which could be huge in practice (e.g., all IP addresses). We show that as long
as the target distribution is sparse or approximately sparse (e.g., highly
skewed), the number of samples needed could be significantly reduced. The
sample complexity of our new mechanism is characterized by the sparsity of the
target distribution and only weakly depends on the size the universe. Our
mechanism does privatization and dimensionality reduction simultaneously, and
the sample complexity will only depend on the reduced dimensionality. The
original distribution is then recovered using tools from compressive sensing.
To complement our theoretical results, we conduct experimental studies, the
results of which clearly demonstrate the advantages of our method and confirm
our theoretical findings
Differentially Private Testing of Identity and Closeness of Discrete Distributions
We study the fundamental problems of identity testing (goodness of fit), and
closeness testing (two sample test) of distributions over elements, under
differential privacy. While the problems have a long history in statistics,
finite sample bounds for these problems have only been established recently.
In this work, we derive upper and lower bounds on the sample complexity of
both the problems under -differential privacy. We
provide optimal sample complexity algorithms for identity testing problem for
all parameter ranges, and the first results for closeness testing. Our
closeness testing bounds are optimal in the sparse regime where the number of
samples is at most .
Our upper bounds are obtained by privatizing non-private estimators for these
problems. The non-private estimators are chosen to have small sensitivity. We
propose a general framework to establish lower bounds on the sample complexity
of statistical tasks under differential privacy. We show a bound on
differentially private algorithms in terms of a coupling between the two
hypothesis classes we aim to test. By constructing carefully chosen priors over
the hypothesis classes, and using Le Cam's two point theorem we provide a
general mechanism for proving lower bounds. We believe that the framework can
be used to obtain strong lower bounds for other statistical tasks under
privacy
Locally Differentially Private Frequency Estimation with Consistency
Local Differential Privacy (LDP) protects user privacy from the data
collector. LDP protocols have been increasingly deployed in the industry. A
basic building block is frequency oracle (FO) protocols, which estimate
frequencies of values. While several FO protocols have been proposed, the
design goal does not lead to optimal results for answering many queries. In
this paper, we show that adding post-processing steps to FO protocols by
exploiting the knowledge that all individual frequencies should be non-negative
and they sum up to one can lead to significantly better accuracy for a wide
range of tasks, including frequencies of individual values, frequencies of the
most frequent values, and frequencies of subsets of values. We consider 10
different methods that exploit this knowledge differently. We establish
theoretical relationships between some of them and conducted extensive
experimental evaluations to understand which methods should be used for
different query tasks.Comment: NDSS 202
Lower Bounds for Learning Distributions under Communication Constraints via Fisher Information
We consider the problem of learning high-dimensional, nonparametric and
structured (e.g. Gaussian) distributions in distributed networks, where each
node in the network observes an independent sample from the underlying
distribution and can use bits to communicate its sample to a central
processor. We consider three different models for communication. Under the
independent model, each node communicates its sample to a central processor by
independently encoding it into bits. Under the more general sequential or
blackboard communication models, nodes can share information interactively but
each node is restricted to write at most bits on the final transcript. We
characterize the impact of the communication constraint on the minimax risk
of estimating the underlying distribution under loss. We develop
minimax lower bounds that apply in a unified way to many common statistical
models and reveal that the impact of the communication constraint can be
qualitatively different depending on the tail behavior of the score function
associated with each model. A key ingredient in our proofs is a geometric
characterization of Fisher information from quantized samples
Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication
We study the problem of estimating -ary distributions under
-local differential privacy. samples are distributed across
users who send privatized versions of their sample to a central server. All
previously known sample optimal algorithms require linear (in )
communication from each user in the high privacy regime ,
and run in time that grows as , which can be prohibitive for large
domain size .
We propose Hadamard Response (HR}, a local privatization scheme that requires
no shared randomness and is symmetric with respect to the users. Our scheme has
order optimal sample complexity for all , a communication of at
most bits per user, and nearly linear running time of .
Our encoding and decoding are based on Hadamard matrices, and are simple to
implement. The statistical performance relies on the coding theoretic aspects
of Hadamard matrices, ie, the large Hamming distance between the rows. An
efficient implementation of the algorithm using the Fast Walsh-Hadamard
transform gives the computational gains.
We compare our approach with Randomized Response (RR), RAPPOR, and
subset-selection mechanisms (SS), both theoretically, and experimentally. For
, our algorithm runs about 100x faster than SS, and RAPPOR
Successive Refinement of Privacy
This work examines a novel question: how much randomness is needed to achieve
local differential privacy (LDP)? A motivating scenario is providing {\em
multiple levels of privacy} to multiple analysts, either for distribution or
for heavy-hitter estimation, using the \emph{same} (randomized) output. We call
this setting \emph{successive refinement of privacy}, as it provides
hierarchical access to the raw data with different privacy levels. For example,
the same randomized output could enable one analyst to reconstruct the input,
while another can only estimate the distribution subject to LDP requirements.
This extends the classical Shannon (wiretap) security setting to local
differential privacy. We provide (order-wise) tight characterizations of
privacy-utility-randomness trade-offs in several cases for distribution
estimation, including the standard LDP setting under a randomness constraint.
We also provide a non-trivial privacy mechanism for multi-level privacy.
Furthermore, we show that we cannot reuse random keys over time while
preserving privacy of each user
Private Identity Testing for High-Dimensional Distributions
In this work we present novel differentially private identity
(goodness-of-fit) testers for natural and widely studied classes of
multivariate product distributions: Gaussians in with known
covariance and product distributions over . Our testers have
improved sample complexity compared to those derived from previous techniques,
and are the first testers whose sample complexity matches the order-optimal
minimax sample complexity of in many parameter regimes.
We construct two types of testers, exhibiting tradeoffs between sample
complexity and computational complexity. Finally, we provide a two-way
reduction between testing a subclass of multivariate product distributions and
testing univariate distributions, and thereby obtain upper and lower bounds for
testing this subclass of product distributions.Comment: Improved the bounds and the writin
Private Hypothesis Selection
We provide a differentially private algorithm for hypothesis selection. Given
samples from an unknown probability distribution and a set of
probability distributions , the goal is to output, in a
-differentially private manner, a distribution from
whose total variation distance to is comparable to that of the best such
distribution (which we denote by ). The sample complexity of our basic
algorithm is , representing a minimal cost for privacy when compared to
the non-private algorithm. We also can handle infinite hypothesis classes
by relaxing to -differential privacy.
We apply our hypothesis selection algorithm to give learning algorithms for a
number of natural distribution classes, including Gaussians, product
distributions, sums of independent random variables, piecewise polynomials, and
mixture classes. Our hypothesis selection procedure allows us to generically
convert a cover for a class to a learning algorithm, complementing known
learning lower bounds which are in terms of the size of the packing number of
the class. As the covering and packing numbers are often closely related, for
constant , our algorithms achieve the optimal sample complexity for
many classes of interest. Finally, we describe an application to private
distribution-free PAC learning.Comment: Appeared in NeurIPS 2019. Final version to appear in IEEE
Transactions on Information Theor
Differentially Private Assouad, Fano, and Le Cam
Le Cam's method, Fano's inequality, and Assouad's lemma are three widely used
techniques to prove lower bounds for statistical estimation tasks. We propose
their analogues under central differential privacy. Our results are simple,
easy to apply and we use them to establish sample complexity bounds in several
estimation tasks. We establish the optimal sample complexity of discrete
distribution estimation under total variation distance and distance.
We also provide lower bounds for several other distribution classes, including
product distributions and Gaussian mixtures that are tight up to logarithmic
factors. The technical component of our paper relates coupling between
distributions to the sample complexity of estimation under differential
privacy