6,118 research outputs found
Optimal locally private estimation under loss for
We consider the minimax estimation problem of a discrete distribution with
support size under locally differential privacy constraints. A
privatization scheme is applied to each raw sample independently, and we need
to estimate the distribution of the raw samples from the privatized samples. A
positive number measures the privacy level of a privatization
scheme.
In our previous work (IEEE Trans. Inform. Theory, 2018), we proposed a family
of new privatization schemes and the corresponding estimator. We also proved
that our scheme and estimator are order optimal in the regime under both (mean square) and loss. In this paper, we
sharpen this result by showing asymptotic optimality of the proposed scheme
under the loss for all More precisely, we show that
for any and any and the ratio between the
worst-case estimation loss of our scheme and the optimal value
approaches as the number of samples tends to infinity. The lower bound on
the minimax risk of private estimation that we establish as a part of the proof
is valid for any loss function Comment: This paper generalizes the optimality results of the preprint
arXiv:1708.00059 from to a broader class of loss functions. The new
approach taken here also results in a much shorter proo
Communication Complexity in Locally Private Distribution Estimation and Heavy Hitters
We consider the problems of distribution estimation and heavy hitter
(frequency) estimation under privacy and communication constraints. While these
constraints have been studied separately, optimal schemes for one are
sub-optimal for the other. We propose a sample-optimal -locally
differentially private (LDP) scheme for distribution estimation, where each
user communicates only one bit, and requires no public randomness. We show that
Hadamard Response, a recently proposed scheme for -LDP
distribution estimation is also utility-optimal for heavy hitter estimation.
Finally, we show that unlike distribution estimation, without public randomness
where only one bit suffices, any heavy hitter estimation algorithm that
communicates bits from each user cannot be
optimal.Comment: ICML 201
Locally Differentially Private Naive Bayes Classification
In machine learning, classification models need to be trained in order to
predict class labels. When the training data contains personal information
about individuals, collecting training data becomes difficult due to privacy
concerns. Local differential privacy is a definition to measure the individual
privacy when there is no trusted data curator. Individuals interact with an
untrusted data aggregator who obtains statistical information about the
population without learning personal data. In order to train a Naive Bayes
classifier in an untrusted setting, we propose to use methods satisfying local
differential privacy. Individuals send their perturbed inputs that keep the
relationship between the feature values and class labels. The data aggregator
estimates all probabilities needed by the Naive Bayes classifier. Then, new
instances can be classified based on the estimated probabilities. We propose
solutions for both discrete and continuous data. In order to eliminate high
amount of noise and decrease communication cost in multi-dimensional data, we
propose utilizing dimensionality reduction techniques which can be applied by
individuals before perturbing their inputs. Our experimental results show that
the accuracy of the Naive Bayes classifier is maintained even when the
individual privacy is guaranteed under local differential privacy, and that
using dimensionality reduction enhances the accuracy
Lower Bounds for Locally Private Estimation via Communication Complexity
We develop lower bounds for estimation under local privacy
constraints---including differential privacy and its relaxations to approximate
or R\'{e}nyi differential privacy---by showing an equivalence between private
estimation and communication-restricted estimation problems. Our results apply
to arbitrarily interactive privacy mechanisms, and they also give sharp lower
bounds for all levels of differential privacy protections, that is, privacy
mechanisms with privacy levels . As a particular
consequence of our results, we show that the minimax mean-squared error for
estimating the mean of a bounded or Gaussian random vector in dimensions
scales as .Comment: To appear in Conference on Learning Theory 201
Successive Refinement of Privacy
This work examines a novel question: how much randomness is needed to achieve
local differential privacy (LDP)? A motivating scenario is providing {\em
multiple levels of privacy} to multiple analysts, either for distribution or
for heavy-hitter estimation, using the \emph{same} (randomized) output. We call
this setting \emph{successive refinement of privacy}, as it provides
hierarchical access to the raw data with different privacy levels. For example,
the same randomized output could enable one analyst to reconstruct the input,
while another can only estimate the distribution subject to LDP requirements.
This extends the classical Shannon (wiretap) security setting to local
differential privacy. We provide (order-wise) tight characterizations of
privacy-utility-randomness trade-offs in several cases for distribution
estimation, including the standard LDP setting under a randomness constraint.
We also provide a non-trivial privacy mechanism for multi-level privacy.
Furthermore, we show that we cannot reuse random keys over time while
preserving privacy of each user
Protection Against Reconstruction and Its Applications in Private Federated Learning
In large-scale statistical learning, data collection and model fitting are
moving increasingly toward peripheral devices---phones, watches, fitness
trackers---away from centralized data collection. Concomitant with this rise in
decentralized data are increasing challenges of maintaining privacy while
allowing enough information to fit accurate, useful statistical models. This
motivates local notions of privacy---most significantly, local differential
privacy, which provides strong protections against sensitive data
disclosures---where data is obfuscated before a statistician or learner can
even observe it, providing strong protections to individuals' data. Yet local
privacy as traditionally employed may prove too stringent for practical use,
especially in modern high-dimensional statistical and machine learning
problems. Consequently, we revisit the types of disclosures and adversaries
against which we provide protections, considering adversaries with limited
prior information and ensuring that with high probability, ensuring they cannot
reconstruct an individual's data within useful tolerances. By reconceptualizing
these protections, we allow more useful data release---large privacy parameters
in local differential privacy---and we design new (minimax) optimal locally
differentially private mechanisms for statistical learning problems for
\emph{all} privacy levels. We thus present practicable approaches to
large-scale locally private model training that were previously impossible,
showing theoretically and empirically that we can fit large-scale image
classification and language models with little degradation in utility
Differentially Private Testing of Identity and Closeness of Discrete Distributions
We study the fundamental problems of identity testing (goodness of fit), and
closeness testing (two sample test) of distributions over elements, under
differential privacy. While the problems have a long history in statistics,
finite sample bounds for these problems have only been established recently.
In this work, we derive upper and lower bounds on the sample complexity of
both the problems under -differential privacy. We
provide optimal sample complexity algorithms for identity testing problem for
all parameter ranges, and the first results for closeness testing. Our
closeness testing bounds are optimal in the sparse regime where the number of
samples is at most .
Our upper bounds are obtained by privatizing non-private estimators for these
problems. The non-private estimators are chosen to have small sensitivity. We
propose a general framework to establish lower bounds on the sample complexity
of statistical tasks under differential privacy. We show a bound on
differentially private algorithms in terms of a coupling between the two
hypothesis classes we aim to test. By constructing carefully chosen priors over
the hypothesis classes, and using Le Cam's two point theorem we provide a
general mechanism for proving lower bounds. We believe that the framework can
be used to obtain strong lower bounds for other statistical tasks under
privacy
Minimax Optimal Procedures for Locally Private Estimation
Working under a model of privacy in which data remains private even from the
statistician, we study the tradeoff between privacy guarantees and the risk of
the resulting statistical estimators. We develop private versions of classical
information-theoretic bounds, in particular those due to Le Cam, Fano, and
Assouad. These inequalities allow for a precise characterization of statistical
rates under local privacy constraints and the development of provably (minimax)
optimal estimation procedures. We provide a treatment of several canonical
families of problems: mean estimation and median estimation, generalized linear
models, and nonparametric density estimation. For all of these families, we
provide lower and upper bounds that match up to constant factors, and exhibit
new (optimal) privacy-preserving mechanisms and computationally efficient
estimators that achieve the bounds. Additionally, we present a variety of
experimental results for estimation problems involving sensitive data,
including salaries, censored blog posts and articles, and drug abuse; these
experiments demonstrate the importance of deriving optimal procedures.Comment: 64 pages, 8 figures. arXiv admin note: substantial text overlap with
arXiv:1302.320
Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication
We study the problem of estimating -ary distributions under
-local differential privacy. samples are distributed across
users who send privatized versions of their sample to a central server. All
previously known sample optimal algorithms require linear (in )
communication from each user in the high privacy regime ,
and run in time that grows as , which can be prohibitive for large
domain size .
We propose Hadamard Response (HR}, a local privatization scheme that requires
no shared randomness and is symmetric with respect to the users. Our scheme has
order optimal sample complexity for all , a communication of at
most bits per user, and nearly linear running time of .
Our encoding and decoding are based on Hadamard matrices, and are simple to
implement. The statistical performance relies on the coding theoretic aspects
of Hadamard matrices, ie, the large Hamming distance between the rows. An
efficient implementation of the algorithm using the Fast Walsh-Hadamard
transform gives the computational gains.
We compare our approach with Randomized Response (RR), RAPPOR, and
subset-selection mechanisms (SS), both theoretically, and experimentally. For
, our algorithm runs about 100x faster than SS, and RAPPOR
Context-Aware Local Differential Privacy
Local differential privacy (LDP) is a strong notion of privacy for individual
users that often comes at the expense of a significant drop in utility. The
classical definition of LDP assumes that all elements in the data domain are
equally sensitive. However, in many applications, some symbols are more
sensitive than others. This work proposes a context-aware framework of local
differential privacy that allows a privacy designer to incorporate the
application's context into the privacy definition. For binary data domains, we
provide a universally optimal privatization scheme and highlight its
connections to Warner's randomized response (RR) and Mangat's improved
response. Motivated by geolocation and web search applications, for -ary
data domains, we consider two special cases of context-aware LDP:
block-structured LDP and high-low LDP. We study discrete distribution
estimation and provide communication-efficient, sample-optimal schemes and
information-theoretic lower bounds for both models. We show that using
contextual information can require fewer samples than classical LDP to achieve
the same accuracy
- …