Search CORE

14 research outputs found

Mutual Information Optimally Local Private Discrete Distribution Estimation

Author: Huang Liusheng
Li Xiang-Yang
Nie Yiwen
Qiao Chunming
Wang Pengzhan
Wang Shaowei
Xu Hongli
Yang Wei
Publication venue
Publication date: 27/07/2016
Field of study

Consider statistical learning (e.g. discrete distribution estimation) with local

\epsilon

-differential privacy, which preserves each data provider's privacy locally, we aim to optimize statistical data utility under the privacy constraints. Specifically, we study maximizing mutual information between a provider's data and its private view, and give the exact mutual information bound along with an attainable mechanism:

k

-subset mechanism as results. The mutual information optimal mechanism randomly outputs a size

k

subset of the original data domain with delicate probability assignment, where

k

varies with the privacy level

\epsilon

and the data domain size

d

. After analysing the limitations of existing local private mechanisms from mutual information perspective, we propose an efficient implementation of the

k

-subset mechanism for discrete distribution estimation, and show its optimality guarantees over existing approaches.Comment: submitted to NIPS201

arXiv.org e-Print Archive

Compressive Privatization: Sparse Distribution Estimation under Locally Differentially Privacy

Author: Huang Zengfeng
Mao Xiaojun
Wang Jian
Xiong Zhongzheng
Ying Shan
Publication venue
Publication date: 03/12/2020
Field of study

We consider the problem of discrete distribution estimation under locally differential privacy. Distribution estimation is one of the most fundamental estimation problems, which is widely studied in both non-private and private settings. In the local model, private mechanisms with provably optimal sample complexity are known. However, they are optimal only in the worst-case sense; their sample complexity is proportional to the size of the entire universe, which could be huge in practice (e.g., all IP addresses). We show that as long as the target distribution is sparse or approximately sparse (e.g., highly skewed), the number of samples needed could be significantly reduced. The sample complexity of our new mechanism is characterized by the sparsity of the target distribution and only weakly depends on the size the universe. Our mechanism does privatization and dimensionality reduction simultaneously, and the sample complexity will only depend on the reduced dimensionality. The original distribution is then recovered using tools from compressive sensing. To complement our theoretical results, we conduct experimental studies, the results of which clearly demonstrate the advantages of our method and confirm our theoretical findings

arXiv.org e-Print Archive

Differentially Private Testing of Identity and Closeness of Discrete Distributions

Author: Acharya Jayadev
Sun Ziteng
Zhang Huanyu
Publication venue
Publication date: 31/10/2017
Field of study

We study the fundamental problems of identity testing (goodness of fit), and closeness testing (two sample test) of distributions over

k

elements, under differential privacy. While the problems have a long history in statistics, finite sample bounds for these problems have only been established recently. In this work, we derive upper and lower bounds on the sample complexity of both the problems under

(\varepsilon, \delta)

-differential privacy. We provide optimal sample complexity algorithms for identity testing problem for all parameter ranges, and the first results for closeness testing. Our closeness testing bounds are optimal in the sparse regime where the number of samples is at most

k

. Our upper bounds are obtained by privatizing non-private estimators for these problems. The non-private estimators are chosen to have small sensitivity. We propose a general framework to establish lower bounds on the sample complexity of statistical tasks under differential privacy. We show a bound on differentially private algorithms in terms of a coupling between the two hypothesis classes we aim to test. By constructing carefully chosen priors over the hypothesis classes, and using Le Cam's two point theorem we provide a general mechanism for proving lower bounds. We believe that the framework can be used to obtain strong lower bounds for other statistical tasks under privacy

arXiv.org e-Print Archive

Locally Differentially Private Frequency Estimation with Consistency

Author: Li Ninghui
Li Zitao
Lopuhaä-Zwakenberg Milan
Skoric Boris
Wang Tianhao
Publication venue
Publication date: 29/01/2020
Field of study

Local Differential Privacy (LDP) protects user privacy from the data collector. LDP protocols have been increasingly deployed in the industry. A basic building block is frequency oracle (FO) protocols, which estimate frequencies of values. While several FO protocols have been proposed, the design goal does not lead to optimal results for answering many queries. In this paper, we show that adding post-processing steps to FO protocols by exploiting the knowledge that all individual frequencies should be non-negative and they sum up to one can lead to significantly better accuracy for a wide range of tasks, including frequencies of individual values, frequencies of the most frequent values, and frequencies of subsets of values. We consider 10 different methods that exploit this knowledge differently. We establish theoretical relationships between some of them and conducted extensive experimental evaluations to understand which methods should be used for different query tasks.Comment: NDSS 202

arXiv.org e-Print Archive

Lower Bounds for Learning Distributions under Communication Constraints via Fisher Information

Author: Barnes Leighton Pate
Han Yanjun
Ozgur Ayfer
Publication venue
Publication date: 31/05/2019
Field of study

We consider the problem of learning high-dimensional, nonparametric and structured (e.g. Gaussian) distributions in distributed networks, where each node in the network observes an independent sample from the underlying distribution and can use

k

bits to communicate its sample to a central processor. We consider three different models for communication. Under the independent model, each node communicates its sample to a central processor by independently encoding it into

k

bits. Under the more general sequential or blackboard communication models, nodes can share information interactively but each node is restricted to write at most

k

bits on the final transcript. We characterize the impact of the communication constraint

k

on the minimax risk of estimating the underlying distribution under

\ell^2

loss. We develop minimax lower bounds that apply in a unified way to many common statistical models and reveal that the impact of the communication constraint can be qualitatively different depending on the tail behavior of the score function associated with each model. A key ingredient in our proofs is a geometric characterization of Fisher information from quantized samples

arXiv.org e-Print Archive

Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication

Author: Acharya Jayadev
Sun Ziteng
Zhang Huanyu
Publication venue
Publication date: 27/06/2018
Field of study

We study the problem of estimating

k

-ary distributions under

\varepsilon

-local differential privacy.

n

samples are distributed across users who send privatized versions of their sample to a central server. All previously known sample optimal algorithms require linear (in

k

) communication from each user in the high privacy regime

(\varepsilon=O(1))

, and run in time that grows as

n\cdot k

, which can be prohibitive for large domain size

k

. We propose Hadamard Response (HR}, a local privatization scheme that requires no shared randomness and is symmetric with respect to the users. Our scheme has order optimal sample complexity for all

\varepsilon

, a communication of at most

\log k+2

bits per user, and nearly linear running time of

\tilde{O}(n + k)

. Our encoding and decoding are based on Hadamard matrices, and are simple to implement. The statistical performance relies on the coding theoretic aspects of Hadamard matrices, ie, the large Hamming distance between the rows. An efficient implementation of the algorithm using the Fast Walsh-Hadamard transform gives the computational gains. We compare our approach with Randomized Response (RR), RAPPOR, and subset-selection mechanisms (SS), both theoretically, and experimentally. For

k=10000

, our algorithm runs about 100x faster than SS, and RAPPOR

arXiv.org e-Print Archive

Successive Refinement of Privacy

Author: Chaudhuri Kamalika
Data Deepesh
Diggavi Suhas
Fragouli Christina
Girgis Antonious M.
Publication venue
Publication date: 24/05/2020
Field of study

This work examines a novel question: how much randomness is needed to achieve local differential privacy (LDP)? A motivating scenario is providing {\em multiple levels of privacy} to multiple analysts, either for distribution or for heavy-hitter estimation, using the \emph{same} (randomized) output. We call this setting \emph{successive refinement of privacy}, as it provides hierarchical access to the raw data with different privacy levels. For example, the same randomized output could enable one analyst to reconstruct the input, while another can only estimate the distribution subject to LDP requirements. This extends the classical Shannon (wiretap) security setting to local differential privacy. We provide (order-wise) tight characterizations of privacy-utility-randomness trade-offs in several cases for distribution estimation, including the standard LDP setting under a randomness constraint. We also provide a non-trivial privacy mechanism for multi-level privacy. Furthermore, we show that we cannot reuse random keys over time while preserving privacy of each user

arXiv.org e-Print Archive

Private Identity Testing for High-Dimensional Distributions

Author: Canonne Clément L.
Kamath Gautam
McMillan Audra
Ullman Jonathan
Zakynthinou Lydia
Publication venue
Publication date: 04/11/2019
Field of study

In this work we present novel differentially private identity (goodness-of-fit) testers for natural and widely studied classes of multivariate product distributions: Gaussians in

\mathbb{R}^d

with known covariance and product distributions over

\{\pm 1\}^{d}

. Our testers have improved sample complexity compared to those derived from previous techniques, and are the first testers whose sample complexity matches the order-optimal minimax sample complexity of

O(d^{1/2}/\alpha^2)

in many parameter regimes. We construct two types of testers, exhibiting tradeoffs between sample complexity and computational complexity. Finally, we provide a two-way reduction between testing a subclass of multivariate product distributions and testing univariate distributions, and thereby obtain upper and lower bounds for testing this subclass of product distributions.Comment: Improved the bounds and the writin

arXiv.org e-Print Archive

Private Hypothesis Selection

Author: Bun Mark
Kamath Gautam
Steinke Thomas
Wu Zhiwei Steven
Publication venue
Publication date: 04/01/2021
Field of study

We provide a differentially private algorithm for hypothesis selection. Given samples from an unknown probability distribution

P

and a set of

m

probability distributions

\mathcal{H}

, the goal is to output, in a

\varepsilon

-differentially private manner, a distribution from

\mathcal{H}

whose total variation distance to

P

is comparable to that of the best such distribution (which we denote by

\alpha

). The sample complexity of our basic algorithm is

O\left(\frac{\log m}{\alpha^2} + \frac{\log m}{\alpha \varepsilon}\right)

, representing a minimal cost for privacy when compared to the non-private algorithm. We also can handle infinite hypothesis classes

\mathcal{H}

by relaxing to

(\varepsilon,\delta)

-differential privacy. We apply our hypothesis selection algorithm to give learning algorithms for a number of natural distribution classes, including Gaussians, product distributions, sums of independent random variables, piecewise polynomials, and mixture classes. Our hypothesis selection procedure allows us to generically convert a cover for a class to a learning algorithm, complementing known learning lower bounds which are in terms of the size of the packing number of the class. As the covering and packing numbers are often closely related, for constant

\alpha

, our algorithms achieve the optimal sample complexity for many classes of interest. Finally, we describe an application to private distribution-free PAC learning.Comment: Appeared in NeurIPS 2019. Final version to appear in IEEE Transactions on Information Theor

arXiv.org e-Print Archive

Differentially Private Assouad, Fano, and Le Cam

Author: Acharya Jayadev
Sun Ziteng
Zhang Huanyu
Publication venue
Publication date: 01/11/2020
Field of study

Le Cam's method, Fano's inequality, and Assouad's lemma are three widely used techniques to prove lower bounds for statistical estimation tasks. We propose their analogues under central differential privacy. Our results are simple, easy to apply and we use them to establish sample complexity bounds in several estimation tasks. We establish the optimal sample complexity of discrete distribution estimation under total variation distance and

\ell_2

distance. We also provide lower bounds for several other distribution classes, including product distributions and Gaussian mixtures that are tight up to logarithmic factors. The technical component of our paper relates coupling between distributions to the sample complexity of estimation under differential privacy

arXiv.org e-Print Archive