14 research outputs found

    Mutual Information Optimally Local Private Discrete Distribution Estimation

    Full text link
    Consider statistical learning (e.g. discrete distribution estimation) with local ϵ\epsilon-differential privacy, which preserves each data provider's privacy locally, we aim to optimize statistical data utility under the privacy constraints. Specifically, we study maximizing mutual information between a provider's data and its private view, and give the exact mutual information bound along with an attainable mechanism: kk-subset mechanism as results. The mutual information optimal mechanism randomly outputs a size kk subset of the original data domain with delicate probability assignment, where kk varies with the privacy level ϵ\epsilon and the data domain size dd. After analysing the limitations of existing local private mechanisms from mutual information perspective, we propose an efficient implementation of the kk-subset mechanism for discrete distribution estimation, and show its optimality guarantees over existing approaches.Comment: submitted to NIPS201

    Compressive Privatization: Sparse Distribution Estimation under Locally Differentially Privacy

    Full text link
    We consider the problem of discrete distribution estimation under locally differential privacy. Distribution estimation is one of the most fundamental estimation problems, which is widely studied in both non-private and private settings. In the local model, private mechanisms with provably optimal sample complexity are known. However, they are optimal only in the worst-case sense; their sample complexity is proportional to the size of the entire universe, which could be huge in practice (e.g., all IP addresses). We show that as long as the target distribution is sparse or approximately sparse (e.g., highly skewed), the number of samples needed could be significantly reduced. The sample complexity of our new mechanism is characterized by the sparsity of the target distribution and only weakly depends on the size the universe. Our mechanism does privatization and dimensionality reduction simultaneously, and the sample complexity will only depend on the reduced dimensionality. The original distribution is then recovered using tools from compressive sensing. To complement our theoretical results, we conduct experimental studies, the results of which clearly demonstrate the advantages of our method and confirm our theoretical findings

    Differentially Private Testing of Identity and Closeness of Discrete Distributions

    Full text link
    We study the fundamental problems of identity testing (goodness of fit), and closeness testing (two sample test) of distributions over kk elements, under differential privacy. While the problems have a long history in statistics, finite sample bounds for these problems have only been established recently. In this work, we derive upper and lower bounds on the sample complexity of both the problems under (ε,δ)(\varepsilon, \delta)-differential privacy. We provide optimal sample complexity algorithms for identity testing problem for all parameter ranges, and the first results for closeness testing. Our closeness testing bounds are optimal in the sparse regime where the number of samples is at most kk. Our upper bounds are obtained by privatizing non-private estimators for these problems. The non-private estimators are chosen to have small sensitivity. We propose a general framework to establish lower bounds on the sample complexity of statistical tasks under differential privacy. We show a bound on differentially private algorithms in terms of a coupling between the two hypothesis classes we aim to test. By constructing carefully chosen priors over the hypothesis classes, and using Le Cam's two point theorem we provide a general mechanism for proving lower bounds. We believe that the framework can be used to obtain strong lower bounds for other statistical tasks under privacy

    Locally Differentially Private Frequency Estimation with Consistency

    Full text link
    Local Differential Privacy (LDP) protects user privacy from the data collector. LDP protocols have been increasingly deployed in the industry. A basic building block is frequency oracle (FO) protocols, which estimate frequencies of values. While several FO protocols have been proposed, the design goal does not lead to optimal results for answering many queries. In this paper, we show that adding post-processing steps to FO protocols by exploiting the knowledge that all individual frequencies should be non-negative and they sum up to one can lead to significantly better accuracy for a wide range of tasks, including frequencies of individual values, frequencies of the most frequent values, and frequencies of subsets of values. We consider 10 different methods that exploit this knowledge differently. We establish theoretical relationships between some of them and conducted extensive experimental evaluations to understand which methods should be used for different query tasks.Comment: NDSS 202

    Lower Bounds for Learning Distributions under Communication Constraints via Fisher Information

    Full text link
    We consider the problem of learning high-dimensional, nonparametric and structured (e.g. Gaussian) distributions in distributed networks, where each node in the network observes an independent sample from the underlying distribution and can use kk bits to communicate its sample to a central processor. We consider three different models for communication. Under the independent model, each node communicates its sample to a central processor by independently encoding it into kk bits. Under the more general sequential or blackboard communication models, nodes can share information interactively but each node is restricted to write at most kk bits on the final transcript. We characterize the impact of the communication constraint kk on the minimax risk of estimating the underlying distribution under 2\ell^2 loss. We develop minimax lower bounds that apply in a unified way to many common statistical models and reveal that the impact of the communication constraint can be qualitatively different depending on the tail behavior of the score function associated with each model. A key ingredient in our proofs is a geometric characterization of Fisher information from quantized samples

    Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication

    Full text link
    We study the problem of estimating kk-ary distributions under ε\varepsilon-local differential privacy. nn samples are distributed across users who send privatized versions of their sample to a central server. All previously known sample optimal algorithms require linear (in kk) communication from each user in the high privacy regime (ε=O(1))(\varepsilon=O(1)), and run in time that grows as nkn\cdot k, which can be prohibitive for large domain size kk. We propose Hadamard Response (HR}, a local privatization scheme that requires no shared randomness and is symmetric with respect to the users. Our scheme has order optimal sample complexity for all ε\varepsilon, a communication of at most logk+2\log k+2 bits per user, and nearly linear running time of O~(n+k)\tilde{O}(n + k). Our encoding and decoding are based on Hadamard matrices, and are simple to implement. The statistical performance relies on the coding theoretic aspects of Hadamard matrices, ie, the large Hamming distance between the rows. An efficient implementation of the algorithm using the Fast Walsh-Hadamard transform gives the computational gains. We compare our approach with Randomized Response (RR), RAPPOR, and subset-selection mechanisms (SS), both theoretically, and experimentally. For k=10000k=10000, our algorithm runs about 100x faster than SS, and RAPPOR

    Successive Refinement of Privacy

    Full text link
    This work examines a novel question: how much randomness is needed to achieve local differential privacy (LDP)? A motivating scenario is providing {\em multiple levels of privacy} to multiple analysts, either for distribution or for heavy-hitter estimation, using the \emph{same} (randomized) output. We call this setting \emph{successive refinement of privacy}, as it provides hierarchical access to the raw data with different privacy levels. For example, the same randomized output could enable one analyst to reconstruct the input, while another can only estimate the distribution subject to LDP requirements. This extends the classical Shannon (wiretap) security setting to local differential privacy. We provide (order-wise) tight characterizations of privacy-utility-randomness trade-offs in several cases for distribution estimation, including the standard LDP setting under a randomness constraint. We also provide a non-trivial privacy mechanism for multi-level privacy. Furthermore, we show that we cannot reuse random keys over time while preserving privacy of each user

    Private Identity Testing for High-Dimensional Distributions

    Full text link
    In this work we present novel differentially private identity (goodness-of-fit) testers for natural and widely studied classes of multivariate product distributions: Gaussians in Rd\mathbb{R}^d with known covariance and product distributions over {±1}d\{\pm 1\}^{d}. Our testers have improved sample complexity compared to those derived from previous techniques, and are the first testers whose sample complexity matches the order-optimal minimax sample complexity of O(d1/2/α2)O(d^{1/2}/\alpha^2) in many parameter regimes. We construct two types of testers, exhibiting tradeoffs between sample complexity and computational complexity. Finally, we provide a two-way reduction between testing a subclass of multivariate product distributions and testing univariate distributions, and thereby obtain upper and lower bounds for testing this subclass of product distributions.Comment: Improved the bounds and the writin

    Private Hypothesis Selection

    Full text link
    We provide a differentially private algorithm for hypothesis selection. Given samples from an unknown probability distribution PP and a set of mm probability distributions H\mathcal{H}, the goal is to output, in a ε\varepsilon-differentially private manner, a distribution from H\mathcal{H} whose total variation distance to PP is comparable to that of the best such distribution (which we denote by α\alpha). The sample complexity of our basic algorithm is O(logmα2+logmαε)O\left(\frac{\log m}{\alpha^2} + \frac{\log m}{\alpha \varepsilon}\right), representing a minimal cost for privacy when compared to the non-private algorithm. We also can handle infinite hypothesis classes H\mathcal{H} by relaxing to (ε,δ)(\varepsilon,\delta)-differential privacy. We apply our hypothesis selection algorithm to give learning algorithms for a number of natural distribution classes, including Gaussians, product distributions, sums of independent random variables, piecewise polynomials, and mixture classes. Our hypothesis selection procedure allows us to generically convert a cover for a class to a learning algorithm, complementing known learning lower bounds which are in terms of the size of the packing number of the class. As the covering and packing numbers are often closely related, for constant α\alpha, our algorithms achieve the optimal sample complexity for many classes of interest. Finally, we describe an application to private distribution-free PAC learning.Comment: Appeared in NeurIPS 2019. Final version to appear in IEEE Transactions on Information Theor

    Differentially Private Assouad, Fano, and Le Cam

    Full text link
    Le Cam's method, Fano's inequality, and Assouad's lemma are three widely used techniques to prove lower bounds for statistical estimation tasks. We propose their analogues under central differential privacy. Our results are simple, easy to apply and we use them to establish sample complexity bounds in several estimation tasks. We establish the optimal sample complexity of discrete distribution estimation under total variation distance and 2\ell_2 distance. We also provide lower bounds for several other distribution classes, including product distributions and Gaussian mixtures that are tight up to logarithmic factors. The technical component of our paper relates coupling between distributions to the sample complexity of estimation under differential privacy
    corecore