6,118 research outputs found

    Optimal locally private estimation under p\ell_p loss for 1p21\le p\le 2

    Full text link
    We consider the minimax estimation problem of a discrete distribution with support size kk under locally differential privacy constraints. A privatization scheme is applied to each raw sample independently, and we need to estimate the distribution of the raw samples from the privatized samples. A positive number ϵ\epsilon measures the privacy level of a privatization scheme. In our previous work (IEEE Trans. Inform. Theory, 2018), we proposed a family of new privatization schemes and the corresponding estimator. We also proved that our scheme and estimator are order optimal in the regime eϵke^{\epsilon} \ll k under both 22\ell_2^2 (mean square) and 1\ell_1 loss. In this paper, we sharpen this result by showing asymptotic optimality of the proposed scheme under the pp\ell_p^p loss for all 1p2.1\le p\le 2. More precisely, we show that for any p[1,2]p\in[1,2] and any kk and ϵ,\epsilon, the ratio between the worst-case pp\ell_p^p estimation loss of our scheme and the optimal value approaches 11 as the number of samples tends to infinity. The lower bound on the minimax risk of private estimation that we establish as a part of the proof is valid for any loss function pp,p1.\ell_p^p, p\ge 1.Comment: This paper generalizes the optimality results of the preprint arXiv:1708.00059 from ell2ell_2 to a broader class of loss functions. The new approach taken here also results in a much shorter proo

    Communication Complexity in Locally Private Distribution Estimation and Heavy Hitters

    Full text link
    We consider the problems of distribution estimation and heavy hitter (frequency) estimation under privacy and communication constraints. While these constraints have been studied separately, optimal schemes for one are sub-optimal for the other. We propose a sample-optimal ε\varepsilon-locally differentially private (LDP) scheme for distribution estimation, where each user communicates only one bit, and requires no public randomness. We show that Hadamard Response, a recently proposed scheme for ε\varepsilon-LDP distribution estimation is also utility-optimal for heavy hitter estimation. Finally, we show that unlike distribution estimation, without public randomness where only one bit suffices, any heavy hitter estimation algorithm that communicates o(min{logn,logk})o(\min \{\log n, \log k\}) bits from each user cannot be optimal.Comment: ICML 201

    Locally Differentially Private Naive Bayes Classification

    Full text link
    In machine learning, classification models need to be trained in order to predict class labels. When the training data contains personal information about individuals, collecting training data becomes difficult due to privacy concerns. Local differential privacy is a definition to measure the individual privacy when there is no trusted data curator. Individuals interact with an untrusted data aggregator who obtains statistical information about the population without learning personal data. In order to train a Naive Bayes classifier in an untrusted setting, we propose to use methods satisfying local differential privacy. Individuals send their perturbed inputs that keep the relationship between the feature values and class labels. The data aggregator estimates all probabilities needed by the Naive Bayes classifier. Then, new instances can be classified based on the estimated probabilities. We propose solutions for both discrete and continuous data. In order to eliminate high amount of noise and decrease communication cost in multi-dimensional data, we propose utilizing dimensionality reduction techniques which can be applied by individuals before perturbing their inputs. Our experimental results show that the accuracy of the Naive Bayes classifier is maintained even when the individual privacy is guaranteed under local differential privacy, and that using dimensionality reduction enhances the accuracy

    Lower Bounds for Locally Private Estimation via Communication Complexity

    Full text link
    We develop lower bounds for estimation under local privacy constraints---including differential privacy and its relaxations to approximate or R\'{e}nyi differential privacy---by showing an equivalence between private estimation and communication-restricted estimation problems. Our results apply to arbitrarily interactive privacy mechanisms, and they also give sharp lower bounds for all levels of differential privacy protections, that is, privacy mechanisms with privacy levels ε[0,)\varepsilon \in [0, \infty). As a particular consequence of our results, we show that the minimax mean-squared error for estimating the mean of a bounded or Gaussian random vector in dd dimensions scales as dndmin{ε,ε2}\frac{d}{n} \cdot \frac{d}{ \min\{\varepsilon, \varepsilon^2\}}.Comment: To appear in Conference on Learning Theory 201

    Successive Refinement of Privacy

    Full text link
    This work examines a novel question: how much randomness is needed to achieve local differential privacy (LDP)? A motivating scenario is providing {\em multiple levels of privacy} to multiple analysts, either for distribution or for heavy-hitter estimation, using the \emph{same} (randomized) output. We call this setting \emph{successive refinement of privacy}, as it provides hierarchical access to the raw data with different privacy levels. For example, the same randomized output could enable one analyst to reconstruct the input, while another can only estimate the distribution subject to LDP requirements. This extends the classical Shannon (wiretap) security setting to local differential privacy. We provide (order-wise) tight characterizations of privacy-utility-randomness trade-offs in several cases for distribution estimation, including the standard LDP setting under a randomness constraint. We also provide a non-trivial privacy mechanism for multi-level privacy. Furthermore, we show that we cannot reuse random keys over time while preserving privacy of each user

    Protection Against Reconstruction and Its Applications in Private Federated Learning

    Full text link
    In large-scale statistical learning, data collection and model fitting are moving increasingly toward peripheral devices---phones, watches, fitness trackers---away from centralized data collection. Concomitant with this rise in decentralized data are increasing challenges of maintaining privacy while allowing enough information to fit accurate, useful statistical models. This motivates local notions of privacy---most significantly, local differential privacy, which provides strong protections against sensitive data disclosures---where data is obfuscated before a statistician or learner can even observe it, providing strong protections to individuals' data. Yet local privacy as traditionally employed may prove too stringent for practical use, especially in modern high-dimensional statistical and machine learning problems. Consequently, we revisit the types of disclosures and adversaries against which we provide protections, considering adversaries with limited prior information and ensuring that with high probability, ensuring they cannot reconstruct an individual's data within useful tolerances. By reconceptualizing these protections, we allow more useful data release---large privacy parameters in local differential privacy---and we design new (minimax) optimal locally differentially private mechanisms for statistical learning problems for \emph{all} privacy levels. We thus present practicable approaches to large-scale locally private model training that were previously impossible, showing theoretically and empirically that we can fit large-scale image classification and language models with little degradation in utility

    Differentially Private Testing of Identity and Closeness of Discrete Distributions

    Full text link
    We study the fundamental problems of identity testing (goodness of fit), and closeness testing (two sample test) of distributions over kk elements, under differential privacy. While the problems have a long history in statistics, finite sample bounds for these problems have only been established recently. In this work, we derive upper and lower bounds on the sample complexity of both the problems under (ε,δ)(\varepsilon, \delta)-differential privacy. We provide optimal sample complexity algorithms for identity testing problem for all parameter ranges, and the first results for closeness testing. Our closeness testing bounds are optimal in the sparse regime where the number of samples is at most kk. Our upper bounds are obtained by privatizing non-private estimators for these problems. The non-private estimators are chosen to have small sensitivity. We propose a general framework to establish lower bounds on the sample complexity of statistical tasks under differential privacy. We show a bound on differentially private algorithms in terms of a coupling between the two hypothesis classes we aim to test. By constructing carefully chosen priors over the hypothesis classes, and using Le Cam's two point theorem we provide a general mechanism for proving lower bounds. We believe that the framework can be used to obtain strong lower bounds for other statistical tasks under privacy

    Minimax Optimal Procedures for Locally Private Estimation

    Full text link
    Working under a model of privacy in which data remains private even from the statistician, we study the tradeoff between privacy guarantees and the risk of the resulting statistical estimators. We develop private versions of classical information-theoretic bounds, in particular those due to Le Cam, Fano, and Assouad. These inequalities allow for a precise characterization of statistical rates under local privacy constraints and the development of provably (minimax) optimal estimation procedures. We provide a treatment of several canonical families of problems: mean estimation and median estimation, generalized linear models, and nonparametric density estimation. For all of these families, we provide lower and upper bounds that match up to constant factors, and exhibit new (optimal) privacy-preserving mechanisms and computationally efficient estimators that achieve the bounds. Additionally, we present a variety of experimental results for estimation problems involving sensitive data, including salaries, censored blog posts and articles, and drug abuse; these experiments demonstrate the importance of deriving optimal procedures.Comment: 64 pages, 8 figures. arXiv admin note: substantial text overlap with arXiv:1302.320

    Hadamard Response: Estimating Distributions Privately, Efficiently, and with Little Communication

    Full text link
    We study the problem of estimating kk-ary distributions under ε\varepsilon-local differential privacy. nn samples are distributed across users who send privatized versions of their sample to a central server. All previously known sample optimal algorithms require linear (in kk) communication from each user in the high privacy regime (ε=O(1))(\varepsilon=O(1)), and run in time that grows as nkn\cdot k, which can be prohibitive for large domain size kk. We propose Hadamard Response (HR}, a local privatization scheme that requires no shared randomness and is symmetric with respect to the users. Our scheme has order optimal sample complexity for all ε\varepsilon, a communication of at most logk+2\log k+2 bits per user, and nearly linear running time of O~(n+k)\tilde{O}(n + k). Our encoding and decoding are based on Hadamard matrices, and are simple to implement. The statistical performance relies on the coding theoretic aspects of Hadamard matrices, ie, the large Hamming distance between the rows. An efficient implementation of the algorithm using the Fast Walsh-Hadamard transform gives the computational gains. We compare our approach with Randomized Response (RR), RAPPOR, and subset-selection mechanisms (SS), both theoretically, and experimentally. For k=10000k=10000, our algorithm runs about 100x faster than SS, and RAPPOR

    Context-Aware Local Differential Privacy

    Full text link
    Local differential privacy (LDP) is a strong notion of privacy for individual users that often comes at the expense of a significant drop in utility. The classical definition of LDP assumes that all elements in the data domain are equally sensitive. However, in many applications, some symbols are more sensitive than others. This work proposes a context-aware framework of local differential privacy that allows a privacy designer to incorporate the application's context into the privacy definition. For binary data domains, we provide a universally optimal privatization scheme and highlight its connections to Warner's randomized response (RR) and Mangat's improved response. Motivated by geolocation and web search applications, for kk-ary data domains, we consider two special cases of context-aware LDP: block-structured LDP and high-low LDP. We study discrete distribution estimation and provide communication-efficient, sample-optimal schemes and information-theoretic lower bounds for both models. We show that using contextual information can require fewer samples than classical LDP to achieve the same accuracy
    corecore