9 research outputs found

    A Formal Privacy Framework for Partially Private Data

    Full text link
    Despite its many useful theoretical properties, differential privacy (DP) has one substantial blind spot: any release that non-trivially depends on confidential data without additional privacy-preserving randomization fails to satisfy DP. Such a restriction is rarely met in practice, as most data releases under DP are actually "partially private" data (PPD). This poses a significant barrier to accounting for privacy risk and utility under logistical constraints imposed on data curators, especially those working with official statistics. In this paper, we propose a privacy definition which accommodates PPD and prove it maintains similar properties to standard DP. We derive optimal transport-based mechanisms for releasing PPD that satisfy our definition and algorithms for valid statistical inference using PPD, demonstrating their improved performance over post-processing methods. Finally, we apply these methods to a case study on US Census and CDC PPD to investigate private COVID-19 infection rates. In doing so, we show how data curators can use our framework to overcome barriers to operationalizing formal privacy while providing more transparency and accountability to users.Comment: 31 pages, 7 figure

    Private Distribution Learning with Public Data: The View from Sample Compression

    Full text link
    We study the problem of private distribution learning with access to public data. In this setup, which we refer to as public-private learning, the learner is given public and private samples drawn from an unknown distribution pp belonging to a class Q\mathcal Q, with the goal of outputting an estimate of pp while adhering to privacy constraints (here, pure differential privacy) only with respect to the private samples. We show that the public-private learnability of a class Q\mathcal Q is connected to the existence of a sample compression scheme for Q\mathcal Q, as well as to an intermediate notion we refer to as list learning. Leveraging this connection: (1) approximately recovers previous results on Gaussians over Rd\mathbb R^d; and (2) leads to new ones, including sample complexity upper bounds for arbitrary kk-mixtures of Gaussians over Rd\mathbb R^d, results for agnostic and distribution-shift resistant learners, as well as closure properties for public-private learnability under taking mixtures and products of distributions. Finally, via the connection to list learning, we show that for Gaussians in Rd\mathbb R^d, at least dd public samples are necessary for private learnability, which is close to the known upper bound of d+1d+1 public samples.Comment: 31 page

    Oracle-Efficient Differentially Private Learning with Public Data

    Full text link
    Due to statistical lower bounds on the learnability of many function classes under privacy constraints, there has been recent interest in leveraging public data to improve the performance of private learning algorithms. In this model, algorithms must always guarantee differential privacy with respect to the private samples while also ensuring learning guarantees when the private data distribution is sufficiently close to that of the public data. Previous work has demonstrated that when sufficient public, unlabelled data is available, private learning can be made statistically tractable, but the resulting algorithms have all been computationally inefficient. In this work, we present the first computationally efficient, algorithms to provably leverage public data to learn privately whenever a function class is learnable non-privately, where our notion of computational efficiency is with respect to the number of calls to an optimization oracle for the function class. In addition to this general result, we provide specialized algorithms with improved sample complexities in the special cases when the function class is convex or when the task is binary classification

    Private Estimation with Public Data

    Full text link
    We initiate the study of differentially private (DP) estimation with access to a small amount of public data. For private estimation of d-dimensional Gaussians, we assume that the public data comes from a Gaussian that may have vanishing similarity in total variation distance with the underlying Gaussian of the private data. We show that under the constraints of pure or concentrated DP, d+1 public data samples are sufficient to remove any dependence on the range parameters of the private data distribution from the private sample complexity, which is known to be otherwise necessary without public data. For separated Gaussian mixtures, we assume that the underlying public and private distributions are the same, and we consider two settings: (1) when given a dimension-independent amount of public data, the private sample complexity can be improved polynomially in terms of the number of mixture components, and any dependence on the range parameters of the distribution can be removed in the approximate DP case; (2) when given an amount of public data linear in the dimension, the private sample complexity can be made independent of range parameters even under concentrated DP, and additional improvements can be made to the overall sample complexity.Comment: 55 pages; updated funding acknowledgement + simulation results from NeurIPS 2022 camera-read

    Bayesian Federated Learning in Predictive Space

    Get PDF
    Federated Learning (FL) involves training a model over a dataset distributed among clients, with the constraint that each client's data is private. This paradigm is useful in settings where different entities own different training points, such as when training on data stored on multiple edge devices. Within this setting, small and noisy datasets are common, which highlights the need for well-calibrated models which are able to represent the uncertainty in their predictions. Alongside this, two other important goals for a practical FL algorithm are 1) that it has low communication costs, operating over only a few rounds of communication, and 2) that it achieves good performance when client datasets are distributed differently from each other (are heterogeneous). Among existing FL techniques, the closest to achieving such goals include Bayesian FL methods which collect parameter samples from local posteriors, and aggregate them to approximate the global posterior. These provide uncertainty estimates, more naturally handle data heterogeneity owing to their Bayesian nature, and can operate in a single round of communication. Of these techniques, many make inaccurate approximations to the high-dimensional posterior over parameters which in turn negatively effects their uncertainty estimates. A Bayesian technique known as the ``Bayesian Committee Machine" (BCM), originally introduced outside the FL context, remedies some of these issues by aggregating the Bayesian posteriors in the lower dimensional predictive space instead. The BCM, in its original form, is impractical for FL due to requiring a large ensemble for inference. We first argue that it is well-suited for heterogeneous FL, then propose a modification to the BCM algorithm, involving distillation, to make it practical for FL. We demonstrate that this modified method outperforms other techniques as heterogeneity increases. We then demonstrate theoretical issues with the calibration of the BCM, namely that it is systematically overconfident. We remedy this by proposing β-Predictive Bayes, a Bayesian FL algorithm which performs a modified aggregation of the local predictive posteriors, using a tunable parameter β. β is tuned to improve the global model's calibration, before it is distilled. We empirically evaluate this method on a number of regression and classification datasets to demonstrate that it generally better calibrated than other baselines, over a range of heterogeneous data partitions
    corecore