9 research outputs found
A Formal Privacy Framework for Partially Private Data
Despite its many useful theoretical properties, differential privacy (DP) has
one substantial blind spot: any release that non-trivially depends on
confidential data without additional privacy-preserving randomization fails to
satisfy DP. Such a restriction is rarely met in practice, as most data releases
under DP are actually "partially private" data (PPD). This poses a significant
barrier to accounting for privacy risk and utility under logistical constraints
imposed on data curators, especially those working with official statistics. In
this paper, we propose a privacy definition which accommodates PPD and prove it
maintains similar properties to standard DP. We derive optimal transport-based
mechanisms for releasing PPD that satisfy our definition and algorithms for
valid statistical inference using PPD, demonstrating their improved performance
over post-processing methods. Finally, we apply these methods to a case study
on US Census and CDC PPD to investigate private COVID-19 infection rates. In
doing so, we show how data curators can use our framework to overcome barriers
to operationalizing formal privacy while providing more transparency and
accountability to users.Comment: 31 pages, 7 figure
Private Distribution Learning with Public Data: The View from Sample Compression
We study the problem of private distribution learning with access to public
data. In this setup, which we refer to as public-private learning, the learner
is given public and private samples drawn from an unknown distribution
belonging to a class , with the goal of outputting an estimate of
while adhering to privacy constraints (here, pure differential privacy)
only with respect to the private samples.
We show that the public-private learnability of a class is
connected to the existence of a sample compression scheme for , as
well as to an intermediate notion we refer to as list learning. Leveraging this
connection: (1) approximately recovers previous results on Gaussians over
; and (2) leads to new ones, including sample complexity upper
bounds for arbitrary -mixtures of Gaussians over , results for
agnostic and distribution-shift resistant learners, as well as closure
properties for public-private learnability under taking mixtures and products
of distributions. Finally, via the connection to list learning, we show that
for Gaussians in , at least public samples are necessary for
private learnability, which is close to the known upper bound of public
samples.Comment: 31 page
Oracle-Efficient Differentially Private Learning with Public Data
Due to statistical lower bounds on the learnability of many function classes
under privacy constraints, there has been recent interest in leveraging public
data to improve the performance of private learning algorithms. In this model,
algorithms must always guarantee differential privacy with respect to the
private samples while also ensuring learning guarantees when the private data
distribution is sufficiently close to that of the public data. Previous work
has demonstrated that when sufficient public, unlabelled data is available,
private learning can be made statistically tractable, but the resulting
algorithms have all been computationally inefficient. In this work, we present
the first computationally efficient, algorithms to provably leverage public
data to learn privately whenever a function class is learnable non-privately,
where our notion of computational efficiency is with respect to the number of
calls to an optimization oracle for the function class. In addition to this
general result, we provide specialized algorithms with improved sample
complexities in the special cases when the function class is convex or when the
task is binary classification
Private Estimation with Public Data
We initiate the study of differentially private (DP) estimation with access
to a small amount of public data. For private estimation of d-dimensional
Gaussians, we assume that the public data comes from a Gaussian that may have
vanishing similarity in total variation distance with the underlying Gaussian
of the private data. We show that under the constraints of pure or concentrated
DP, d+1 public data samples are sufficient to remove any dependence on the
range parameters of the private data distribution from the private sample
complexity, which is known to be otherwise necessary without public data. For
separated Gaussian mixtures, we assume that the underlying public and private
distributions are the same, and we consider two settings: (1) when given a
dimension-independent amount of public data, the private sample complexity can
be improved polynomially in terms of the number of mixture components, and any
dependence on the range parameters of the distribution can be removed in the
approximate DP case; (2) when given an amount of public data linear in the
dimension, the private sample complexity can be made independent of range
parameters even under concentrated DP, and additional improvements can be made
to the overall sample complexity.Comment: 55 pages; updated funding acknowledgement + simulation results from
NeurIPS 2022 camera-read
Bayesian Federated Learning in Predictive Space
Federated Learning (FL) involves training a model over a dataset distributed among clients, with the constraint that each client's data is private. This paradigm is useful in settings where different entities own different training points, such as when training on data stored on multiple edge devices. Within this setting, small and noisy datasets are common, which highlights the need for well-calibrated models which are able to represent the uncertainty in their predictions. Alongside this, two other important goals for a practical FL algorithm are 1) that it has low communication costs, operating over only a few rounds of communication, and 2) that it achieves good performance when client datasets are distributed differently from each other (are heterogeneous). Among existing FL techniques, the closest to achieving such goals include Bayesian FL methods which collect parameter samples from local posteriors, and aggregate them to approximate the global posterior. These provide uncertainty estimates, more naturally handle data heterogeneity owing to their Bayesian nature, and can operate in a single round of communication. Of these techniques, many make inaccurate approximations to the high-dimensional posterior over parameters which in turn negatively effects their uncertainty estimates. A Bayesian technique known as the ``Bayesian Committee Machine" (BCM), originally introduced outside the FL context, remedies some of these issues by aggregating the Bayesian posteriors in the lower dimensional predictive space instead.
The BCM, in its original form, is impractical for FL due to requiring a large ensemble for inference. We first argue that it is well-suited for heterogeneous FL, then propose a modification to the BCM algorithm, involving distillation, to make it practical for FL. We demonstrate that this modified method outperforms other techniques as heterogeneity increases. We then demonstrate theoretical issues with the calibration of the BCM, namely that it is systematically overconfident. We remedy this by proposing β-Predictive Bayes, a Bayesian FL algorithm which performs a modified aggregation of the local predictive posteriors, using a tunable parameter β. β is tuned to improve the global model's calibration, before it is distilled. We empirically evaluate this method on a number of regression and classification datasets to demonstrate that it generally better calibrated than other baselines, over a range of heterogeneous data partitions