16 research outputs found
Differentially Private Model Selection with Penalized and Constrained Likelihood
In statistical disclosure control, the goal of data analysis is twofold: The
released information must provide accurate and useful statistics about the
underlying population of interest, while minimizing the potential for an
individual record to be identified. In recent years, the notion of differential
privacy has received much attention in theoretical computer science, machine
learning, and statistics. It provides a rigorous and strong notion of
protection for individuals' sensitive information. A fundamental question is
how to incorporate differential privacy into traditional statistical inference
procedures. In this paper we study model selection in multivariate linear
regression under the constraint of differential privacy. We show that model
selection procedures based on penalized least squares or likelihood can be made
differentially private by a combination of regularization and randomization,
and propose two algorithms to do so. We show that our private procedures are
consistent under essentially the same conditions as the corresponding
non-private procedures. We also find that under differential privacy, the
procedure becomes more sensitive to the tuning parameters. We illustrate and
evaluate our method using simulation studies and two real data examples
Context-Aware Generative Adversarial Privacy
Preserving the utility of published datasets while simultaneously providing
provable privacy guarantees is a well-known challenge. On the one hand,
context-free privacy solutions, such as differential privacy, provide strong
privacy guarantees, but often lead to a significant reduction in utility. On
the other hand, context-aware privacy solutions, such as information theoretic
privacy, achieve an improved privacy-utility tradeoff, but assume that the data
holder has access to dataset statistics. We circumvent these limitations by
introducing a novel context-aware privacy framework called generative
adversarial privacy (GAP). GAP leverages recent advancements in generative
adversarial networks (GANs) to allow the data holder to learn privatization
schemes from the dataset itself. Under GAP, learning the privacy mechanism is
formulated as a constrained minimax game between two players: a privatizer that
sanitizes the dataset in a way that limits the risk of inference attacks on the
individuals' private variables, and an adversary that tries to infer the
private variables from the sanitized dataset. To evaluate GAP's performance, we
investigate two simple (yet canonical) statistical dataset models: (a) the
binary data model, and (b) the binary Gaussian mixture model. For both models,
we derive game-theoretically optimal minimax privacy mechanisms, and show that
the privacy mechanisms learned from data (in a generative adversarial fashion)
match the theoretically optimal ones. This demonstrates that our framework can
be easily applied in practice, even in the absence of dataset statistics.Comment: Improved version of a paper accepted by Entropy Journal, Special
Issue on Information Theory in Machine Learning and Data Scienc
Comparing Population Means under Local Differential Privacy: with Significance and Power
A statistical hypothesis test determines whether a hypothesis should be
rejected based on samples from populations. In particular, randomized
controlled experiments (or A/B testing) that compare population means using,
e.g., t-tests, have been widely deployed in technology companies to aid in
making data-driven decisions. Samples used in these tests are collected from
users and may contain sensitive information. Both the data collection and the
testing process may compromise individuals' privacy. In this paper, we study
how to conduct hypothesis tests to compare population means while preserving
privacy. We use the notation of local differential privacy (LDP), which has
recently emerged as the main tool to ensure each individual's privacy without
the need of a trusted data collector. We propose LDP tests that inject noise
into every user's data in the samples before collecting them (so users do not
need to trust the data collector), and draw conclusions with bounded type-I
(significance level) and type-II errors (1 - power). Our approaches can be
extended to the scenario where some users require LDP while some are willing to
provide exact data. We report experimental results on real-world datasets to
verify the effectiveness of our approaches.Comment: Full version of an AAAI 2018 conference pape