29 research outputs found
A Formal Privacy Framework for Partially Private Data
Despite its many useful theoretical properties, differential privacy (DP) has
one substantial blind spot: any release that non-trivially depends on
confidential data without additional privacy-preserving randomization fails to
satisfy DP. Such a restriction is rarely met in practice, as most data releases
under DP are actually "partially private" data (PPD). This poses a significant
barrier to accounting for privacy risk and utility under logistical constraints
imposed on data curators, especially those working with official statistics. In
this paper, we propose a privacy definition which accommodates PPD and prove it
maintains similar properties to standard DP. We derive optimal transport-based
mechanisms for releasing PPD that satisfy our definition and algorithms for
valid statistical inference using PPD, demonstrating their improved performance
over post-processing methods. Finally, we apply these methods to a case study
on US Census and CDC PPD to investigate private COVID-19 infection rates. In
doing so, we show how data curators can use our framework to overcome barriers
to operationalizing formal privacy while providing more transparency and
accountability to users.Comment: 31 pages, 7 figure
Privacy-Preserving Data Sharing for Genome-Wide Association Studies
Traditional statistical methods for confidentiality protection of statistical
databases do not scale well to deal with GWAS (genome-wide association studies)
databases especially in terms of guarantees regarding protection from linkage
to external information. The more recent concept of differential privacy,
introduced by the cryptographic community, is an approach which provides a
rigorous definition of privacy with meaningful privacy guarantees in the
presence of arbitrary external information, although the guarantees come at a
serious price in terms of data utility. Building on such notions, we propose
new methods to release aggregate GWAS data without compromising an individual's
privacy. We present methods for releasing differentially private minor allele
frequencies, chi-square statistics and p-values. We compare these approaches on
simulated data and on a GWAS study of canine hair length involving 685 dogs. We
also propose a privacy-preserving method for finding genome-wide associations
based on a differentially-private approach to penalized logistic regression
Differentially private model selection with penalized and constrained likelihood
Summary: In statistical disclosure control, the goal of data analysis is twofold: the information released must provide accurate and useful statistics about the underlying population of interest, while minimizing the potential for an individual record to be identified. In recent years, the notion of differential privacy has received much attention in theoretical computer science, machine learning and statistics. It provides a rigorous and strong notion of protection for individualsā sensitive information. A fundamental question is how to incorporate differential privacy in traditional statistical inference procedures. We study model selection in multivariate linear regression under the constraint of differential privacy. We show that model selection procedures based on penalized least squares or likelihood can be made differentially private by a combination of regularization and randomization, and we propose two algorithms to do so. We show that our privacy procedures are consistent under essentially the same conditions as the corresponding nonāprivacy procedures. We also find that, under differential privacy, the procedure becomes more sensitive to the tuning parameters. We illustrate and evaluate our method by using simulation studies and two real data examples.Accepted manuscrip