49 research outputs found
Privacy-Preserving Data Sharing for Genome-Wide Association Studies
Traditional statistical methods for confidentiality protection of statistical
databases do not scale well to deal with GWAS (genome-wide association studies)
databases especially in terms of guarantees regarding protection from linkage
to external information. The more recent concept of differential privacy,
introduced by the cryptographic community, is an approach which provides a
rigorous definition of privacy with meaningful privacy guarantees in the
presence of arbitrary external information, although the guarantees come at a
serious price in terms of data utility. Building on such notions, we propose
new methods to release aggregate GWAS data without compromising an individual's
privacy. We present methods for releasing differentially private minor allele
frequencies, chi-square statistics and p-values. We compare these approaches on
simulated data and on a GWAS study of canine hair length involving 685 dogs. We
also propose a privacy-preserving method for finding genome-wide associations
based on a differentially-private approach to penalized logistic regression
Enabling Privacy-Preserving GWAS in Heterogeneous Human Populations
The projected increase of genotyping in the clinic and the rise of large
genomic databases has led to the possibility of using patient medical data to
perform genomewide association studies (GWAS) on a larger scale and at a lower
cost than ever before. Due to privacy concerns, however, access to this data is
limited to a few trusted individuals, greatly reducing its impact on biomedical
research. Privacy preserving methods have been suggested as a way of allowing
more people access to this precious data while protecting patients. In
particular, there has been growing interest in applying the concept of
differential privacy to GWAS results. Unfortunately, previous approaches for
performing differentially private GWAS are based on rather simple statistics
that have some major limitations. In particular, they do not correct for
population stratification, a major issue when dealing with the genetically
diverse populations present in modern GWAS. To address this concern we
introduce a novel computational framework for performing GWAS that tailors
ideas from differential privacy to protect private phenotype information, while
at the same time correcting for population stratification. This framework
allows us to produce privacy preserving GWAS results based on two of the most
commonly used GWAS statistics: EIGENSTRAT and linear mixed model (LMM) based
statistics. We test our differentially private statistics, PrivSTRAT and
PrivLMM, on both simulated and real GWAS datasets and find that they are able
to protect privacy while returning meaningful GWAS results.Comment: To be presented at RECOMB 201
Scalable Privacy-Preserving Data Sharing Methodology for Genome-Wide Association Studies
The protection of privacy of individual-level information in genome-wide
association study (GWAS) databases has been a major concern of researchers
following the publication of "an attack" on GWAS data by Homer et al. (2008)
Traditional statistical methods for confidentiality and privacy protection of
statistical databases do not scale well to deal with GWAS data, especially in
terms of guarantees regarding protection from linkage to external information.
The more recent concept of differential privacy, introduced by the
cryptographic community, is an approach that provides a rigorous definition of
privacy with meaningful privacy guarantees in the presence of arbitrary
external information, although the guarantees may come at a serious price in
terms of data utility. Building on such notions, Uhler et al. (2013) proposed
new methods to release aggregate GWAS data without compromising an individual's
privacy. We extend the methods developed in Uhler et al. (2013) for releasing
differentially-private -statistics by allowing for arbitrary number of
cases and controls, and for releasing differentially-private allelic test
statistics. We also provide a new interpretation by assuming the controls' data
are known, which is a realistic assumption because some GWAS use publicly
available data as controls. We assess the performance of the proposed methods
through a risk-utility analysis on a real data set consisting of DNA samples
collected by the Wellcome Trust Case Control Consortium and compare the methods
with the differentially-private release mechanism proposed by Johnson and
Shmatikov (2013).Comment: 28 pages, 2 figures, source code available upon reques
Differentially Private Model Selection with Penalized and Constrained Likelihood
In statistical disclosure control, the goal of data analysis is twofold: The
released information must provide accurate and useful statistics about the
underlying population of interest, while minimizing the potential for an
individual record to be identified. In recent years, the notion of differential
privacy has received much attention in theoretical computer science, machine
learning, and statistics. It provides a rigorous and strong notion of
protection for individuals' sensitive information. A fundamental question is
how to incorporate differential privacy into traditional statistical inference
procedures. In this paper we study model selection in multivariate linear
regression under the constraint of differential privacy. We show that model
selection procedures based on penalized least squares or likelihood can be made
differentially private by a combination of regularization and randomization,
and propose two algorithms to do so. We show that our private procedures are
consistent under essentially the same conditions as the corresponding
non-private procedures. We also find that under differential privacy, the
procedure becomes more sensitive to the tuning parameters. We illustrate and
evaluate our method using simulation studies and two real data examples