1,291 research outputs found
Privacy-Preserving Data Sharing for Genome-Wide Association Studies
Traditional statistical methods for confidentiality protection of statistical
databases do not scale well to deal with GWAS (genome-wide association studies)
databases especially in terms of guarantees regarding protection from linkage
to external information. The more recent concept of differential privacy,
introduced by the cryptographic community, is an approach which provides a
rigorous definition of privacy with meaningful privacy guarantees in the
presence of arbitrary external information, although the guarantees come at a
serious price in terms of data utility. Building on such notions, we propose
new methods to release aggregate GWAS data without compromising an individual's
privacy. We present methods for releasing differentially private minor allele
frequencies, chi-square statistics and p-values. We compare these approaches on
simulated data and on a GWAS study of canine hair length involving 685 dogs. We
also propose a privacy-preserving method for finding genome-wide associations
based on a differentially-private approach to penalized logistic regression
Supporting Regularized Logistic Regression Privately and Efficiently
As one of the most popular statistical and machine learning models, logistic
regression with regularization has found wide adoption in biomedicine, social
sciences, information technology, and so on. These domains often involve data
of human subjects that are contingent upon strict privacy regulations.
Increasing concerns over data privacy make it more and more difficult to
coordinate and conduct large-scale collaborative studies, which typically rely
on cross-institution data sharing and joint analysis. Our work here focuses on
safeguarding regularized logistic regression, a widely-used machine learning
model in various disciplines while at the same time has not been investigated
from a data security and privacy perspective. We consider a common use scenario
of multi-institution collaborative studies, such as in the form of research
consortia or networks as widely seen in genetics, epidemiology, social
sciences, etc. To make our privacy-enhancing solution practical, we demonstrate
a non-conventional and computationally efficient method leveraging distributing
computing and strong cryptography to provide comprehensive protection over
individual-level and summary data. Extensive empirical evaluation on several
studies validated the privacy guarantees, efficiency and scalability of our
proposal. We also discuss the practical implications of our solution for
large-scale studies and applications from various disciplines, including
genetic and biomedical studies, smart grid, network analysis, etc
Enabling Privacy-Preserving GWAS in Heterogeneous Human Populations
The projected increase of genotyping in the clinic and the rise of large
genomic databases has led to the possibility of using patient medical data to
perform genomewide association studies (GWAS) on a larger scale and at a lower
cost than ever before. Due to privacy concerns, however, access to this data is
limited to a few trusted individuals, greatly reducing its impact on biomedical
research. Privacy preserving methods have been suggested as a way of allowing
more people access to this precious data while protecting patients. In
particular, there has been growing interest in applying the concept of
differential privacy to GWAS results. Unfortunately, previous approaches for
performing differentially private GWAS are based on rather simple statistics
that have some major limitations. In particular, they do not correct for
population stratification, a major issue when dealing with the genetically
diverse populations present in modern GWAS. To address this concern we
introduce a novel computational framework for performing GWAS that tailors
ideas from differential privacy to protect private phenotype information, while
at the same time correcting for population stratification. This framework
allows us to produce privacy preserving GWAS results based on two of the most
commonly used GWAS statistics: EIGENSTRAT and linear mixed model (LMM) based
statistics. We test our differentially private statistics, PrivSTRAT and
PrivLMM, on both simulated and real GWAS datasets and find that they are able
to protect privacy while returning meaningful GWAS results.Comment: To be presented at RECOMB 201
DPWeka: Achieving Differential Privacy in WEKA
Organizations belonging to the government, commercial, and non-profit industries collect and store large amounts of sensitive data, which include medical, financial, and personal information. They use data mining methods to formulate business strategies that yield high long-term and short-term financial benefits. While analyzing such data, the private information of the individuals present in the data must be protected for moral and legal reasons. Current practices such as redacting sensitive attributes, releasing only the aggregate values, and query auditing do not provide sufficient protection against an adversary armed with auxiliary information. In the presence of additional background information, the privacy protection framework, differential privacy, provides mathematical guarantees against adversarial attacks.
Existing platforms for differential privacy employ specific mechanisms for limited applications of data mining. Additionally, widely used data mining tools do not contain differentially private data mining algorithms. As a result, for analyzing sensitive data, the cognizance of differentially private methods is currently limited outside the research community.
This thesis examines various mechanisms to realize differential privacy in practice and investigates methods to integrate them with a popular machine learning toolkit, WEKA. We present DPWeka, a package that provides differential privacy capabilities to WEKA, for practical data mining. DPWeka includes a suite of differential privacy preserving algorithms which support a variety of data mining tasks including attribute selection and regression analysis. It has provisions for users to control privacy and model parameters, such as privacy mechanism, privacy budget, and other algorithm specific variables. We evaluate private algorithms on real-world datasets, such as genetic data and census data, to demonstrate the practical applicability of DPWeka
Homomorphic Encryption for Machine Learning in Medicine and Bioinformatics
Machine learning techniques are an excellent tool for the medical community to analyzing large amounts of medical and genomic data. On the other hand, ethical concerns and privacy regulations prevent the free sharing of this data. Encryption methods such as fully homomorphic encryption (FHE) provide a method evaluate over encrypted data. Using FHE, machine learning models such as deep learning, decision trees, and naive Bayes have been implemented for private prediction using medical data. FHE has also been shown to enable secure genomic algorithms, such as paternity testing, and secure application of genome-wide association studies. This survey provides an overview of fully homomorphic encryption and its applications in medicine and bioinformatics. The high-level concepts behind FHE and its history are introduced. Details on current open-source implementations are provided, as is the state of FHE for privacy-preserving techniques in machine learning and bioinformatics and future growth opportunities for FHE
sPLINK : a hybrid federated tool as a robust alternative to meta-analysis in genome-wide association studies
Meta-analysis has been established as an effective approach to combining summary statistics of several genome-wide association studies (GWAS). However, the accuracy of meta-analysis can be attenuated in the presence of cross-study heterogeneity. We present sPLINK, a hybrid federated and user-friendly tool, which performs privacy-aware GWAS on distributed datasets while preserving the accuracy of the results. sPLINK is robust against heterogeneous distributions of data across cohorts while meta-analysis considerably loses accuracy in such scenarios. sPLINK achieves practical runtime and acceptable network usage for chi-square and linear/logistic regression tests.Peer reviewe
Sharing Privacy-sensitive Access to Neuroimaging and Genetics Data: A Review and Preliminary Validation
The growth of data sharing initiatives for neuroimaging and genomics represents an exciting opportunity to confront the “small N” problem that plagues contemporary neuroimaging studies while further understanding the role genetic markers play in the function of the brain. When it is possible, open data sharing provides the most benefits. However, some data cannot be shared at all due to privacy concerns and/or risk of re-identification. Sharing other data sets is hampered by the proliferation of complex data use agreements (DUAs) which preclude truly automated data mining. These DUAs arise because of concerns about the privacy and confidentiality for subjects; though many do permit direct access to data, they often require a cumbersome approval process that can take months. An alternative approach is to only share data derivatives such as statistical summaries—the challenges here are to reformulate computational methods to quantify the privacy risks associated with sharing the results of those computations. For example, a derived map of gray matter is often as identifiable as a fingerprint. Thus alternative approaches to accessing data are needed. This paper reviews the relevant literature on differential privacy, a framework for measuring and tracking privacy loss in these settings, and demonstrates the feasibility of using this framework to calculate statistics on data distributed at many sites while still providing privacy
A Multi-site Resting State fMRI Study on the Amplitude of Low Frequency Fluctuations in Schizophrenia
Background: This multi-site study compares resting state fMRI amplitude of low frequency fluctuations (ALFF) and fractional ALFF (fALFF) between patients with schizophrenia (SZ) and healthy controls (HC). Methods: Eyes-closed resting fMRI scans (5:38 min; n = 306, 146 SZ) were collected from 6 Siemens 3T scanners and one GE 3T scanner. Imaging data were pre-processed using an SPM pipeline. Power in the low frequency band (0.01–0.08 Hz) was calculated both for the original pre-processed data as well as for the pre-processed data after regressing out the six rigid-body motion parameters, mean white matter (WM) and cerebral spinal fluid (CSF) signals. Both original and regressed ALFF and fALFF measures were modeled with site, diagnosis, age, and diagnosis × age interactions. Results: Regressing out motion and non-gray matter signals significantly decreased fALFF throughout the brain as well as ALFF in the cortical edge, but significantly increased ALFF in subcortical regions. Regression had little effect on site, age, and diagnosis effects on ALFF, other than to reduce diagnosis effects in subcortical regions. There were significant effects of site across the brain in all the analyses, largely due to vendor differences. HC showed greater ALFF in the occipital, posterior parietal, and superior temporal lobe, while SZ showed smaller clusters of greater ALFF in the frontal and temporal/insular regions as well as in the caudate, putamen, and hippocampus. HC showed greater fALFF compared with SZ in all regions, though subcortical differences were only significant for original fALFF. Conclusions: SZ show greater eyes-closed resting state low frequency power in frontal cortex, and less power in posterior lobes than do HC; fALFF, however, is lower in SZ than HC throughout the cortex. These effects are robust to multi-site variability. Regressing out physiological noise signals significantly affects both total and fALFF measures, but does not affect the pattern of case/control differences
- …