1,291 research outputs found

    Privacy-Preserving Data Sharing for Genome-Wide Association Studies

    Full text link
    Traditional statistical methods for confidentiality protection of statistical databases do not scale well to deal with GWAS (genome-wide association studies) databases especially in terms of guarantees regarding protection from linkage to external information. The more recent concept of differential privacy, introduced by the cryptographic community, is an approach which provides a rigorous definition of privacy with meaningful privacy guarantees in the presence of arbitrary external information, although the guarantees come at a serious price in terms of data utility. Building on such notions, we propose new methods to release aggregate GWAS data without compromising an individual's privacy. We present methods for releasing differentially private minor allele frequencies, chi-square statistics and p-values. We compare these approaches on simulated data and on a GWAS study of canine hair length involving 685 dogs. We also propose a privacy-preserving method for finding genome-wide associations based on a differentially-private approach to penalized logistic regression

    Supporting Regularized Logistic Regression Privately and Efficiently

    Full text link
    As one of the most popular statistical and machine learning models, logistic regression with regularization has found wide adoption in biomedicine, social sciences, information technology, and so on. These domains often involve data of human subjects that are contingent upon strict privacy regulations. Increasing concerns over data privacy make it more and more difficult to coordinate and conduct large-scale collaborative studies, which typically rely on cross-institution data sharing and joint analysis. Our work here focuses on safeguarding regularized logistic regression, a widely-used machine learning model in various disciplines while at the same time has not been investigated from a data security and privacy perspective. We consider a common use scenario of multi-institution collaborative studies, such as in the form of research consortia or networks as widely seen in genetics, epidemiology, social sciences, etc. To make our privacy-enhancing solution practical, we demonstrate a non-conventional and computationally efficient method leveraging distributing computing and strong cryptography to provide comprehensive protection over individual-level and summary data. Extensive empirical evaluation on several studies validated the privacy guarantees, efficiency and scalability of our proposal. We also discuss the practical implications of our solution for large-scale studies and applications from various disciplines, including genetic and biomedical studies, smart grid, network analysis, etc

    Enabling Privacy-Preserving GWAS in Heterogeneous Human Populations

    Get PDF
    The projected increase of genotyping in the clinic and the rise of large genomic databases has led to the possibility of using patient medical data to perform genomewide association studies (GWAS) on a larger scale and at a lower cost than ever before. Due to privacy concerns, however, access to this data is limited to a few trusted individuals, greatly reducing its impact on biomedical research. Privacy preserving methods have been suggested as a way of allowing more people access to this precious data while protecting patients. In particular, there has been growing interest in applying the concept of differential privacy to GWAS results. Unfortunately, previous approaches for performing differentially private GWAS are based on rather simple statistics that have some major limitations. In particular, they do not correct for population stratification, a major issue when dealing with the genetically diverse populations present in modern GWAS. To address this concern we introduce a novel computational framework for performing GWAS that tailors ideas from differential privacy to protect private phenotype information, while at the same time correcting for population stratification. This framework allows us to produce privacy preserving GWAS results based on two of the most commonly used GWAS statistics: EIGENSTRAT and linear mixed model (LMM) based statistics. We test our differentially private statistics, PrivSTRAT and PrivLMM, on both simulated and real GWAS datasets and find that they are able to protect privacy while returning meaningful GWAS results.Comment: To be presented at RECOMB 201

    DPWeka: Achieving Differential Privacy in WEKA

    Get PDF
    Organizations belonging to the government, commercial, and non-profit industries collect and store large amounts of sensitive data, which include medical, financial, and personal information. They use data mining methods to formulate business strategies that yield high long-term and short-term financial benefits. While analyzing such data, the private information of the individuals present in the data must be protected for moral and legal reasons. Current practices such as redacting sensitive attributes, releasing only the aggregate values, and query auditing do not provide sufficient protection against an adversary armed with auxiliary information. In the presence of additional background information, the privacy protection framework, differential privacy, provides mathematical guarantees against adversarial attacks. Existing platforms for differential privacy employ specific mechanisms for limited applications of data mining. Additionally, widely used data mining tools do not contain differentially private data mining algorithms. As a result, for analyzing sensitive data, the cognizance of differentially private methods is currently limited outside the research community. This thesis examines various mechanisms to realize differential privacy in practice and investigates methods to integrate them with a popular machine learning toolkit, WEKA. We present DPWeka, a package that provides differential privacy capabilities to WEKA, for practical data mining. DPWeka includes a suite of differential privacy preserving algorithms which support a variety of data mining tasks including attribute selection and regression analysis. It has provisions for users to control privacy and model parameters, such as privacy mechanism, privacy budget, and other algorithm specific variables. We evaluate private algorithms on real-world datasets, such as genetic data and census data, to demonstrate the practical applicability of DPWeka

    Homomorphic Encryption for Machine Learning in Medicine and Bioinformatics

    Get PDF
    Machine learning techniques are an excellent tool for the medical community to analyzing large amounts of medical and genomic data. On the other hand, ethical concerns and privacy regulations prevent the free sharing of this data. Encryption methods such as fully homomorphic encryption (FHE) provide a method evaluate over encrypted data. Using FHE, machine learning models such as deep learning, decision trees, and naive Bayes have been implemented for private prediction using medical data. FHE has also been shown to enable secure genomic algorithms, such as paternity testing, and secure application of genome-wide association studies. This survey provides an overview of fully homomorphic encryption and its applications in medicine and bioinformatics. The high-level concepts behind FHE and its history are introduced. Details on current open-source implementations are provided, as is the state of FHE for privacy-preserving techniques in machine learning and bioinformatics and future growth opportunities for FHE

    sPLINK : a hybrid federated tool as a robust alternative to meta-analysis in genome-wide association studies

    Get PDF
    Meta-analysis has been established as an effective approach to combining summary statistics of several genome-wide association studies (GWAS). However, the accuracy of meta-analysis can be attenuated in the presence of cross-study heterogeneity. We present sPLINK, a hybrid federated and user-friendly tool, which performs privacy-aware GWAS on distributed datasets while preserving the accuracy of the results. sPLINK is robust against heterogeneous distributions of data across cohorts while meta-analysis considerably loses accuracy in such scenarios. sPLINK achieves practical runtime and acceptable network usage for chi-square and linear/logistic regression tests.Peer reviewe

    Sharing Privacy-sensitive Access to Neuroimaging and Genetics Data: A Review and Preliminary Validation

    Get PDF
    The growth of data sharing initiatives for neuroimaging and genomics represents an exciting opportunity to confront the “small N” problem that plagues contemporary neuroimaging studies while further understanding the role genetic markers play in the function of the brain. When it is possible, open data sharing provides the most benefits. However, some data cannot be shared at all due to privacy concerns and/or risk of re-identification. Sharing other data sets is hampered by the proliferation of complex data use agreements (DUAs) which preclude truly automated data mining. These DUAs arise because of concerns about the privacy and confidentiality for subjects; though many do permit direct access to data, they often require a cumbersome approval process that can take months. An alternative approach is to only share data derivatives such as statistical summaries—the challenges here are to reformulate computational methods to quantify the privacy risks associated with sharing the results of those computations. For example, a derived map of gray matter is often as identifiable as a fingerprint. Thus alternative approaches to accessing data are needed. This paper reviews the relevant literature on differential privacy, a framework for measuring and tracking privacy loss in these settings, and demonstrates the feasibility of using this framework to calculate statistics on data distributed at many sites while still providing privacy

    A Multi-site Resting State fMRI Study on the Amplitude of Low Frequency Fluctuations in Schizophrenia

    Get PDF
    Background: This multi-site study compares resting state fMRI amplitude of low frequency fluctuations (ALFF) and fractional ALFF (fALFF) between patients with schizophrenia (SZ) and healthy controls (HC). Methods: Eyes-closed resting fMRI scans (5:38 min; n = 306, 146 SZ) were collected from 6 Siemens 3T scanners and one GE 3T scanner. Imaging data were pre-processed using an SPM pipeline. Power in the low frequency band (0.01–0.08 Hz) was calculated both for the original pre-processed data as well as for the pre-processed data after regressing out the six rigid-body motion parameters, mean white matter (WM) and cerebral spinal fluid (CSF) signals. Both original and regressed ALFF and fALFF measures were modeled with site, diagnosis, age, and diagnosis × age interactions. Results: Regressing out motion and non-gray matter signals significantly decreased fALFF throughout the brain as well as ALFF in the cortical edge, but significantly increased ALFF in subcortical regions. Regression had little effect on site, age, and diagnosis effects on ALFF, other than to reduce diagnosis effects in subcortical regions. There were significant effects of site across the brain in all the analyses, largely due to vendor differences. HC showed greater ALFF in the occipital, posterior parietal, and superior temporal lobe, while SZ showed smaller clusters of greater ALFF in the frontal and temporal/insular regions as well as in the caudate, putamen, and hippocampus. HC showed greater fALFF compared with SZ in all regions, though subcortical differences were only significant for original fALFF. Conclusions: SZ show greater eyes-closed resting state low frequency power in frontal cortex, and less power in posterior lobes than do HC; fALFF, however, is lower in SZ than HC throughout the cortex. These effects are robust to multi-site variability. Regressing out physiological noise signals significantly affects both total and fALFF measures, but does not affect the pattern of case/control differences
    corecore