2 research outputs found

    Statistical Methods and Privacy Preserving Protocols for Combining Genetic Data with Electronic Health Records

    Full text link
    In recent years, electronic health records (EHR) have been combined with genetic data to uncover disease biology and accelerate generation of hypotheses for drug development and treatment strategies. The goal of this dissertation is to develop novel statistical models that can address the challenges of analyzing ‘imperfect’ EHR data and to propose privacy-preserving methods that enable sensitive individual-level data sharing across EHR studies and other large genetic studies. In Chapter II, we propose a statistical method to address misclassified clinical outcomes, a common challenge in EHR data. One essential step of EHR-based genome-wide association studies is constructing a cohort of cases and controls for a specific disease from billing codes and other clinical or administrative data. Nearly always, a perfect strategy for deriving disease phenotypes from billing codes is not available, resulting in some incorrect case/control labels. Here, we propose a method to estimate the misclassification of case/control status by examining genotype information of dozens of disease associated loci. Through simulation and application to the Michigan Genomics Initiative data, we demonstrate that the method enables the evaluation of new EHR-based phenotype definition schemes and provides accurate estimates of disease association measures when phenotypes are misclassified. In Chapter III and IV, we focus on identifying overlapping samples between studies, a common challenge when aggregating information across datasets. We particularly focus on identifying duplicate or related samples when sharing the underlying individual level genetic data is restricted. We propose methods that do not require disclosure of individual identities but that can still identify genetic relatives across datasets. In Chapter III, we show that by grouping genotypes into segments and calculating summary statistics within each segment, we are able to obscure and encode individual-level genetic information. Relatives can be inferred with the coded genotypes using a likelihood model. Simulation and application to the Trans-Omics for Precision Medicine (TOPMed) program data demonstrate the utility and security of the method. In Chapter IV, we extend the method further, with a strategy that guarantees stronger encryption and is expected to work across heterogeneous populations. This secure protocol can infer genetic relatives among people of diverse ethnic backgrounds. The method works by combining a cryptographic technique, homomorphic encryption, with the robust relationship inference method previously described by Manichaikul et al (2010). Through simulations, we show that our method's performance is identical to that of implementations that use the original unencrypted genotypes. Our protocol scales well in computing time and is protected from several possible attacks. The secure protocol was again applied to TOPMed dataset. Securely identifying related samples will facilitate combination of results across datasets when there are restrictions to sharing the underlying individual level data. In conclusion, the methods developed here well enhance use of EHR data and genome data to improve accuracy of case/control status as well as decrease inclusion of relatives across studies when desired.PHDBiostatisticsUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttps://deepblue.lib.umich.edu/bitstream/2027.42/153415/1/xtzhao_1.pd

    Secure, privacy-preserving and practical collaborative Genome-Wide Association Studies

    Get PDF
    Understanding the interplay between genomics and human health is a crucial step for the advancement and development of our society. Genome-Wide Association Study (GWAS) is one of the most popular methods for discovering correlations between genomic variations associated with a particular phenotype (i.e., an observable trait such as a disease). Leveraging genome data from multiple institutions worldwide nowadays is essential to produce more powerful findings by operating GWAS at larger scale. However, this raises several security and privacy risks, not only in the computation of such statistics, but also in the public release of GWAS results. To that extent, several solutions in the literature have adopted cryptographic approaches to allow secure and privacy-preserving processing of genome data for federated analysis. However, conducting federated GWAS in a secure and privacy-preserving manner is not enough since the public releases of GWAS results might be vulnerable to known genomic privacy attacks, such as recovery and membership attacks. The present thesis explores possible solutions to enable end-to-end privacy-preserving federated GWAS in line with data privacy regulations such as GDPR to secure the public release of the results of Genome-Wide Association Studies (GWASes) that are dynamically updated as new genomes become available, that might overlap with their genomes and considered locations within the genome, that can support internal threats such as colluding members in the federation and that are computed in a distributed manner without shipping actual genome data. While achieving these goals, this work created several contributions described below. First, the thesis proposes DyPS, a Trusted Execution Environment (TEE)-based framework that reconciles efficient and secure genome data outsourcing with privacy-preserving data processing inside TEE enclaves to assess and create private releases of dynamic GWAS. In particular, DyPS presents the conditions for the creation of safe dynamic releases certifying that the theoretical complexity of the solution space an external probabilistic polynomial-time (p.p.t.) adversary or a group of colluders (up to all-but-one parties) would need to infer when launching recovery attacks on the observation of GWAS statistics is large enough. Besides that, DyPS executes an exhaustive verification algorithm along with a Likelihood-ratio test to measure the probability of identifying individuals in studies. Thus, also protecting individuals against membership inference attacks. Only safe genome data (i.e., genomes and SNPs) that DyPS selects are further used for the computation and release of GWAS results. At the same time, the remaining (unsafe) data is kept secluded and protected inside the enclave until it eventually can be used. Our results show that if dynamic releases are not improperly evaluated, up to 8% of genomes could be exposed to genomic privacy attacks. Moreover, the experiments show that DyPS’ TEE-based architecture can accommodate the computational resources demanded by our algorithms and present practical running times for larger-scale GWAS. Secondly, the thesis offers I-GWAS that identifies the new conditions for safe releases when considering the existence of overlapping data among multiple GWASes (e.g., same individuals participating in several studies). Indeed, it is shown that adversaries might leverage information of overlapping data to make both recovery and membership attacks feasible again (even if they are produced following the conditions for safe single-GWAS releases). Our experiments show that up to 28.6% of genetic variants of participants could be inferred during recovery attacks, and 92.3% of these variants would enable membership attacks from adversaries observing overlapping studies, which are withheld by I-GWAS. Lastly yet importantly, the thesis presents GenDPR, which encompasses extensions to our protocols so that the privacy-verification algorithms can be conducted distributively among the federation members without demanding the outsourcing of genome data across boundaries. Further, GenDPR can also cope with collusion among participants while selecting genome data that can be used to create safe releases. Additionally, GenDPRproduces the same privacy guarantees as centralized architectures, i.e., it correctly identifies and selects the same data in need of protection as with centralized approaches. In the end, the thesis presents a homogenized framework comprising DyPS, I-GWAS and GenDPR simultaneously. Thus, offering a usable approach for conducting practical GWAS. The method chosen for protection is of a statistical nature, ensuring that the theoretical complexity of attacks remains high and withholding releases of statistics that would impose membership inference risks to participants using Likelihood-ratio tests, despite adversaries gaining additional information over time, but the thesis also relates the findings to techniques that can be leveraged to protect releases (such as Differential Privacy). The proposed solutions leverage Intel SGX as Trusted Execution Environment to perform selected critical operations in a performant manner, however, the work translates equally well to other trusted execution environments and other schemes, such as Homomorphic Encryption
    corecore