Genotype-phenotype association studies play a central role in precision medicine by enabling researchers to identify genetic variants that influence complex traits and diseases with applications ranging from disease risk prediction to treatment stratification. Notable approaches include genome-wide association studies (GWAS), quantitative trait loci (QTL) mapping, and increasingly, machine learning models that learn predictive relationships between genotype and phenotype data. The power of these approaches grows substantially in collaborative research settings, where pooling data across cohorts increases the statistical power for variant discovery, and sharing trained machine learning models enables institutions with limited computational resources to perform inference on their own cohorts.
However, these collaborative gains come with a fundamental privacy-utility tradeoff: sharing raw genomic and phenotypic data can expose sensitive information about study participants, and sharing machine learning models introduces the risk of information leakage through model parameters. These concerns are amplified by strict data governance policies and institutional silos that restrict data movement, underscoring the urgent need for privacy-preserving frameworks to enable secure, collaborative analysis while maintaining statistical rigor and practical scalability.
This dissertation develops end-to-end, cryptography-based solutions for privacy-preserving genotype-phenotype association studies, and addresses the entire analysis pipeline from preprocessing and harmonization of phenotype data to secure multi-site eQTL mapping to inference on sensitive data using machine learning models. The overarching goal is to design frameworks that are designed to (i) provide strong cryptographic security guarantees, (ii) be computationally scalable to large multi-institution cohorts, and (iii) be statistically robust by preserving accuracy comparable to non-secure baselines.
In Chapter 3, we design a privacy-preserving gene expression preprocessing framework that addresses key bottlenecks in federated transcriptomic studies. Our framework provides two secure normalization options — quantile normalization (QN) and relative log expression (RLE) — to allow flexibility depending on data sharing standards, a secure multiparty computation (MPC)-based protocol for inverse normal transformation, and a scalable local principal component analysis (PCA)-based hidden covariate correction strategy. We validate our approach using both simulated multi-institution datasets and real-world gene expression data to show that our methods achieve phenotype correction accuracy comparable to centralized, non-secure pipelines while maintaining privacy of individual-level data. These results demonstrate that federated preprocessing with local computation is feasible and effective for collaborative studies.
Building on this foundation, Chapter 4 introduces privateQTL, a secure and scalable framework for multi-center eQTL mapping. privateQTL implements practical genotype and phenotype correction strategies, including genotype population stratification via projection on public reference panels and the privacy-preserving gene expression preprocessing discussed in Chapter 3. We further propose a one-shot matrix multiplication approach that enables efficient nominal association testing and permutation-based false discovery control without repeated communication rounds, significantly reducing runtime. Our evaluation compares privateQTL against meta-analyses and centralized pipelines across multiple axes — eGene and eVariant discovery rates, robustness to batch effects, statistical power, runtime, and memory footprint — using both simulated federated datasets and real-world multi-site data with known batch heterogeneity. Our results demonstrate that privateQTL achieves superior discovery rates compared to meta-analyses, particularly in heterogeneous data settings, while maintaining strong privacy guarantees under a semi-honest adversary model.
Finally, in Chapter 5, we address the emerging challenge of using secure machine learning inference for sensitive genomic data. We present two HE-based frameworks: (i) a secure inference method for linear models where both inputs and model weights are encrypted, enabling end-to-end confidential inference without compromising predictive performance, and (ii) a method for secure inference on transformer architectures using approximations for non-linear functions. For linear model inference, we introduce an efficient encoding method that improves computational efficiency during encrypted dot product computation, and for transformer inference, we develop polynomial approximations for nonlinear functions such as softmax, ReLU, and layer normalization to balance computational feasibility with model accuracy. We validate our linear model inference framework on both continuous and binary phenotype prediction tasks using simulated and real data, achieving performance comparable to plaintext inference. For transformer inference, we discuss the challenges of implementing our approximations in a practical and scalable setting for large scale transformer inference and lay the grounds for future work.
Collectively, this dissertation makes significant contributions to the field of privacy-preserving biomedical informatics. By providing scalable, modular, and cryptographically sound methods for phenotype preprocessing, federated eQTL mapping, and secure machine learning inference, this work enables collaborative genomic research while rigorously protecting sensitive participant data. The frameworks and findings presented here create a foundation for future developments in privacy-aware collaborative studies, advancing the realization of precision medicine in a manner that respects individual privacy, complies with regulatory requirements, and preserves scientific reproducibility
Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.