High-definition likelihood framework for local genetic architecture and proteome-informed target discovery

Abstract

Genome-wide association studies (GWAS) have identified thousands of loci associated with complex traits, but translating these association signals into actionable biology remains a challenge. This is largely because most GWAS signals are non-coding, often reflect multiple correlated variants in linkage disequilibrium (LD), and rarely identify the effector gene, tissue, or mechanism. Large-scale plasma proteomics can help bridge this gap by providing molecular mediators that are closer to pharmacological intervention and can be anchored to genetic variation through protein quantitative trait loci (pQTL). However, integrative analyses that connect pQTLs to disease genetics are often unstable at the locus level due to polygenicity, allelic heterogeneity, and imperfect LD information, leading to overconfident or inconsistent conclusions. In addition, most studies focus on protein monomers, yet proteins rarely act in isolation but instead interact with other proteins to form complexes and pathways that collectively execute cellular functions. This thesis develops likelihood-based methodologies that make local genetic architecture a practical unit for inference and evidence integration, and applies them to regional genetic analysis, protein-disease colocalization, and complexlevel target discovery. In Study I, we introduce HDL-L, a high-dimensional, fulllikelihood framework for estimating local SNP heritability and local genetic covariance from summary association statistics. Across extensive simulations, HDL-L provides well-calibrated regional inference, yielding more consistent local heritability estimates and more efficient genetic correlation estimates than LAVA over a wide range of genetic architectures. Applied to 30 UK Biobank phenotypes, HDL-L identified 109 significant local genetic correlations with a substantial computational advantage. In Study II, we extend the likelihood framework to perform colocalization analysis (HDL-C) of plasma proteins with complex diseases. Leveraging large-scale proteomic and GWAS resources, we prioritize protein-disease pairs driven by shared genetic effects and evaluate robustness using a cross-sex reproducibility design. Integrating genetically supported signals with drug knowledge bases distinguishes validated targets, repurposing opportunities, and novel hypotheses: among the top 50 HDL-C prioritized protein-disease pairs in male and female UK Biobank cohorts, we recover 40 previously validated drug-protein-disease relationships with approved indications and identify 62 additional relationships with repurposing potential, alongside 63 new protein-disease candidates for therapeutic follow-up. In Study III, we move beyond protein monomers to protein complexes by combining genetically informed protein-protein interactions with structural modeling to uncover targetable protein complexes. We identify 3,370 significant cis-trans protein pairs from 2,864 pQTLs, including 633 supported by STRING and 262 supported by experimental evidence, enriched for coherent functional modules such as apolipoprotein- and CD22-centered networks. Using AlphaFold3, we show that genetically supported interactions are enriched for high-confidence assemblies relative to matched controls. By using machine learning models, druggability analyses further reveal 25 complex-specific binding pockets across seven complexes that are absent in isolated monomers, highlighting opportunities for complex-selective pharmacological modulation, exemplified by SERPINA1-CTRL and HSBP1-WASHC3 within the WASH-Arp2/3 pathway. Overall, these studies position local shared-architecture inference as a robust layer of evidence for moving from genetic association signals to biological insight. Rather than stopping at the observation that traits are genetically related, the framework developed in this thesis helps localize where that sharing occurs in the genome and links it to specific genes, proteins, and pathways that may underlie disease mechanisms. In this way, the thesis provides both statistical tools and an analytical workflow that integrates regional genetic inference, proteomic data, and structural protein information. This combination makes it possible to move from broad genomic findings to more concrete and testable hypotheses, thereby supporting biomarker discovery, clarifying disease biology, and prioritizing therapeutic targets.List of scientific papersThe articles will be referred to in the text by their Roman numerals, and are reproduced in full at the end of the thesis. I. Li, Y., Pawitan, Y., & Shen, X. (2025). An enhanced framework for local genetic correlation analysis. Nature Genetics, 57(4), 1053 – 1058. https://doi.org/10.1038/s41588-025-02123-3II. Li, Y., Zhai, R., Yang, Z., Li, T., Pawitan, Y., & Shen, X. (2026). High-definition likelihood inference of genetic colocalization reveals protein biomarkers for human complex diseases. GigaScience, 15, giaf155. https://doi.org/10.1093/gigascience/giaf155III. Li, Y., Wang, Q., Zhai, R., Vu, T. N., Pawitan, Y., & Shen, X. (2026). A statistical genetics framework for discovering druggable protein complexes in human plasma. [Manuscript]</p

Similar works

Full text

This paper was published in KI Open Archive Karolinska Institutet.

Having an issue?

Is data on this page outdated, violates copyrights or anything else? Report the problem now and we will take corresponding actions after reviewing your request.

Licence: CC BY-NC 4.0