Machine learning and large scale cancer omic data: decoding the biological mechanisms underpinning cancer

Abstract

Many of the mechanisms underpinning cancer risk and tumorigenesis are still not fully understood. However, the next-generation sequencing revolution and the rapid advances in big data analytics allow us to study cells and complex phenotypes at unprecedented depth and breadth. While experimental and clinical data are still fundamental to validate findings and confirm hypotheses, computational biology is key for the analysis of system- and population-level data for detection of hidden patterns and the generation of testable hypotheses. In this work, I tackle two main questions regarding cancer risk and tumorigenesis that require novel computational methods for the analysis of system-level omic data. First, I focused on how frequent, low-penetrance inherited variants modulate cancer risk in the broader population. Genome-Wide Association Studies (GWAS) have shown that Single Nucleotide Polymorphisms (SNP) contribute to cancer risk with multiple subtle effects, but they are still failing to give further insight into their synergistic effects. I developed a novel hierarchical Bayesian regression model, BAGHERA, to estimate heritability at the gene-level from GWAS summary statistics. I then used BAGHERA to analyse data from 38 malignancies in the UK Biobank. I showed that genes with high heritable risk are involved in key processes associated with cancer and are often localised in genes that are somatically mutated drivers. Heritability, like many other omics analysis methods, study the effects of DNA variants on single genes in isolation. However, we know that most biological processes require the interplay of multiple genes and we often lack a broad perspective on them. For the second part of this thesis, I then worked on the integration of Protein-Protein Interaction (PPI) graphs and omics data, which bridges this gap and recapitulates these interactions at a system level. First, I developed a modular and scalable Python package, PyGNA, that enables robust statistical testing of genesets' topological properties. PyGNA complements the literature with a tool that can be routinely introduced in bioinformatics automated pipelines. With PyGNA I processed multiple genesets obtained from genomics and transcriptomics data. However, topological properties alone have proven to be insufficient to fully characterise complex phenotypes. Therefore, I focused on a model that allows to combine topological and functional data to detect multiple communities associated with a phenotype. Detecting cancer-specific submodules is still an open problem, but it has the potential to elucidate mechanisms detectable only by integrating multi-omics data. Building on the recent advances in Graph Neural Networks (GNN), I present a supervised geometric deep learning model that combines GNNs and Stochastic Block Models (SBM). The model is able to learn multiple graph-aware representations, as multiple joint SBMs, of the attributed network, accounting for nodes participating in multiple processes. The simultaneous estimation of structure and function provides an interpretable picture of how genes interact in specific conditions and it allows to detect novel putative pathways associated with cancer

    Similar works