Many of the mechanisms underpinning cancer risk and tumorigenesis are still not
fully understood. However, the next-generation sequencing revolution and the
rapid advances in big data analytics allow us to study cells
and complex phenotypes at unprecedented depth and breadth. While experimental
and clinical data are still fundamental to validate findings and confirm
hypotheses, computational biology is key for the analysis of system- and
population-level data for detection of hidden patterns and the generation of
testable hypotheses.
In this work, I tackle two main questions regarding cancer risk and tumorigenesis
that require novel computational methods for the analysis of system-level omic
data. First, I focused on how frequent, low-penetrance inherited variants modulate
cancer risk in the broader population. Genome-Wide Association Studies (GWAS)
have shown that Single Nucleotide Polymorphisms (SNP) contribute to cancer risk
with multiple subtle effects, but they are still failing to give further insight
into their synergistic effects. I developed a novel hierarchical Bayesian
regression model, BAGHERA, to estimate heritability at the gene-level from GWAS
summary statistics. I then used BAGHERA to analyse data from 38 malignancies in
the UK Biobank. I showed that genes with high heritable risk are involved in key
processes associated with cancer and are often localised in genes that are
somatically mutated drivers.
Heritability, like many other omics analysis methods, study the effects of DNA
variants on single genes in isolation. However, we know that most biological
processes require the interplay of multiple genes and we often lack a broad
perspective on them. For the second part of this thesis, I then worked on the
integration of Protein-Protein Interaction (PPI) graphs and omics data, which
bridges this gap and recapitulates these interactions at a system level. First,
I developed a modular and scalable Python package, PyGNA, that enables
robust statistical testing of genesets' topological properties. PyGNA complements
the literature with a tool that can be routinely introduced in bioinformatics
automated pipelines. With PyGNA I processed multiple genesets obtained from
genomics and transcriptomics data. However, topological properties alone have
proven to be insufficient to fully characterise complex phenotypes.
Therefore, I focused on a model that allows to combine topological and functional
data to detect multiple communities associated with a phenotype. Detecting
cancer-specific submodules is still an open problem, but it has the potential to
elucidate mechanisms detectable only by integrating multi-omics data. Building
on the recent advances in Graph Neural Networks (GNN), I present a supervised
geometric deep learning model that combines GNNs and Stochastic Block Models
(SBM). The model is able to learn multiple graph-aware representations, as
multiple joint SBMs, of the attributed network, accounting for nodes
participating in multiple processes. The simultaneous estimation of structure
and function provides an interpretable picture of how genes interact in specific
conditions and it allows to detect novel putative pathways associated with
cancer