Studying Disease Dynamics with Manifold Learning

Abstract

In an effort to computationally predict causal genetic networks and derive biological pathways empirically, biomedical scientists have begun to design increasingly complicated single cell experiments. A single study can now sequence millions of cells from hundreds of patient samples across different chronological timepoints or stages of disease progression with multiple measurement modalities. None of these additional complexities, however, are adequately addressed by the current state-of-the-art computational tools. Current machine learning techniques ignore these valuable sources of information by either downsampling the number of cells to a computationally tractable number or discarding experimental information, like stage of disease, modality or timepoint, in order to perform associative analyses. In this thesis, I will describe five manifold learning algorithms that will address each of these shortcomings in an attempt to move single cell machine learning research towards identifying causal mechanisms that underlie disease pathogenesis. I first describe a general framework called diffusion condensation, which uses a cascade of diffusion filters to learn hierarchy from a high dimensional dataset. Next, I describe Cellular Analysis of Topology and Condensation Homology, an extension of diffusion condensation that applies a cascade of manifold-intrinsic diffusion filters to single cells to learn cellular clusters across granularities, identify pathogenic populations and perform rapid differential gene expression analysis. With this approach, I identified an IL1B signaling axis between microglia and astrocytes which we show drives disease progression in age-related macular degeneration. I further extend the diffusion condensation framework to visualize cellular hierarchy by integrating diffusion condensation with potential distance theoretic in Multicale PHATE. By analyzing 54 million cells from 163 patients infected with SARS-CoV-2, Multiscale PHATE identified celltypes and cellular subsets directly predictive of patient mortality. In an effort to integrate information from multimodal data, I present integrated diffusion, a novel framework for integrating multimodal single cell data and perform downstream analysis tasks like visualization and data denoising for identifying epigenetic-genetic interactions and networks. Finally, I present TrajectoryNet a novel trajectory inference tool that creates continuous trajectories from timelapsed single cell measurements. I leverage this approach to identify the transcriptional program responsible for driving metastasis in an in vitro model of mesenchymal-to-epithelial transition. Using TrajectoryNet, I identified ESRRA as a genetic switch that promotes differentiation to epithelial cell state and metastasis. Together these approaches integrate experimental information with complex single cell datasets to infer biological mechanisms driving disease pathogenesis, helping move the computational biology field away from associative research and towards causality

    Similar works

    Full text

    thumbnail-image

    Available Versions