Supervised learning methods for association detection, biomarker discovery, and pattern recognition in compositional omics data

Abstract

Rapid advances and reduced cost in high throughput sequencing (HTS) technologies have enabled widespread profiling of microbial metagenomes and microbiomes in humans to better understand associations between microbial communities and disease. Data generated using these technologies are vast, high-dimensional, and nuanced, including limitations in instrument sequencing capacities and measurements that are inherently relative rather than absolute. Unlike absolute measurements, these relative counts — referred to as compositional data — require special methods for analysis and interpretation. Unfortunately, compositional data methodology are esoteric and generally not well adapted to high throughput sequencing data. Because of this, HTS data are often analyzed with traditional statistical methods that do not properly account for the underlying compositional sample space. This practice may result in spurious associations being reported which may limit study-to-study generalizations and reproducibility. In this thesis, building on existing literature in compositional data analysis and feature selection methodology, we develop a novel statistical association test and a powerful machine learning framework using robust pairwise logratios. Additionally, for each method, we developed freely available (GitHub) R packages (SelEnergyPermR \& DiCoVarML) with functions to perform the core analysis of each method. In the first chapter we provide a basic overview of compositional data and its connection to HTS data. In the second chapter, we present the SelEnergyPerm method for detecting sparse associations in high dimensional metagenomic data. In the third chapter, building on the concept of differential compositional variation proposed in SelEnergyPerm, we present the DiCoVarML framework for supervised classification and biomarker discovery. In the final chapter, we apply the SelEnergyPerm method to test for an association between toxicant exposures and the composition of microbial communities in the nasal passage. Using a parsimonious logratio signature detected by SelEnergyPerm, we then perform integrative analysis, where we explore the connection between nasal microbiome dsybiosis and immune mediator expression in nasal lavage fluid.Doctor of Philosoph

    Similar works