Incorporation Of Quantification Uncertainty Into Bulk and Single-Cell RNA-seq Analysis

Abstract

In the first part of the dissertation, we propose a new method, CompDTU, that applies an isometric log-ratio transform to the vector of transcript-level relative abundance proportions that are of interest in differential transcript usage (DTU) analyses and assumes the resulting transformed data follow a multivariate normal distribution. This procedure does not suffer from computational speed and scalability issues that are present in many methods, making it ideally suited for DTU analysis with large sample sizes. Additionally, we extend CompDTU to incorporate quantification uncertainty using bootstrap replicates of abundance estimates and term this method CompDTUme. We show that CompDTU improves sensitivity and reduces false positive results relative to existing methods. Additionally, CompDTUme results in further improvements in performance over CompDTU while maintaining favorable speed and scalability. In the second part of the dissertation, we examine properties of bootstrap replicates of gene-level quantification estimates for single-cell RNA-seq (scRNA-seq) data. Specifically, we investigate the coverage of various intervals constructed using the bootstrap replicates and demonstrate that storage of mean and variance values from the set of bootstrap replicates ("compression") is sufficient to capture gene-level quantification uncertainty. Pseudo-replicates can then be simulated from a negative binomial distribution as needed, resulting in significant decreases in memory and storage space required to conduct uncertainty-aware analyses. We additionally extend the Swish method to use compression and show improvements in computation time and memory consumption without losses in performance. In the third part of the dissertation, we propose a general framework for incorporating simulated pseudo-replicates into statistical analyses. These approaches involve combining results across different pseudo-replicates using either the mean test statistic or specific quantiles of all p-values across replicates. We apply our framework to trajectory-based differential expression analysis of scRNA-seq data and show reductions in false positives relative to only incorporating the standard point-estimates of expression. Lastly, we demonstrate that discarding multi-mapping reads can result in significant underestimation of counts for functionally important genes using scRNA-seq data from developing mice embryos.Doctor of Philosoph

    Similar works