Improving Statistical Rigor in Single-cell and Spatial Omics

Song, Dongyuan

Improving Statistical Rigor in Single-cell and Spatial Omics

Authors: Dongyuan Song
Publication date: 1 January 2024
Publisher: eScholarship, University of California

Abstract

The recent technological revolution in single-cell and spatial omics has provided unprecedented multi-modal views of individual cells, transforming our understanding of cell biology in health and disease. Numerous computational methods have been developed to analyze data generated from these technologies. However, the statistical rigor of existing computational methods is often questionable: many computational methods are complicated "black-box'" algorithms (e.g., deep-learning-based methods). Therefore, it remains challenging to obtain correct statistical interpretation (e.g., well-calibrated p-values), and to avoid misinterpretation of observed data (e.g., exaggerated false discoveries). Due to the existence of numerous methods, the field crucially needs precise, statistically robust, and interpretable methods. During my Ph.D., I have been focusing on combining statistics and computational biology to provide accurate statistical interpretation to computational analyses in single-cell and spatial omics. This dissertation aims to address this statistical rigor issue through three main themes.My first theme concentrates on the probabilistic generative models for high-dimensional single-cell and spatial multi-omics data. The realistic simulation of single-cell and spatial multi-omics data plays a critical role in both evaluating the performance of computational tools and facilitating the exploration of experimental designs. However, the complex topology of cells and the high-dimension features pose significant challenges to this endeavor. To overcome this challenge, I developed scDesign3, the first unified framework for realistic in silico data generation of both single-cell and spatial omics.My second theme focuses on differential expression (DE) tests and false discovery rate (FDR) control based on inferred covariates. Identifying differentially expressed (DE) genes between or between cell states is a crucial task in investigating the underlying molecular mechanisms in cells. However, in single-cell RNA sequencing (scRNA-seq) analysis, the latent cell states are usually inferred from the data (e.g., inferred cell types by clustering or continuous trajectories by pseudotime inference). Therefore, conventional statistical tests can behave incorrectly if we ignore the fact that the covariates are inferred rather than observed. Hence, I developed PseudotimeDE, a robust DE method that accounts for pseudotime inference uncertainty and yields well-calibrated p-values. Separately, post-clustering DE was another related issue that drew our attention. This two-step procedure uses the same data twice: once to define cell clusters as potential cell types, and then to identify DE genes as potential cell-type marker genes. This practice, often known as "double dipping," can lead to the erroneous identification of false-positive cell-type marker genes, particularly when the cell clusters themselves are not well-defined. To overcome this challenge, I proposed ClusterDE, a post-clustering DE method for controlling the FDR of identified DE genes regardless of the clustering quality by using "synthetic null data."My third theme aims at feature selection and subsampling in large-scale scRNA-seq data. The large number of genes (~20,000) and increasing number of measured cells (> 1 million) in scRNA-seq datasets remain a challenge for data analysis. A practical solution to this computational bottleneck involves the strategic selection of a subset of cells or genes. We developed scSampler, a fast diversity-preserving cell subsampling inspired by space-filling design in the field of experimental design. scSampler selects a small subset of cells to accurately represent the primary variability in the entire dataset. In addition, we developed scPNMF, an unsupervised gene selection method through matrix factorization. This method effectively selects a significantly smaller subset of genes (~100) while still achieving robust discrimination in cell-type identification. We developed scGTM, a flexible and interpretable model that captures the trend of gene expression along the pseudotime of cells to select genes with specific expression patterns

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Sustaining member

eScholarship - University of California

oai:escholarship.org:ark:/1303...

Last time updated on 16/09/2024