High-dimensional non-Gaussian data analysis based on sample relationship

Abstract

High-dimensional data are omnipresent. Although many statistical methods developed for analysing high-dimensional data adopt the normality assumption, the Gaussian distribution could be a poor approximation of real data in many applications. In this thesis, we investigate how to properly analyse such high-dimensional non-Gaussian data. As quantifying sample relationships, such as measuring the inter-sample proximity and determining neighbours for samples, is an important step in numerous statistical approaches, this thesis develops three methods for analysing different high-dimensional non-Gaussian data types based on the sample relationship: dimension reduction for single cell RNA-sequencing data with missingness with a proposed proximity measure, dimension reduction for data of small counts with a developed proximity measure, and modelling skewed survival data with a proposed procedure of identifying neighbours for samples. In chapter 3, I develop an unbiased estimator of the Gram matrix, which characterises the proximity between samples. The proposed estimator improves a broad spectrum of dimension reduction methods when applied to single cell RNA-sequencing data with missingness. In addition, the consequences of directly applying existing dimension reduction methods to data with missingness are empirically and theoretically clarified. In chapter 4, I develop a dissimilarity measure for count data with an excess of zeros based on the Kullback-Leibler divergence and the empirical Bayes estimators. The proposed measure is shown to have better discriminative power compared with other popular measures. The proposed measure boosts the performance of standard dimension reduction methods on count data containing many zeros. In chapter 5, I clarify that graphs derived from features themselves can be beneficial for the analysis of high-dimensional survival data when used in graph convolutional networks. Besides, a sequential forward floating selection algorithm is proposed to simultaneously perform survival analysis and unveil the local neighbourhoods of samples with the aid of graph convolutional networks

    Similar works