thesis

Factor regression for dimensionality reduction and data integration techniques with applications to cancer data

Abstract

Two key challenges in modern statistical applications are the large amount of information recorded per individual, and the fact that such data are often not collected all at once but in batches, often causing distortions in both mean and variance. We address both issues by introducing a novel sparse latent factor regression model to integrate heterogeneous data. The model provides a tool that addresses data exploration via dimensionality reduction and corrects the so-called batch effects, and provides sparse low-rank covariance matrix estimates. We study the use of several sparse priors, both local and non-local, to learn the dimension of the latent factors. Our model is fitted in a deterministic fashion by means of an EM algorithm for which we derive closed-form updates; this contributes a novel scalable algorithm for non-local priors, which is of interest beyond the immediate scope of this thesis. We also present several examples, with a focus on bioinformatics applications. Our results mainly show an increase in the accuracy of low-dimensional data reconstructions, with non-local priors substantially improving the inference on factor cardinality and non-zero factor loadings. Moreover, thanks to our batch effect correction, we achieve a considerable improvement in recovering the latent factors. Altogether, this thesis provides a novel approach to latent factor regression that balances sparsity with sensitivity, as well as being highly computationally efficient, and opens new avenues for future research on dimension-reduction-based data integration. The methodology developed in this thesis is available in an R package at https://github.com/AleAviP/BFR.BE.

    Similar works