thesis

Multi-study factor models for high-dimensional biological data

Abstract

High-throughput assays are transforming the study of biology, and are generating a rich, complex and diverse collection of high-dimensional data sets. Building systematic knowledge from this data is a cumulative process, which requires analyses that integrate multiple sources, studies, and technologies. The increased availability of ensembles of studies on related clinical populations, assaying technologies, and genomic features poses two categories of very important multi-study statistical components: 1) common factors shared across multiple studies; 2) study-specific factors. To capture these two different quantities, in this thesis we propose a novel class of factor analysis models, both under a frequentist and Bayesian approach. In the frequentist approach an ECM algorithm is provided to obtain the maximum likelihood estimates. Moreover, we propose a Bayesian approach to apply the method to settings with more variables than subjects. In modeling dependencies among many variables, a sparse structure underlying the associations among genes is assumed. Both methods allow to perform joint analysis of multiple high-throughput studies. The results are helpful for combining multiple studies, identifying reproducible biology across studies and interesting study-specific components, and removing idiosyncratic variation that lacks cross-study reproducibility

    Similar works