thesis

INTEGRATION OF MULTI-PLATFORM HIGH-DIMENSIONAL OMIC DATA

Abstract

The development of high-throughput biotechnologies have made data accessible from different platforms, including RNA sequencing, copy number variation, DNA methylation, protein lysate arrays, etc. The high-dimensional omic data derived from different technological platforms have been extensively used to facilitate comprehensive understanding of disease mechanisms and to determine personalized health treatments. Although vital to the progress of clinical research, the high dimensional multi-platform data impose new challenges for data analysis. Numerous studies have been proposed to integrate multi-platform omic data; however, few have efficiently and simultaneously addressed the problems that arise from high dimensionality and complex correlations. In my dissertation, I propose a statistical framework of shared informative factor model (SIFORM) that can jointly analyze multi-platform omic data and explore their associations with a disease phenotype. The common disease- associated sample characteristics across different data types can be captured through the shared structure space, while the corresponding weights of genetic variables directly index the strengths of their association with the phenotype. I compare the performance of the proposed method with several popular regularized regression methods and canonical correlation analysis (CCA)-based methods through extensive simulation studies and two lung adenocarcinoma applications. The two lung adenocarcinoma applications jointly explore the associations of mRNA expression and protein expression with smoking status and survival using The Cancer Genome Atlas (TCGA) datasets. The simulation studies demonstrate the superior performance of SIFORM in terms of biomarker detection accuracy. In lung cancer applications, SIFORM identifies many biomarkers that belong to key pathways for lung tumorigenesis. It also discovers potential prognostic biomarkers for lung cancer patients survival and some biomarkers that reveal different tumorigenesis mechanisms between light smokers and heavy smokers. To improve the prediction accuracy and interpretability of the proposed model, I extend it to PSIFORM by incorporating existing biological pathway information to current statistical framework. I adopt a network-based regularization to ensure that the neighboring genes in the same pathway tend to be selected (or eliminated) simultaneously. Through simulation studies and a TCGA kidney cancer application, I show that PSIFORM outperforms its competitors in both variable selection and prediction. The statistical framework of PSIFORM also has a great potential in incorporating the hierarchical order across the multi-platform omic measurements

    Similar works