'Columbia University Libraries/Information Services'
Doi
Abstract
The theme of this dissertation is to develop robust statistical approaches for the high-dimensional observational data. The development of technology makes data sets more accessible than any other time in history. Abundant data leads to numerous appealing findings and at the same time, requires more thoughtful efforts. We are encountered many obstacles when dealing with high-dimensional data. Heterogeneity and complex interaction structure rule out the traditional mean regression method and expect a novel approach to circumvent the complexity and obtain significant conclusions. Missing data mechanism in high-dimensional data is complicated and is hard to manage with existing methods. This dissertation contains three parts to tackle these obstacles: (1) a tree-based method integrated with the domain knowledge to improve prediction accuracy; (2) a tree-based method with linear splits to accommodate the large-scale and highly correlated data set; (3) an integrative analysis method to reduce the dimension and impute the block-wise missing data simultaneously.
In the first part of the dissertation, we propose a tree-based method called conditional quantile random forest (CQRF) to improve the screening and intervention of the onset of mentor disorder incorporating with rich and comprehensive electronic medical records (EMR). Our research is motivated by the REactions to Acute Care and Hospitalization (REACH) study, which is an ongoing prospective observational cohort study of the patient with symptoms of a suspected acute coronary syndrome (ACS). We aim to develop a robust and effective statistical prediction method. The proposed approach fully takes the population heterogeneity into account. We partition the sample space guided by quantile regression over the entire quantile process. The proposed CQRF can provide a more comprehensive and accurate prediction. We also provide theoretical justification for the estimate quantile process.
In the second part of the dissertation, we apply the proposed CQRF to REACH data set. The predictive analysis derived by the proposed approach shows that for both entire samples and high-risk group, the proposed CQRF provides more accurate predictions compared with other existing and widely used methods. The variable importance scores give a promising result based on the proposed CQRF that the proposed importance scores identify two variables which have been proved to be critical features by the qualitative study. We also apply the proposed CQRF to Sequenced Treatment Alternatives to Relieve Depression (STAR*D) study data set. We show that the proposed approach improves the personalized medicine recommendation compared with existing treatment recommendation method. We also conduct two simulation studies based on the two real data sets. Both simulation studies validate the consistent property of the estimated quantile process.
In the second part, we also extend the proposed CQRF with univariate splits to linear splits to accommodate a large number of highly correlated variables. Gene-environment interaction is a widely concerned topic since the traits of complex disease is always difficult to understand, and we are eager to find interventions tailored to individual genetic variations. The proposed approach is applied to a Breast Cancer Family Registry (BCFR) study data set with body mass index (BMI) as the response variable, several nutrition intake factors, and genotype variables. We aim to figure out what kind of genetic variations affect the heterogeneous effect of the environmental factors on BMI. We devise a criterion which measures the relationship between the response variable and gene variants conditioning on the environmental factor to determine the optimal linear combination split. The variable importance score is also calculated by summing up the criterion across all splits in the random forest. We show in the results that top-ranked genes prioritized by the proposed importance scores make the effect of the environmental factors on BMI differently.
In the third part, we introduce an integrative analysis approach called generalized integrative principal component analysis (GIPCA). The heterogeneous data types and the presence of block-wise missing data pose significant challenges to the integration of multi-source data and further statistical analyses. There is not literature can easily accommodate data of multiple types with block-wise missing structure. The proposed GIPCA is a low-rank method which conducts the dimension reduction and imputation of block-wise missing data simultaneously to data with multiple types. Both simulation study and real data analysis show that the proposed approach achieves good missing data imputation accuracy and identifies some meaningful signals