Methods for High Dimensional Inferences With Applications in Genomics

Abstract

In this dissertation, I have developed several high dimensional inferences and computational methods motivated by problems in genomics studies. It consists of two parts. The first part is motivated by analysis of data from genome-wide association studies (GWAS), where I have developed an optimal false discovery rate (FDR) con- trolling method for high dimensional dependent data. For short-ranged dependent data, I have shown that the marginal plug-in procedure has the optimal property in controlling the FDR and minimizing the false non-discovery rate (FNR). When applied to analysis of the neuroblastoma GWAS data, this procedure identified six more disease-associated variants compared to previous p-value based procedures such as the Benjamini and Hochberg procedure. I have further investigated the statistical issue of sparse signal recovery in the setting of GWAS and developed a rigorous procedure for sample size and power analysis in the framework of FDR and FNR for GWAS. In addition, I have characterized the almost complete discovery boundary in terms of signal strength and non-null proportion and developed a procedure to achieve the almost complete recovery of the signals. The second part of my dissertation was motivated by gene regulation network construction based on the genetical genomics data (eQTL). I have developed a sparse high dimensional multivariate regression model for studying the conditional independent relationships among a set of genes adjusting for possible genetic effects, as well as the genetic architecture that influences the gene expression. I have developed a covariate adjusted precision matrix estimation method (CAPME), which can be easily implemented by linear programming. Asymptotic convergence rates and sign consistency are established for the estimators of the regression coefficients and the precision matrix. Numerical performance of the estimator was investigated using both simulated and real data sets. Simulation results have shown that the CAPME resulted in great improvements in both estimation and graph structure selection. I have applied the CAPME to analysis of a yeast eQTL data in order to identify the gene regulatory network among a set of genes in the MAPK signaling pathway. Finally, I have also made the R software package CAPME based on my dissertation work

    Similar works