Graphical models for high dimensional genomic data

Abstract

Graphical models study the relations among a set of random variables. In a graph, vertices represent variables and edges capture relations among the variables. We have developed three statistical methods for graphical model construction using high dimensional genomic data. We first focus on estimating a high-dimensional partial correlation matrix. It is estimated by ridge penalty followed by hypothesis testing. The null distribution of the test statistics derived from penalized partial correlation estimates has not been established. We address this challenge by estimating the null distribution from the empirical distribution of the test statistics of all the penalized partial correlation estimates. The performance of our method is systematically evaluated in simulation and application studies. Next, we consider estimating Directed Acyclic Graph (DAG) models for multivariate Gaussian random variables. The skeleton of a DAG is an undirected graphical model, which is constructed by removing the directions of all the edges in the DAG. Given observational data, not all the directions of the edges of a DAG are identifiable; however the skeleton of the DAG is identifiable. We propose a novel method named PenPC to estimate the skeleton of a high dimensional DAG by a two-step approach. We first estimate an undirected graph by selecting the non-zero entries of the partial correlation matrix, then remove false connections in this undirected graph to obtain the skeleton. We systematically study the asymptotic property of PenPC on high dimensional problems. Both simulations and real data analysis suggest that our method have substantially higher sensitivity and specificity to estimate network skeleton than existing methods. To orient the edges in the skeleton of a DAG, we exploit interventional data on an additional set of variables. The variables are direct causes of some vertices in the DAG and enable estimating directions of the edges in the skeleton. More specifically, given the skeleton of a DAG, we calculate the posterior probabilities of edge directions using the additional set of variables. We evaluate our method by simulations and an application where variables modeled by a DAG are gene expression and the additional set variables are DNA polymorphisms.Doctor of Philosoph

    Similar works