9,338 research outputs found
Sure Screening for Gaussian Graphical Models
We propose {graphical sure screening}, or GRASS, a very simple and
computationally-efficient screening procedure for recovering the structure of a
Gaussian graphical model in the high-dimensional setting. The GRASS estimate of
the conditional dependence graph is obtained by thresholding the elements of
the sample covariance matrix. The proposed approach possesses the sure
screening property: with very high probability, the GRASS estimated edge set
contains the true edge set. Furthermore, with high probability, the size of the
estimated edge set is controlled. We provide a choice of threshold for GRASS
that can control the expected false positive rate. We illustrate the
performance of GRASS in a simulation study and on a gene expression data set,
and show that in practice it performs quite competitively with more complex and
computationally-demanding techniques for graph estimation
A Double Regression Method for Graphical Modeling of High-dimensional Nonlinear and Non-Gaussian Data
Graphical models have long been studied in statistics as a tool for inferring
conditional independence relationships among a large set of random variables.
The most existing works in graphical modeling focus on the cases that the data
are Gaussian or mixed and the variables are linearly dependent. In this paper,
we propose a double regression method for learning graphical models under the
high-dimensional nonlinear and non-Gaussian setting, and prove that the
proposed method is consistent under mild conditions. The proposed method works
by performing a series of nonparametric conditional independence tests. The
conditioning set of each test is reduced via a double regression procedure
where a model-free sure independence screening procedure or a sparse deep
neural network can be employed. The numerical results indicate that the
proposed method works well for high-dimensional nonlinear and non-Gaussian
data.Comment: 1 figur
RANK: Large-Scale Inference with Graphical Nonlinear Knockoffs
Power and reproducibility are key to enabling refined scientific discoveries
in contemporary big data applications with general high-dimensional nonlinear
models. In this paper, we provide theoretical foundations on the power and
robustness for the model-free knockoffs procedure introduced recently in
Cand\`{e}s, Fan, Janson and Lv (2016) in high-dimensional setting when the
covariate distribution is characterized by Gaussian graphical model. We
establish that under mild regularity conditions, the power of the oracle
knockoffs procedure with known covariate distribution in high-dimensional
linear models is asymptotically one as sample size goes to infinity. When
moving away from the ideal case, we suggest the modified model-free knockoffs
method called graphical nonlinear knockoffs (RANK) to accommodate the unknown
covariate distribution. We provide theoretical justifications on the robustness
of our modified procedure by showing that the false discovery rate (FDR) is
asymptotically controlled at the target level and the power is asymptotically
one with the estimated covariate distribution. To the best of our knowledge,
this is the first formal theoretical result on the power for the knockoffs
procedure. Simulation results demonstrate that compared to existing approaches,
our method performs competitively in both FDR control and power. A real data
set is analyzed to further assess the performance of the suggested knockoffs
procedure.Comment: 37 pages, 6 tables, 9 pages supplementary materia
Variable selection in high-dimensional linear models: partially faithful distributions and the PC-simple algorithm
We consider variable selection in high-dimensional linear models where the
number of covariates greatly exceeds the sample size. We introduce the new
concept of partial faithfulness and use it to infer associations between the
covariates and the response. Under partial faithfulness, we develop a
simplified version of the PC algorithm (Spirtes et al., 2000), the PC-simple
algorithm, which is computationally feasible even with thousands of covariates
and provides consistent variable selection under conditions on the random
design matrix that are of a different nature than coherence conditions for
penalty-based approaches like the Lasso. Simulations and application to real
data show that our method is competitive compared to penalty-based approaches.
We provide an efficient implementation of the algorithm in the R-package pcalg.Comment: 20 pages, 3 figure
- …