18 research outputs found
Optimality of Graphlet Screening in High Dimensional Variable Selection
Consider a linear regression model where the design matrix X has n rows and p
columns. We assume (a) p is much large than n, (b) the coefficient vector beta
is sparse in the sense that only a small fraction of its coordinates is
nonzero, and (c) the Gram matrix G = X'X is sparse in the sense that each row
has relatively few large coordinates (diagonals of G are normalized to 1).
The sparsity in G naturally induces the sparsity of the so-called graph of
strong dependence (GOSD). We find an interesting interplay between the signal
sparsity and the graph sparsity, which ensures that in a broad context, the set
of true signals decompose into many different small-size components of GOSD,
where different components are disconnected.
We propose Graphlet Screening (GS) as a new approach to variable selection,
which is a two-stage Screen and Clean method. The key methodological innovation
of GS is to use GOSD to guide both the screening and cleaning. Compared to
m-variate brute-forth screening that has a computational cost of p^m, the GS
only has a computational cost of p (up to some multi-log(p) factors) in
screening.
We measure the performance of any variable selection procedure by the minimax
Hamming distance. We show that in a very broad class of situations, GS achieves
the optimal rate of convergence in terms of the Hamming distance. Somewhat
surprisingly, the well-known procedures subset selection and the lasso are rate
non-optimal, even in very simple settings and even when their tuning parameters
are ideally set
Rate optimal multiple testing procedure in high-dimensional regression
Multiple testing and variable selection have gained much attention in
statistical theory and methodology research. They are dealing with the same
problem of identifying the important variables among many (Jin, 2012). However,
there is little overlap in the literature. Research on variable selection has
been focusing on selection consistency, i.e., both type I and type II errors
converging to zero. This is only possible when the signals are sufficiently
strong, contrary to many modern applications. For the regime where the signals
are both rare and weak, it is inevitable that a certain amount of false
discoveries will be allowed, as long as some error rate can be controlled. In
this paper, motivated by the research by Ji and Jin (2012) and Jin (2012) in
the rare/weak regime, we extend their UPS procedure for variable selection to
multiple testing. Under certain conditions, the new UPT procedure achieves the
fastest convergence rate of marginal false non-discovery rates, while
controlling the marginal false discovery rate at any designated level
asymptotically. Numerical results are provided to demonstrate the advantage of
the proposed method.Comment: 27 page
Covariate assisted screening and estimation
Consider a linear model , where and .
The vector is unknown but is sparse in the sense that most of its
coordinates are . The main interest is to separate its nonzero coordinates
from the zero ones (i.e., variable selection). Motivated by examples in
long-memory time series (Fan and Yao [Nonlinear Time Series: Nonparametric and
Parametric Methods (2003) Springer]) and the change-point problem (Bhattacharya
[In Change-Point Problems (South Hadley, MA, 1992) (1994) 28-56 IMS]), we are
primarily interested in the case where the Gram matrix is nonsparse but
sparsifiable by a finite order linear filter. We focus on the regime where
signals are both rare and weak so that successful variable selection is very
challenging but is still possible. We approach this problem by a new procedure
called the covariate assisted screening and estimation (CASE). CASE first uses
a linear filtering to reduce the original setting to a new regression model
where the corresponding Gram (covariance) matrix is sparse. The new covariance
matrix induces a sparse graph, which guides us to conduct multivariate
screening without visiting all the submodels. By interacting with the signal
sparsity, the graph enables us to decompose the original problem into many
separated small-size subproblems (if only we know where they are!). Linear
filtering also induces a so-called problem of information leakage, which can be
overcome by the newly introduced patching technique. Together, these give rise
to CASE, which is a two-stage screen and clean [Fan and Song Ann. Statist. 38
(2010) 3567-3604; Wasserman and Roeder Ann. Statist. 37 (2009) 2178-2201]
procedure, where we first identify candidates of these submodels by patching
and screening, and then re-examine each candidate to remove false positives.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1243 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org