18 research outputs found

    Optimality of Graphlet Screening in High Dimensional Variable Selection

    Full text link
    Consider a linear regression model where the design matrix X has n rows and p columns. We assume (a) p is much large than n, (b) the coefficient vector beta is sparse in the sense that only a small fraction of its coordinates is nonzero, and (c) the Gram matrix G = X'X is sparse in the sense that each row has relatively few large coordinates (diagonals of G are normalized to 1). The sparsity in G naturally induces the sparsity of the so-called graph of strong dependence (GOSD). We find an interesting interplay between the signal sparsity and the graph sparsity, which ensures that in a broad context, the set of true signals decompose into many different small-size components of GOSD, where different components are disconnected. We propose Graphlet Screening (GS) as a new approach to variable selection, which is a two-stage Screen and Clean method. The key methodological innovation of GS is to use GOSD to guide both the screening and cleaning. Compared to m-variate brute-forth screening that has a computational cost of p^m, the GS only has a computational cost of p (up to some multi-log(p) factors) in screening. We measure the performance of any variable selection procedure by the minimax Hamming distance. We show that in a very broad class of situations, GS achieves the optimal rate of convergence in terms of the Hamming distance. Somewhat surprisingly, the well-known procedures subset selection and the lasso are rate non-optimal, even in very simple settings and even when their tuning parameters are ideally set

    Rate optimal multiple testing procedure in high-dimensional regression

    Full text link
    Multiple testing and variable selection have gained much attention in statistical theory and methodology research. They are dealing with the same problem of identifying the important variables among many (Jin, 2012). However, there is little overlap in the literature. Research on variable selection has been focusing on selection consistency, i.e., both type I and type II errors converging to zero. This is only possible when the signals are sufficiently strong, contrary to many modern applications. For the regime where the signals are both rare and weak, it is inevitable that a certain amount of false discoveries will be allowed, as long as some error rate can be controlled. In this paper, motivated by the research by Ji and Jin (2012) and Jin (2012) in the rare/weak regime, we extend their UPS procedure for variable selection to multiple testing. Under certain conditions, the new UPT procedure achieves the fastest convergence rate of marginal false non-discovery rates, while controlling the marginal false discovery rate at any designated level α\alpha asymptotically. Numerical results are provided to demonstrate the advantage of the proposed method.Comment: 27 page

    Covariate assisted screening and estimation

    Full text link
    Consider a linear model Y=Xβ+zY=X\beta+z, where X=Xn,pX=X_{n,p} and z∼N(0,In)z\sim N(0,I_n). The vector β\beta is unknown but is sparse in the sense that most of its coordinates are 00. The main interest is to separate its nonzero coordinates from the zero ones (i.e., variable selection). Motivated by examples in long-memory time series (Fan and Yao [Nonlinear Time Series: Nonparametric and Parametric Methods (2003) Springer]) and the change-point problem (Bhattacharya [In Change-Point Problems (South Hadley, MA, 1992) (1994) 28-56 IMS]), we are primarily interested in the case where the Gram matrix G=X′XG=X'X is nonsparse but sparsifiable by a finite order linear filter. We focus on the regime where signals are both rare and weak so that successful variable selection is very challenging but is still possible. We approach this problem by a new procedure called the covariate assisted screening and estimation (CASE). CASE first uses a linear filtering to reduce the original setting to a new regression model where the corresponding Gram (covariance) matrix is sparse. The new covariance matrix induces a sparse graph, which guides us to conduct multivariate screening without visiting all the submodels. By interacting with the signal sparsity, the graph enables us to decompose the original problem into many separated small-size subproblems (if only we know where they are!). Linear filtering also induces a so-called problem of information leakage, which can be overcome by the newly introduced patching technique. Together, these give rise to CASE, which is a two-stage screen and clean [Fan and Song Ann. Statist. 38 (2010) 3567-3604; Wasserman and Roeder Ann. Statist. 37 (2009) 2178-2201] procedure, where we first identify candidates of these submodels by patching and screening, and then re-examine each candidate to remove false positives.Comment: Published in at http://dx.doi.org/10.1214/14-AOS1243 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    corecore