115,563 research outputs found

    Multiple Testing and Variable Selection along Least Angle Regression's path

    Full text link
    In this article, we investigate multiple testing and variable selection using Least Angle Regression (LARS) algorithm in high dimensions under the Gaussian noise assumption. LARS is known to produce a piecewise affine solutions path with change points referred to as knots of the LARS path. The cornerstone of the present work is the expression in closed form of the exact joint law of K-uplets of knots conditional on the variables selected by LARS, namely the so-called post-selection joint law of the LARS knots. Numerical experiments demonstrate the perfect fit of our finding. Our main contributions are three fold. First, we build testing procedures on variables entering the model along the LARS path in the general design case when the noise level can be unknown. This testing procedures are referred to as the Generalized t-Spacing tests (GtSt) and we prove that they have exact non-asymptotic level (i.e., Type I error is exactly controlled). In that way, we extend a work from (Taylor et al., 2014) where the Spacing test works for consecutive knots and known variance. Second, we introduce a new exact multiple false negatives test after model selection in the general design case when the noise level can be unknown. We prove that this testing procedure has exact non-asymptotic level for general design and unknown noise level. Last, we give an exact control of the false discovery rate (FDR) under orthogonal design assumption. Monte-Carlo simulations and a real data experiment are provided to illustrate our results in this case. Of independent interest, we introduce an equivalent formulation of LARS algorithm based on a recursive function.Comment: 62 pages; new: FDR control and power comparison between Knockoff, FCD, Slope and our proposed method; new: the introduction has been revised and now present a synthetic presentation of the main results. We believe that this introduction brings new insists compared to previous version

    Improved variable selection with Forward-Lasso adaptive shrinkage

    Full text link
    Recently, considerable interest has focused on variable selection methods in regression situations where the number of predictors, pp, is large relative to the number of observations, nn. Two commonly applied variable selection approaches are the Lasso, which computes highly shrunk regression coefficients, and Forward Selection, which uses no shrinkage. We propose a new approach, "Forward-Lasso Adaptive SHrinkage" (FLASH), which includes the Lasso and Forward Selection as special cases, and can be used in both the linear regression and the Generalized Linear Model domains. As with the Lasso and Forward Selection, FLASH iteratively adds one variable to the model in a hierarchical fashion but, unlike these methods, at each step adjusts the level of shrinkage so as to optimize the selection of the next variable. We first present FLASH in the linear regression setting and show that it can be fitted using a variant of the computationally efficient LARS algorithm. Then, we extend FLASH to the GLM domain and demonstrate, through numerous simulations and real world data sets, as well as some theoretical analysis, that FLASH generally outperforms many competing approaches.Comment: Published in at http://dx.doi.org/10.1214/10-AOAS375 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Model Selection for High Dimensional Quadratic Regression via Regularization

    Full text link
    Quadratic regression (QR) models naturally extend linear models by considering interaction effects between the covariates. To conduct model selection in QR, it is important to maintain the hierarchical model structure between main effects and interaction effects. Existing regularization methods generally achieve this goal by solving complex optimization problems, which usually demands high computational cost and hence are not feasible for high dimensional data. This paper focuses on scalable regularization methods for model selection in high dimensional QR. We first consider two-stage regularization methods and establish theoretical properties of the two-stage LASSO. Then, a new regularization method, called Regularization Algorithm under Marginality Principle (RAMP), is proposed to compute a hierarchy-preserving regularization solution path efficiently. Both methods are further extended to solve generalized QR models. Numerical results are also shown to demonstrate performance of the methods.Comment: 37 pages, 1 figure with supplementary materia

    Local-Aggregate Modeling for Big-Data via Distributed Optimization: Applications to Neuroimaging

    Full text link
    Technological advances have led to a proliferation of structured big data that have matrix-valued covariates. We are specifically motivated to build predictive models for multi-subject neuroimaging data based on each subject's brain imaging scans. This is an ultra-high-dimensional problem that consists of a matrix of covariates (brain locations by time points) for each subject; few methods currently exist to fit supervised models directly to this tensor data. We propose a novel modeling and algorithmic strategy to apply generalized linear models (GLMs) to this massive tensor data in which one set of variables is associated with locations. Our method begins by fitting GLMs to each location separately, and then builds an ensemble by blending information across locations through regularization with what we term an aggregating penalty. Our so called, Local-Aggregate Model, can be fit in a completely distributed manner over the locations using an Alternating Direction Method of Multipliers (ADMM) strategy, and thus greatly reduces the computational burden. Furthermore, we propose to select the appropriate model through a novel sequence of faster algorithmic solutions that is similar to regularization paths. We will demonstrate both the computational and predictive modeling advantages of our methods via simulations and an EEG classification problem.Comment: 41 pages, 5 figures and 3 table
    • …
    corecore