Search CORE

2,843 research outputs found

A Survey of Tuning Parameter Selection for High-dimensional Regression

Author: Wang Lan
Wu Yunan
Publication venue
Publication date: 09/08/2019
Field of study

Penalized (or regularized) regression, as represented by Lasso and its variants, has become a standard technique for analyzing high-dimensional data when the number of variables substantially exceeds the sample size. The performance of penalized regression relies crucially on the choice of the tuning parameter, which determines the amount of regularization and hence the sparsity level of the fitted model. The optimal choice of tuning parameter depends on both the structure of the design matrix and the unknown random error distribution (variance, tail behavior, etc). This article reviews the current literature of tuning parameter selection for high-dimensional regression from both theoretical and practical perspectives. We discuss various strategies that choose the tuning parameter to achieve prediction accuracy or support recovery. We also review several recently proposed methods for tuning-free high-dimensional regression.Comment: 28 pages, 2 figure

arXiv.org e-Print Archive

University of Miami: Scholarship Miami

Foundational principles for large scale inference: Illustrations through correlation mining

Author: Alfred O. Hero
Alfred O. Hero
Alfred O. Hero
Bala Rajaratnam
Bala Rajaratnam
Bala Rajaratnam
Publication venue
Publication date: 18/05/2015
Field of study

When can reliable inference be drawn in the "Big Data" context? This paper presents a framework for answering this fundamental question in the context of correlation mining, with implications for general large scale inference. In large scale data applications like genomics, connectomics, and eco-informatics the dataset is often variable-rich but sample-starved: a regime where the number

n

of acquired samples (statistical replicates) is far fewer than the number

p

of observed variables (genes, neurons, voxels, or chemical constituents). Much of recent work has focused on understanding the computational complexity of proposed methods for "Big Data." Sample complexity however has received relatively less attention, especially in the setting when the sample size

n

is fixed, and the dimension

p

grows without bound. To address this gap, we develop a unified statistical framework that explicitly quantifies the sample complexity of various inferential tasks. Sampling regimes can be divided into several categories: 1) the classical asymptotic regime where the variable dimension is fixed and the sample size goes to infinity; 2) the mixed asymptotic regime where both variable dimension and sample size go to infinity at comparable rates; 3) the purely high dimensional asymptotic regime where the variable dimension goes to infinity and the sample size is fixed. Each regime has its niche but only the latter regime applies to exa-scale data dimension. We illustrate this high dimensional framework for the problem of correlation mining, where it is the matrix of pairwise and partial correlations among the variables that are of interest. We demonstrate various regimes of correlation mining based on the unifying perspective of high dimensional learning rates and sample complexity for different structured covariance models and different inference tasks

arXiv.org e-Print Archive

CiteSeerX

PubMed Central

eScholarship - University of California

Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons

Author: Verzelen Nicolas
Publication venue
Publication date: 01/01/2012
Field of study

Consider the standard Gaussian linear regression model

Y=X\theta+\epsilon

, where

Y\in R^n

is a response vector and

X\in R^{n*p}

is a design matrix. Numerous work have been devoted to building efficient estimators of

\theta

when

p

is much larger than

n

. In such a situation, a classical approach amounts to assume that

\theta_0

is approximately sparse. This paper studies the minimax risks of estimation and testing over classes of

k

-sparse vectors

\theta

. These bounds shed light on the limitations due to high-dimensionality. The results encompass the problem of prediction (estimation of

X\theta

), the inverse problem (estimation of

\theta_0

) and linear testing (testing

X\theta=0

). Interestingly, an elbow effect occurs when the number of variables

k\log(p/k)

becomes large compared to

n

. Indeed, the minimax risks and hypothesis separation distances blow up in this ultra-high dimensional setting. We also prove that even dimension reduction techniques cannot provide satisfying results in an ultra-high dimensional setting. Moreover, we compute the minimax risks when the variance of the noise is unknown. The knowledge of this variance is shown to play a significant role in the optimal rates of estimation and testing. All these minimax bounds provide a characterization of statistical problems that are so difficult so that no procedure can provide satisfying results

arXiv.org e-Print Archive

Crossref

ProdInra