2,843 research outputs found
A Survey of Tuning Parameter Selection for High-dimensional Regression
Penalized (or regularized) regression, as represented by Lasso and its
variants, has become a standard technique for analyzing high-dimensional data
when the number of variables substantially exceeds the sample size. The
performance of penalized regression relies crucially on the choice of the
tuning parameter, which determines the amount of regularization and hence the
sparsity level of the fitted model. The optimal choice of tuning parameter
depends on both the structure of the design matrix and the unknown random error
distribution (variance, tail behavior, etc). This article reviews the current
literature of tuning parameter selection for high-dimensional regression from
both theoretical and practical perspectives. We discuss various strategies that
choose the tuning parameter to achieve prediction accuracy or support recovery.
We also review several recently proposed methods for tuning-free
high-dimensional regression.Comment: 28 pages, 2 figure
Foundational principles for large scale inference: Illustrations through correlation mining
When can reliable inference be drawn in the "Big Data" context? This paper
presents a framework for answering this fundamental question in the context of
correlation mining, with implications for general large scale inference. In
large scale data applications like genomics, connectomics, and eco-informatics
the dataset is often variable-rich but sample-starved: a regime where the
number of acquired samples (statistical replicates) is far fewer than the
number of observed variables (genes, neurons, voxels, or chemical
constituents). Much of recent work has focused on understanding the
computational complexity of proposed methods for "Big Data." Sample complexity
however has received relatively less attention, especially in the setting when
the sample size is fixed, and the dimension grows without bound. To
address this gap, we develop a unified statistical framework that explicitly
quantifies the sample complexity of various inferential tasks. Sampling regimes
can be divided into several categories: 1) the classical asymptotic regime
where the variable dimension is fixed and the sample size goes to infinity; 2)
the mixed asymptotic regime where both variable dimension and sample size go to
infinity at comparable rates; 3) the purely high dimensional asymptotic regime
where the variable dimension goes to infinity and the sample size is fixed.
Each regime has its niche but only the latter regime applies to exa-scale data
dimension. We illustrate this high dimensional framework for the problem of
correlation mining, where it is the matrix of pairwise and partial correlations
among the variables that are of interest. We demonstrate various regimes of
correlation mining based on the unifying perspective of high dimensional
learning rates and sample complexity for different structured covariance models
and different inference tasks
Minimax risks for sparse regressions: Ultra-high-dimensional phenomenons
Consider the standard Gaussian linear regression model ,
where is a response vector and is a design matrix.
Numerous work have been devoted to building efficient estimators of
when is much larger than . In such a situation, a classical approach
amounts to assume that is approximately sparse. This paper studies
the minimax risks of estimation and testing over classes of -sparse vectors
. These bounds shed light on the limitations due to
high-dimensionality. The results encompass the problem of prediction
(estimation of ), the inverse problem (estimation of ) and
linear testing (testing ). Interestingly, an elbow effect occurs
when the number of variables becomes large compared to .
Indeed, the minimax risks and hypothesis separation distances blow up in this
ultra-high dimensional setting. We also prove that even dimension reduction
techniques cannot provide satisfying results in an ultra-high dimensional
setting. Moreover, we compute the minimax risks when the variance of the noise
is unknown. The knowledge of this variance is shown to play a significant role
in the optimal rates of estimation and testing. All these minimax bounds
provide a characterization of statistical problems that are so difficult so
that no procedure can provide satisfying results
- …