152 research outputs found
A bias correction for the minimum error rate in cross-validation
Tuning parameters in supervised learning problems are often estimated by
cross-validation. The minimum value of the cross-validation error can be biased
downward as an estimate of the test error at that same value of the tuning
parameter. We propose a simple method for the estimation of this bias that uses
information from the cross-validation process. As a result, it requires
essentially no additional computation. We apply our bias estimate to a number
of popular classifiers in various settings, and examine its performance.Comment: Published in at http://dx.doi.org/10.1214/08-AOAS224 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
The Lasso Problem and Uniqueness
The lasso is a popular tool for sparse linear regression, especially for
problems in which the number of variables p exceeds the number of observations
n. But when p>n, the lasso criterion is not strictly convex, and hence it may
not have a unique minimum. An important question is: when is the lasso solution
well-defined (unique)? We review results from the literature, which show that
if the predictor variables are drawn from a continuous probability
distribution, then there is a unique lasso solution with probability one,
regardless of the sizes of n and p. We also show that this result extends
easily to penalized minimization problems over a wide range of loss
functions.
A second important question is: how can we deal with the case of
non-uniqueness in lasso solutions? In light of the aforementioned result, this
case really only arises when some of the predictor variables are discrete, or
when some post-processing has been performed on continuous predictor
measurements. Though we certainly cannot claim to provide a complete answer to
such a broad question, we do present progress towards understanding some
aspects of non-uniqueness. First, we extend the LARS algorithm for computing
the lasso solution path to cover the non-unique case, so that this path
algorithm works for any predictor matrix. Next, we derive a simple method for
computing the component-wise uncertainty in lasso solutions of any given
problem instance, based on linear programming. Finally, we review results from
the literature on some of the unifying properties of lasso solutions, and also
point out particular forms of solutions that have distinctive properties.Comment: 25 pages, 0 figure
A General Framework for Fast Stagewise Algorithms
Forward stagewise regression follows a very simple strategy for constructing
a sequence of sparse regression estimates: it starts with all coefficients
equal to zero, and iteratively updates the coefficient (by a small amount
) of the variable that achieves the maximal absolute inner product
with the current residual. This procedure has an interesting connection to the
lasso: under some conditions, it is known that the sequence of forward
stagewise estimates exactly coincides with the lasso path, as the step size
goes to zero. Furthermore, essentially the same equivalence holds
outside of least squares regression, with the minimization of a differentiable
convex loss function subject to an norm constraint (the stagewise
algorithm now updates the coefficient corresponding to the maximal absolute
component of the gradient).
Even when they do not match their -constrained analogues, stagewise
estimates provide a useful approximation, and are computationally appealing.
Their success in sparse modeling motivates the question: can a simple,
effective strategy like forward stagewise be applied more broadly in other
regularization settings, beyond the norm and sparsity? The current
paper is an attempt to do just this. We present a general framework for
stagewise estimation, which yields fast algorithms for problems such as
group-structured learning, matrix completion, image denoising, and more.Comment: 56 pages, 15 figure
Exact Post-Selection Inference for Sequential Regression Procedures
We propose new inference tools for forward stepwise regression, least angle
regression, and the lasso. Assuming a Gaussian model for the observation vector
y, we first describe a general scheme to perform valid inference after any
selection event that can be characterized as y falling into a polyhedral set.
This framework allows us to derive conditional (post-selection) hypothesis
tests at any step of forward stepwise or least angle regression, or any step
along the lasso regularization path, because, as it turns out, selection events
for these procedures can be expressed as polyhedral constraints on y. The
p-values associated with these tests are exactly uniform under the null
distribution, in finite samples, yielding exact type I error control. The tests
can also be inverted to produce confidence intervals for appropriate underlying
regression parameters. The R package "selectiveInference", freely available on
the CRAN repository, implements the new inference tools described in this
paper.Comment: 26 pages, 5 figure
Excess Optimism: How Biased is the Apparent Error of an Estimator Tuned by SURE?
Nearly all estimators in statistical prediction come with an associated
tuning parameter, in one way or another. Common practice, given data, is to
choose the tuning parameter value that minimizes a constructed estimate of the
prediction error of the estimator; we focus on Stein's unbiased risk estimator,
or SURE (Stein, 1981; Efron, 1986) which forms an unbiased estimate of the
prediction error by augmenting the observed training error with an estimate of
the degrees of freedom of the estimator. Parameter tuning via SURE minimization
has been advocated by many authors, in a wide variety of problem settings, and
in general, it is natural to ask: what is the prediction error of the
SURE-tuned estimator? An obvious strategy would be simply use the apparent
error estimate as reported by SURE, i.e., the value of the SURE criterion at
its minimum, to estimate the prediction error of the SURE-tuned estimator. But
this is no longer unbiased; in fact, we would expect that the minimum of the
SURE criterion is systematically biased downwards for the true prediction
error. In this paper, we formally describe and study this bias.Comment: 39 pages, 3 figure
From Fixed-X to Random-X Regression: Bias-Variance Decompositions, Covariance Penalties, and Prediction Error Estimation
In statistical prediction, classical approaches for model selection and model
evaluation based on covariance penalties are still widely used. Most of the
literature on this topic is based on what we call the "Fixed-X" assumption,
where covariate values are assumed to be nonrandom. By contrast, it is often
more reasonable to take a "Random-X" view, where the covariate values are
independently drawn for both training and prediction. To study the
applicability of covariance penalties in this setting, we propose a
decomposition of Random-X prediction error in which the randomness in the
covariates contributes to both the bias and variance components. This
decomposition is general, but we concentrate on the fundamental case of least
squares regression. We prove that in this setting the move from Fixed-X to
Random-X prediction results in an increase in both bias and variance. When the
covariates are normally distributed and the linear model is unbiased, all terms
in this decomposition are explicitly computable, which yields an extension of
Mallows' Cp that we call . also holds asymptotically for certain
classes of nonnormal covariates. When the noise variance is unknown, plugging
in the usual unbiased estimate leads to an approach that we call ,
which is closely related to Sp (Tukey 1967), and GCV (Craven and Wahba 1978).
For excess bias, we propose an estimate based on the "shortcut-formula" for
ordinary cross-validation (OCV), resulting in an approach we call .
Theoretical arguments and numerical simulations suggest that is
typically superior to OCV, though the difference is small. We further examine
the Random-X error of other popular estimators. The surprising result we get
for ridge regression is that, in the heavily-regularized regime, Random-X
variance is smaller than Fixed-X variance, which can lead to smaller overall
Random-X error
The solution path of the generalized lasso
We present a path algorithm for the generalized lasso problem. This problem
penalizes the norm of a matrix D times the coefficient vector, and has
a wide range of applications, dictated by the choice of D. Our algorithm is
based on solving the dual of the generalized lasso, which greatly facilitates
computation of the path. For (the usual lasso), we draw a connection
between our approach and the well-known LARS algorithm. For an arbitrary D, we
derive an unbiased estimate of the degrees of freedom of the generalized lasso
fit. This estimate turns out to be quite intuitive in many applications.Comment: Published in at http://dx.doi.org/10.1214/11-AOS878 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- β¦