1,447 research outputs found
Choosing a penalty for model selection in heteroscedastic regression
We consider the problem of choosing between several models in least-squares
regression with heteroscedastic data. We prove that any penalization procedure
is suboptimal when the penalty is a function of the dimension of the model, at
least for some typical heteroscedastic model selection problems. In particular,
Mallows' Cp is suboptimal in this framework. On the contrary, optimal model
selection is possible with data-driven penalties such as resampling or -fold
penalties. Therefore, it is worth estimating the shape of the penalty from
data, even at the price of a higher computational cost. Simulation experiments
illustrate the existence of a trade-off between statistical accuracy and
computational complexity. As a conclusion, we sketch some rules for choosing a
penalty in least-squares regression, depending on what is known about possible
variations of the noise-level
Analysis of purely random forests bias
Random forests are a very effective and commonly used statistical method, but
their full theoretical analysis is still an open problem. As a first step,
simplified models such as purely random forests have been introduced, in order
to shed light on the good performance of random forests. In this paper, we
study the approximation error (the bias) of some purely random forest models in
a regression framework, focusing in particular on the influence of the number
of trees in the forest. Under some regularity assumptions on the regression
function, we show that the bias of an infinite forest decreases at a faster
rate (with respect to the size of each tree) than a single tree. As a
consequence, infinite forests attain a strictly better risk rate (with respect
to the sample size) than single trees. Furthermore, our results allow to derive
a minimum number of trees sufficient to reach the same rate as an infinite
forest. As a by-product of our analysis, we also show a link between the bias
of purely random forests and the bias of some kernel estimators
Some nonasymptotic results on resampling in high dimension, I: Confidence regions, II: Multiple tests
We study generalized bootstrap confidence regions for the mean of a random
vector whose coordinates have an unknown dependency structure. The random
vector is supposed to be either Gaussian or to have a symmetric and bounded
distribution. The dimensionality of the vector can possibly be much larger than
the number of observations and we focus on a nonasymptotic control of the
confidence level, following ideas inspired by recent results in learning
theory. We consider two approaches, the first based on a concentration
principle (valid for a large class of resampling weights) and the second on a
resampled quantile, specifically using Rademacher weights. Several intermediate
results established in the approach based on concentration principles are of
interest in their own right. We also discuss the question of accuracy when
using Monte Carlo approximations of the resampled quantities.Comment: Published in at http://dx.doi.org/10.1214/08-AOS667;
http://dx.doi.org/10.1214/08-AOS668 the Annals of Statistics
(http://www.imstat.org/aos/) by the Institute of Mathematical Statistics
(http://www.imstat.org
Multi-task Regression using Minimal Penalties
In this paper we study the kernel multiple ridge regression framework, which
we refer to as multi-task regression, using penalization techniques. The
theoretical analysis of this problem shows that the key element appearing for
an optimal calibration is the covariance matrix of the noise between the
different tasks. We present a new algorithm to estimate this covariance matrix,
based on the concept of minimal penalty, which was previously used in the
single-task regression framework to estimate the variance of the noise. We
show, in a non-asymptotic setting and under mild assumptions on the target
function, that this estimator converges towards the covariance matrix. Then
plugging this estimator into the corresponding ideal penalty leads to an oracle
inequality. We illustrate the behavior of our algorithm on synthetic examples
Coupling the Yoccoz-Birkeland population model with price dynamics: chaotic livestock commodities market cycles
We propose a new model for the time evolution of livestock commodities which
exhibits endogenous deterministic stochastic behaviour. The model is based on
the Yoccoz-Birkeland integral equation, a model first developed for studying
the time-evolution of single species with high average fertility, a relatively
short mating season and density dependent reproduction rates. This equation is
then coupled with a differential equation describing the price of a livestock
commodity driven by the unbalance between its demand and supply. At its birth
the cattle population is split into two parts: reproducing females and cattle
for butchery. The relative amount of the two is determined by the spot price of
the meat. We prove the existence of an attractor and we investigate numerically
its properties: the strange attractor existing for the original
Yoccoz-Birkeland model is persistent but its chaotic behaviour depends also
from the price evolution in an essential way.Comment: 26 pages, 19 figure
Metric Learning for Temporal Sequence Alignment
In this paper, we propose to learn a Mahalanobis distance to perform
alignment of multivariate time series. The learning examples for this task are
time series for which the true alignment is known. We cast the alignment
problem as a structured prediction task, and propose realistic losses between
alignments for which the optimization is tractable. We provide experiments on
real data in the audio to audio context, where we show that the learning of a
similarity measure leads to improvements in the performance of the alignment
task. We also propose to use this metric learning framework to perform feature
selection and, from basic audio features, build a combination of these with
better performance for the alignment
Model selection by resampling penalization
In this paper, a new family of resampling-based penalization procedures for
model selection is defined in a general framework. It generalizes several
methods, including Efron's bootstrap penalization and the leave-one-out
penalization recently proposed by Arlot (2008), to any exchangeable weighted
bootstrap resampling scheme. In the heteroscedastic regression framework,
assuming the models to have a particular structure, these resampling penalties
are proved to satisfy a non-asymptotic oracle inequality with leading constant
close to 1. In particular, they are asympotically optimal. Resampling penalties
are used for defining an estimator adapting simultaneously to the smoothness of
the regression function and to the heteroscedasticity of the noise. This is
remarkable because resampling penalties are general-purpose devices, which have
not been built specifically to handle heteroscedastic data. Hence, resampling
penalties naturally adapt to heteroscedasticity. A simulation study shows that
resampling penalties improve on V-fold cross-validation in terms of final
prediction error, in particular when the signal-to-noise ratio is not large.Comment: extended version of hal-00125455, with a technical appendi
V-fold cross-validation improved: V-fold penalization
We study the efficiency of V-fold cross-validation (VFCV) for model selection
from the non-asymptotic viewpoint, and suggest an improvement on it, which we
call ``V-fold penalization''. Considering a particular (though simple)
regression problem, we prove that VFCV with a bounded V is suboptimal for model
selection, because it ``overpenalizes'' all the more that V is large. Hence,
asymptotic optimality requires V to go to infinity. However, when the
signal-to-noise ratio is low, it appears that overpenalizing is necessary, so
that the optimal V is not always the larger one, despite of the variability
issue. This is confirmed by some simulated data. In order to improve on the
prediction performance of VFCV, we define a new model selection procedure,
called ``V-fold penalization'' (penVF). It is a V-fold subsampling version of
Efron's bootstrap penalties, so that it has the same computational cost as
VFCV, while being more flexible. In a heteroscedastic regression framework,
assuming the models to have a particular structure, we prove that penVF
satisfies a non-asymptotic oracle inequality with a leading constant that tends
to 1 when the sample size goes to infinity. In particular, this implies
adaptivity to the smoothness of the regression function, even with a highly
heteroscedastic noise. Moreover, it is easy to overpenalize with penVF,
independently from the V parameter. A simulation study shows that this results
in a significant improvement on VFCV in non-asymptotic situations.Comment: 40 pages, plus a separate technical appendi
Minimal penalties and the slope heuristics: a survey
Birg{\'e} and Massart proposed in 2001 the slope heuristics as a way to
choose optimally from data an unknown multiplicative constant in front of a
penalty. It is built upon the notion of minimal penalty, and it has been
generalized since to some "minimal-penalty algorithms". This paper reviews the
theoretical results obtained for such algorithms, with a self-contained proof
in the simplest framework, precise proof ideas for further generalizations, and
a few new results. Explicit connections are made with residual-variance
estimators-with an original contribution on this topic, showing that for this
task the slope heuristics performs almost as well as a residual-based estimator
with the best model choice-and some classical algorithms such as L-curve or
elbow heuristics, Mallows' C p , and Akaike's FPE. Practical issues are also
addressed, including two new practical definitions of minimal-penalty
algorithms that are compared on synthetic data to previously-proposed
definitions. Finally, several conjectures and open problems are suggested as
future research directions
- …