Cross-validation is the de facto standard for predictive model evaluation and
selection. In proper use, it provides an unbiased estimate of a model's
predictive performance. However, data sets often undergo various forms of
data-dependent preprocessing, such as mean-centering, rescaling, dimensionality
reduction, and outlier removal. It is often believed that such preprocessing
stages, if done in an unsupervised manner (that does not incorporate the class
labels or response values) are generally safe to do prior to cross-validation.
In this paper, we study three commonly-practiced preprocessing procedures prior
to a regression analysis: (i) variance-based feature selection; (ii) grouping
of rare categorical features; and (iii) feature rescaling. We demonstrate that
unsupervised preprocessing procedures can, in fact, introduce a large bias into
cross-validation estimates and potentially lead to sub-optimal model selection.
This bias may be either positive or negative and its exact magnitude depends on
all the parameters of the problem in an intricate manner. Further research is
needed to understand the real-world impact of this bias across different
application domains, particularly when dealing with low sample counts and
high-dimensional data.Comment: 29 pages, 4 figures, 1 tabl