Location of Repository

Optimal classifier selection and negative bias in error rate estimation: An empirical study on high-dimensional prediction

By Anne-Laure Boulesteix and Carolin Strobl


In biometric practice, researchers often apply a large number of different methods in a "trial-and-error" strategy to get as much as possible out of their data and, due to publication pressure or pressure from the consulting customer, present only the most favorable results. This strategy may induce a substantial optimistic bias in prediction error estimation, which is quantitatively assessed in the present manuscript. The focus of our work is on class prediction based on high-dimensional data (e.g. microarray data), since such analyses are particularly exposed to this kind of bias. In our study we consider a total of 124 variants of classifiers (possibly including variable selection or tuning steps) within a cross-validation evaluation scheme. The classifiers are applied to original and modified real microarray data sets, some of which are obtained by randomly permuting the class labels to mimic non-informative predictors while preserving their correlation structure. We then assess the minimal misclassification rate over the different variants of classifiers in order to quantify the bias arising when the optimal classifier is selected a posteriori in a data-driven manner. The bias resulting from the parameter tuning (including gene selection parameters as a special case) and the bias resulting from the choice of the classification method are examined both separately and jointly. We conclude that the strategy to present only the optimal result is not acceptable, and suggest alternative approaches for properly reporting classification accuracy.

Topics: Technische Reports, ddc:510
Year: 2009
DOI identifier: 10.1186/1471-2288-9-85
OAI identifier: oai:epub.ub.uni-muenchen.de:10606
Provided by: Open Access LMU

Suggested articles



  1. (2008). A unified approach to false discovery rate estimation.
  2. (1999). A: Broad patterns of gene expression revealed by clustering analysis of tumor and normal colon tissues probed by oligonucleotide arrays.
  3. A: MCRestimate: Misclassification error estimation with cross-validation 2007. [R package version 1.10.2].
  4. (2006). Bias in error estimation when using cross-validation for model selection.
  5. (2008). Boulesteix AL: CMA (Classiciation for MicroArrays)
  6. (2008). CMA - A comprehensive Bioconductor package for supervised classification with high dimensional data. BMC Bioinformatics
  7. (2007). Critical Review of Published Microarray Studies for Cancer Outcome and Guidelines on Statistical Analysis and Reporting.
  8. (2006). de Andr´ e s SA: Gene selection and classification of microarray data using random forests.
  9. (2006). Development and Validation of Therapeutically Relevant MultiGene Biomarker Classifiers.
  10. (2002). Diagnosis of multiple cancer types by shrunken centroids of gene expression.
  11. (2008). Ebers G: Reducing the probability of false positive research findings by pre-publication validation – Experience with a large multiple sclerosis database.
  12. (2008). Evaluating microarray-based classifiers: an overview. Cancer Informatics
  13. (2007). Hothorn T: Boosting algorithms: regularization, prediction and model fitting (with discussion).
  14. (2007). Ioannidis JP: Almost all articles on cancer prognostic markers report statistically significant results.
  15. (2001). JH: The elements of statistical learning.
  16. (2007). KTM: The normal fetal heart rate study: Analysis plan. Nature Precedings
  17. (2004). Linear models and empirical Bayes methods for assessing differential expression in microarray experiments.
  18. (2003). LM: Pitfalls in the Use of DNA Microarray Data for Diagnostic and Prognostic Classification.
  19. (2007). Mengersen KL: Classification based upon gene expression data: bias and precision of error rates. Bioinformatics
  20. (2005). Microarrays and molecular research: noise discovery. The Lancet
  21. (2004). PLS dimension reduction for classification with microarray data.
  22. (2005). Prediction error estimation: a comparison of resampling methods. Bioinformatics
  23. (2001). Random Forests. Machine Learning
  24. (2006). Reader’s reaction to ’Dimension reduction for classification with gene expression microarray data’ by Dai et al
  25. (1996). Regression shrinkage and selection via the LASSO .
  26. (2005). Regularization and variable selection via the elastic net.
  27. (2005). S: A comprehensive evaluation of multicategory classification methods for microarray gene expression cancer diagnosis. Bioinformatics
  28. (2002). Selection bias in gene extraction in tumour classification on basis of microarray gene expression data.
  29. (2002). Sellers WR: Gene expression correlates of clinical prostate cancer behavior. Cancer Cell
  30. (2007). Strimmer K: Partial Least Squares: A versatile tool for the analysis of high-dimensional genomic data. Briefings in Bioinformatics
  31. (1995). The control of the false discovery rate in multiple testing under dependency.
  32. (1995). The nature of statistical learning theory.
  33. (2006). van’t Veer et al L: Validation and Clinical Utility of a 70-Gene Prognostic Signature for Women With Node-Negative Breast Cancer.
  34. (2005). Why Most Published Research Findings Are False. PLoS Medicine
  35. (2007). WilcoxCV: An R package for fast variable selection in crossvalidation. Bioinformatics

To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.