Abstract Background We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? Results We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller <it>n </it>resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (<it>n </it>≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.</p

A Dupuy

A Rosenwald

AM Molinaro

B Efron

C Ambroise

J Schafer

JM Boer

K Fukunaga

K Shedden

Kevin K Dobbin

KI Kim

KK Dobbin

L Devroye

L Sun

LJ van't Veer

MD Radmacher

O Ledoit

R Simon

Richard M Simon

RO Duda

S Mukherjee

TR Golub

WJ Fu

English

PubMed

Springer - Publisher Connector

Optimally splitting cases for training and testing high dimensional classifiers

Abstract Background We consider the problem of designing a study to develop a predictive classifier from high dimensional data. A common study design is to split the sample into a training set and an independent test set, where the former is used to develop the classifier and the latter to evaluate its performance. In this paper we address the question of what proportion of the samples should be devoted to the training set. How does this proportion impact the mean squared error (MSE) of the prediction accuracy estimate? Results We develop a non-parametric algorithm for determining an optimal splitting proportion that can be applied with a specific dataset and classifier algorithm. We also perform a broad simulation study for the purpose of better understanding the factors that determine the best split proportions and to evaluate commonly used splitting strategies (1/2 training or 2/3 training) under a wide variety of conditions. These methods are based on a decomposition of the MSE into three intuitive component parts. Conclusions By applying these approaches to a number of synthetic and real microarray datasets we show that for linear classifiers the optimal proportion depends on the overall number of samples available and the degree of differential expression between the classes. The optimal proportion was found to depend on the full dataset size (n) and classification accuracy - with higher accuracy and smaller n resulting in more assigned to the training set. The commonly used strategy of allocating 2/3rd of cases for training was close to optimal for reasonable sized datasets (n ≥ 100) with strong signals (i.e. 85% or greater full dataset accuracy). In general, we recommend use of our nonparametric resampling approach for determing the optimal split. This approach can be applied to any dataset, using any predictor development method, to determine the best split.</p

Simon Richard M

Dobbin Kevin K

Directory of Open Access Journals

BMC Medical Genomics

Crossref

A method for constructing a confidence bound for the actual error rate of a prediction rule in high dimensions. Biostatistics

A paradigm for class prediction using gene expression profiles.

A shrinkage approach to large-scale covariance matrix estimation and implicatins for functional genomics. Statistical Applications in Genetics and Molecular Biology

A well-conditioned estimator for large-dimensional covariance matrices.

Critical review of published microarray studies for cancer outcome and guidelines on statistical analysis and reporting.

ES: Molecular classification of cancer: class discovery and class prediction by gene expression monitoring. Science

Gene expression-based survival prediction in lung adenocarcinmoa: a multi-site, blinded validation study. Nat Med

Introduction to Statistical Pattern Recognition. Second edition.

LM: Pitfalls in the use of DNA micoarray data for diagnostic and prognostic classification.

LM: The use of molecular profiling to predict survival after chemotherapy for diffuse large-B-cell lymphoma.

Lugosi G: A Probabilistic Theory of Pattern Recognition

Mesirov JP: Estimating dataset size requirements for classifying DNA microarray data.

Neuronal and glioma-derived stem cell factor induces angiogenesis within the brain. Cancer Cell

Poustka A: Identification and classification of differentially expressed genes in renal cell carcinoma by expression profiling on a global human 31,500-element cDNA array. Gen Res

Prediction error estimation: a comparison of resampling methods. Bioinformatics

Probabilistic classifiers in high dimensional data. Biostatistics

RJ: How many samples are needed to build a classifier: a general sequential approach. Bioinformatics

Sample size planning for developing classifiers using high-dimensional DNA microarray data. Biostatistics

Selection bias in gene extraction on the basis of microarray gene-expression data. Proc Natl Acad Sci USA

SH: Gene expression profiling predicts clinical outcome of breast cancer.

Tibshirani R: An Introduction to the Bootstrap Boca Raton: Chapman and Hall;

http://doaj.org/search?source=%7B%22query%22%3A%7B%22bool%22%3A%7B%22must%22%3A%5B%7B%22term%22%3A%7B%22id%22%3A%223cb53c9cc6774de3a54b413d9b8d1eac%22%7D%7D%5D%7D%7D%7D

Optimally splitting cases for training and testing high dimensional classifiers

Abstract

Similar works

Full text

Available Versions

Springer - Publisher Connector

Directory of Open Access Journals

Crossref