Abstract Background High-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data. Results Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets. Conclusions The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed.</p

Hoefsloot, Huub CJ

Smilde, Age K

Xu, Cheng-Jian

English

PubMed

Background High-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data. Results Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets. Conclusions The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed

Xu, C.J.

Hoefsloot, H.C.J.

Smilde, A.K.

NARCIS 

To aggregate or not to aggregate high-dimensional classifiers

Cheng-Jian  Xu

Huub CJ Hoefsloot

Age K Smilde

Crossref

Cheng-Jian Xu

Springer - Publisher Connector

BackgroundHigh-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data. ResultsPrincipal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets. ConclusionsThe aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed. <br/

UvA-DARE

Abstract Background High-throughput functional genomics technologies generate large amount of data with hundreds or thousands of measurements per sample. The number of sample is usually much smaller in the order of ten or hundred. This poses statistical challenges and calls for appropriate solutions for the analysis of this kind of data. Results Principal component discriminant analysis (PCDA), an adaptation of classical linear discriminant analysis (LDA) for high-dimensional data, has been selected as an example of a base learner. The multiple versions of PCDA models from repeated double cross-validation were aggregated, and the final classification was performed by majority voting. The performance of this approach was evaluated by simulation, genomics, proteomics and metabolomics data sets. Conclusions The aggregating PCDA learner can improve the prediction performance, provide more stable result, and help to know the variability of the models. The disadvantage and limitations of aggregating were also discussed.</p

Smilde Age K

Hoefsloot Huub CJ

Xu Cheng-Jian

Directory of Open Access Journals

BMC Bioinformatics

International Migration, Integration and Social Cohesion online publications

A study of cross-validation and bootstrap for accuracy estimation and model selection.

Aliferis CF: A comprehensive comparison of random forests and support vector machines for microarray-based cancer classification.

Analyzing bagging. Ann Stat

Bagging equalizes influence.

Bagging predictors.

Braga-Neto UM: Is Bagging Effective in the Classification of SmallSample Genomic and Proteomic Data?

Computer aided design of experiments. Technometrics

Cross-Validatory Choice and

de Koster CG: Assessing the statistical validity of proteomics based biomarkers. Anal Chim Acta

Deelder AM: Mass spectrometry proteomic diagnosis: Enacting the double cross-validatory paradigm.

Diagnosis of multiple cancer types by shrunken centroids of gene expression.

Dubchak I: Multi-class protein fold recognition using support vector machines and neural networks. Bioinformatics

Duin RPW: Bagging for linear classifiers. Pattern Recognit

ES: Molecular classification of cancer: Class discovery and class prediction by gene expression monitoring. Science

GJ: A new LDA-based face recognition system which can solve the small sample size problem. Pattern Recognit

Heyes MP: Prediction of clinical drug efficacy by classification of drug-induced genomic expression profiles in vitro.

Introduction to Statistical Pattern Recognition.

JH: The Elements of Statistical Learning: Data Mining, Inference and Prediction.

Kistemaker PG: Discriminant-analysis by double stage principal component analysis. Anal Chem

Kriegman DJ: Eigenfaces vs. Fisherfaces: Recognition using class specific linear projection.

On bias, variance, 0/1 - Loss, and the curse-ofdimensionality. Data Min Knowl Discov

Pintelas PE: Combining Bagging and Boosting.

Rank products: a simple, yet powerful, new method to detect differentially regulated genes in replicated microarray experiments. FEBS Lett

Smilde AK: A classification model for the Leiden proteomics competition.

Statistical challenges with high dimensionality: feature selection in knowledge discovery.

Statistical Learning from a Regression Perspective.

Stuetzle W: Observations on bagging. Stat Sin

The Nature of Statistical Learning Theory Springer-Verlag;

Varmuza K: Repeated double cross validation.

Wehenkel L: Proteomic mass spectra classification using decision tree based ensemble methods. Bioinformatics

Yang JY: Why can LDA be performed in PCA transformed space? Pattern Recognit

https://pure.uva.nl/ws/files/1126540/103848_354023.pdf

To aggregate or not to aggregate high-dimensional classifiers

Abstract

Similar works

Full text

Available Versions

NARCIS

Crossref

Springer - Publisher Connector

Springer - Publisher Connector

UvA-DARE

Directory of Open Access Journals

International Migration, Integration and Social Cohesion online publications

International Migration, Integration and Social Cohesion online publications