Motivation: Biomarker discovery from high-dimensional data is a crucial
problem with enormous applications in biology and medicine. It is also
extremely challenging from a statistical viewpoint, but surprisingly few
studies have investigated the relative strengths and weaknesses of the plethora
of existing feature selection methods. Methods: We compare 32 feature selection
methods on 4 public gene expression datasets for breast cancer prognosis, in
terms of predictive performance, stability and functional interpretability of
the signatures they produce. Results: We observe that the feature selection
method has a significant influence on the accuracy, stability and
interpretability of signatures. Simple filter methods generally outperform more
complex embedded or wrapper methods, and ensemble feature selection has
generally no positive effect. Overall a simple Student's t-test seems to
provide the best results. Availability: Code and data are publicly available at
http://cbio.ensmp.fr/~ahaury/

A Ivshina

Anne-Claire Haury

C Ambroise

C Fan

C Lai

C Sotiriou

F Reyal

G Abraham

H Zou

I Guyon

J Bi

J Mairal

J Wang

Jean-Philippe Vert

JPA Ioannidis

L Ein-Dor

M Dai

Muy-Teck Teh

N Meinshausen

P Wirapati

Pierre Gestraud

R Kohavi

R Shen

R Simon

R Tibshirani

RA Irizarry

S Michiels

T Abeel

T Barrett

T Iwamoto

W Shi

Y Benjamini

Y Pawitan

Y Wang

English

arXiv

International audienceMotivation: Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. Methods: We compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. Results: We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Simple filter methods generally outperform more complex embedded or wrapper methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results. Availability: Code and data are publicly available at http://cbio.ensmp.fr/~ahaury/

Haury, Anne-Claire

Gestraud, Pierre

Vert, Jean-Philippe

HAL Descartes

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

HAL: Hyper Article en Ligne

Haury Anne-Claire

Gestraud Pierre

Vert Jean-Philippe

Public Library of Science (PLOS)

The Influence of Feature Selection Methods on Accuracy, Stability and Interpretability of Molecular Signatures

Biomarker discovery from high-dimensional data is a crucial problem with enormous applications in biology and medicine. It is also extremely challenging from a statistical viewpoint, but surprisingly few studies have investigated the relative strengths and weaknesses of the plethora of existing feature selection methods. In this study we compare 32 feature selection methods on 4 public gene expression datasets for breast cancer prognosis, in terms of predictive performance, stability and functional interpretability of the signatures they produce. We observe that the feature selection method has a significant influence on the accuracy, stability and interpretability of signatures. Surprisingly, complex wrapper and embedded methods generally do not outperform simple univariate feature selection methods, and ensemble feature selection has generally no positive effect. Overall a simple Student's t-test seems to provide the best results

Directory of Open Access Journals

PLoS ONE

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures.

HAL-MINES ParisTech

Crossref

A comparison of univariate and multivariate gene selection techniques for classification of cancer datasets.

A comprehensive analysis of prognostic signatures reveals the high predictive capacity of the proliferation, immune response and RNA splicing modules in breast cancer.

A new method to measure the semantic similarity of GO terms.

Concordance among gene-expressionbased predictors for breast cancer.

Controlling the false discovery rate: a practical and powerful approach to multiple testing.

Dimensionality reduction via sparse support vector machines.

Elisseeff A

Evolving gene/ transcript definitions significantly alter the interpretation of GeneChip data.

Exploration, normalization, and summaries of high density oligonucleotide array probe level datas.

Gene selection for cancer classification using support vector machines.

Gene-expression signatures in breast cancer.

Metaanalysis of gene expression profiles in breast cancer: toward a uni_ed understanding of breast cancer subtyping and prognosis signatures.

Microarrays and molecular research: noise discovery?

NCBI GEO: archive for highthroughput functional genomic data.

Online learning for matrix factorization and sparse coding.

Outcome signature genes in breast cancer: is there a unique set?

Pathway analysis reveals functional convergence of gene expression profiles in breast cancer.

Prediction of breast cancer prognosis using gene set statistics provides signature stability and biological context.

Prediction of cancer outcome with microarrays: a multiple random validation strategy.

Regression shrinkage and selection via the lasso.

Regularization and variable selection via the Elastic Net.

Robust biomarker identification for cancer diagnosis with ensemble feature selection methods.

Stability selection.

Thousands of samples are needed to generate a robust gene list for predicting outcome in cancer.

Wrappers for feature selection.

The influence of feature selection methods on accuracy, stability and
  interpretability of molecular signatures

The influence of feature selection methods on accuracy, stability and interpretability of molecular signatures

Abstract

Similar works

Full text

Available Versions

HAL Descartes

HAL: Hyper Article en Ligne

Public Library of Science (PLOS)

Public Library of Science (PLOS)

Directory of Open Access Journals

HAL-MINES ParisTech

Crossref