Search CORE

Open Access LMU

ZORA

Stepwise classification of cancer samples using clinical and molecular data

Author: A Tan
AL Boulesteix
AL Boulesteix
AL Boulesteix
Askar Obulkasim
D Dunkler
D Krag
Gerrit A Meijer
JA Stephenson
JR Tibshirani
KA Cao
L Breiman
M Bovelstad
M Futschik
M Jelizarow
M van de Vijver
Mark A van de Wiel
RJ Nevins
SL Pomeroy
Y Qi
Z Yong
ZX Huang
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Combining clinical and molecular data types may potentially improve prediction accuracy of a classifier. However, currently there is a shortage of effective and efficient statistical and bioinformatic tools for true integrative data analysis. Existing integrative classifiers have two main disadvantages: First, coarse combination may lead to subtle contributions of one data type to be overshadowed by more obvious contributions of the other. Second, the need to measure both data types for all patients may be both unpractical and (cost) inefficient. Results We introduce a novel classification method, a stepwise classifier, which takes advantage of the distinct classification power of clinical data and high-dimensional molecular data. We apply classification algorithms to two data types independently, starting with the traditional clinical risk factors. We only turn to relatively expensive molecular data when the uncertainty of prediction result from clinical data exceeds a predefined limit. Experimental results show that our approach is adaptive: the proportion of samples that needs to be re-classified using molecular data depends on how much we expect the predictive accuracy to increase when re-classifying those samples. Conclusions Our method renders a more cost-efficient classifier that is at least as good, and sometimes better, than one based on clinical or molecular data alone. Hence our approach is not just a classifier that minimizes a particular loss function. Instead, it aims to be cost-efficient by avoiding molecular tests for a potentially large subgroup of individuals; moreover, for these individuals a test result would be quickly available, which may lead to reduced waiting times (for diagnosis) and hence lower the patients distress. Stepwise classification is implemented in R-package <it>stepwiseCM </it>and available at the Bioconductor website.</p

VU Research Portal

Testing the additional predictive value of high-dimensional molecular data

Author: AL Boulesteix
AL Boulesteix
Anne-Laure Boulesteix
C Truntzer
G Tutz
H Binder
H Höing
J Fridlyand
J Friedman
J Goeman
JJ Goeman
JJ Goeman
LJ van't Veer
M Schmidberger
O Gevaert
P Bühlmann
P Eden
R Tibshirani
R Tibshirani
S Chiaretti
T Golub
T Hothorn
T Hothorn
Torsten Hothorn
X Li
Y Freund
Y Sun
Publication venue: BioMed Central
Publication date: 01/09/2009
Field of study

While high-dimensional molecular data such as microarray gene expression data have been used for disease outcome prediction or diagnosis purposes for about ten years in biomedical research, the question of the additional predictive value of such data given that classical predictors are already available has long been under-considered in the bioinformatics literature. We suggest an intuitive permutation-based testing procedure for assessing the additional predictive value of high-dimensional molecular data. Our method combines two well-known statistical tools: logistic regression and boosting regression. We give clear advice for the choice of the only method parameter (the number of boosting iterations). In simulations, our novel approach is found to have very good power in different settings, e.g. few strong predictors or many weak predictors. For illustrative purpose, it is applied to two publicly available cancer data sets. Our simple and computationally efficient approach can be used to globally assess the additional predictive power of a large number of candidate predictors given that a few clinical covariates or a known prognostic index are already available

Open Access LMU

Bias in random forest variable importance measures: Illustrations, sources and a solution

Author: A Bureau
A Dobra
A Liaw
Achim Zeileis
AG Heidema
AL Boulesteix
AL Boulesteix
Anne-Laure Boulesteix
C Furlanello
C Strobl
C Strobl
C Strobl
Carolin Strobl
DN Politis
EC Gunther
H Kim
I Kononenko
J Friedman
J Friedman
K Arun
KL Lunetta
L Breiman
L Breiman
L Breiman
M van der Laan
MM Ward
MP Cummings
MP Cummings
MR Segal
P Bühlmann
PJ Bickel
R Development Core Team
R Díaz-Uriarte
R Guha
T Hothorn
T Hothorn
TM Therneau
Torsten Hothorn
V Svetnik
X Huang
Y Qi
Y Shih
Publication venue: BioMed Central
Publication date: 01/01/2006
Field of study

BACKGROUND: Variable importance measures for random forests have been receiving increased attention as a means of variable selection in many classification tasks in bioinformatics and related scientific fields, for instance to select a subset of genetic markers relevant for the prediction of a certain disease. We show that random forest variable importance measures are a sensible means for variable selection in many applications, but are not reliable in situations where potential predictor variables vary in their scale of measurement or their number of categories. This is particularly important in genomics and computational biology, where predictors often include variables of different types, for example when predictors include both sequence data and continuous variables such as folding energy, or when amino acid sequence data show different numbers of categories. RESULTS: Simulation studies are presented illustrating that, when random forest variable importance measures are used with data of varying types, the results are misleading because suboptimal predictor variables may be artificially preferred in variable selection. The two mechanisms underlying this deficiency are biased variable selection in the individual classification trees used to build the random forest on one hand, and effects induced by bootstrap sampling with replacement on the other hand. CONCLUSION: We propose to employ an alternative implementation of random forests, that provides unbiased variable selection in the individual classification trees. When this method is applied using subsampling without replacement, the resulting variable importance measures can be used reliably for variable selection even in situations where the potential predictor variables vary in their scale of measurement or their number of categories. The usage of both random forest algorithms and their variable importance measures in the R system for statistical computing is illustrated and documented thoroughly in an application re-analyzing data from a study on RNA editing. Therefore the suggested method can be applied straightforwardly by scientists in bioinformatics research

Elektronische Publikationen der Wirtschaftsuniversität Wien

Open Access LMU

Importance of data structure in comparing two dimension reduction methods for classification of microarray gene expression data

Author: A Boulesteix
A Culhane
A Culhane
AL Boulesteix
Caroline Truntzer
Catherine Mercier
Christian Gautier
D Nguyen
D Nguyen
D Singh
H Hotelling
H Martens
I Frank
I Jeffery
J Dai
Jacques Estève
L Lebart
M Barker
M Shipp
M Stone
P Garthwaite
P Mahalanobis
Pascal Roy
R Fisher
S Chiaretti
S DeJong
S Doledec
T Golub
Y Escoufier
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: With the advance of microarray technology, several methods for gene classification and prognosis have been already designed. However, under various denominations, some of these methods have similar approaches. This study evaluates the influence of gene expression variance structure on the performance of methods that describe the relationship between gene expression levels and a given phenotype through projection of data onto discriminant axes. RESULTS: We compared Between-Group Analysis and Discriminant Analysis (with prior dimension reduction through Partial Least Squares or Principal Components Analysis). A geometric approach showed that these two methods are strongly related, but differ in the way they handle data structure. Yet, data structure helps understanding the predictive efficiency of these methods. Three main structure situations may be identified. When the clusters of points are clearly split, both methods perform equally well. When the clusters superpose, both methods fail to give interesting predictions. In intermediate situations, the configuration of the clusters of points has to be handled by the projection to improve prediction. For this, we recommend Discriminant Analysis. Besides, an innovative way of simulation generated the three main structures by modelling different partitions of the whole variance into within-group and between-group variances. These simulated datasets were used in complement to some well-known public datasets to investigate the methods behaviour in a large diversity of structure situations. To examine the structure of a dataset before analysis and preselect an a priori appropriate method for its analysis, we proposed a two-graph preliminary visualization tool: plotting patients on the Between-Group Analysis discriminant axis (x-axis) and on the first and the second within-group Principal Components Analysis component (y-axis), respectively. CONCLUSION: Discriminant Analysis outperformed Between-Group Analysis because it allows for the dataset structure. An a priori knowledge of that structure may guide the choice of the analysis method. Simulated datasets with known properties are valuable to assess and compare the performance of analysis methods, then implementation on real datasets checks and validates the results. Thus, we warn against the use of unchallenging datasets for method comparison, such as the Golub dataset, because their structure is such that any method would be efficient

HAL-HCL

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

An experimental study of the intrinsic stability of random forest variable importance measures

Author: A Altmann
A Kalousis
A Statnikov
A Statnikov
A Verikas
AC Haury
AL Boulesteix
AL Boulesteix
CH Park
D Ma
DM Reif
DS Cao
EC Fieller
Fan Yang
H Wang
Huazhen Wang
I Guyon
I Kamkar
J Paul
JM Cadenas
KK Nicodemus
L Breiman
L Hamers
L Yu
L Yu
LI Kuncheva
MB Kursa
ML Calle
O Okun
R Díaz-Uriarte
R Fagin
R Genuer
S Alelyani
S Loscalzo
S Pleus
SS Lee
SY Kim
TK Ho
VY Kulkarni
Y Han
Y Zhang
Z He
Zhiyuan Luo
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2016
Field of study

BACKGROUND: The stability of Variable Importance Measures (VIMs) based on random forest has recently received increased attention. Despite the extensive attention on traditional stability of data perturbations or parameter variations, few studies include influences coming from the intrinsic randomness in generating VIMs, i.e. bagging, randomization and permutation. To address these influences, in this paper we introduce a new concept of intrinsic stability of VIMs, which is defined as the self-consistence among feature rankings in repeated runs of VIMs without data perturbations and parameter variations. Two widely used VIMs, i.e., Mean Decrease Accuracy (MDA) and Mean Decrease Gini (MDG) are comprehensively investigated. The motivation of this study is two-fold. First, we empirically verify the prevalence of intrinsic stability of VIMs over many real-world datasets to highlight that the instability of VIMs does not originate exclusively from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. Second, through Spearman and Pearson tests we comprehensively investigate how different factors influence the intrinsic stability. RESULTS: The experiments are carried out on 19 benchmark datasets with diverse characteristics, including 10 high-dimensional and small-sample gene expression datasets. Experimental results demonstrate the prevalence of intrinsic stability of VIMs. Spearman and Pearson tests on the correlations between intrinsic stability and different factors show that #feature (number of features) and #sample (size of sample) have a coupling effect on the intrinsic stability. The synthetic indictor, #feature/#sample, shows both negative monotonic correlation and negative linear correlation with the intrinsic stability, while OOB accuracy has monotonic correlations with intrinsic stability. This indicates that high-dimensional, small-sample and high complexity datasets may suffer more from intrinsic instability of VIMs. Furthermore, with respect to parameter settings of random forest, a large number of trees is preferred. No significant correlations can be seen between intrinsic stability and other factors. Finally, the magnitude of intrinsic stability is always smaller than that of traditional stability. CONCLUSION: First, the prevalence of intrinsic stability of VIMs demonstrates that the instability of VIMs not only comes from data perturbations or parameter variations, but also stems from the intrinsic randomness of VIMs. This finding gives a better understanding of VIM stability, and may help reduce the instability of VIMs. Second, by investigating the potential factors of intrinsic stability, users would be more aware of the risks and hence more careful when using VIMs, especially on high-dimensional, small-sample and high complexity datasets

Royal Holloway - Pure

Factor analysis for gene regulatory networks and transcription factor activity profiles

Author: A Frigyesi
A Utsugi
AL Boulesteix
AM Martoglio
C Sabatti
C Sabatti
E Fokoue
G Hinton
H Kaiser
H Ming
H Salgado
I Pournara
Iosifina Pournara
J Liao
K Kao
L Tran
Lorenz Wernisch
M Tipping
M West
O Aguilar
P Schönemann
W Liebermeister
Z Ghahramani
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

BACKGROUND: Most existing algorithms for the inference of the structure of gene regulatory networks from gene expression data assume that the activity levels of transcription factors (TFs) are proportional to their mRNA levels. This assumption is invalid for most biological systems. However, one might be able to reconstruct unobserved activity profiles of TFs from the expression profiles of target genes. A simple model is a two-layer network with unobserved TF variables in the first layer and observed gene expression variables in the second layer. TFs are connected to regulated genes by weighted edges. The weights, known as factor loadings, indicate the strength and direction of regulation. Of particular interest are methods that produce sparse networks, networks with few edges, since it is known that most genes are regulated by only a small number of TFs, and most TFs regulate only a small number of genes. RESULTS: In this paper, we explore the performance of five factor analysis algorithms, Bayesian as well as classical, on problems with biological context using both simulated and real data. Factor analysis (FA) models are used in order to describe a larger number of observed variables by a smaller number of unobserved variables, the factors, whereby all correlation between observed variables is explained by common factors. Bayesian FA methods allow one to infer sparse networks by enforcing sparsity through priors. In contrast, in the classical FA, matrix rotation methods are used to enforce sparsity and thus to increase the interpretability of the inferred factor loadings matrix. However, we also show that Bayesian FA models that do not impose sparsity through the priors can still be used for the reconstruction of a gene regulatory network if applied in conjunction with matrix rotation methods. Finally, we show the added advantage of merging the information derived from all algorithms in order to obtain a combined result. CONCLUSION: Most of the algorithms tested are successful in reconstructing the connectivity structure as well as the TF profiles. Moreover, we demonstrate that if the underlying network is sparse it is still possible to reconstruct hidden activity profiles of TFs to some degree without prior connectivity information

A comparative study on gene-set analysis methods for assessing differential expression associated with the survival phenotype

Author: A Rosenwald
A Subramanian
AA Alizadeh
AJ Adewale
AL Boulesteix
AP Crijns
E Bair
H Binder
HK Dressman
I Dinu
J Gui
Jinheum Kim
JJ Goeman
JJ Goeman
JJ Goeman
K Jung
L Tian
Q Liu
R Tibshirani
Seungyeoun Lee
Sunho Lee
SY Kim
TR Golub
TS Furey
VK Mootha
X Chen
Y Benjamini
Publication venue: BioMed Central
Publication date: 01/01/2011
Field of study

Abstract Background Many gene-set analysis methods have been previously proposed and compared through simulation studies and analysis of real datasets for binary phenotypes. We focused on the survival phenotype and compared the performances of Gene Set Enrichment Analysis (GSEA), Global Test (GT), Wald-type Test (WT) and Global Boost Test (GBST) methods in a simulation study and on two ovarian cancer data sets. We considered two versions of GSEA by allowing different weights: GSEA1 uses equal weights, yielding results similar to the Kolmogorov-Smirnov test; while GSEA2's weights are based on the correlation between genes and the phenotype. Results We compared GSEA1, GSEA2, GT, WT and GBST in a simulation study with various settings for the correlation structure of the genes and the association parameter between the survival outcome and the genes. Simulation results indicated that GT, WT and GBST consistently have higher power than GSEA1 and GSEA2 across all scenarios. However, the power of the five tests depends on the combination of correlation structure and association parameter. For the ovarian cancer data set, using the FDR threshold of q Conclusion Simulation studies and a real data example indicate that GT, WT and GBST tend to have high power, whereas GSEA1 and GSEA2 have lower power. We also found that the power of the five tests is much higher when genes are correlated than when genes are independent, when survival is positively associated with genes. It seems that there is a synergistic effect in detecting significant gene sets when significant genes have within-class correlation and the association between survival and genes is positive or negative (i.e., one-direction correlation).</p

A classification model to predict synergism/antagonism of cytotoxic mixtures using protein-drug docking scores

Author: A Goldin
A Hoskuldsson
AL Boulesteix
BA Carlson
CD Lao
E Perola
F Abas
FM Muggia
G Patlakas
GL Warren
GR Zimmermann
J Lehar
JC Boik
JC Boik
JM Nabholtz
John C Boik
M Momma
M Tabata
N Akula
Robert A Newman
RP Araujo
RP Sheridan
T Hastie
T Safra
TG Dietterich
VT DeVita Jr.
Y Hayashi
Z Zsoldos
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

arXiv.org e-Print Archive

Algebraic Comparison of Partial Lists in Bioinformatics

Author: A Gobbi
A Kalousis
A Kossenkov
A Sboner
AC Haury
AL Boulesteix
Arkady B. Khodursky
B Di Camillo
B Efron
B Efron
B Efron
B Schowe
C Cortes
C Cortes
C Furlanello
C Schneider
C Schneider
C Soneson
C Yao
Cesare Furlanello
Consortium The MicroArray Quality Control (MAQC)
D Albanese
D Cai
D Corrada
D Critchlow
D Saari
D Witten
G Guzzetta
G Jurman
G Jurman
G Lance
G Lance
G Smyth
Giuseppe Jurman
GS Cheon
I Guyon
I Jeffery
I Lönnstedt
J Bar-Ilan
J Borda
J Chen
J Ioannidis
J Neter
J Storey
L Ein-Dor
L Kuncheva
L Yu
L Zhang
M Desarkar
M Kauers
M Kauers
M Kendall
M Schimek
M Schimek
M Slawski
M Villarino
M Villarino
O Bousquet
P Baldi
P Diaconis
P Diaconis
P Hall
P Hall
P Krízek
PC Boutros
R Fagin
R Gentleman
R Graham
R Pearson
R Pique-Regi
R Pique-Regi
R Simon
Roberto Visintainer
S Abramov
S Dudoit
S Lin
S Lin
S Mukherjee
S Setlur
S Simićc
S Vanderlooy
Samantha Riccadonna
SK Lau
T Bø
T Calders
V Tusher
Visintainer
W Fury
W Hoeffding
W Shi
X Wang
X Yang
Y Xiao
Y Xiao
Z He
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 08/04/2010
Field of study

The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or just within a meta-analysis comparison, instead of one list it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained. Here we introduce a method, based on the algebraic theory of symmetric groups, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated first on synthetic data in a gene filtering task and then for finding gene profiles on a recent prostate cancer dataset

Archivio della ricerca - Fondazione Bruno Kessler