Search CORE

3,065 research outputs found

Factors Influencing the Statistical Power of Complex Data Analysis Protocols for Molecular Signature Development from Microarray Data

Author: A Bhattacharjee
A Butte
A Dupuy
A Potti
A Rosenwald
A Statnikov
A Statnikov
A Statnikov
Alexander Statnikov
AM Glas
B Freidlin
Bryan E. Shepherd
CF Aliferis
Constantin F. Aliferis
CX Ling
DG Beer
DJ Hand
EJ Yeoh
EL Lehmann
FE Harrell Jr
Frank E. Harrell
G Casella
Ioannis Tsamardinos
JA Sparano
Jonathan S. Schildcrout
JP Ioannidis
KK Dobbin
KK Dobbin
L Ein-Dor
L Shi
LA Habel
LJ van't Veer
M Saerens
MD Radmacher
ME Burczynski
MJ Marton
ML Lee
N Iizuka
P Baldi
PI Good
R Kohavi
R Simon
RE Fan
S Michiels
S Mukherjee
S Paik
S Paik
S Ramaswamy
SL Pomeroy
T Bammler
T Hastie
TR Golub
TS Furey
UM Braga-Neto
Vladimir B. Bajic
VN Vapnik
W Jiang
Publication venue: Public Library of Science
Publication date: 17/03/2009
Field of study

Critical to the development of molecular signatures from microarray and other high-throughput data is testing the statistical significance of the produced signature in order to ensure its statistical reproducibility. While current best practices emphasize sufficiently powered univariate tests of differential expression, little is known about the factors that affect the statistical power of complex multivariate analysis protocols for high-dimensional molecular signature development.We show that choices of specific components of the analysis (i.e., error metric, classifier, error estimator and event balancing) have large and compounding effects on statistical power. The effects are demonstrated empirically by an analysis of 7 of the largest microarray cancer outcome prediction datasets and supplementary simulations, and by contrasting them to prior analyses of the same data.THE FINDINGS OF THE PRESENT STUDY HAVE TWO IMPORTANT PRACTICAL IMPLICATIONS: First, high-throughput studies by avoiding under-powered data analysis protocols, can achieve substantial economies in sample required to demonstrate statistical significance of predictive signal. Factors that affect power are identified and studied. Much less sample than previously thought may be sufficient for exploratory studies as long as these factors are taken into consideration when designing and executing the analysis. Second, previous highly-cited claims that microarray assays may not be able to predict disease outcomes better than chance are shown by our experiments to be due to under-powered data analysis combined with inappropriate statistical tests

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Essential guidelines for computational method benchmarking

Author: Boulesteix Anne-Laure
Cannoodt Robrecht
Gardner Paul P.
Hapfelmeier Alexander
Robinson Mark D.
Saelens Wouter
Saeys Yvan
Soneson Charlotte
Weber Lukas M.
Publication venue
Publication date: 01/01/2019
Field of study

In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.Comment: Minor update

arXiv.org e-Print Archive

Ghent University Academic Bibliography

Open Access LMU

ZORA

Algebraic Comparison of Partial Lists in Bioinformatics

Author: A Gobbi
A Kalousis
A Kossenkov
A Sboner
AC Haury
AL Boulesteix
Arkady B. Khodursky
B Di Camillo
B Efron
B Efron
B Efron
B Schowe
C Cortes
C Cortes
C Furlanello
C Schneider
C Schneider
C Soneson
C Yao
Cesare Furlanello
Consortium The MicroArray Quality Control (MAQC)
D Albanese
D Cai
D Corrada
D Critchlow
D Saari
D Witten
G Guzzetta
G Jurman
G Jurman
G Lance
G Lance
G Smyth
Giuseppe Jurman
GS Cheon
I Guyon
I Jeffery
I Lönnstedt
J Bar-Ilan
J Borda
J Chen
J Ioannidis
J Neter
J Storey
L Ein-Dor
L Kuncheva
L Yu
L Zhang
M Desarkar
M Kauers
M Kauers
M Kendall
M Schimek
M Schimek
M Slawski
M Villarino
M Villarino
O Bousquet
P Baldi
P Diaconis
P Diaconis
P Hall
P Hall
P Krízek
PC Boutros
R Fagin
R Gentleman
R Graham
R Pearson
R Pique-Regi
R Pique-Regi
R Simon
Roberto Visintainer
S Abramov
S Dudoit
S Lin
S Lin
S Mukherjee
S Setlur
S Simićc
S Vanderlooy
Samantha Riccadonna
SK Lau
T Bø
T Calders
V Tusher
Visintainer
W Fury
W Hoeffding
W Shi
X Wang
X Yang
Y Xiao
Y Xiao
Z He
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 08/04/2010
Field of study

The outcome of a functional genomics pipeline is usually a partial list of genomic features, ranked by their relevance in modelling biological phenotype in terms of a classification or regression model. Due to resampling protocols or just within a meta-analysis comparison, instead of one list it is often the case that sets of alternative feature lists (possibly of different lengths) are obtained. Here we introduce a method, based on the algebraic theory of symmetric groups, for studying the variability between lists ("list stability") in the case of lists of unequal length. We provide algorithms evaluating stability for lists embedded in the full feature set or just limited to the features occurring in the partial lists. The method is demonstrated first on synthetic data in a gene filtering task and then for finding gene profiles on a recent prostate cancer dataset

arXiv.org e-Print Archive

Crossref

Archivio della ricerca - Fondazione Bruno Kessler

Directory of Open Access Journals

PubMed Central

Unconventional machine learning of genome-wide human cancer data

Author: Bajaj Sweta R.
Chittenden Thomas W.
Cilfone Nicholas
Gamel Omar E.
Gujja Sharvari
Gulcher Jeffrey R.
Li Richard Y.
Lidar Daniel A.
Publication venue
Publication date: 13/05/2020
Field of study

Recent advances in high-throughput genomic technologies coupled with exponential increases in computer processing and memory have allowed us to interrogate the complex aberrant molecular underpinnings of human disease from a genome-wide perspective. While the deluge of genomic information is expected to increase, a bottleneck in conventional high-performance computing is rapidly approaching. Inspired in part by recent advances in physical quantum processors, we evaluated several unconventional machine learning (ML) strategies on actual human tumor data. Here we show for the first time the efficacy of multiple annealing-based ML algorithms for classification of high-dimensional, multi-omics human cancer data from the Cancer Genome Atlas. To assess algorithm performance, we compared these classifiers to a variety of standard ML methods. Our results indicate the feasibility of using annealing-based ML to provide competitive classification of human cancer types and associated molecular subtypes and superior performance with smaller training datasets, thus providing compelling empirical evidence for the potential future application of unconventional computing architectures in the biomedical sciences

arXiv.org e-Print Archive

Directory of Open Access Journals

Prediction of lithium response using genomic data

Author: Jiménez Esther
Vieta i Pascual Eduard, 1963-
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 18/03/2021
Field of study

Predicting lithium response prior to treatment could both expedite therapy and avoid exposure to side effects. Since lithium responsiveness may be heritable, its predictability based on genomic data is of interest. We thus evaluate the degree to which lithium response can be predicted with a machine learning (ML) approach using genomic data. Using the largest existing genomic dataset in the lithium response literature (n = 2210 across 14 international sites; 29% responders), we evaluated the degree to which lithium response could be predicted based on 47,465 genotyped single nucleotide polymorphisms using a supervised ML approach. Under appropriate cross-validation procedures, lithium response could be predicted to above-chance levels in two constituent sites (Halifax, Cohen's kappa 0.15, 95% confidence interval, CI [0.07, 0.24]; and Würzburg, kappa 0.2 [0.1, 0.3]). Variants with shared importance in these models showed over-representation of postsynaptic membrane related genes. Lithium response was not predictable in the pooled dataset (kappa 0.02 [− 0.01, 0.04]), although non-trivial performance was achieved within a restricted dataset including only those patients followed prospectively (kappa 0.09 [0.04, 0.14]). Genomic classification of lithium response remains a promising but difficult task. Classification performance could potentially be improved by further harmonization of data collection procedures

Diposit Digital de la Universitat de Barcelona

Prediction of lithium response using genomic data

Author: Akiyama Kazufumi
Akula Nirmala
Alda Martin
Ardau Raffaella
Aubry Jean-Michel
Backlund Lena
Bauer Michael
Bellivier Frank
Cervantes Pablo
Chen Hsi-Chung
Chillotti Caterina
Cruceanu Cristiana
Dayer Alexandre
Degenhardt Franziska
Del Zompo Maria
Forstner Andreas J
Frye Mark
Fullerton Janice M
Grigoroiu-Serbanescu Maria
Grof Paul
Hashimoto Ryota
Hou Liping
Jiménez Esther
Kato Tadafumi
Kelsoe John
Kittel-Schneider Sarah
Kuo Po-Hsiu
Kusumi Ichiro
Lavebratt Catharina
Manchia Mirko
Martinsson Lina
Mattheisen Manuel
McMahon Francis J
Millischer Vincent
Mitchell Philip B
Nunes Abraham
Nöthen Markus M
O'Donovan Claire
Ozaki Norio
Pisanu Claudia
Reif Andreas
Rietschel Marcella
Rouleau Guy
Rybakowski Janusz
Schalling Martin
Schofield Peter R
Schulze Thomas G
Severino Giovanni
Squassina Alessio
Stone William
Trappenberg Thomas
Veeh Julia
Vieta Eduard
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2021
Field of study

Predicting lithium response prior to treatment could both expedite therapy and avoid exposure to side effects. Since lithium responsiveness may be heritable, its predictability based on genomic data is of interest. We thus evaluate the degree to which lithium response can be predicted with a machine learning (ML) approach using genomic data. Using the largest existing genomic dataset in the lithium response literature (n = 2210 across 14 international sites; 29% responders), we evaluated the degree to which lithium response could be predicted based on 47,465 genotyped single nucleotide polymorphisms using a supervised ML approach. Under appropriate cross-validation procedures, lithium response could be predicted to above-chance levels in two constituent sites (Halifax, Cohen's kappa 0.15, 95% confidence interval, CI [0.07, 0.24]; and Würzburg, kappa 0.2 [0.1, 0.3]). Variants with shared importance in these models showed over-representation of postsynaptic membrane related genes. Lithium response was not predictable in the pooled dataset (kappa 0.02 [- 0.01, 0.04]), although non-trivial performance was achieved within a restricted dataset including only those patients followed prospectively (kappa 0.09 [0.04, 0.14]). Genomic classification of lithium response remains a promising but difficult task. Classification performance could potentially be improved by further harmonization of data collection procedures

Archivio istituzionale della ricerca - Università di Cagliari

Essential guidelines for computational method benchmarking

Author: Boulesteix Anne-Laure
Cannoodt Robrecht
Gardner Paul P
Hapfelmeier Alexander
Robinson Mark D
Saelens Wouter
Saeys Yvan
Soneson Charlotte
Weber Lukas M
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2019
Field of study

Ghent University Academic Bibliography

Open Access LMU

ZORA

Batch effect confounding leads to strong bias in performance estimates obtained by cross-validation.

Author: A Brazma
A Liaw
A Sadanandam
AH Sims
AL Boulesteix
AM Molinaro
C Ambroise
C Chen
C Cortes
C Lazar
Charlotte Soneson
E Budinska
E Van Cutsem
H Zou
HS Parker
IA Wood
J Gagnon-Bartsch
J Luo
JH Kim
JM Akey
JT Leek
JT Leek
L Breiman
L Shi
M Benito
M Lukk
Mauro Delorenzi
MD Radmacher
MK Kerr
O Alter
PO Brown
R Edgar
R Simon
R Tibshirani
S Varma
Sarah Gerster
Shu-Dong Zhang
W Johnson
Publication venue: 'Public Library of Science (PLoS)'
Publication date: 01/01/2014
Field of study

BACKGROUND: With the large amount of biological data that is currently publicly available, many investigators combine multiple data sets to increase the sample size and potentially also the power of their analyses. However, technical differences ("batch effects") as well as differences in sample composition between the data sets may significantly affect the ability to draw generalizable conclusions from such studies. FOCUS: The current study focuses on the construction of classifiers, and the use of cross-validation to estimate their performance. In particular, we investigate the impact of batch effects and differences in sample composition between batches on the accuracy of the classification performance estimate obtained via cross-validation. The focus on estimation bias is a main difference compared to previous studies, which have mostly focused on the predictive performance and how it relates to the presence of batch effects. DATA: We work on simulated data sets. To have realistic intensity distributions, we use real gene expression data as the basis for our simulation. Random samples from this expression matrix are selected and assigned to group 1 (e.g., 'control') or group 2 (e.g., 'treated'). We introduce batch effects and select some features to be differentially expressed between the two groups. We consider several scenarios for our study, most importantly different levels of confounding between groups and batch effects. METHODS: We focus on well-known classifiers: logistic regression, Support Vector Machines (SVM), k-nearest neighbors (kNN) and Random Forests (RF). Feature selection is performed with the Wilcoxon test or the lasso. Parameter tuning and feature selection, as well as the estimation of the prediction performance of each classifier, is performed within a nested cross-validation scheme. The estimated classification performance is then compared to what is obtained when applying the classifier to independent data

Crossref

Serveur académique lausannois

Directory of Open Access Journals

PubMed Central

FigShare