Search CORE

1,443 research outputs found

Stratification bias in low signal microarray studies

Author: Bedo Justin
Guenter Simon
Parker Brian
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 09/12/2015
Field of study

BACKGROUND: When analysing microarray and other small sample size biological datasets, care is needed to avoid various biases. We analyse a form of bias, stratification bias, that can substantially affect analyses using sample-reuse validation techniques and lead to inaccurate results. This bias is due to imperfect stratification of samples in the training and test sets and the dependency between these stratification errors, i.e. the variations in class proportions in the training and test sets are negatively correlated. RESULTS: We show that when estimating the performance of classifiers on low signal datasets (i.e. those which are difficult to classify), which are typical of many prognostic microarray studies, commonly used performance measures can suffer from a substantial negative bias. For error rate this bias is only severe in quite restricted situations, but can be much larger and more frequent when using ranking measures such as the receiver operating characteristic (ROC) curve and area under the ROC (AUC). Substantial biases are shown in simulations and on the van 't Veer breast cancer dataset. The classification error rate can have large negative biases for balanced datasets, whereas the AUC shows substantial pessimistic biases even for imbalanced datasets. In simulation studies using 10-fold cross-validation, AUC values of less than 0.3 can be observed on random datasets rather than the expected 0.5. Further experiments on the van 't Veer breast cancer dataset show these biases exist in practice. CONCLUSION: Stratification bias can substantially affect several performance measures. In computing the AUC, the strategy of pooling the test samples from the various folds of cross-validation can lead to large biases; computing it as the average of per-fold estimates avoids this bias and is thus the recommended approach. As a more general solution applicable to other performance measures, we show that stratified repeated holdout and a modified version of k-fold cross-validation, balanced, stratified cross-validation and balanced leave-one-out cross-validation, avoids the bias. Therefore for model selection and evaluation of microarray and other small biological datasets, these methods should be used and unstratified versions avoided. In particular, the commonly used (unbalanced) leave-one-out cross-validation should not be used to estimate AUC for small datasets

The Australian National University

Class prediction for high-dimensional class-imbalanced data

Author: Blagus Rok
Lusa Lara
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background The goal of class prediction studies is to develop rules to accurately predict the class membership of new samples. The rules are derived using the values of the variables available for each subject: the main characteristic of high-dimensional data is that the number of variables greatly exceeds the number of samples. Frequently the classifiers are developed using class-imbalanced data, i.e., data sets where the number of samples in each class is not equal. Standard classification methods used on class-imbalanced data often produce classifiers that do not accurately predict the minority class; the prediction is biased towards the majority class. In this paper we investigate if the high-dimensionality poses additional challenges when dealing with class-imbalanced prediction. We evaluate the performance of six types of classifiers on class-imbalanced data, using simulated data and a publicly available data set from a breast cancer gene-expression microarray study. We also investigate the effectiveness of some strategies that are available to overcome the effect of class imbalance. Results Our results show that the evaluated classifiers are highly sensitive to class imbalance and that variable selection introduces an additional bias towards classification into the majority class. Most new samples are assigned to the majority class from the training set, unless the difference between the classes is very large. As a consequence, the class-specific predictive accuracies differ considerably. When the class imbalance is not too severe, down-sizing and asymmetric bagging embedding variable selection work well, while over-sampling does not. Variable normalization can further worsen the performance of the classifiers. Conclusions Our results show that matching the prevalence of the classes in training and test set does not guarantee good performance of classifiers and that the problems related to classification with class-imbalanced data are exacerbated when dealing with high-dimensional data. Researchers using class-imbalanced data should be careful in assessing the predictive accuracy of the classifiers and, unless the class imbalance is mild, they should always use an appropriate method for dealing with the class imbalance problem.</p

Springer - Publisher Connector

Directory of Open Access Journals

Essential guidelines for computational method benchmarking

Author: Boulesteix Anne-Laure
Cannoodt Robrecht
Gardner Paul P.
Hapfelmeier Alexander
Robinson Mark D.
Saelens Wouter
Saeys Yvan
Soneson Charlotte
Weber Lukas M.
Publication venue
Publication date: 01/01/2019
Field of study

In computational biology and other sciences, researchers are frequently faced with a choice between several computational methods for performing data analyses. Benchmarking studies aim to rigorously compare the performance of different methods using well-characterized benchmark datasets, to determine the strengths of each method or to provide recommendations regarding suitable choices of methods for an analysis. However, benchmarking studies must be carefully designed and implemented to provide accurate, unbiased, and informative results. Here, we summarize key practical guidelines and recommendations for performing high-quality benchmarking analyses, based on our experiences in computational biology.Comment: Minor update

arXiv.org e-Print Archive

Tumor Transcriptome Sequencing Reveals Allelic Expression Imbalances Associated with Copy Number Alterations

Author: A Gabory
A Jemal
A Mortazavi
Adam J. Broomer
AE Al Moustafa
AP Feinberg
AP Feinberg
Asim S. Siddiqui
Brian B. Tuch
C Leethanakul
C Mayr
C Wissmann
CH Chen
Christina B. Chung
Cinna K. Monighetti
CJ Kang
CN Henrichsen
D Lipson
D Tsafrir
David I. Smith
DM Parkin
DR Bentley
DY Chiang
E Dehan
E Schuuring
Eric J. Moore
ET Wang
F Tang
Francisco M. De La Vega
G Clement
GH Perry
H Cui
H Sasaki
I Alevizos
IF Tsui
IJ Matouk
J Gil
J Herrick
J Kim
J Miyazawa
J Sebat
JA Lee
Jan L. Kasperbauer
JC Marioni
Jian Gu
JK Kim
K Freier
K Willert
K Yoshinaga
Kerry D. Olsen
L Xu
LA Liotta
M Egeblad
M Johnsen
Matthew W. Muller
Melissa A. Barker
MR Stratton
N Adachi
ND Trinklein
O Ogawa
P Cahan
P Krimpenfort
P Lopez
Pius M. Brzoska
PJ Campbell
PM Haverty
R Nusse
Rebecca R. Laborde
RK Thomas
Ruoying Tan
S Ghaemmaghami
S Ortega
S Rainier
S Wachi
SA McCarroll
Sarah J. Stanley
Scott Kuersten
SP Bohen
ST Sherry
T Beissbarth
T LaFramboise
Timothy Ravasi
TJ Belbin
TJ Ley
TM Holm
Xing Xu
Y Du
Y Hosokawa
Y Zhang
Yan W. Asmann
YH Yu
Yongming Sun
YS Lee
Z Wang
Publication venue: Public Library of Science
Publication date: 01/01/2010
Field of study

Due to growing throughput and shrinking cost, massively parallel sequencing is rapidly becoming an attractive alternative to microarrays for the genome-wide study of gene expression and copy number alterations in primary tumors. The sequencing of transcripts (RNA-Seq) should offer several advantages over microarray-based methods, including the ability to detect somatic mutations and accurately measure allele-specific expression. To investigate these advantages we have applied a novel, strand-specific RNA-Seq method to tumors and matched normal tissue from three patients with oral squamous cell carcinomas. Additionally, to better understand the genomic determinants of the gene expression changes observed, we have sequenced the tumor and normal genomes of one of these patients. We demonstrate here that our RNA-Seq method accurately measures allelic imbalance and that measurement on the genome-wide scale yields novel insights into cancer etiology. As expected, the set of genes differentially expressed in the tumors is enriched for cell adhesion and differentiation functions, but, unexpectedly, the set of allelically imbalanced genes is also enriched for these same cancer-related functions. By comparing the transcriptomic perturbations observed in one patient to his underlying normal and tumor genomes, we find that allelic imbalance in the tumor is associated with copy number mutations and that copy number mutations are, in turn, strongly associated with changes in transcript abundance. These results support a model in which allele-specific deletions and duplications drive allele-specific changes in gene expression in the developing tumor

CiteSeerX

Directory of Open Access Journals

An improved method for detecting and delineating genomic regions with altered gene expression in cancer

Author: Fioretos Thoas
Heyden Anders
Johansson Mikael
Nelander Sven
Nilsson Björn
Publication venue: BioMed Central
Publication date: 01/01/2008
Field of study

A method is presented for identifying genomic regions with altered gene expression in gene expression maps

Lund University Publications

Malmö University Electronic Publishing

Springer - Publisher Connector

Digitala Vetenskapliga Arkivet - Academic Archive On-line