Search CORE

6 research outputs found

Re-sampling strategy to improve the estimation of number of null hypotheses in FDR control under strong correlation structures

Author: AL Barabasi
AL Barabasi
AL Richardson
B Efron
B Efron
B Efron
B Efron
B Wu
C Li
CM Perou
David L Perkins
DB Allison
E Ravasz
HM Hsueh
J Aubert
JD Storey
JD Storey
JD Storey
JD Storey
JG Liao
M Langaas
N Meinshausen
P Broberg
PH Westfall
S Pounds
S Pounds
S Scheid
S Scheid
S Scheid
T Sorlie
VG Tusher
X Qiu
X Zhang
Xin Lu
Y Barash
Y Benjamini
Y Benjamini
Y Benjamini
Y Ge
Z Wu
Publication venue: BioMed Central
Publication date: 01/01/2007
Field of study

Abstract Background When conducting multiple hypothesis tests, it is important to control the number of false positives, or the False Discovery Rate (FDR). However, there is a tradeoff between controlling FDR and maximizing power. Several methods have been proposed, such as the q-value method, to estimate the proportion of true null hypothesis among the tested hypotheses, and use this estimation in the control of FDR. These methods usually depend on the assumption that the test statistics are independent (or only weakly correlated). However, many types of data, for example microarray data, often contain large scale correlation structures. Our objective was to develop methods to control the FDR while maintaining a greater level of power in highly correlated datasets by improving the estimation of the proportion of null hypotheses. Results We showed that when strong correlation exists among the data, which is common in microarray datasets, the estimation of the proportion of null hypotheses could be highly variable resulting in a high level of variation in the FDR. Therefore, we developed a re-sampling strategy to reduce the variation by breaking the correlations between gene expression values, then using a conservative strategy of selecting the upper quartile of the re-sampling estimations to obtain a strong control of FDR. Conclusion With simulation studies and perturbations on actual microarray datasets, our method, compared to competing methods such as q-value, generated slightly biased estimates on the proportion of null hypotheses but with lower mean square errors. When selecting genes with controlling the same FDR level, our methods have on average a significantly lower false discovery rate in exchange for a minor reduction in the power.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

Sources of variation in false discovery rate estimation include sample size, correlation, and inherent differences between groups

Author: Jiexin Zhang
Kevin R Coombes
Publication venue: Springer Nature
Publication date: 24/08/2012
Field of study

BACKGROUND: High-throughtput technologies enable the testing of tens of thousands of measurements simultaneously. Identification of genes that are differentially expressed or associated with clinical outcomes invokes the multiple testing problem. False Discovery Rate (FDR) control is a statistical method used to correct for multiple comparisons for independent or weakly dependent test statistics. Although FDR control is frequently applied to microarray data analysis, gene expression is usually correlated, which might lead to inaccurate estimates. In this paper, we evaluate the accuracy of FDR estimation. METHODS: Using two real data sets, we resampled subgroups of patients and recalculated statistics of interest to illustrate the imprecision of FDR estimation. Next, we generated many simulated data sets with block correlation structures and realistic noise parameters, using the Ultimate Microarray Prediction, Inference, and Reality Engine (UMPIRE) R package. We estimated FDR using a beta-uniform mixture (BUM) model, and examined the variation in FDR estimation. RESULTS: The three major sources of variation in FDR estimation are the sample size, correlations among genes, and the true proportion of differentially expressed genes (DEGs). The sample size and proportion of DEGs affect both magnitude and precision of FDR estimation, while the correlation structure mainly affects the variation of the estimated parameters. CONCLUSIONS: We have decomposed various factors that affect FDR estimation, and illustrated the direction and extent of the impact. We found that the proportion of DEGs has a significant impact on FDR; this factor might have been overlooked in previous studies and deserves more thought when controlling FDR

Springer - Publisher Connector

PubMed Central

Heading Down the Wrong Pathway: on the Influence of Correlation within Gene Sets

Author: A Subramanian
AL Boulesteix
Andrew B Nobel
B Efron
BR Zeeberg
D Montaner
D Sean
Daniel M Gatti
DB Allison
dW Huang
Fred A Wright
H Ogata
HK Lee
I Dinu
I Dinu
Ivan Rusyn
J Shi
JJ Goeman
JJ Goeman
JM Fostel
JT Leek
K Virtaneva
L Klebanov
L Tian
M Ackermann
M Ashburner
M Hummel
MB Eisen
P Kaposi-Novak
R Development Core Team
RC Fry
RC Gentleman
S Song
SW Kong
SY Kim
T Barrett
T Breslin
T Sorlie
VK Mootha
William T Barry
WT Barry
WT Barry
X Lu
X Qiu
Y Zhu
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Analysis of microarray experiments often involves testing for the overrepresentation of pre-defined sets of genes among lists of genes deemed individually significant. Most popular gene set testing methods assume the independence of genes within each set, an assumption that is seriously violated, as extensive correlation between genes is a well-documented phenomenon. Results We conducted a meta-analysis of over 200 datasets from the Gene Expression Omnibus in order to demonstrate the practical impact of strong gene correlation patterns that are highly consistent across experiments. We show that a common independence assumption-based gene set testing procedure produces very high false positive rates when applied to data sets for which treatment groups have been randomized, and that gene sets with high internal correlation are more likely to be declared significant. A reanalysis of the same datasets using an array resampling approach properly controls false positive rates, leading to more parsimonious and high-confidence gene set findings, which should facilitate pathway-based interpretation of the microarray data. Conclusions These findings call into question many of the gene set testing results in the literature and argue strongly for the adoption of resampling based gene set testing criteria in the peer reviewed biomedical literature.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

DukeSpace

Carolina Digital Repository

Towards Accurate Estimation of the Proportion of True Null Hypotheses in Multiple Testing

Author: C Cheng
CA Tsai
J Bukszar
JD Storey
JD Storey
MA Black
RL Fernando
S Jiao
S Pounds
S Pounds
SD Zhang
SD Zhang
Shu-Dong Zhang
X Lu
Y Benjamini
Ying Xu
Publication venue: Public Library of Science
Publication date: 01/01/2011
Field of study

BACKGROUND: Biomedical researchers are now often faced with situations where it is necessary to test a large number of hypotheses simultaneously, eg, in comparative gene expression studies using high-throughput microarray technology. To properly control false positive errors the FDR (false discovery rate) approach has become widely used in multiple testing. The accurate estimation of FDR requires the proportion of true null hypotheses being accurately estimated. To date many methods for estimating this quantity have been proposed. Typically when a new method is introduced, some simulations are carried out to show the improved accuracy of the new method. However, the simulations are often very limited to covering only a few points in the parameter space. RESULTS: Here I have carried out extensive in silico experiments to compare some commonly used methods for estimating the proportion of true null hypotheses. The coverage of these simulations is unprecedented thorough over the parameter space compared to typical simulation studies in the literature. Thus this work enables us to draw conclusions globally as to the performance of these different methods. It was found that a very simple method gives the most accurate estimation in a dominantly large area of the parameter space. Given its simplicity and its overall superior accuracy I recommend its use as the first choice for estimating the proportion of true null hypotheses in multiple testing

Queen's University Belfast Research Portal

Public Library of Science (PLOS)

Crossref

Directory of Open Access Journals

PubMed Central

Ulster University's Research Portal

Modelling p-value distributions to improve theme-driven survival analysis of cancer transcriptome datasets

Author: Brors Benedikt
Czwan Esteban
Kipling David
Publication venue: BioMed Central
Publication date: 01/01/2010
Field of study

Abstract Background Theme-driven cancer survival studies address whether the expression signature of genes related to a biological process can predict patient survival time. Although this should ideally be achieved by testing two separate null hypotheses, current methods treat both hypotheses as one. The first test should assess whether a geneset, independent of its composition, is associated with prognosis (frequently done with a survival test). The second test then verifies whether the theme of the geneset is relevant (usually done with an empirical test that compares the geneset of interest with random genesets). Current methods do not test this second null hypothesis because it has been assumed that the distribution of p-values for random genesets (when tested against the first null hypothesis) is uniform. Here we demonstrate that such an assumption is generally incorrect and consequently, such methods may erroneously associate the biology of a particular geneset with cancer prognosis. Results To assess the impact of non-uniform distributions for random genesets in such studies, an automated theme-driven method was developed. This method empirically approximates the p-value distribution of sets of unrelated genes based on a permutation approach, and tests whether predefined sets of biologically-related genes are associated with survival. The results from a comparison with a published theme-driven approach revealed non-uniform distributions, suggesting a significant problem exists with false positive rates in the original study. When applied to two public cancer datasets our technique revealed novel ontological categories with prognostic power, including significant correlations between "fatty acid metabolism" with overall survival in breast cancer, as well as "receptor mediated endocytosis", "brain development", "apical plasma membrane" and "MAPK signaling pathway" with overall survival in lung cancer. Conclusions Current methods of theme-driven survival studies assume uniformity of p-values for random genesets, which can lead to false conclusions. Our approach provides a method to correct for this pitfall, and provides a novel route to identifying higher-level biological themes and pathways with prognostic power in clinical microarray datasets.</p

Crossref

Springer - Publisher Connector

Directory of Open Access Journals

PubMed Central

The Limitations of Simple Gene Set Enrichment Analysis Assuming Gene Independence

Author: Liberzon Arthur
Mesirov Jill P.
Steinhardt George
Tamayo Pablo
Publication venue: 'Elsevier BV'
Publication date: 30/04/2012
Field of study

Since its first publication in 2003, the Gene Set Enrichment Analysis (GSEA) method, based on the Kolmogorov-Smirnov statistic, has been heavily used, modified, and also questioned. Recently a simplified approach, using a one sample t test score to assess enrichment and ignoring gene-gene correlations was proposed by Irizarry et al. 2009 as a serious contender. The argument criticizes GSEA's nonparametric nature and its use of an empirical null distribution as unnecessary and hard to compute. We refute these claims by careful consideration of the assumptions of the simplified method and its results, including a comparison with GSEA's on a large benchmark set of 50 datasets. Our results provide strong empirical evidence that gene-gene correlations cannot be ignored due to the significant variance inflation they produced on the enrichment scores and should be taken into account when estimating gene set enrichment significance. In addition, we discuss the challenges that the complex correlation structure and multi-modality of gene sets pose more generally for gene set enrichment methods.Comment: Submitted to Statistical Methods in Medical Researc

arXiv.org e-Print Archive

Elsevier - Publisher Connector