Search CORE

2,411 research outputs found

Size, power and false discovery rates

Author: Efron Bradley
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2007
Field of study

Modern scientific technology has provided a new class of large-scale simultaneous inference problems, with thousands of hypothesis tests to consider at the same time. Microarrays epitomize this type of technology, but similar situations arise in proteomics, spectroscopy, imaging, and social science surveys. This paper uses false discovery rate methods to carry out both size and power calculations on large-scale problems. A simple empirical Bayes approach allows the false discovery rate (fdr) analysis to proceed with a minimum of frequentist or Bayesian modeling assumptions. Closed-form accuracy formulas are derived for estimated false discovery rates, and used to compare different methodologies: local or tail-area fdr's, theoretical, permutation, or empirical null hypothesis estimates. Two microarray data sets as well as simulations are used to evaluate the methodology, the power diagnostics showing why nonnull cases might easily fail to appear on a list of ``significant'' discoveries.Comment: Published in at http://dx.doi.org/10.1214/009053606000001460 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Effect of pooling samples on the efficiency of comparative studies using microarrays

Author: Agrawal
Churchill
Gastwirth
Jin
Kendziorski
Peng
Pounds
S.-D. Zhang
T. W. Gant
Publication venue: 'Oxford University Press (OUP)'
Publication date: 13/10/2005
Field of study

Many biomedical experiments are carried out by pooling individual biological samples. However, pooling samples can potentially hide biological variance and give false confidence concerning the data significance. In the context of microarray experiments for detecting differentially expressed genes, recent publications have addressed the problem of the efficiency of sample-pooling, and some approximate formulas were provided for the power and sample size calculations. It is desirable to have exact formulas for these calculations and have the approximate results checked against the exact ones. We show that the difference between the approximate and exact results can be large. In this study, we have characterized quantitatively the effect of pooling samples on the efficiency of microarray experiments for the detection of differential gene expression between two classes. We present exact formulas for calculating the power of microarray experimental designs involving sample pooling and technical replications. The formulas can be used to determine the total numbers of arrays and biological subjects required in an experiment to achieve the desired power at a given significance level. The conditions under which pooled design becomes preferable to non-pooled design can then be derived given the unit cost associated with a microarray and that with a biological subject. This paper thus serves to provide guidance on sample pooling and cost effectiveness. The formulation in this paper is outlined in the context of performing microarray comparative studies, but its applicability is not limited to microarray experiments. It is also applicable to a wide range of biomedical comparative studies where sample pooling may be involved.Comment: 8 pages, 1 figure, 2 tables; to appear in Bioinformatic

arXiv.org e-Print Archive

Queen's University Belfast Research Portal

Crossref

Microarrays, Empirical Bayes and the Two-Groups Model

Author: Efron Bradley
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2008
Field of study

The classic frequentist theory of hypothesis testing developed by Neyman, Pearson and Fisher has a claim to being the twentieth century's most influential piece of applied mathematics. Something new is happening in the twenty-first century: high-throughput devices, such as microarrays, routinely require simultaneous hypothesis tests for thousands of individual cases, not at all what the classical theory had in mind. In these situations empirical Bayes information begins to force itself upon frequentists and Bayesians alike. The two-groups model is a simple Bayesian construction that facilitates empirical Bayes analysis. This article concerns the interplay of Bayesian and frequentist ideas in the two-groups setting, with particular attention focused on Benjamini and Hochberg's False Discovery Rate method. Topics include the choice and meaning of the null hypothesis in large-scale testing situations, power considerations, the limitations of permutation methods, significance testing for groups of cases (such as pathways in microarray studies), correlation effects, multiple confidence intervals and Bayesian competitors to the two-groups model.Comment: This paper commented in: [arXiv:0808.0582], [arXiv:0808.0593], [arXiv:0808.0597], [arXiv:0808.0599]. Rejoinder in [arXiv:0808.0603]. Published in at http://dx.doi.org/10.1214/07-STS236 the Statistical Science (http://www.imstat.org/sts/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref

Tellipsoid: Exploiting inter-gene correlation for improved detection of differential gene expression

Author: Deller Jr., J. R.
Desai Keyur
McCormick J. Justin
Publication venue
Publication date: 01/01/2008
Field of study

Motivation: Algorithms for differential analysis of microarray data are vital to modern biomedical research. Their accuracy strongly depends on effective treatment of inter-gene correlation. Correlation is ordinarily accounted for in terms of its effect on significance cut-offs. In this paper it is shown that correlation can, in fact, be exploited {to share information across tests}, which, in turn, can increase statistical power. Results: Vastly and demonstrably improved differential analysis approaches are the result of combining identifiability (the fact that in most microarray data sets, a large proportion of genes can be identified a priori as non-differential) with optimization criteria that incorporate correlation. As a special case, we develop a method which builds upon the widely used two-sample t-statistic based approach and uses the Mahalanobis distance as an optimality criterion. Results on the prostate cancer data of Singh et al. (2002) suggest that the proposed method outperforms all published approaches in terms of statistical power. Availability: The proposed algorithm is implemented in MATLAB and in R. The software, called Tellipsoid, and relevant data sets are available at http://www.egr.msu.edu/~desaikeyComment: 19 pages, Submitted to Bioinformatic

arXiv.org e-Print Archive

CiteSeerX

Sequential stopping for high-throughput experiments

Author: Armstrong
Campo Dell'orto
D. Rossell
P. Muller
Tibshirani
Yang
Zien
Publication venue: 'Oxford University Press (OUP)'
Publication date: 20/08/2012
Field of study

In high-throughput experiments, the sample size is typically chosen informally. Most formal sample-size calculations depend critically on prior knowledge. We propose a sequential strategy that, by updating knowledge when new data are available, depends less critically on prior assumptions. Experiments are stopped or continued based on the potential benefits in obtaining additional data. The underlying decision-theoretic framework guarantees the design to proceed in a coherent fashion. We propose intuitively appealing, easy-to-implement utility functions. As in most sequential design problems, an exact solution is prohibitive. We propose a simulation-based approximation that uses decision boundaries. We apply the method to RNA-seq, microarray, and reverse-phase protein array studies and show its potential advantages. The approach has been added to the Bioconductor package gaga

Crossref

PubMed Central

Warwick Research Archives Portal Repository

The Reproducibility of Lists of Differentially Expressed Genes in Microarray Studies

Reproducibility is a fundamental requirement in scientific experiments and clinical contexts. Recent publications raise concerns about the reliability of microarray technology because of the apparent lack of agreement between lists of differentially expressed genes (DEGs). In this study we demonstrate that (1) such discordance may stem from ranking and selecting DEGs solely by statistical significance (P) derived from widely used simple t-tests; (2) when fold change (FC) is used as the ranking criterion, the lists become much more reproducible, especially when fewer genes are selected; and (3) the instability of short DEG lists based on P cutoffs is an expected mathematical consequence of the high variability of the t-values. We recommend the use of FC ranking plus a non-stringent P cutoff as a baseline practice in order to generate more reproducible DEG lists. The FC criterion enhances reproducibility while the P criterion balances sensitivity and specificity

Crossref

Nature Precedings

Random-set methods identify distinct aspects of the enrichment signal in gene-set analysis

Author: Ahlquist Paul
Boon Johan A. den
Newton Michael A.
Quintana Fernando A.
Sengupta Srikumar
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 31/08/2007
Field of study

A prespecified set of genes may be enriched, to varying degrees, for genes that have altered expression levels relative to two or more states of a cell. Knowing the enrichment of gene sets defined by functional categories, such as gene ontology (GO) annotations, is valuable for analyzing the biological signals in microarray expression data. A common approach to measuring enrichment is by cross-classifying genes according to membership in a functional category and membership on a selected list of significantly altered genes. A small Fisher's exact test

p

-value, for example, in this

2\times2

table is indicative of enrichment. Other category analysis methods retain the quantitative gene-level scores and measure significance by referring a category-level statistic to a permutation distribution associated with the original differential expression problem. We describe a class of random-set scoring methods that measure distinct components of the enrichment signal. The class includes Fisher's test based on selected genes and also tests that average gene-level evidence across the category. Averaging and selection methods are compared empirically using Affymetrix data on expression in nasopharyngeal cancer tissue, and theoretically using a location model of differential expression. We find that each method has a domain of superiority in the state space of enrichment problems, and that both methods have benefits in practice. Our analysis also addresses two problems related to multiple-category inference, namely, that equally enriched categories are not detected with equal probability if they are of different sizes, and also that there is dependence among category statistics owing to shared genes. Random-set enrichment calculations do not require Monte Carlo for implementation. They are made available in the R package allez.Comment: Published at http://dx.doi.org/10.1214/07-AOAS104 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

Quick Calculation for Sample Size while Controlling False Discovery Rate with Application to Microarray Analysis

Author: Hwang J.T. Gene
Liu Peng
Publication venue: Iowa State University Digital Repository
Publication date: 01/11/2006
Field of study

Sample size estimation is important in microarray or proteomic experiments since biologists can typically afford only a few repetitions. In the multiple testing problems involving these experiments, it is more powerful and more reasonable to control false discovery rate (FDR) or positive FDR (pFDR) instead of type I error, e.g., family-wise error rate (FWER) (Storey and Tibshirani, 2003). However, the traditional approach of estimating sample size is no longer applicable to controlling FDR, which has left most practitioners to rely on haphazard guessing. We propose a procedure to calculate sample size while controlling false discovery rate. Two major definitions of the false discovery rate (FDR in Benjamini and Hochberg, 1995, and pFDR in Storey, 2002) vary slightly. Our procedure applies to both definitions. The proposed method is straightforward to apply and requires minimal computation, as illustrated with two sample t-tests and F-tests. We have also demonstrated by simulation that, with the calculated sample size, the desired level of power is achievable by the q-value procedure (Storey, Taylor and Siegmund, 2004) when gene expressions are either independent or dependent

Digital Repository @ Iowa State University (ISU)

On testing the significance of sets of genes

Author: Efron Bradley
Tibshirani Robert
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2006
Field of study

This paper discusses the problem of identifying differentially expressed groups of genes from a microarray experiment. The groups of genes are externally defined, for example, sets of gene pathways derived from biological databases. Our starting point is the interesting Gene Set Enrichment Analysis (GSEA) procedure of Subramanian et al. [Proc. Natl. Acad. Sci. USA 102 (2005) 15545--15550]. We study the problem in some generality and propose two potential improvements to GSEA: the maxmean statistic for summarizing gene-sets, and restandardization for more accurate inferences. We discuss a variety of examples and extensions, including the use of gene-set scores for class predictions. We also describe a new R language package GSA that implements our ideas.Comment: Published at http://dx.doi.org/10.1214/07-AOAS101 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

CiteSeerX

Crossref