519 research outputs found

    A Bayesian approach to efficient differential allocation for resampling-based significance testing

    Get PDF
    <p>Abstract</p> <p>Background</p> <p>Large-scale statistical analyses have become hallmarks of post-genomic era biological research due to advances in high-throughput assays and the integration of large biological databases. One accompanying issue is the simultaneous estimation of p-values for a large number of hypothesis tests. In many applications, a parametric assumption in the null distribution such as normality may be unreasonable, and resampling-based p-values are the preferred procedure for establishing statistical significance. Using resampling-based procedures for multiple testing is computationally intensive and typically requires large numbers of resamples.</p> <p>Results</p> <p>We present a new approach to more efficiently assign resamples (such as bootstrap samples or permutations) within a nonparametric multiple testing framework. We formulated a Bayesian-inspired approach to this problem, and devised an algorithm that adapts the assignment of resamples iteratively with negligible space and running time overhead. In two experimental studies, a breast cancer microarray dataset and a genome wide association study dataset for Parkinson's disease, we demonstrated that our differential allocation procedure is substantially more accurate compared to the traditional uniform resample allocation.</p> <p>Conclusion</p> <p>Our experiments demonstrate that using a more sophisticated allocation strategy can improve our inference for hypothesis testing without a drastic increase in the amount of computation on randomized data. Moreover, we gain more improvement in efficiency when the number of tests is large. R code for our algorithm and the shortcut method are available at <url>http://people.pcbi.upenn.edu/~lswang/pub/bmc2009/</url>.</p

    Some nonasymptotic results on resampling in high dimension, I: Confidence regions, II: Multiple tests

    Get PDF
    We study generalized bootstrap confidence regions for the mean of a random vector whose coordinates have an unknown dependency structure. The random vector is supposed to be either Gaussian or to have a symmetric and bounded distribution. The dimensionality of the vector can possibly be much larger than the number of observations and we focus on a nonasymptotic control of the confidence level, following ideas inspired by recent results in learning theory. We consider two approaches, the first based on a concentration principle (valid for a large class of resampling weights) and the second on a resampled quantile, specifically using Rademacher weights. Several intermediate results established in the approach based on concentration principles are of interest in their own right. We also discuss the question of accuracy when using Monte Carlo approximations of the resampled quantities.Comment: Published in at http://dx.doi.org/10.1214/08-AOS667; http://dx.doi.org/10.1214/08-AOS668 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org

    A statistical framework for testing functional categories in microarray data

    Get PDF
    Ready access to emerging databases of gene annotation and functional pathways has shifted assessments of differential expression in DNA microarray studies from single genes to groups of genes with shared biological function. This paper takes a critical look at existing methods for assessing the differential expression of a group of genes (functional category), and provides some suggestions for improved performance. We begin by presenting a general framework, in which the set of genes in a functional category is compared to the complementary set of genes on the array. The framework includes tests for overrepresentation of a category within a list of significant genes, and methods that consider continuous measures of differential expression. Existing tests are divided into two classes. Class 1 tests assume gene-specific measures of differential expression are independent, despite overwhelming evidence of positive correlation. Analytic and simulated results are presented that demonstrate Class 1 tests are strongly anti-conservative in practice. Class 2 tests account for gene correlation, typically through array permutation that by construction has proper Type I error control for the induced null. However, both Class 1 and Class 2 tests use a null hypothesis that all genes have the same degree of differential expression. We introduce a more sensible and general (Class 3) null under which the profile of differential expression is the same within the category and complement. Under this broader null, Class 2 tests are shown to be conservative. We propose standard bootstrap methods for testing against the Class 3 null and demonstrate they provide valid Type I error control and more power than array permutation in simulated datasets and real microarray experiments.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS146 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Multiple testing in genome-wide studies

    Get PDF
    DNA microarray technologies allow us to monitor expression levels of thousands of genes simultaneously. A basic task in analyzing microarray data is the identification of differentially expressed genes under different experimental conditions. The null hypothsis is no association between the expression levels and explanatory variables or covariates. Family-wise error rate (FWER), although very conservative, controls type I error. False Discovery Rate (FDR) is a less stringent approach which aims to control the expected proportion of Type I errors among the rejected hypotheses. Since there are thousands of genes tested simultaneously, FDR may be enhanced. High correlation between tested genes, attributed to co-regulations and dependency in the measurement errors, further complicates the problem. Most of the current FDR procedures assume independence or rather restrictive dependence structures, resulting in being less reliable. In this work, we address these very large multiplicity problems by adopting a two-stage FDR controlling procedure under suitable dependence structures and based on Poisson distributional approximation, which eliminates the need to assume restricted dependence structures. We compare the performance of the proposed FDR procedure with that of other FDR controlling procedures, with illustration of the leukemia microarray study of Golub et al. (1999) and simulated data. In these studies, the proposed FDR procedure has greater power without much elevation of FDR. Current FDR procedures have not been used extensively in genomic sequences involving count or discrete, or purely qualitative responses, confronted with high-dimensional low sample size constraints. Using the 2002-03 SARS epidemic model, it is shown that proposed FDR procedure along with an appropriate test statistic based on a pseudo-marginal approach with Hamming distance performs better. Finally, for classfication of genes of dependent genes with heterogeneity amidst a small sample, standard robust inference may not work out. This issue involves setting up a hypothesis when parameters of interest are subject to inequality restrictions. Usual (restricted) likelihood based statistical inference procedures may not be computationally intensive. Roy's union-intersection principle may be a viable alternative. The breast cancer study of Lobenhofer et al. is included for numerical illustration

    Spectral estimation in unevenly sampled space of periodically expressed microarray time series data

    Get PDF
    BACKGROUND: Periodogram analysis of time-series is widespread in biology. A new challenge for analyzing the microarray time series data is to identify genes that are periodically expressed. Such challenge occurs due to the fact that the observed time series usually exhibit non-idealities, such as noise, short length, and unevenly sampled time points. Most methods used in the literature operate on evenly sampled time series and are not suitable for unevenly sampled time series. RESULTS: For evenly sampled data, methods based on the classical Fourier periodogram are often used to detect periodically expressed gene. Recently, the Lomb-Scargle algorithm has been applied to unevenly sampled gene expression data for spectral estimation. However, since the Lomb-Scargle method assumes that there is a single stationary sinusoid wave with infinite support, it introduces spurious periodic components in the periodogram for data with a finite length. In this paper, we propose a new spectral estimation algorithm for unevenly sampled gene expression data. The new method is based on signal reconstruction in a shift-invariant signal space, where a direct spectral estimation procedure is developed using the B-spline basis. Experiments on simulated noisy gene expression profiles show that our algorithm is superior to the Lomb-Scargle algorithm and the classical Fourier periodogram based method in detecting periodically expressed genes. We have applied our algorithm to the Plasmodium falciparum and Yeast gene expression data and the results show that the algorithm is able to detect biologically meaningful periodically expressed genes. CONCLUSION: We have proposed an effective method for identifying periodic genes in unevenly sampled space of microarray time series gene expression data. The method can also be used as an effective tool for gene expression time series interpolation or resampling

    Power-enhanced multiple decision functions controlling family-wise error and false discovery rates

    Get PDF
    Improved procedures, in terms of smaller missed discovery rates (MDR), for performing multiple hypotheses testing with weak and strong control of the family-wise error rate (FWER) or the false discovery rate (FDR) are developed and studied. The improvement over existing procedures such as the \v{S}id\'ak procedure for FWER control and the Benjamini--Hochberg (BH) procedure for FDR control is achieved by exploiting possible differences in the powers of the individual tests. Results signal the need to take into account the powers of the individual tests and to have multiple hypotheses decision functions which are not limited to simply using the individual pp-values, as is the case, for example, with the \v{S}id\'ak, Bonferroni, or BH procedures. They also enhance understanding of the role of the powers of individual tests, or more precisely the receiver operating characteristic (ROC) functions of decision processes, in the search for better multiple hypotheses testing procedures. A decision-theoretic framework is utilized, and through auxiliary randomizers the procedures could be used with discrete or mixed-type data or with rank-based nonparametric tests. This is in contrast to existing pp-value based procedures whose theoretical validity is contingent on each of these pp-value statistics being stochastically equal to or greater than a standard uniform variable under the null hypothesis. Proposed procedures are relevant in the analysis of high-dimensional "large MM, small nn" data sets arising in the natural, physical, medical, economic and social sciences, whose generation and creation is accelerated by advances in high-throughput technology, notably, but not limited to, microarray technology.Comment: Published in at http://dx.doi.org/10.1214/10-AOS844 the Annals of Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore