Resampling-based tests of functional categories in gene expression studies

Abstract

DNA microarrays allow researchers to measure the coexpression of thousands of genes, and are commonly used to identify changes in expression either across experimental conditions or in association with some clinical outcome. With increasing availability of gene annotation, researchers have begun to ask global questions of functional genomics that explore the interactions of genes in cellular processes and signaling pathways. A common hypothesis test for gene categories is constructed as a post hoc analysis performed once a list of significant genes is identified, using classically derived tests for 2x2 contingency tables. We note several drawbacks to this approach including the violation of an independence assumption by the correlation in expression that exists among genes. To test gene categories in a more appropriate manner, we propose a flexible, permutation-based framework, termed SAFE (for Significance Analysis of Function and Expression). SAFE is a two-stage approach, whereby gene-specific statistics are calculated for the association between expression and the response of interest and then a global statistic is used to detect a shift within a gene category to more extreme associations. Significance is assessed by repeatedly permuting whole arrays whereby the correlation between all genes is held constant and accounted for. This permutation scheme also preserves the relatedness of categories containing overlapping genes, such that error rate estimates can be readily obtained for multiple dependent tests. Through a detailed survey of gene category tests and simulations based on real microarray, we demonstrate how SAFE generates appropriate Type I error rates as compared to other methods. Under a more rigorously defined null hypothesis, permutation-based tests of gene categories are shown to be conservative by inducing a special case with a maximum variance for the test statistic. A bootstrap-based approach to hypothesis testing is incorporated into the SAFE framework providing better coverage and improved power under a defined class of alternatives. Lastly, we extend the SAFE framework to consider gene categories in a probabilistic manner. This allows for a hypothesis test of co-regulation, using models of transcription factor binding sites to score for the presence of motifs in the upstream regions of genes

    Similar works