3,051 research outputs found
Nonparametric estimation of genewise variance for microarray data
Estimation of genewise variance arises from two important applications in
microarray data analysis: selecting significantly differentially expressed
genes and validation tests for normalization of microarray data. We approach
the problem by introducing a two-way nonparametric model, which is an extension
of the famous Neyman--Scott model and is applicable beyond microarray data. The
problem itself poses interesting challenges because the number of nuisance
parameters is proportional to the sample size and it is not obvious how the
variance function can be estimated when measurements are correlated. In such a
high-dimensional nonparametric problem, we proposed two novel nonparametric
estimators for genewise variance function and semiparametric estimators for
measurement correlation, via solving a system of nonlinear equations. Their
asymptotic normality is established. The finite sample property is demonstrated
by simulation studies. The estimators also improve the power of the tests for
detecting statistically differentially expressed genes. The methodology is
illustrated by the data from microarray quality control (MAQC) project.Comment: Published in at http://dx.doi.org/10.1214/10-AOS802 the Annals of
Statistics (http://www.imstat.org/aos/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Diverse correlation structures in gene expression data and their utility in improving statistical inference
It is well known that correlations in microarray data represent a serious
nuisance deteriorating the performance of gene selection procedures. This paper
is intended to demonstrate that the correlation structure of microarray data
provides a rich source of useful information. We discuss distinct correlation
substructures revealed in microarray gene expression data by an appropriate
ordering of genes. These substructures include stochastic proportionality of
expression signals in a large percentage of all gene pairs, negative
correlations hidden in ordered gene triples, and a long sequence of weakly
dependent random variables associated with ordered pairs of genes. The reported
striking regularities are of general biological interest and they also have
far-reaching implications for theory and practice of statistical methods of
microarray data analysis. We illustrate the latter point with a method for
testing differential expression of nonoverlapping gene pairs. While designed
for testing a different null hypothesis, this method provides an order of
magnitude more accurate control of type 1 error rate compared to conventional
methods of individual gene expression profiling. In addition, this method is
robust to the technical noise. Quantitative inference of the correlation
structure has the potential to extend the analysis of microarray data far
beyond currently practiced methods.Comment: Published in at http://dx.doi.org/10.1214/07-AOAS120 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
BNP-Seq: Bayesian Nonparametric Differential Expression Analysis of Sequencing Count Data
We perform differential expression analysis of high-throughput sequencing
count data under a Bayesian nonparametric framework, removing sophisticated
ad-hoc pre-processing steps commonly required in existing algorithms. We
propose to use the gamma (beta) negative binomial process, which takes into
account different sequencing depths using sample-specific negative binomial
probability (dispersion) parameters, to detect differentially expressed genes
by comparing the posterior distributions of gene-specific negative binomial
dispersion (probability) parameters. These model parameters are inferred by
borrowing statistical strength across both the genes and samples. Extensive
experiments on both simulated and real-world RNA sequencing count data show
that the proposed differential expression analysis algorithms clearly
outperform previously proposed ones in terms of the areas under both the
receiver operating characteristic and precision-recall curves.Comment: To appear in Journal of the American Statistical Associatio
GaGa: A parsimonious and flexible model for differential expression analysis
Hierarchical models are a powerful tool for high-throughput data with a small
to moderate number of replicates, as they allow sharing information across
units of information, for example, genes. We propose two such models and show
its increased sensitivity in microarray differential expression applications.
We build on the gamma--gamma hierarchical model introduced by Kendziorski et
al. [Statist. Med. 22 (2003) 3899--3914] and Newton et al. [Biostatistics 5
(2004) 155--176], by addressing important limitations that may have hampered
its performance and its more widespread use. The models parsimoniously describe
the expression of thousands of genes with a small number of hyper-parameters.
This makes them easy to interpret and analytically tractable. The first model
is a simple extension that improves the fit substantially with almost no
increase in complexity. We propose a second extension that uses a mixture of
gamma distributions to further improve the fit, at the expense of increased
computational burden. We derive several approximations that significantly
reduce the computational cost. We find that our models outperform the original
formulation of the model, as well as some other popular methods for
differential expression analysis. The improved performance is specially
noticeable for the small sample sizes commonly encountered in high-throughput
experiments. Our methods are implemented in the freely available Bioconductor
gaga package.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS244 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
NBLDA: Negative Binomial Linear Discriminant Analysis for RNA-Seq Data
RNA-sequencing (RNA-Seq) has become a powerful technology to characterize
gene expression profiles because it is more accurate and comprehensive than
microarrays. Although statistical methods that have been developed for
microarray data can be applied to RNA-Seq data, they are not ideal due to the
discrete nature of RNA-Seq data. The Poisson distribution and negative binomial
distribution are commonly used to model count data. Recently, Witten (2011)
proposed a Poisson linear discriminant analysis for RNA-Seq data. The Poisson
assumption may not be as appropriate as negative binomial distribution when
biological replicates are available and in the presence of overdispersion
(i.e., when the variance is larger than the mean). However, it is more
complicated to model negative binomial variables because they involve a
dispersion parameter that needs to be estimated. In this paper, we propose a
negative binomial linear discriminant analysis for RNA-Seq data. By Bayes'
rule, we construct the classifier by fitting a negative binomial model, and
propose some plug-in rules to estimate the unknown parameters in the
classifier. The relationship between the negative binomial classifier and the
Poisson classifier is explored, with a numerical investigation of the impact of
dispersion on the discriminant score. Simulation results show the superiority
of our proposed method. We also analyze four real RNA-Seq data sets to
demonstrate the advantage of our method in real-world applications
Estimating the proportion of differentially expressed genes in comparative DNA microarray experiments
DNA microarray experiments, a well-established experimental technique, aim at
understanding the function of genes in some biological processes. One of the
most common experiments in functional genomics research is to compare two
groups of microarray data to determine which genes are differentially
expressed. In this paper, we propose a methodology to estimate the proportion
of differentially expressed genes in such experiments. We study the performance
of our method in a simulation study where we compare it to other standard
methods. Finally we compare the methods in real data from two toxicology
experiments with mice.Comment: Published at http://dx.doi.org/10.1214/074921707000000076 in the IMS
Lecture Notes Monograph Series
(http://www.imstat.org/publications/lecnotes.htm) by the Institute of
Mathematical Statistics (http://www.imstat.org
A New Test Statistic Based on Shrunken Sample Variance for Identifying Differentially Expressed Genes in Small Microarray Experiments
Choosing an appropriate statistic and precisely evaluating the false discovery rate (FDR) are both essential for devising an effective method for identifying differentially expressed genes in microarray data. The t-type score proposed by Pan et al. (2003) succeeded in suppressing false positives by controlling the underestimation of variance but left the overestimation uncontrolled. For controlling the overestimation, we devised a new test statistic (variance stabilized t-type score) by placing shrunken sample variances of the James-Stein type in the denominator of the t-type score. Since the relative superiority of the mean and median FDRs was unclear in the widely adopted Significance Analysis of Microarrays (SAM), we conducted simulation studies to examine the performance of the variance stabilized t-type score and the characteristics of the two FDRs. The variance stabilized t-type score was generally better than or at least as good as the t-type score, irrespective of the sample size and proportion of differentially expressed genes. In terms of accuracy, the median FDR was superior to the mean FDR when the proportion of differentially expressed genes was large. The variance stabilized t-type score with the median FDR was applied to actual colorectal cancer data and yielded a reasonable result
- …