2,676 research outputs found
The importance of distinct modeling strategies for gene and gene-specific treatment effects in hierarchical models for microarray data
When analyzing microarray data, hierarchical models are often used to share
information across genes when estimating means and variances or identifying
differential expression. Many methods utilize some form of the two-level
hierarchical model structure suggested by Kendziorski et al. [Stat. Med. (2003)
22 3899-3914] in which the first level describes the distribution of latent
mean expression levels among genes and among differentially expressed
treatments within a gene. The second level describes the conditional
distribution, given a latent mean, of repeated observations for a single gene
and treatment. Many of these models, including those used in Kendziorski's et
al. [Stat. Med. (2003) 22 3899-3914] EBarrays package, assume that expression
level changes due to treatment effects have the same distribution as expression
level changes from gene to gene. We present empirical evidence that this
assumption is often inadequate and propose three-level hierarchical models as
extensions to the two-level log-normal based EBarrays models to address this
inadequacy. We demonstrate that use of our three-level models dramatically
changes analysis results for a variety of microarray data sets and verify the
validity and improved performance of our suggested method in a series of
simulation studies. We also illustrate the importance of accounting for the
uncertainty of gene-specific error variance estimates when using hierarchical
models to identify differentially expressed genes.Comment: Published in at http://dx.doi.org/10.1214/12-AOAS535 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Modeling and visualizing uncertainty in gene expression clusters using Dirichlet process mixtures
Although the use of clustering methods has rapidly become one of the standard computational approaches in the literature of microarray gene expression data, little attention has been paid to uncertainty in the results obtained. Dirichlet process mixture (DPM) models provide a nonparametric Bayesian alternative to the bootstrap approach to modeling uncertainty in gene expression clustering. Most previously published applications of Bayesian model-based clustering methods have been to short time series data. In this paper, we present a case study of the application of nonparametric Bayesian clustering methods to the clustering of high-dimensional nontime series gene expression data using full Gaussian covariances. We use the probability that two genes belong to the same cluster in a DPM model as a measure of the similarity of these gene expression profiles. Conversely, this probability can be used to define a dissimilarity measure, which, for the purposes of visualization, can be input to one of the standard linkage algorithms used for hierarchical clustering. Biologically plausible results are obtained from the Rosetta compendium of expression profiles which extend previously published cluster analyses of this data
Bayesian testing of many hypotheses many genes: A study of sleep apnea
Substantial statistical research has recently been devoted to the analysis of
large-scale microarray experiments which provide a measure of the simultaneous
expression of thousands of genes in a particular condition. A typical goal is
the comparison of gene expression between two conditions (e.g., diseased vs.
nondiseased) to detect genes which show differential expression. Classical
hypothesis testing procedures have been applied to this problem and more recent
work has employed sophisticated models that allow for the sharing of
information across genes. However, many recent gene expression studies have an
experimental design with several conditions that requires an even more involved
hypothesis testing approach. In this paper, we use a hierarchical Bayesian
model to address the situation where there are many hypotheses that must be
simultaneously tested for each gene. In addition to having many hypotheses
within each gene, our analysis also addresses the more typical multiple
comparison issue of testing many genes simultaneously. We illustrate our
approach with an application to a study of genes involved in obstructive sleep
apnea in humans.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS241 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Of mice and men: Sparse statistical modeling in cardiovascular genomics
In high-throughput genomics, large-scale designed experiments are becoming
common, and analysis approaches based on highly multivariate regression and
anova concepts are key tools. Shrinkage models of one form or another can
provide comprehensive approaches to the problems of simultaneous inference that
involve implicit multiple comparisons over the many, many parameters
representing effects of design factors and covariates. We use such approaches
here in a study of cardiovascular genomics. The primary experimental context
concerns a carefully designed, and rich, gene expression study focused on
gene-environment interactions, with the goals of identifying genes implicated
in connection with disease states and known risk factors, and in generating
expression signatures as proxies for such risk factors. A coupled exploratory
analysis investigates cross-species extrapolation of gene expression
signatures--how these mouse-model signatures translate to humans. The latter
involves exploration of sparse latent factor analysis of human observational
data and of how it relates to projected risk signatures derived in the animal
models. The study also highlights a range of applied statistical and genomic
data analysis issues, including model specification, computational questions
and model-based correction of experimental artifacts in DNA microarray data.Comment: Published at http://dx.doi.org/10.1214/07-AOAS110 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
An Empirical Bayes Approach for Multiple Tissue eQTL Analysis
Expression quantitative trait loci (eQTL) analyses, which identify genetic
markers associated with the expression of a gene, are an important tool in the
understanding of diseases in human and other populations. While most eQTL
studies to date consider the connection between genetic variation and
expression in a single tissue, complex, multi-tissue data sets are now being
generated by the GTEx initiative. These data sets have the potential to improve
the findings of single tissue analyses by borrowing strength across tissues,
and the potential to elucidate the genotypic basis of differences between
tissues.
In this paper we introduce and study a multivariate hierarchical Bayesian
model (MT-eQTL) for multi-tissue eQTL analysis. MT-eQTL directly models the
vector of correlations between expression and genotype across tissues. It
explicitly captures patterns of variation in the presence or absence of eQTLs,
as well as the heterogeneity of effect sizes across tissues. Moreover, the
model is applicable to complex designs in which the set of donors can (i) vary
from tissue to tissue, and (ii) exhibit incomplete overlap between tissues. The
MT-eQTL model is marginally consistent, in the sense that the model for a
subset of tissues can be obtained from the full model via marginalization.
Fitting of the MT-eQTL model is carried out via empirical Bayes, using an
approximate EM algorithm. Inferences concerning eQTL detection and the
configuration of eQTLs across tissues are derived from adaptive thresholding of
local false discovery rates, and maximum a-posteriori estimation, respectively.
We investigate the MT-eQTL model through a simulation study, and rigorously
establish the FDR control of the local FDR testing procedure under mild
assumptions appropriate for dependent data.Comment: accepted by Biostatistic
Laplace Approximated EM Microarray Analysis: An Empirical Bayes Approach for Comparative Microarray Experiments
A two-groups mixed-effects model for the comparison of (normalized)
microarray data from two treatment groups is considered. Most competing
parametric methods that have appeared in the literature are obtained as special
cases or by minor modification of the proposed model. Approximate maximum
likelihood fitting is accomplished via a fast and scalable algorithm, which we
call LEMMA (Laplace approximated EM Microarray Analysis). The posterior odds of
treatment gene interactions, derived from the model, involve shrinkage
estimates of both the interactions and of the gene specific error variances.
Genes are classified as being associated with treatment based on the posterior
odds and the local false discovery rate (f.d.r.) with a fixed cutoff. Our
model-based approach also allows one to declare the non-null status of a gene
by controlling the false discovery rate (FDR). It is shown in a detailed
simulation study that the approach outperforms well-known competitors. We also
apply the proposed methodology to two previously analyzed microarray examples.
Extensions of the proposed method to paired treatments and multiple treatments
are also discussed.Comment: Published in at http://dx.doi.org/10.1214/10-STS339 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
- …