19,287 research outputs found
A statistical framework for joint eQTL analysis in multiple tissues
Mapping expression Quantitative Trait Loci (eQTLs) represents a powerful and
widely-adopted approach to identifying putative regulatory variants and linking
them to specific genes. Up to now eQTL studies have been conducted in a
relatively narrow range of tissues or cell types. However, understanding the
biology of organismal phenotypes will involve understanding regulation in
multiple tissues, and ongoing studies are collecting eQTL data in dozens of
cell types. Here we present a statistical framework for powerfully detecting
eQTLs in multiple tissues or cell types (or, more generally, multiple
subgroups). The framework explicitly models the potential for each eQTL to be
active in some tissues and inactive in others. By modeling the sharing of
active eQTLs among tissues this framework increases power to detect eQTLs that
are present in more than one tissue compared with "tissue-by-tissue" analyses
that examine each tissue separately. Conversely, by modeling the inactivity of
eQTLs in some tissues, the framework allows the proportion of eQTLs shared
across different tissues to be formally estimated as parameters of a model,
addressing the difficulties of accounting for incomplete power when comparing
overlaps of eQTLs identified by tissue-by-tissue analyses. Applying our
framework to re-analyze data from transformed B cells, T cells and fibroblasts
we find that it substantially increases power compared with tissue-by-tissue
analysis, identifying 63% more genes with eQTLs (at FDR=0.05). Further the
results suggest that, in contrast to previous analyses of the same data, the
majority of eQTLs detectable in these data are shared among all three tissues.Comment: Summitted to PLoS Genetic
A hierarchical Bayesian model for inference of copy number variants and their association to gene expression
A number of statistical models have been successfully developed for the
analysis of high-throughput data from a single source, but few methods are
available for integrating data from different sources. Here we focus on
integrating gene expression levels with comparative genomic hybridization (CGH)
array measurements collected on the same subjects. We specify a measurement
error model that relates the gene expression levels to latent copy number
states which, in turn, are related to the observed surrogate CGH measurements
via a hidden Markov model. We employ selection priors that exploit the
dependencies across adjacent copy number states and investigate MCMC stochastic
search techniques for posterior inference. Our approach results in a unified
modeling framework for simultaneously inferring copy number variants (CNV) and
identifying their significant associations with mRNA transcripts abundance. We
show performance on simulated data and illustrate an application to data from a
genomic study on human cancer cell lines.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS705 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
The Infinite Hierarchical Factor Regression Model
We propose a nonparametric Bayesian factor regression model that accounts for
uncertainty in the number of factors, and the relationship between factors. To
accomplish this, we propose a sparse variant of the Indian Buffet Process and
couple this with a hierarchical model over factors, based on Kingman's
coalescent. We apply this model to two problems (factor analysis and factor
regression) in gene-expression data analysis
Bayesian Gene Set Analysis
Gene expression microarray technologies provide the simultaneous measurements
of a large number of genes. Typical analyses of such data focus on the
individual genes, but recent work has demonstrated that evaluating changes in
expression across predefined sets of genes often increases statistical power
and produces more robust results. We introduce a new methodology for
identifying gene sets that are differentially expressed under varying
experimental conditions. Our approach uses a hierarchical Bayesian framework
where a hyperparameter measures the significance of each gene set. Using
simulated data, we compare our proposed method to alternative approaches, such
as Gene Set Enrichment Analysis (GSEA) and Gene Set Analysis (GSA). Our
approach provides the best overall performance. We also discuss the application
of our method to experimental data based on p53 mutation status
Accelerating Bayesian hierarchical clustering of time series data with a randomised algorithm
We live in an era of abundant data. This has necessitated the development of new and innovative statistical algorithms to get the most from experimental data. For example, faster algorithms make practical the analysis of larger genomic data sets, allowing us to extend the utility of cutting-edge statistical methods. We present a randomised algorithm that accelerates the clustering of time series data using the Bayesian Hierarchical Clustering (BHC) statistical method. BHC is a general method for clustering any discretely sampled time series data. In this paper we focus on a particular application to microarray gene expression data. We define and analyse the randomised algorithm, before presenting results on both synthetic and real biological data sets. We show that the randomised algorithm leads to substantial gains in speed with minimal loss in clustering quality. The randomised time series BHC algorithm is available as part of the R package BHC, which is available for download from Bioconductor (version 2.10 and above) via http://bioconductor.org/packages/2.10/bioc/html/BHC.html. We have also made available a set of R scripts which can be used to reproduce the analyses carried out in this paper. These are available from the following URL. https://sites.google.com/site/randomisedbhc/
Bayesian testing of many hypotheses many genes: A study of sleep apnea
Substantial statistical research has recently been devoted to the analysis of
large-scale microarray experiments which provide a measure of the simultaneous
expression of thousands of genes in a particular condition. A typical goal is
the comparison of gene expression between two conditions (e.g., diseased vs.
nondiseased) to detect genes which show differential expression. Classical
hypothesis testing procedures have been applied to this problem and more recent
work has employed sophisticated models that allow for the sharing of
information across genes. However, many recent gene expression studies have an
experimental design with several conditions that requires an even more involved
hypothesis testing approach. In this paper, we use a hierarchical Bayesian
model to address the situation where there are many hypotheses that must be
simultaneously tested for each gene. In addition to having many hypotheses
within each gene, our analysis also addresses the more typical multiple
comparison issue of testing many genes simultaneously. We illustrate our
approach with an application to a study of genes involved in obstructive sleep
apnea in humans.Comment: Published in at http://dx.doi.org/10.1214/09-AOAS241 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
A hierarchical model of transcriptional dynamics allows robust estimation of transcription rates in populations of single cells with variable gene copy number
Motivation: cis-regulatory DNA sequence elements, such as enhancers and silencers, function to control the spatial and temporal expression of their target genes. Although the overall levels of gene expression in large cell populations seem to be precisely controlled, transcription of individual genes in single cells is extremely variable in real time. It is, therefore, important to understand how these cis-regulatory elements function to dynamically control transcription at single-cell resolution. Recently, statistical methods have been proposed to back calculate the rates involved in mRNA transcription using parameter estimation of a mathematical model of transcription and translation. However, a major complication in these approaches is that some of the parameters, particularly those corresponding to the gene copy number and transcription rate, cannot be distinguished; therefore, these methods cannot be used when the copy number is unknown.
Results: Here, we develop a hierarchical Bayesian model to estimate biokinetic parameters from live cell enhancer–promoter reporter measurements performed on a population of single cells. This allows us to investigate transcriptional dynamics when the copy number is variable across the population. We validate our method using synthetic data and then apply it to quantify the function of two known developmental enhancers in real time and in single cells
Discovering transcriptional modules by Bayesian data integration
Motivation: We present a method for directly inferring transcriptional modules (TMs) by integrating gene expression and transcription factor binding (ChIP-chip) data. Our model extends a hierarchical Dirichlet process mixture model to allow data fusion on a gene-by-gene basis. This encodes the intuition that co-expression and co-regulation are not necessarily equivalent and hence we do not expect all genes to group similarly in both datasets. In particular, it allows us to identify the subset of genes that share the same structure of transcriptional modules in both datasets.
Results: We find that by working on a gene-by-gene basis, our model is able to extract clusters with greater functional coherence than existing methods. By combining gene expression and transcription factor binding (ChIP-chip) data in this way, we are better able to determine the groups of genes that are most likely to represent underlying TMs
- …