1,289 research outputs found
Models for transcript quantification from RNA-Seq
RNA-Seq is rapidly becoming the standard technology for transcriptome
analysis. Fundamental to many of the applications of RNA-Seq is the
quantification problem, which is the accurate measurement of relative
transcript abundances from the sequenced reads. We focus on this problem, and
review many recently published models that are used to estimate the relative
abundances. In addition to describing the models and the different approaches
to inference, we also explain how methods are related to each other. A key
result is that we show how inference with many of the models results in
identical estimates of relative abundances, even though model formulations can
be very different. In fact, we are able to show how a single general model
captures many of the elements of previously published methods. We also review
the applications of RNA-Seq models to differential analysis, and explain why
accurate relative transcript abundance estimates are crucial for downstream
analyses
MSIQ: Joint Modeling of Multiple RNA-seq Samples for Accurate Isoform Quantification
Next-generation RNA sequencing (RNA-seq) technology has been widely used to
assess full-length RNA isoform abundance in a high-throughput manner. RNA-seq
data offer insight into gene expression levels and transcriptome structures,
enabling us to better understand the regulation of gene expression and
fundamental biological processes. Accurate isoform quantification from RNA-seq
data is challenging due to the information loss in sequencing experiments. A
recent accumulation of multiple RNA-seq data sets from the same tissue or cell
type provides new opportunities to improve the accuracy of isoform
quantification. However, existing statistical or computational methods for
multiple RNA-seq samples either pool the samples into one sample or assign
equal weights to the samples when estimating isoform abundance. These methods
ignore the possible heterogeneity in the quality of different samples and could
result in biased and unrobust estimates. In this article, we develop a method,
which we call "joint modeling of multiple RNA-seq samples for accurate isoform
quantification" (MSIQ), for more accurate and robust isoform quantification by
integrating multiple RNA-seq samples under a Bayesian framework. Our method
aims to (1) identify a consistent group of samples with homogeneous quality and
(2) improve isoform quantification accuracy by jointly modeling multiple
RNA-seq samples by allowing for higher weights on the consistent group. We show
that MSIQ provides a consistent estimator of isoform abundance, and we
demonstrate the accuracy and effectiveness of MSIQ compared with alternative
methods through simulation studies on D. melanogaster genes. We justify MSIQ's
advantages over existing approaches via application studies on real RNA-seq
data from human embryonic stem cells, brain tissues, and the HepG2 immortalized
cell line
Identifying differentially expressed transcripts from RNA-seq data with biological variation
Motivation: High-throughput sequencing enables expression analysis at the level of individual transcripts. The analysis of transcriptome expression levels and differential expression (DE) estimation requires a probabilistic approach to properly account for ambiguity caused by shared exons and finite read sampling as well as the intrinsic biological variance of transcript expression. Results: We present Bayesian inference of transcripts from sequencing data (BitSeq), a Bayesian approach for estimation of transcript expression level from RNA-seq experiments. Inferred relative expression is represented by Markov chain Monte Carlo samples from the posterior probability distribution of a generative model of the read data. We propose a novel method for DE analysis across replicates which propagates uncertainty from the sample-level model while modelling biological variance using an expression-level-dependent prior. We demonstrate the advantages of our method using simulated data as well as an RNA-seq dataset with technical and biological replication for both studied conditions. Availability: The implementation of the transcriptome expression estimation and differential expression analysis, BitSeq, has been written in C++ and Python. The software is available online from http://code.google.com/p/bitseq/, version 0.4 was used for generating results presented in this article.Peer reviewe
Statistical Modeling of RNA-Seq Data
Recently, ultra high-throughput sequencing of RNA (RNA-Seq) has been
developed as an approach for analysis of gene expression. By obtaining tens or
even hundreds of millions of reads of transcribed sequences, an RNA-Seq
experiment can offer a comprehensive survey of the population of genes
(transcripts) in any sample of interest. This paper introduces a statistical
model for estimating isoform abundance from RNA-Seq data and is flexible enough
to accommodate both single end and paired end RNA-Seq data and sampling bias
along the length of the transcript. Based on the derivation of minimal
sufficient statistics for the model, a computationally feasible implementation
of the maximum likelihood estimator of the model is provided. Further, it is
shown that using paired end RNA-Seq provides more accurate isoform abundance
estimates than single end sequencing at fixed sequencing depth. Simulation
studies are also given.Comment: Published in at http://dx.doi.org/10.1214/10-STS343 the Statistical
Science (http://www.imstat.org/sts/) by the Institute of Mathematical
Statistics (http://www.imstat.org
Quantifying alternative splicing from paired-end RNA-sequencing data
RNA-sequencing has revolutionized biomedical research and, in particular, our
ability to study gene alternative splicing. The problem has important
implications for human health, as alternative splicing may be involved in
malfunctions at the cellular level and multiple diseases. However, the
high-dimensional nature of the data and the existence of experimental biases
pose serious data analysis challenges. We find that the standard data summaries
used to study alternative splicing are severely limited, as they ignore a
substantial amount of valuable information. Current data analysis methods are
based on such summaries and are hence suboptimal. Further, they have limited
flexibility in accounting for technical biases. We propose novel data summaries
and a Bayesian modeling framework that overcome these limitations and determine
biases in a nonparametric, highly flexible manner. These summaries adapt
naturally to the rapid improvements in sequencing technology. We provide
efficient point estimates and uncertainty assessments. The approach allows to
study alternative splicing patterns for individual samples and can also be the
basis for downstream analyses. We found a severalfold improvement in estimation
mean square error compared popular approaches in simulations, and substantially
higher consistency between replicates in experimental data. Our findings
indicate the need for adjusting the routine summarization and analysis of
alternative splicing RNA-seq studies. We provide a software implementation in
the R package casper.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS687 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org). With correction
- …