RNA-sequencing has revolutionized biomedical research and, in particular, our
ability to study gene alternative splicing. The problem has important
implications for human health, as alternative splicing may be involved in
malfunctions at the cellular level and multiple diseases. However, the
high-dimensional nature of the data and the existence of experimental biases
pose serious data analysis challenges. We find that the standard data summaries
used to study alternative splicing are severely limited, as they ignore a
substantial amount of valuable information. Current data analysis methods are
based on such summaries and are hence suboptimal. Further, they have limited
flexibility in accounting for technical biases. We propose novel data summaries
and a Bayesian modeling framework that overcome these limitations and determine
biases in a nonparametric, highly flexible manner. These summaries adapt
naturally to the rapid improvements in sequencing technology. We provide
efficient point estimates and uncertainty assessments. The approach allows to
study alternative splicing patterns for individual samples and can also be the
basis for downstream analyses. We found a severalfold improvement in estimation
mean square error compared popular approaches in simulations, and substantially
higher consistency between replicates in experimental data. Our findings
indicate the need for adjusting the routine summarization and analysis of
alternative splicing RNA-seq studies. We provide a software implementation in
the R package casper.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS687 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org). With correction