13 research outputs found

    Quantifying alternative splicing from paired-end RNA-sequencing data

    Full text link
    RNA-sequencing has revolutionized biomedical research and, in particular, our ability to study gene alternative splicing. The problem has important implications for human health, as alternative splicing may be involved in malfunctions at the cellular level and multiple diseases. However, the high-dimensional nature of the data and the existence of experimental biases pose serious data analysis challenges. We find that the standard data summaries used to study alternative splicing are severely limited, as they ignore a substantial amount of valuable information. Current data analysis methods are based on such summaries and are hence suboptimal. Further, they have limited flexibility in accounting for technical biases. We propose novel data summaries and a Bayesian modeling framework that overcome these limitations and determine biases in a nonparametric, highly flexible manner. These summaries adapt naturally to the rapid improvements in sequencing technology. We provide efficient point estimates and uncertainty assessments. The approach allows to study alternative splicing patterns for individual samples and can also be the basis for downstream analyses. We found a severalfold improvement in estimation mean square error compared popular approaches in simulations, and substantially higher consistency between replicates in experimental data. Our findings indicate the need for adjusting the routine summarization and analysis of alternative splicing RNA-seq studies. We provide a software implementation in the R package casper.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS687 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org). With correction

    A Bayesian time-varying autoregressive model for improved short‐term and long‐term prediction

    Get PDF
    Motivated by the application to German interest rates, we propose a time‐varying autoregressive model for short‐term and long‐term prediction of time series that exhibit a temporary nonstationary behavior but are assumed to mean revert in the long run. We use a Bayesian formulation to incorporate prior assumptions on the mean reverting process in the model and thereby regularize predictions in the far future. We use MCMC‐based inference by deriving relevant full conditional distributions and employ a Metropolis‐Hastings within Gibbs sampler approach to sample from the posterior (predictive) distribution. In combining data‐driven short‐term predictions with long‐term distribution assumptions our model is competitive to the existing methods in the short horizon while yielding reasonable predictions in the long run. We apply our model to interest rate data and contrast the forecasting performance to that of a 2‐Additive‐Factor Gaussian model as well as to the predictions of a dynamic Nelson‐Siegel model.Peer Reviewe

    Elastic analysis of irregularly or sparsely sampled curves

    Get PDF
    We provide statistical analysis methods for samples of curves in two or more dimensions, where the image, but not the parameterization of the curves, is of interest and suitable alignment/registration is thus necessary. Examples are handwritten letters, movement paths, or object outlines. We focus in particular on the computation of (smooth) means and distances, allowing, for example, classification or clustering. Existing parameterization invariant analysis methods based on the elastic distance of the curves modulo parameterization, using the square‐root‐velocity framework, have limitations in common realistic settings where curves are irregularly and potentially sparsely observed. We propose using spline curves to model smooth or polygonal (FrĂ©chet) means of open or closed curves with respect to the elastic distance and show identifiability of the spline model modulo parameterization. We further provide methods and algorithms to approximate the elastic distance for irregularly or sparsely observed curves, via interpreting them as polygons. We illustrate the usefulness of our methods on two datasets. The first application classifies irregularly sampled spirals drawn by Parkinson's patients and healthy controls, based on the elastic distance to a mean spiral curve computed using our approach. The second application clusters sparsely sampled GPS tracks based on the elastic distance and computes smooth cluster means to find new paths on the Tempelhof field in Berlin. All methods are implemented in the R‐package “elasdics” and evaluated in simulations.Peer Reviewe

    Functional Additive Models on Manifolds of Planar Shapes and Forms

    Get PDF
    The “shape” of a planar curve and/or landmark configuration is considered its equivalence class under translation, rotation, and scaling, its “form” its equivalence class under translation and rotation while scale is preserved. We extend generalized additive regression to models for such shapes/forms as responses respecting the resulting quotient geometry by employing the squared geodesic distance as loss function and a geodesic response function to map the additive predictor to the shape/form space. For fitting the model, we propose a Riemannian L2-Boosting algorithm well suited for a potentially large number of possibly parameter-intensive model terms, which also yields automated model selection. We provide novel intuitively interpretable visualizations for (even nonlinear) covariate effects in the shape/form space via suitable tensor-product factorization. The usefulness of the proposed framework is illustrated in an analysis of (a) astragalus shapes of wild and domesticated sheep and (b) cell forms generated in a biophysical model, as well as (c) in a realistic simulation study with response shapes and forms motivated from a dataset on bottle outlines. Supplementary materials for this article are available online.Peer Reviewe

    Boosting Functional Response Models for Location, Scale and Shape with an Application to Bacterial Competition

    Get PDF
    We extend Generalized Additive Models for Location, Scale, and Shape (GAMLSS) to regression with functional response. This allows us to simultaneously model point-wise mean curves, variances and other distributional parameters of the response in dependence of various scalar and functional covariate effects. In addition, the scope of distributions is extended beyond exponential families. The model is fitted via gradient boosting, which offers inherent model selection and is shown to be suitable for both complex model structures and highly auto-correlated response curves. This enables us to analyze bacterial growth in \textit{Escherichia coli} in a complex interaction scenario, fruitfully extending usual growth models.Comment: bootstrap confidence interval type uncertainty bounds added; minor changes in formulation

    Multivariate functional additive mixed models

    Get PDF
    Multivariate functional data can be intrinsically multivariate like movement trajectories in 2D or complementary such as precipitation, temperature and wind speeds over time at a given weather station. We propose a multivariate functional additive mixed model (multiFAMM) and show its application to both data situations using examples from sports science (movement trajectories of snooker players) and phonetic science (acoustic signals and articulation of consonants). The approach includes linear and nonlinear covariate effects and models the dependency structure between the dimensions of the responses using multivariate functional principal component analysis. Multivariate functional random intercepts capture both the auto-correlation within a given function and cross-correlations between the multivariate functional dimensions. They also allow us to model between-function correlations as induced by, for example, repeated measurements or crossed study designs. Modelling the dependency structure between the dimensions can generate additional insight into the properties of the multivariate functional process, improves the estimation of random effects, and yields corrected confidence bands for covariate effects. Extensive simulation studies indicate that a multivariate modelling approach is more parsimonious than fitting independent univariate models to the data while maintaining or improving model fit.Peer Reviewe

    Quantifying alternative splicing from paired-end RNA-sequencing data

    Get PDF
    RNA-sequencing has revolutionized biomedical research and, in particular, our ability to study gene alternative splicing. The problem has important implications for human health, as alternative splicing is involved in malfunctions at the cellular level and multiple diseases. However, the high-dimensional nature of the data and the existence of experimental biases pose serious data analysis challenges. We find that the standard data summaries used to study alternative splicing are severely limited, as they ignore a substantial amount of valuable information. Current data analysis methods are based on such summaries and are hence sub-optimal. Further, they have limited flexibility in accounting for technical biases. We propose novel data summaries and a Bayesian modeling framework that overcome these limitations and determine biases in a non-parametric, data-dependent manner. These summaries adapt naturally to the rapid improvements in sequencing technology. We provide efficient point estimates and uncertainty assessments. The approach allows to study alternative splicing patterns for individual samples and can also be the basis for downstream differential expression analysis. We found an over 5 fold improvement in estimation mean square error compared to a popular approach in simulations, and substantially higher correlations between replicates in experimental data. Our findings indicate the need for modifying the routine summarization and analysis of alternative splicing RNA-seq studies. We provide a software implementation in the R package casper

    Pedestrian exposure to black carbon and PM2.5 emissions in urban hot spots: new findings using mobile measurement techniques and flexible Bayesian regression models

    Get PDF
    Background Data from extensive mobile measurements (MM) of air pollutants provide spatially resolved information on pedestrians’ exposure to particulate matter (black carbon (BC) and PM2.5 mass concentrations). Objective We present a distributional regression model in a Bayesian framework that estimates the effects of spatiotemporal factors on the pollutant concentrations influencing pedestrian exposure. Methods We modeled the mean and variance of the pollutant concentrations obtained from MM in two cities and extended commonly used lognormal models with a lognormal-normal convolution (logNNC) extension for BC to account for instrument measurement error. Results The logNNC extension significantly improved the BC model. From these model results, we found local sources and, hence, local mitigation efforts to improve air quality, have more impact on the ambient levels of BC mass concentrations than on the regulated PM2.5. Significance Firstly, this model (logNNC in bamlss package available in R) could be used for the statistical analysis of MM data from various study areas and pollutants with the potential for predicting pollutant concentrations in urban areas. Secondly, with respect to pedestrian exposure, it is crucial for BC mass concentration to be monitored and regulated in areas dominated by traffic-related air pollution
    corecore