608 research outputs found

    Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in "large nn & large pp" sparse Bayesian regression

    Full text link
    In a modern observational study based on healthcare databases, the number of observations and of predictors typically range in the order of 10510^5 ~ 10610^6 and of 10410^4 ~ 10510^5. Despite the large sample size, data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression techniques provide potential solutions, one notable approach being the Bayesian methods based on shrinkage priors. In the "large nn & large pp" setting, however, posterior computation encounters a major bottleneck at repeated sampling from a high-dimensional Gaussian distribution, whose precision matrix Φ\Phi is expensive to compute and factorize. In this article, we present a novel algorithm to speed up this bottleneck based on the following observation: we can cheaply generate a random vector bb such that the solution to the linear system Φβ=b\Phi \beta = b has the desired Gaussian distribution. We can then solve the linear system by the conjugate gradient (CG) algorithm through matrix-vector multiplications by Φ\Phi, without ever explicitly inverting Φ\Phi. Rapid convergence of CG in this specific context is achieved by the theory of prior-preconditioning we develop. We apply our algorithm to a clinically relevant large-scale observational study with nn = 72,489 patients and pp = 22,175 clinical covariates, designed to assess the relative risk of adverse events from two alternative blood anti-coagulants. Our algorithm demonstrates an order of magnitude speed-up in the posterior computation.Comment: 32 pages, 7 figures + Supplement (23 pages, 7 figures

    Sex, lies and self-reported counts: Bayesian mixture models for heaping in longitudinal count data via birth-death processes

    Full text link
    Surveys often ask respondents to report nonnegative counts, but respondents may misremember or round to a nearby multiple of 5 or 10. This phenomenon is called heaping, and the error inherent in heaped self-reported numbers can bias estimation. Heaped data may be collected cross-sectionally or longitudinally and there may be covariates that complicate the inferential task. Heaping is a well-known issue in many survey settings, and inference for heaped data is an important statistical problem. We propose a novel reporting distribution whose underlying parameters are readily interpretable as rates of misremembering and rounding. The process accommodates a variety of heaping grids and allows for quasi-heaping to values nearly but not equal to heaping multiples. We present a Bayesian hierarchical model for longitudinal samples with covariates to infer both the unobserved true distribution of counts and the parameters that control the heaping process. Finally, we apply our methods to longitudinal self-reported counts of sex partners in a study of high-risk behavior in HIV-positive youth.Comment: Published at http://dx.doi.org/10.1214/15-AOAS809 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Stability-mediated epistasis constrains the evolution of an influenza protein.

    Get PDF
    John Maynard Smith compared protein evolution to the game where one word is converted into another a single letter at a time, with the constraint that all intermediates are words: WORD→WORE→GORE→GONE→GENE. In this analogy, epistasis constrains evolution, with some mutations tolerated only after the occurrence of others. To test whether epistasis similarly constrains actual protein evolution, we created all intermediates along a 39-mutation evolutionary trajectory of influenza nucleoprotein, and also introduced each mutation individually into the parent. Several mutations were deleterious to the parent despite becoming fixed during evolution without negative impact. These mutations were destabilizing, and were preceded or accompanied by stabilizing mutations that alleviated their adverse effects. The constrained mutations occurred at sites enriched in T-cell epitopes, suggesting they promote viral immune escape. Our results paint a coherent portrait of epistasis during nucleoprotein evolution, with stabilizing mutations permitting otherwise inaccessible destabilizing mutations which are sometimes of adaptive value. DOI:http://dx.doi.org/10.7554/eLife.00631.001

    Scalable Sparse Cox's Regression for Large-Scale Survival Data via Broken Adaptive Ridge

    Full text link
    This paper develops a new scalable sparse Cox regression tool for sparse high-dimensional massive sample size (sHDMSS) survival data. The method is a local L0L_0-penalized Cox regression via repeatedly performing reweighted L2L_2-penalized Cox regression. We show that the resulting estimator enjoys the best of L0L_0- and L2L_2-penalized Cox regressions while overcoming their limitations. Specifically, the estimator is selection consistent, oracle for parameter estimation, and possesses a grouping property for highly correlated covariates. Simulation results suggest that when the sample size is large, the proposed method with pre-specified tuning parameters has a comparable or better performance than some popular penalized regression methods. More importantly, because the method naturally enables adaptation of efficient algorithms for massive L2L_2-penalized optimization and does not require costly data driven tuning parameter selection, it has a significant computational advantage for sHDMSS data, offering an average of 5-fold speedup over its closest competitor in empirical studies
    • …
    corecore