687 research outputs found
Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in "large & large " sparse Bayesian regression
In a modern observational study based on healthcare databases, the number of
observations and of predictors typically range in the order of ~
and of ~ . Despite the large sample size, data rarely provide
sufficient information to reliably estimate such a large number of parameters.
Sparse regression techniques provide potential solutions, one notable approach
being the Bayesian methods based on shrinkage priors. In the "large & large
" setting, however, posterior computation encounters a major bottleneck at
repeated sampling from a high-dimensional Gaussian distribution, whose
precision matrix is expensive to compute and factorize. In this article,
we present a novel algorithm to speed up this bottleneck based on the following
observation: we can cheaply generate a random vector such that the solution
to the linear system has the desired Gaussian distribution. We
can then solve the linear system by the conjugate gradient (CG) algorithm
through matrix-vector multiplications by , without ever explicitly
inverting . Rapid convergence of CG in this specific context is achieved
by the theory of prior-preconditioning we develop. We apply our algorithm to a
clinically relevant large-scale observational study with = 72,489 patients
and = 22,175 clinical covariates, designed to assess the relative risk of
adverse events from two alternative blood anti-coagulants. Our algorithm
demonstrates an order of magnitude speed-up in the posterior computation.Comment: 32 pages, 7 figures + Supplement (23 pages, 7 figures
Sex, lies and self-reported counts: Bayesian mixture models for heaping in longitudinal count data via birth-death processes
Surveys often ask respondents to report nonnegative counts, but respondents
may misremember or round to a nearby multiple of 5 or 10. This phenomenon is
called heaping, and the error inherent in heaped self-reported numbers can bias
estimation. Heaped data may be collected cross-sectionally or longitudinally
and there may be covariates that complicate the inferential task. Heaping is a
well-known issue in many survey settings, and inference for heaped data is an
important statistical problem. We propose a novel reporting distribution whose
underlying parameters are readily interpretable as rates of misremembering and
rounding. The process accommodates a variety of heaping grids and allows for
quasi-heaping to values nearly but not equal to heaping multiples. We present a
Bayesian hierarchical model for longitudinal samples with covariates to infer
both the unobserved true distribution of counts and the parameters that control
the heaping process. Finally, we apply our methods to longitudinal
self-reported counts of sex partners in a study of high-risk behavior in
HIV-positive youth.Comment: Published at http://dx.doi.org/10.1214/15-AOAS809 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Stability-mediated epistasis constrains the evolution of an influenza protein.
John Maynard Smith compared protein evolution to the game where one word is converted into another a single letter at a time, with the constraint that all intermediates are words: WORD→WORE→GORE→GONE→GENE. In this analogy, epistasis constrains evolution, with some mutations tolerated only after the occurrence of others. To test whether epistasis similarly constrains actual protein evolution, we created all intermediates along a 39-mutation evolutionary trajectory of influenza nucleoprotein, and also introduced each mutation individually into the parent. Several mutations were deleterious to the parent despite becoming fixed during evolution without negative impact. These mutations were destabilizing, and were preceded or accompanied by stabilizing mutations that alleviated their adverse effects. The constrained mutations occurred at sites enriched in T-cell epitopes, suggesting they promote viral immune escape. Our results paint a coherent portrait of epistasis during nucleoprotein evolution, with stabilizing mutations permitting otherwise inaccessible destabilizing mutations which are sometimes of adaptive value. DOI:http://dx.doi.org/10.7554/eLife.00631.001
Scalable Sparse Cox's Regression for Large-Scale Survival Data via Broken Adaptive Ridge
This paper develops a new scalable sparse Cox regression tool for sparse
high-dimensional massive sample size (sHDMSS) survival data. The method is a
local -penalized Cox regression via repeatedly performing reweighted
-penalized Cox regression. We show that the resulting estimator enjoys the
best of - and -penalized Cox regressions while overcoming their
limitations. Specifically, the estimator is selection consistent, oracle for
parameter estimation, and possesses a grouping property for highly correlated
covariates. Simulation results suggest that when the sample size is large, the
proposed method with pre-specified tuning parameters has a comparable or better
performance than some popular penalized regression methods. More importantly,
because the method naturally enables adaptation of efficient algorithms for
massive -penalized optimization and does not require costly data driven
tuning parameter selection, it has a significant computational advantage for
sHDMSS data, offering an average of 5-fold speedup over its closest competitor
in empirical studies
- …