Search CORE

608 research outputs found

Prior-preconditioned conjugate gradient method for accelerated Gibbs sampling in "large $n$ & large $p$ " sparse Bayesian regression

Author: Nishimura Akihiko
Suchard Marc A.
Publication venue
Publication date: 17/01/2020
Field of study

In a modern observational study based on healthcare databases, the number of observations and of predictors typically range in the order of

10^5

10^6

and of

10^4

10^5

. Despite the large sample size, data rarely provide sufficient information to reliably estimate such a large number of parameters. Sparse regression techniques provide potential solutions, one notable approach being the Bayesian methods based on shrinkage priors. In the "large

n

& large

p

" setting, however, posterior computation encounters a major bottleneck at repeated sampling from a high-dimensional Gaussian distribution, whose precision matrix

\Phi

is expensive to compute and factorize. In this article, we present a novel algorithm to speed up this bottleneck based on the following observation: we can cheaply generate a random vector

b

such that the solution to the linear system

\Phi \beta = b

has the desired Gaussian distribution. We can then solve the linear system by the conjugate gradient (CG) algorithm through matrix-vector multiplications by

\Phi

, without ever explicitly inverting

\Phi

. Rapid convergence of CG in this specific context is achieved by the theory of prior-preconditioning we develop. We apply our algorithm to a clinically relevant large-scale observational study with

n

= 72,489 patients and

p

= 22,175 clinical covariates, designed to assess the relative risk of adverse events from two alternative blood anti-coagulants. Our algorithm demonstrates an order of magnitude speed-up in the posterior computation.Comment: 32 pages, 7 figures + Supplement (23 pages, 7 figures

arXiv.org e-Print Archive

eScholarship - University of California

Sex, lies and self-reported counts: Bayesian mixture models for heaping in longitudinal count data via birth-death processes

Author: Crawford Forrest W.
Suchard Marc A.
Weiss Robert E.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/01/2015
Field of study

Surveys often ask respondents to report nonnegative counts, but respondents may misremember or round to a nearby multiple of 5 or 10. This phenomenon is called heaping, and the error inherent in heaped self-reported numbers can bias estimation. Heaped data may be collected cross-sectionally or longitudinally and there may be covariates that complicate the inferential task. Heaping is a well-known issue in many survey settings, and inference for heaped data is an important statistical problem. We propose a novel reporting distribution whose underlying parameters are readily interpretable as rates of misremembering and rounding. The process accommodates a variety of heaping grids and allows for quasi-heaping to values nearly but not equal to heaping multiples. We present a Bayesian hierarchical model for longitudinal samples with covariates to infer both the unobserved true distribution of counts and the parameters that control the heaping process. Finally, we apply our methods to longitudinal self-reported counts of sex partners in a study of high-risk behavior in HIV-positive youth.Comment: Published at http://dx.doi.org/10.1214/15-AOAS809 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

PubMed Central

eScholarship - University of California

Stability-mediated epistasis constrains the evolution of an influenza protein.

Author: Bloom Jesse D
Gong Lizhi Ian
Suchard Marc A
Publication venue: eScholarship, University of California
Publication date: 01/05/2013
Field of study

John Maynard Smith compared protein evolution to the game where one word is converted into another a single letter at a time, with the constraint that all intermediates are words: WORD→WORE→GORE→GONE→GENE. In this analogy, epistasis constrains evolution, with some mutations tolerated only after the occurrence of others. To test whether epistasis similarly constrains actual protein evolution, we created all intermediates along a 39-mutation evolutionary trajectory of influenza nucleoprotein, and also introduced each mutation individually into the parent. Several mutations were deleterious to the parent despite becoming fixed during evolution without negative impact. These mutations were destabilizing, and were preceded or accompanied by stabilizing mutations that alleviated their adverse effects. The constrained mutations occurred at sites enriched in T-cell epitopes, suggesting they promote viral immune escape. Our results paint a coherent portrait of epistasis during nucleoprotein evolution, with stabilizing mutations permitting otherwise inaccessible destabilizing mutations which are sometimes of adaptive value. DOI:http://dx.doi.org/10.7554/eLife.00631.001

PubMed Central

eScholarship - University of California

Scalable Sparse Cox's Regression for Large-Scale Survival Data via Broken Adaptive Ridge

Author: Kawaguchi Eric S.
Li Gang
Liu Zhenqiu
Suchard Marc A.
Publication venue: 'Wiley'
Publication date: 25/07/2018
Field of study

This paper develops a new scalable sparse Cox regression tool for sparse high-dimensional massive sample size (sHDMSS) survival data. The method is a local

L_0

-penalized Cox regression via repeatedly performing reweighted

L_2

-penalized Cox regression. We show that the resulting estimator enjoys the best of

L_0

- and

L_2

-penalized Cox regressions while overcoming their limitations. Specifically, the estimator is selection consistent, oracle for parameter estimation, and possesses a grouping property for highly correlated covariates. Simulation results suggest that when the sample size is large, the proposed method with pre-specified tuning parameters has a comparable or better performance than some popular penalized regression methods. More importantly, because the method naturally enables adaptation of efficient algorithms for massive

L_2

-penalized optimization and does not require costly data driven tuning parameter selection, it has a significant computational advantage for sHDMSS data, offering an average of 5-fold speedup over its closest competitor in empirical studies

arXiv.org e-Print Archive

eScholarship - University of California