202 research outputs found
Fast model-fitting of Bayesian variable selection regression using the iterative complex factorization algorithm
Bayesian variable selection regression (BVSR) is able to jointly analyze
genome-wide genetic datasets, but the slow computation via Markov chain Monte
Carlo (MCMC) hampered its wide-spread usage. Here we present a novel iterative
method to solve a special class of linear systems, which can increase the speed
of the BVSR model-fitting tenfold. The iterative method hinges on the complex
factorization of the sum of two matrices and the solution path resides in the
complex domain (instead of the real domain). Compared to the Gauss-Seidel
method, the complex factorization converges almost instantaneously and its
error is several magnitude smaller than that of the Gauss-Seidel method. More
importantly, the error is always within the pre-specified precision while the
Gauss-Seidel method is not. For large problems with thousands of covariates,
the complex factorization is 10 -- 100 times faster than either the
Gauss-Seidel method or the direct method via the Cholesky decomposition. In
BVSR, one needs to repetitively solve large penalized regression systems whose
design matrices only change slightly between adjacent MCMC steps. This slight
change in design matrix enables the adaptation of the iterative complex
factorization method. The computational innovation will facilitate the
wide-spread use of BVSR in reanalyzing genome-wide association datasets.Comment: Accepted versio
Small-world MCMC and convergence to multi-modal distributions: From slow mixing to fast mixing
We compare convergence rates of Metropolis--Hastings chains to multi-modal
target distributions when the proposal distributions can be of ``local'' and
``small world'' type. In particular, we show that by adding occasional
long-range jumps to a given local proposal distribution, one can turn a chain
that is ``slowly mixing'' (in the complexity of the problem) into a chain that
is ``rapidly mixing.'' To do this, we obtain spectral gap estimates via a new
state decomposition theorem and apply an isoperimetric inequality for
log-concave probability measures. We discuss potential applicability of our
result to Metropolis-coupled Markov chain Monte Carlo schemes.Comment: Published at http://dx.doi.org/10.1214/105051606000000772 in the
Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute
of Mathematical Statistics (http://www.imstat.org
Quasi-likelihood for Spatial Point Processes
Fitting regression models for intensity functions of spatial point processes
is of great interest in ecological and epidemiological studies of association
between spatially referenced events and geographical or environmental
covariates. When Cox or cluster process models are used to accommodate
clustering not accounted for by the available covariates, likelihood based
inference becomes computationally cumbersome due to the complicated nature of
the likelihood function and the associated score function. It is therefore of
interest to consider alternative more easily computable estimating functions.
We derive the optimal estimating function in a class of first-order estimating
functions. The optimal estimating function depends on the solution of a certain
Fredholm integral equation which in practice is solved numerically. The
approximate solution is equivalent to a quasi-likelihood for binary spatial
data and we therefore use the term quasi-likelihood for our optimal estimating
function approach. We demonstrate in a simulation study and a data example that
our quasi-likelihood method for spatial point processes is both statistically
and computationally efficient
Joint modeling of longitudinal drug using pattern and time to first relapse in cocaine dependence treatment data
An important endpoint variable in a cocaine rehabilitation study is the time
to first relapse of a patient after the treatment. We propose a joint modeling
approach based on functional data analysis to study the relationship between
the baseline longitudinal cocaine-use pattern and the interval censored time to
first relapse. For the baseline cocaine-use pattern, we consider both
self-reported cocaine-use amount trajectories and dichotomized use
trajectories. Variations within the generalized longitudinal trajectories are
modeled through a latent Gaussian process, which is characterized by a few
leading functional principal components. The association between the baseline
longitudinal trajectories and the time to first relapse is built upon the
latent principal component scores. The mean and the eigenfunctions of the
latent Gaussian process as well as the hazard function of time to first relapse
are modeled nonparametrically using penalized splines, and the parameters in
the joint model are estimated by a Monte Carlo EM algorithm based on
Metropolis-Hastings steps. An Akaike information criterion (AIC) based on
effective degrees of freedom is proposed to choose the tuning parameters, and a
modified empirical information is proposed to estimate the variance-covariance
matrix of the estimators.Comment: Published at http://dx.doi.org/10.1214/15-AOAS852 in the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Orthogonal series estimation of the pair correlation function of a spatial point process
The pair correlation function is a fundamental spatial point process
characteristic that, given the intensity function, determines second order
moments of the point process. Non-parametric estimation of the pair correlation
function is a typical initial step of a statistical analysis of a spatial point
pattern. Kernel estimators are popular but especially for clustered point
patterns suffer from bias for small spatial lags. In this paper we introduce a
new orthogonal series estimator. The new estimator is consistent and
asymptotically normal according to our theoretical and simulation results. Our
simulations further show that the new estimator can outperform the kernel
estimators in particular for Poisson and clustered point processes
Bayesian variable selection regression for genome-wide association studies and other large-scale problems
We consider applying Bayesian Variable Selection Regression, or BVSR, to
genome-wide association studies and similar large-scale regression problems.
Currently, typical genome-wide association studies measure hundreds of
thousands, or millions, of genetic variants (SNPs), in thousands or tens of
thousands of individuals, and attempt to identify regions harboring SNPs that
affect some phenotype or outcome of interest. This goal can naturally be cast
as a variable selection regression problem, with the SNPs as the covariates in
the regression. Characteristic features of genome-wide association studies
include the following: (i) a focus primarily on identifying relevant variables,
rather than on prediction; and (ii) many relevant covariates may have tiny
effects, making it effectively impossible to confidently identify the complete
"correct" subset of variables. Taken together, these factors put a premium on
having interpretable measures of confidence for individual covariates being
included in the model, which we argue is a strength of BVSR compared with
alternatives such as penalized regression methods. Here we focus primarily on
analysis of quantitative phenotypes, and on appropriate prior specification for
BVSR in this setting, emphasizing the idea of considering what the priors imply
about the total proportion of variance in outcome explained by relevant
covariates. We also emphasize the potential for BVSR to estimate this
proportion of variance explained, and hence shed light on the issue of "missing
heritability" in genome-wide association studies.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS455 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
Recommended from our members
Practical Issues in Imputation-Based Association Mapping
Imputation-based association methods provide a powerful framework for testing untyped variants for association with phenotypes and for combining results from multiple studies that use different genotyping platforms. Here, we consider several issues that arise when applying these methods in practice, including: (i) factors affecting imputation accuracy, including choice of reference panel; (ii) the effects of imputation accuracy on power to detect associations; (iii) the relative merits of Bayesian and frequentist approaches to testing imputed genotypes for association with phenotype; and (iv) how to quickly and accurately compute Bayes factors for testing imputed SNPs. We find that imputation-based methods can be robust to imputation accuracy and can improve power to detect associations, even when average imputation accuracy is poor. We explain how ranking SNPs for association by a standard likelihood ratio test gives the same results as a Bayesian procedure that uses an unnatural prior assumption—specifically, that difficult-to-impute SNPs tend to have larger effects—and assess the power gained from using a Bayesian approach that does not make this assumption. Within the Bayesian framework, we find that good approximations to a full analysis can be achieved by simply replacing unknown genotypes with a point estimate—their posterior mean. This approximation considerably reduces computational expense compared with published sampling-based approaches, and the methods we present are practical on a genome-wide scale with very modest computational resources (e.g., a single desktop computer). The approximation also facilitates combining information across studies, using only summary data for each SNP. Methods discussed here are implemented in the software package BIMBAM, which is available from http://stephenslab.uchicago.edu/software.html.</p
Estimating daily nitrogen dioxide level: Exploring traffic effects
Data used to assess acute health effects from air pollution typically have
good temporal but poor spatial resolution or the opposite. A modified
longitudinal model was developed that sought to improve resolution in both
domains by bringing together data from three sources to estimate daily levels
of nitrogen dioxide () at a geographic location. Monthly
measurements at 316 sites were made available by the Study of
Traffic, Air quality and Respiratory health (STAR). Four US Environmental
Protection Agency monitoring stations have hourly measurements of . Finally, the Connecticut Department of Transportation provides data on
traffic density on major roadways, a primary contributor to
pollution. Inclusion of a traffic variable improved performance of the model,
and it provides a method for estimating exposure at points that do not have
direct measurements of the outcome. This approach can be used to estimate daily
variation in levels of over a region.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS642 the Annals of
Applied Statistics (http://www.imstat.org/aoas/) by the Institute of
Mathematical Statistics (http://www.imstat.org
- …