202 research outputs found

    Fast model-fitting of Bayesian variable selection regression using the iterative complex factorization algorithm

    Full text link
    Bayesian variable selection regression (BVSR) is able to jointly analyze genome-wide genetic datasets, but the slow computation via Markov chain Monte Carlo (MCMC) hampered its wide-spread usage. Here we present a novel iterative method to solve a special class of linear systems, which can increase the speed of the BVSR model-fitting tenfold. The iterative method hinges on the complex factorization of the sum of two matrices and the solution path resides in the complex domain (instead of the real domain). Compared to the Gauss-Seidel method, the complex factorization converges almost instantaneously and its error is several magnitude smaller than that of the Gauss-Seidel method. More importantly, the error is always within the pre-specified precision while the Gauss-Seidel method is not. For large problems with thousands of covariates, the complex factorization is 10 -- 100 times faster than either the Gauss-Seidel method or the direct method via the Cholesky decomposition. In BVSR, one needs to repetitively solve large penalized regression systems whose design matrices only change slightly between adjacent MCMC steps. This slight change in design matrix enables the adaptation of the iterative complex factorization method. The computational innovation will facilitate the wide-spread use of BVSR in reanalyzing genome-wide association datasets.Comment: Accepted versio

    Small-world MCMC and convergence to multi-modal distributions: From slow mixing to fast mixing

    Full text link
    We compare convergence rates of Metropolis--Hastings chains to multi-modal target distributions when the proposal distributions can be of ``local'' and ``small world'' type. In particular, we show that by adding occasional long-range jumps to a given local proposal distribution, one can turn a chain that is ``slowly mixing'' (in the complexity of the problem) into a chain that is ``rapidly mixing.'' To do this, we obtain spectral gap estimates via a new state decomposition theorem and apply an isoperimetric inequality for log-concave probability measures. We discuss potential applicability of our result to Metropolis-coupled Markov chain Monte Carlo schemes.Comment: Published at http://dx.doi.org/10.1214/105051606000000772 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Quasi-likelihood for Spatial Point Processes

    Full text link
    Fitting regression models for intensity functions of spatial point processes is of great interest in ecological and epidemiological studies of association between spatially referenced events and geographical or environmental covariates. When Cox or cluster process models are used to accommodate clustering not accounted for by the available covariates, likelihood based inference becomes computationally cumbersome due to the complicated nature of the likelihood function and the associated score function. It is therefore of interest to consider alternative more easily computable estimating functions. We derive the optimal estimating function in a class of first-order estimating functions. The optimal estimating function depends on the solution of a certain Fredholm integral equation which in practice is solved numerically. The approximate solution is equivalent to a quasi-likelihood for binary spatial data and we therefore use the term quasi-likelihood for our optimal estimating function approach. We demonstrate in a simulation study and a data example that our quasi-likelihood method for spatial point processes is both statistically and computationally efficient

    Joint modeling of longitudinal drug using pattern and time to first relapse in cocaine dependence treatment data

    Full text link
    An important endpoint variable in a cocaine rehabilitation study is the time to first relapse of a patient after the treatment. We propose a joint modeling approach based on functional data analysis to study the relationship between the baseline longitudinal cocaine-use pattern and the interval censored time to first relapse. For the baseline cocaine-use pattern, we consider both self-reported cocaine-use amount trajectories and dichotomized use trajectories. Variations within the generalized longitudinal trajectories are modeled through a latent Gaussian process, which is characterized by a few leading functional principal components. The association between the baseline longitudinal trajectories and the time to first relapse is built upon the latent principal component scores. The mean and the eigenfunctions of the latent Gaussian process as well as the hazard function of time to first relapse are modeled nonparametrically using penalized splines, and the parameters in the joint model are estimated by a Monte Carlo EM algorithm based on Metropolis-Hastings steps. An Akaike information criterion (AIC) based on effective degrees of freedom is proposed to choose the tuning parameters, and a modified empirical information is proposed to estimate the variance-covariance matrix of the estimators.Comment: Published at http://dx.doi.org/10.1214/15-AOAS852 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Orthogonal series estimation of the pair correlation function of a spatial point process

    Full text link
    The pair correlation function is a fundamental spatial point process characteristic that, given the intensity function, determines second order moments of the point process. Non-parametric estimation of the pair correlation function is a typical initial step of a statistical analysis of a spatial point pattern. Kernel estimators are popular but especially for clustered point patterns suffer from bias for small spatial lags. In this paper we introduce a new orthogonal series estimator. The new estimator is consistent and asymptotically normal according to our theoretical and simulation results. Our simulations further show that the new estimator can outperform the kernel estimators in particular for Poisson and clustered point processes

    Bayesian variable selection regression for genome-wide association studies and other large-scale problems

    Full text link
    We consider applying Bayesian Variable Selection Regression, or BVSR, to genome-wide association studies and similar large-scale regression problems. Currently, typical genome-wide association studies measure hundreds of thousands, or millions, of genetic variants (SNPs), in thousands or tens of thousands of individuals, and attempt to identify regions harboring SNPs that affect some phenotype or outcome of interest. This goal can naturally be cast as a variable selection regression problem, with the SNPs as the covariates in the regression. Characteristic features of genome-wide association studies include the following: (i) a focus primarily on identifying relevant variables, rather than on prediction; and (ii) many relevant covariates may have tiny effects, making it effectively impossible to confidently identify the complete "correct" subset of variables. Taken together, these factors put a premium on having interpretable measures of confidence for individual covariates being included in the model, which we argue is a strength of BVSR compared with alternatives such as penalized regression methods. Here we focus primarily on analysis of quantitative phenotypes, and on appropriate prior specification for BVSR in this setting, emphasizing the idea of considering what the priors imply about the total proportion of variance in outcome explained by relevant covariates. We also emphasize the potential for BVSR to estimate this proportion of variance explained, and hence shed light on the issue of "missing heritability" in genome-wide association studies.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS455 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

    Estimating daily nitrogen dioxide level: Exploring traffic effects

    Full text link
    Data used to assess acute health effects from air pollution typically have good temporal but poor spatial resolution or the opposite. A modified longitudinal model was developed that sought to improve resolution in both domains by bringing together data from three sources to estimate daily levels of nitrogen dioxide (NO2\mathrm {NO}_2) at a geographic location. Monthly NO2\mathrm {NO}_2 measurements at 316 sites were made available by the Study of Traffic, Air quality and Respiratory health (STAR). Four US Environmental Protection Agency monitoring stations have hourly measurements of NO2\mathrm {NO}_2. Finally, the Connecticut Department of Transportation provides data on traffic density on major roadways, a primary contributor to NO2\mathrm {NO}_2 pollution. Inclusion of a traffic variable improved performance of the model, and it provides a method for estimating exposure at points that do not have direct measurements of the outcome. This approach can be used to estimate daily variation in levels of NO2\mathrm {NO}_2 over a region.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS642 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org
    • …
    corecore