Search CORE

202 research outputs found

Fast model-fitting of Bayesian variable selection regression using the iterative complex factorization algorithm

Author: Guan Yongtao
Zhou Quan
Publication venue
Publication date: 30/07/2018
Field of study

Bayesian variable selection regression (BVSR) is able to jointly analyze genome-wide genetic datasets, but the slow computation via Markov chain Monte Carlo (MCMC) hampered its wide-spread usage. Here we present a novel iterative method to solve a special class of linear systems, which can increase the speed of the BVSR model-fitting tenfold. The iterative method hinges on the complex factorization of the sum of two matrices and the solution path resides in the complex domain (instead of the real domain). Compared to the Gauss-Seidel method, the complex factorization converges almost instantaneously and its error is several magnitude smaller than that of the Gauss-Seidel method. More importantly, the error is always within the pre-specified precision while the Gauss-Seidel method is not. For large problems with thousands of covariates, the complex factorization is 10 -- 100 times faster than either the Gauss-Seidel method or the direct method via the Cholesky decomposition. In BVSR, one needs to repetitively solve large penalized regression systems whose design matrices only change slightly between adjacent MCMC steps. This slight change in design matrix enables the adaptation of the iterative complex factorization method. The computational innovation will facilitate the wide-spread use of BVSR in reanalyzing genome-wide association datasets.Comment: Accepted versio

arXiv.org e-Print Archive

University of Miami: Scholarship Miami

Small-world MCMC and convergence to multi-modal distributions: From slow mixing to fast mixing

Author: Guan Yongtao
Krone Stephen M.
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/03/2007
Field of study

We compare convergence rates of Metropolis--Hastings chains to multi-modal target distributions when the proposal distributions can be of ``local'' and ``small world'' type. In particular, we show that by adding occasional long-range jumps to a given local proposal distribution, one can turn a chain that is ``slowly mixing'' (in the complexity of the problem) into a chain that is ``rapidly mixing.'' To do this, we obtain spectral gap estimates via a new state decomposition theorem and apply an isoperimetric inequality for log-concave probability measures. We discuss potential applicability of our result to Metropolis-coupled Markov chain Monte Carlo schemes.Comment: Published at http://dx.doi.org/10.1214/105051606000000772 in the Annals of Applied Probability (http://www.imstat.org/aap/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

University of Miami: Scholarship Miami

Quasi-likelihood for Spatial Point Processes

Author: Guan Yongtao
Jalilian Abdollah
Waagepetersen Rasmus
Publication venue: 'Wiley'
Publication date: 01/01/2013
Field of study

Fitting regression models for intensity functions of spatial point processes is of great interest in ecological and epidemiological studies of association between spatially referenced events and geographical or environmental covariates. When Cox or cluster process models are used to accommodate clustering not accounted for by the available covariates, likelihood based inference becomes computationally cumbersome due to the complicated nature of the likelihood function and the associated score function. It is therefore of interest to consider alternative more easily computable estimating functions. We derive the optimal estimating function in a class of first-order estimating functions. The optimal estimating function depends on the solution of a certain Fredholm integral equation which in practice is solved numerically. The approximate solution is equivalent to a quasi-likelihood for binary spatial data and we therefore use the term quasi-likelihood for our optimal estimating function approach. We demonstrate in a simulation study and a data example that our quasi-likelihood method for spatial point processes is both statistically and computationally efficient

arXiv.org e-Print Archive

CiteSeerX

Joint modeling of longitudinal drug using pattern and time to first relapse in cocaine dependence treatment data

Author: Guan Yongtao
Li Yehua
Ye Jun
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 21/08/2015
Field of study

An important endpoint variable in a cocaine rehabilitation study is the time to first relapse of a patient after the treatment. We propose a joint modeling approach based on functional data analysis to study the relationship between the baseline longitudinal cocaine-use pattern and the interval censored time to first relapse. For the baseline cocaine-use pattern, we consider both self-reported cocaine-use amount trajectories and dichotomized use trajectories. Variations within the generalized longitudinal trajectories are modeled through a latent Gaussian process, which is characterized by a few leading functional principal components. The association between the baseline longitudinal trajectories and the time to first relapse is built upon the latent principal component scores. The mean and the eigenfunctions of the latent Gaussian process as well as the hazard function of time to first relapse are modeled nonparametrically using penalized splines, and the parameters in the joint model are estimated by a Monte Carlo EM algorithm based on Metropolis-Hastings steps. An Akaike information criterion (AIC) based on effective degrees of freedom is proposed to choose the tuning parameters, and a modified empirical information is proposed to estimate the variance-covariance matrix of the estimators.Comment: Published at http://dx.doi.org/10.1214/15-AOAS852 in the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

University of Miami: Scholarship Miami

Orthogonal series estimation of the pair correlation function of a spatial point process

Author: Guan Yongtao
Jalilian Abdollah
Waagepetersen Rasmus
Publication venue: 'Institute of Statistical Science'
Publication date: 06/02/2017
Field of study

The pair correlation function is a fundamental spatial point process characteristic that, given the intensity function, determines second order moments of the point process. Non-parametric estimation of the pair correlation function is a typical initial step of a statistical analysis of a spatial point pattern. Kernel estimators are popular but especially for clustered point patterns suffer from bias for small spatial lags. In this paper we introduce a new orthogonal series estimator. The new estimator is consistent and asymptotically normal according to our theoretical and simulation results. Our simulations further show that the new estimator can outperform the kernel estimators in particular for Poisson and clustered point processes

arXiv.org e-Print Archive

University of Miami: Scholarship Miami

Bayesian variable selection regression for genome-wide association studies and other large-scale problems

Author: Guan Yongtao
Stephens Matthew
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 27/10/2011
Field of study

We consider applying Bayesian Variable Selection Regression, or BVSR, to genome-wide association studies and similar large-scale regression problems. Currently, typical genome-wide association studies measure hundreds of thousands, or millions, of genetic variants (SNPs), in thousands or tens of thousands of individuals, and attempt to identify regions harboring SNPs that affect some phenotype or outcome of interest. This goal can naturally be cast as a variable selection regression problem, with the SNPs as the covariates in the regression. Characteristic features of genome-wide association studies include the following: (i) a focus primarily on identifying relevant variables, rather than on prediction; and (ii) many relevant covariates may have tiny effects, making it effectively impossible to confidently identify the complete "correct" subset of variables. Taken together, these factors put a premium on having interpretable measures of confidence for individual covariates being included in the model, which we argue is a strength of BVSR compared with alternatives such as penalized regression methods. Here we focus primarily on analysis of quantitative phenotypes, and on appropriate prior specification for BVSR in this setting, emphasizing the idea of considering what the priors imply about the total proportion of variance in outcome explained by relevant covariates. We also emphasize the potential for BVSR to estimate this proportion of variance explained, and hence shed light on the issue of "missing heritability" in genome-wide association studies.Comment: Published in at http://dx.doi.org/10.1214/11-AOAS455 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

University of Miami: Scholarship Miami

Recommended from our members

Practical Issues in Imputation-Based Association Mapping

Author: Guan Yongtao
Stephens Matthew
Publication venue
Publication date: 03/01/2024
Field of study

Imputation-based association methods provide a powerful framework for testing untyped variants for association with phenotypes and for combining results from multiple studies that use different genotyping platforms. Here, we consider several issues that arise when applying these methods in practice, including: (i) factors affecting imputation accuracy, including choice of reference panel; (ii) the effects of imputation accuracy on power to detect associations; (iii) the relative merits of Bayesian and frequentist approaches to testing imputed genotypes for association with phenotype; and (iv) how to quickly and accurately compute Bayes factors for testing imputed SNPs. We find that imputation-based methods can be robust to imputation accuracy and can improve power to detect associations, even when average imputation accuracy is poor. We explain how ranking SNPs for association by a standard likelihood ratio test gives the same results as a Bayesian procedure that uses an unnatural prior assumption—specifically, that difficult-to-impute SNPs tend to have larger effects—and assess the power gained from using a Bayesian approach that does not make this assumption. Within the Bayesian framework, we find that good approximations to a full analysis can be achieved by simply replacing unknown genotypes with a point estimate—their posterior mean. This approximation considerably reduces computational expense compared with published sampling-based approaches, and the methods we present are practical on a genome-wide scale with very modest computational resources (e.g., a single desktop computer). The approximation also facilitates combining information across studies, using only summary data for each SNP. Methods discussed here are implemented in the software package BIMBAM, which is available from http://stephenslab.uchicago.edu/software.html.</p

Knowledge UChicago

Estimating daily nitrogen dioxide level: Exploring traffic effects

Author: Guan Yongtao
Holford Theodore R.
Leaderer Brian P.
Zhang Lixun
Publication venue: 'Institute of Mathematical Statistics'
Publication date: 01/09/2013
Field of study

Data used to assess acute health effects from air pollution typically have good temporal but poor spatial resolution or the opposite. A modified longitudinal model was developed that sought to improve resolution in both domains by bringing together data from three sources to estimate daily levels of nitrogen dioxide (

\mathrm {NO}_2

) at a geographic location. Monthly

\mathrm {NO}_2

measurements at 316 sites were made available by the Study of Traffic, Air quality and Respiratory health (STAR). Four US Environmental Protection Agency monitoring stations have hourly measurements of

\mathrm {NO}_2

. Finally, the Connecticut Department of Transportation provides data on traffic density on major roadways, a primary contributor to

\mathrm {NO}_2

pollution. Inclusion of a traffic variable improved performance of the model, and it provides a method for estimating exposure at points that do not have direct measurements of the outcome. This approach can be used to estimate daily variation in levels of

\mathrm {NO}_2

over a region.Comment: Published in at http://dx.doi.org/10.1214/13-AOAS642 the Annals of Applied Statistics (http://www.imstat.org/aoas/) by the Institute of Mathematical Statistics (http://www.imstat.org

arXiv.org e-Print Archive

Crossref

PubMed Central

University of Miami: Scholarship Miami