13 research outputs found
Bayesian Additive Regression Trees With Parametric Models of Heteroskedasticity
We incorporate heteroskedasticity into Bayesian Additive Regression Trees
(BART) by modeling the log of the error variance parameter as a linear function
of prespecified covariates. Under this scheme, the Gibbs sampling procedure for
the original sum-of- trees model is easily modified, and the parameters for the
variance model are updated via a Metropolis-Hastings step. We demonstrate the
promise of our approach by providing more appropriate posterior predictive
intervals than homoskedastic BART in heteroskedastic settings and demonstrating
the model's resistance to overfitting. Our implementation will be offered in an
upcoming release of the R package bartMachine.Comment: 20 pages, 5 figure
Predicting Inflation: Professional Experts Versus No-Change Forecasts
We compare forecasts of United States inflation from the Survey of
Professional Forecasters (SPF) to predictions made by simple statistical
techniques. In nowcasting, economic expertise is persuasive. When projecting
beyond the current quarter, novel yet simplistic probabilistic no-change
forecasts are equally competitive. We further interpret surveys as ensembles of
forecasts, and show that they can be used similarly to the ways in which
ensemble prediction systems have transformed weather forecasting. Then we
borrow another idea from weather forecasting, in that we apply statistical
techniques to postprocess the SPF forecast, based on experience from the recent
past. The foregoing conclusions remain unchanged after survey postprocessing
Regression density estimation using smooth adaptive Gaussian mixtures
We model a regression density flexibly so that at each value of the covariates the density is a mixture of normals with the means, variances and mixture probabilities of the components changing smoothly as a function of the covariates. The model extends existing models in two important ways. First, the components are allowed to be heteroscedastic regressions as the standard model with homoscedastic regressions can give a poor fit to heteroscedastic data, especially when the number of covariates is large. Furthermore, we typically need fewer components, which makes it easier to interpret the model and speeds up the computation. The second main extension is to introduce a novel variable selection prior into all the components of the model. The variable selection prior acts as a self-adjusting mechanism that prevents overfitting and makes it feasible to fit flexible high-dimensional surfaces. We use Bayesian inference and Markov Chain Monte Carlo methods to estimate the model. Simulated and real examples are used to show that the full generality of our model is required to fit a large class of densities, but also that special cases of the general model are interesting models for economic data
Bayesian Matrix Completion for Hypothesis Testing
High-throughput screening (HTS) is a well-established technology that rapidly
and efficiently screens thousands of chemicals for potential toxicity. Massive
testing using HTS primarily aims to differentiate active vs inactive chemicals
for different types of biological endpoints. However, even using
high-throughput technology, it is not feasible to test all possible
combinations of chemicals and assay endpoints, resulting in a majority of
missing combinations. Our goal is to derive posterior probabilities of activity
for each chemical by assay endpoint combination, addressing the sparsity of HTS
data. We propose a Bayesian hierarchical framework, which borrows information
across different chemicals and assay endpoints in a low-dimensional latent
space. This framework facilitates out-of-sample prediction of bioactivity
potential for new chemicals not yet tested. Furthermore, this paper makes a
novel attempt in toxicology to simultaneously model heteroscedastic errors as
well as a nonparametric mean function. It leads to a broader definition of
activity whose need has been suggested by toxicologists. Simulation studies
demonstrate that our approach shows superior performance with more realistic
inferences on activity than current standard methods. Application to an HTS
data set identifies chemicals that are most likely active for two disease
outcomes: neurodevelopmental disorders and obesity. Code is available on
Github
Efficient Nonparametric and Semiparametric Regression Methods with application in Case-Control Studies
Regression Analysis is one of the most important tools of statistics which is widely used in other scientific fields for projection and modeling of association between two variables. Nowadays with modern computing techniques and super high performance devices, regression analysis on multiple dimensions has become an important issue. Our task is to address the issue of modeling with no assumption on the mean and the variance structure and further with no assumption on the error distribution. In other words, we focus on developing robust semiparametric and nonparamteric regression problems. In modern genetic epidemiological association studies, it is often important to investigate the relationships among the potential covariates related to disease in case-control data, a study known as "Secondary Analysis". First we focus to model the association between the potential covariates in univariate dimension nonparametrically. Then we focus to model the association in mulivariate set up by assuming a convenient and popular multivariate semiparametric model, known as Single-Index Model. The secondary analysis of case-control studies is particularly challenging due to multiple reasons (a) the case-control sample is not a random sample, (b) the logistic intercept is practically not identifiable and (c) misspecification of error distribution leads to inconsistent results. For rare disease, controls (individual free of disease) are typically used for valid estimation. However, numerous publication are done to utilize the entire case-control sample (including the diseased individual) to increase the efficiency. Previous work in this context has either specified a fully parametric distribution for regression errors or specified a homoscedastic distribution for the regression errors or have assumed parametric forms on the regression mean.
In the first chapter we focus on to predict an univariate covariate Y by another potential univariate covariate X neither by any parametric form on the mean function nor by any distributional assumption on error, hence addressing potential heteroscedasticity, a problem which has not been studied before. We develop a tilted Kernel based estimator which is a first attempt to model the mean function nonparametrically in secondary analysis. In the following chapters, we focus on i.i.d samples to model both the mean and variance function for predicting Y by multiple covariates X without assuming any form on the regression mean. In particular we
model Y by a single-index model m(X^T ϴ), where ϴ is a single-index vector and m is unspecified. We also model the variance function by another flexible single index model. We develop a practical and readily applicable Bayesian methodology based on penalized spline and Markov Chain Monte Carlo (MCMC) both in i.i.d set up and in case-control set up. For efficient estimation, we model the error distribution by a Dirichlet process mixture models of Normals (DPMM). In numerical examples, we illustrate the finite sample performance of the posterior estimates for both i.i.d and for case-control set up. For single-index set up, in i.i.d case only one existing work based on local linear kernel method addresses modeling of the variance function. We found that our method based on DPMM vastly outperforms the other existing method in terms of mean square efficiency and computation stability. We develop the single-index modeling in secondary analysis to introduce flexible mean and variance function modeling in case-control studies, a problem which has not been studies before. We showed that our method is almost 2 times efficient than using only controls, which is typically used for many cases. We use the real data example from NIH-AARP study on breast cancer, from Colon Cancer Study on red meat consumption and from National Morbidity Air Pollution Study to illustrate the computational efficiency and stability of our methods
NONPARAMETRIC BAYESIAN INFERENCES ON PREDICTOR-DEPENDENT RESPONSE DISTRIBUTIONS
A common statistical problem in biomedical research is to characterize the relationship between a response and predictors. The heterogeneity among subjects causes the response distribution to change across the predictor space in distributional characteristics such as skewness, quantiles and residual variation. In such settings, it would be appealing to model the conditional response distributions as flexibly changing across the predictors while conducting variable selection to identify important predictors both locally (within some local regions) and globally (across the entire range of the predictor space) for the response distribution change. Nonparametric Bayes methods have been very useful for flexible modeling where nonparametric distributions are assumed unknown and assigned priors such as the Dirichlet process (DP). In recent years, there has been a growing interest in extending the DP to a prior model for predictor-dependent unknown distributions, so that the extended priors are applied to flexible conditional distribution modeling. However, for most of the proposed extensions, construction is not simple and computation can be quite difficult. In addition, literature has been focused on estimation and few attempts have been made to address related hypothesis testing problems such as variable selection. Paper 1 proposes a local Dirichlet process (lDP) as a generalization of the Dirichlet process to provide a prior distribution for a collection of random probability measures indexed by predictors. The lDP involves a simple construction, results in a marginal Dirichlet process prior for the random measure at any specifc predictor value, and leads to a straightforward posterior computation. In paper 2, we propose a more general approach not only estimating the conditional response distribution but also identifying important predictors for the response distribution change both with local regions and globally. This is achieved through the probit stick-breaking process mixture (PSBPM) of normal linear regressions where the PSBP is a newly proposed prior for dependent probability measures and particularly convenient to incorporate variable selection structure. In paper 3, we extend the paper 2 method for longitudinal outcomes which are correlated within subject. The PSBPM of linear mixed effects (LME) model is considered allowing for individual variability within a mixture component.Doctor of Philosoph