13 research outputs found

    Bayesian Additive Regression Trees With Parametric Models of Heteroskedasticity

    Full text link
    We incorporate heteroskedasticity into Bayesian Additive Regression Trees (BART) by modeling the log of the error variance parameter as a linear function of prespecified covariates. Under this scheme, the Gibbs sampling procedure for the original sum-of- trees model is easily modified, and the parameters for the variance model are updated via a Metropolis-Hastings step. We demonstrate the promise of our approach by providing more appropriate posterior predictive intervals than homoskedastic BART in heteroskedastic settings and demonstrating the model's resistance to overfitting. Our implementation will be offered in an upcoming release of the R package bartMachine.Comment: 20 pages, 5 figure

    Predicting Inflation: Professional Experts Versus No-Change Forecasts

    Full text link
    We compare forecasts of United States inflation from the Survey of Professional Forecasters (SPF) to predictions made by simple statistical techniques. In nowcasting, economic expertise is persuasive. When projecting beyond the current quarter, novel yet simplistic probabilistic no-change forecasts are equally competitive. We further interpret surveys as ensembles of forecasts, and show that they can be used similarly to the ways in which ensemble prediction systems have transformed weather forecasting. Then we borrow another idea from weather forecasting, in that we apply statistical techniques to postprocess the SPF forecast, based on experience from the recent past. The foregoing conclusions remain unchanged after survey postprocessing

    Regression density estimation using smooth adaptive Gaussian mixtures

    Full text link
    We model a regression density flexibly so that at each value of the covariates the density is a mixture of normals with the means, variances and mixture probabilities of the components changing smoothly as a function of the covariates. The model extends existing models in two important ways. First, the components are allowed to be heteroscedastic regressions as the standard model with homoscedastic regressions can give a poor fit to heteroscedastic data, especially when the number of covariates is large. Furthermore, we typically need fewer components, which makes it easier to interpret the model and speeds up the computation. The second main extension is to introduce a novel variable selection prior into all the components of the model. The variable selection prior acts as a self-adjusting mechanism that prevents overfitting and makes it feasible to fit flexible high-dimensional surfaces. We use Bayesian inference and Markov Chain Monte Carlo methods to estimate the model. Simulated and real examples are used to show that the full generality of our model is required to fit a large class of densities, but also that special cases of the general model are interesting models for economic data

    Bayesian Matrix Completion for Hypothesis Testing

    Full text link
    High-throughput screening (HTS) is a well-established technology that rapidly and efficiently screens thousands of chemicals for potential toxicity. Massive testing using HTS primarily aims to differentiate active vs inactive chemicals for different types of biological endpoints. However, even using high-throughput technology, it is not feasible to test all possible combinations of chemicals and assay endpoints, resulting in a majority of missing combinations. Our goal is to derive posterior probabilities of activity for each chemical by assay endpoint combination, addressing the sparsity of HTS data. We propose a Bayesian hierarchical framework, which borrows information across different chemicals and assay endpoints in a low-dimensional latent space. This framework facilitates out-of-sample prediction of bioactivity potential for new chemicals not yet tested. Furthermore, this paper makes a novel attempt in toxicology to simultaneously model heteroscedastic errors as well as a nonparametric mean function. It leads to a broader definition of activity whose need has been suggested by toxicologists. Simulation studies demonstrate that our approach shows superior performance with more realistic inferences on activity than current standard methods. Application to an HTS data set identifies chemicals that are most likely active for two disease outcomes: neurodevelopmental disorders and obesity. Code is available on Github

    Efficient Nonparametric and Semiparametric Regression Methods with application in Case-Control Studies

    Get PDF
    Regression Analysis is one of the most important tools of statistics which is widely used in other scientific fields for projection and modeling of association between two variables. Nowadays with modern computing techniques and super high performance devices, regression analysis on multiple dimensions has become an important issue. Our task is to address the issue of modeling with no assumption on the mean and the variance structure and further with no assumption on the error distribution. In other words, we focus on developing robust semiparametric and nonparamteric regression problems. In modern genetic epidemiological association studies, it is often important to investigate the relationships among the potential covariates related to disease in case-control data, a study known as "Secondary Analysis". First we focus to model the association between the potential covariates in univariate dimension nonparametrically. Then we focus to model the association in mulivariate set up by assuming a convenient and popular multivariate semiparametric model, known as Single-Index Model. The secondary analysis of case-control studies is particularly challenging due to multiple reasons (a) the case-control sample is not a random sample, (b) the logistic intercept is practically not identifiable and (c) misspecification of error distribution leads to inconsistent results. For rare disease, controls (individual free of disease) are typically used for valid estimation. However, numerous publication are done to utilize the entire case-control sample (including the diseased individual) to increase the efficiency. Previous work in this context has either specified a fully parametric distribution for regression errors or specified a homoscedastic distribution for the regression errors or have assumed parametric forms on the regression mean. In the first chapter we focus on to predict an univariate covariate Y by another potential univariate covariate X neither by any parametric form on the mean function nor by any distributional assumption on error, hence addressing potential heteroscedasticity, a problem which has not been studied before. We develop a tilted Kernel based estimator which is a first attempt to model the mean function nonparametrically in secondary analysis. In the following chapters, we focus on i.i.d samples to model both the mean and variance function for predicting Y by multiple covariates X without assuming any form on the regression mean. In particular we model Y by a single-index model m(X^T ϴ), where ϴ is a single-index vector and m is unspecified. We also model the variance function by another flexible single index model. We develop a practical and readily applicable Bayesian methodology based on penalized spline and Markov Chain Monte Carlo (MCMC) both in i.i.d set up and in case-control set up. For efficient estimation, we model the error distribution by a Dirichlet process mixture models of Normals (DPMM). In numerical examples, we illustrate the finite sample performance of the posterior estimates for both i.i.d and for case-control set up. For single-index set up, in i.i.d case only one existing work based on local linear kernel method addresses modeling of the variance function. We found that our method based on DPMM vastly outperforms the other existing method in terms of mean square efficiency and computation stability. We develop the single-index modeling in secondary analysis to introduce flexible mean and variance function modeling in case-control studies, a problem which has not been studies before. We showed that our method is almost 2 times efficient than using only controls, which is typically used for many cases. We use the real data example from NIH-AARP study on breast cancer, from Colon Cancer Study on red meat consumption and from National Morbidity Air Pollution Study to illustrate the computational efficiency and stability of our methods

    NONPARAMETRIC BAYESIAN INFERENCES ON PREDICTOR-DEPENDENT RESPONSE DISTRIBUTIONS

    Get PDF
    A common statistical problem in biomedical research is to characterize the relationship between a response and predictors. The heterogeneity among subjects causes the response distribution to change across the predictor space in distributional characteristics such as skewness, quantiles and residual variation. In such settings, it would be appealing to model the conditional response distributions as flexibly changing across the predictors while conducting variable selection to identify important predictors both locally (within some local regions) and globally (across the entire range of the predictor space) for the response distribution change. Nonparametric Bayes methods have been very useful for flexible modeling where nonparametric distributions are assumed unknown and assigned priors such as the Dirichlet process (DP). In recent years, there has been a growing interest in extending the DP to a prior model for predictor-dependent unknown distributions, so that the extended priors are applied to flexible conditional distribution modeling. However, for most of the proposed extensions, construction is not simple and computation can be quite difficult. In addition, literature has been focused on estimation and few attempts have been made to address related hypothesis testing problems such as variable selection. Paper 1 proposes a local Dirichlet process (lDP) as a generalization of the Dirichlet process to provide a prior distribution for a collection of random probability measures indexed by predictors. The lDP involves a simple construction, results in a marginal Dirichlet process prior for the random measure at any specifc predictor value, and leads to a straightforward posterior computation. In paper 2, we propose a more general approach not only estimating the conditional response distribution but also identifying important predictors for the response distribution change both with local regions and globally. This is achieved through the probit stick-breaking process mixture (PSBPM) of normal linear regressions where the PSBP is a newly proposed prior for dependent probability measures and particularly convenient to incorporate variable selection structure. In paper 3, we extend the paper 2 method for longitudinal outcomes which are correlated within subject. The PSBPM of linear mixed effects (LME) model is considered allowing for individual variability within a mixture component.Doctor of Philosoph
    corecore