8,883 research outputs found

    parallelMCMCcombine: An R Package for Bayesian Methods for Big Data and Analytics

    Full text link
    Recent advances in big data and analytics research have provided a wealth of large data sets that are too big to be analyzed in their entirety, due to restrictions on computer memory or storage size. New Bayesian methods have been developed for large data sets that are only large due to large sample sizes; these methods partition big data sets into subsets, and perform independent Bayesian Markov chain Monte Carlo analyses on the subsets. The methods then combine the independent subset posterior samples to estimate a posterior density given the full data set. These approaches were shown to be effective for Bayesian models including logistic regression models, Gaussian mixture models and hierarchical models. Here, we introduce the R package parallelMCMCcombine which carries out four of these techniques for combining independent subset posterior samples. We illustrate each of the methods using a Bayesian logistic regression model for simulation data and a Bayesian Gamma model for real data; we also demonstrate features and capabilities of the R package. The package assumes the user has carried out the Bayesian analysis and has produced the independent subposterior samples outside of the package. The methods are primarily suited to models with unknown parameters of fixed dimension that exist in continuous parameter spaces. We envision this tool will allow researchers to explore the various methods for their specific applications, and will assist future progress in this rapidly developing field.Comment: for published version see: http://www.plosone.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0108425&representation=PD

    Asymptotically Exact, Embarrassingly Parallel MCMC

    Full text link
    Communication costs, resulting from synchronization requirements during learning, can greatly slow down many parallel machine learning algorithms. In this paper, we present a parallel Markov chain Monte Carlo (MCMC) algorithm in which subsets of data are processed independently, with very little communication. First, we arbitrarily partition data onto multiple machines. Then, on each machine, any classical MCMC method (e.g., Gibbs sampling) may be used to draw samples from a posterior distribution given the data subset. Finally, the samples from each machine are combined to form samples from the full posterior. This embarrassingly parallel algorithm allows each machine to act independently on a subset of the data (without communication) until the final combination stage. We prove that our algorithm generates asymptotically exact samples and empirically demonstrate its ability to parallelize burn-in and sampling in several models

    Semiparametric Multinomial Logit Models for Analysing Consumer Choice Behaviour

    Get PDF
    The multinomial logit model (MNL) is one of the most frequently used statistical models in marketing applications. It allows to relate an unordered categorical response variable, for example representing the choice of a brand, to a vector of covariates such as the price of the brand or variables characterising the consumer. In its classical form, all covariates enter in strictly parametric, linear form into the utility function of the MNL model. In this paper, we introduce semiparametric extensions, where smooth effects of continuous covariates are modelled by penalised splines. A mixed model representation of these penalised splines is employed to obtain estimates of the corresponding smoothing parameters, leading to a fully automated estimation procedure. To validate semiparametric models against parametric models, we utilise proper scoring rules and compare parametric and semiparametric approaches for a number of brand choice data sets

    Semiparametric posterior limits

    Full text link
    We review the Bayesian theory of semiparametric inference following Bickel and Kleijn (2012) and Kleijn and Knapik (2013). After an overview of efficiency in parametric and semiparametric estimation problems, we consider the Bernstein-von Mises theorem (see, e.g., Le Cam and Yang (1990)) and generalize it to (LAN) regular and (LAE) irregular semiparametric estimation problems. We formulate a version of the semiparametric Bernstein-von Mises theorem that does not depend on least-favourable submodels, thus bypassing the most restrictive condition in the presentation of Bickel and Kleijn (2012). The results are applied to the (regular) estimation of the linear coefficient in partial linear regression (with a Gaussian nuisance prior) and of the kernel bandwidth in a model of normal location mixtures (with a Dirichlet nuisance prior), as well as the (irregular) estimation of the boundary of the support of a monotone family of densities (with a Gaussian nuisance prior).Comment: 47 pp., 1 figure, submitted for publication. arXiv admin note: substantial text overlap with arXiv:1007.017

    Semiparametric theory and empirical processes in causal inference

    Full text link
    In this paper we review important aspects of semiparametric theory and empirical processes that arise in causal inference problems. We begin with a brief introduction to the general problem of causal inference, and go on to discuss estimation and inference for causal effects under semiparametric models, which allow parts of the data-generating process to be unrestricted if they are not of particular interest (i.e., nuisance functions). These models are very useful in causal problems because the outcome process is often complex and difficult to model, and there may only be information available about the treatment process (at best). Semiparametric theory gives a framework for benchmarking efficiency and constructing estimators in such settings. In the second part of the paper we discuss empirical process theory, which provides powerful tools for understanding the asymptotic behavior of semiparametric estimators that depend on flexible nonparametric estimators of nuisance functions. These tools are crucial for incorporating machine learning and other modern methods into causal inference analyses. We conclude by examining related extensions and future directions for work in semiparametric causal inference
    corecore