8,883 research outputs found
parallelMCMCcombine: An R Package for Bayesian Methods for Big Data and Analytics
Recent advances in big data and analytics research have provided a wealth of
large data sets that are too big to be analyzed in their entirety, due to
restrictions on computer memory or storage size. New Bayesian methods have been
developed for large data sets that are only large due to large sample sizes;
these methods partition big data sets into subsets, and perform independent
Bayesian Markov chain Monte Carlo analyses on the subsets. The methods then
combine the independent subset posterior samples to estimate a posterior
density given the full data set. These approaches were shown to be effective
for Bayesian models including logistic regression models, Gaussian mixture
models and hierarchical models. Here, we introduce the R package
parallelMCMCcombine which carries out four of these techniques for combining
independent subset posterior samples. We illustrate each of the methods using a
Bayesian logistic regression model for simulation data and a Bayesian Gamma
model for real data; we also demonstrate features and capabilities of the R
package. The package assumes the user has carried out the Bayesian analysis and
has produced the independent subposterior samples outside of the package. The
methods are primarily suited to models with unknown parameters of fixed
dimension that exist in continuous parameter spaces. We envision this tool will
allow researchers to explore the various methods for their specific
applications, and will assist future progress in this rapidly developing field.Comment: for published version see:
http://www.plosone.org/article/fetchObject.action?uri=info%3Adoi%2F10.1371%2Fjournal.pone.0108425&representation=PD
Asymptotically Exact, Embarrassingly Parallel MCMC
Communication costs, resulting from synchronization requirements during
learning, can greatly slow down many parallel machine learning algorithms. In
this paper, we present a parallel Markov chain Monte Carlo (MCMC) algorithm in
which subsets of data are processed independently, with very little
communication. First, we arbitrarily partition data onto multiple machines.
Then, on each machine, any classical MCMC method (e.g., Gibbs sampling) may be
used to draw samples from a posterior distribution given the data subset.
Finally, the samples from each machine are combined to form samples from the
full posterior. This embarrassingly parallel algorithm allows each machine to
act independently on a subset of the data (without communication) until the
final combination stage. We prove that our algorithm generates asymptotically
exact samples and empirically demonstrate its ability to parallelize burn-in
and sampling in several models
Semiparametric Multinomial Logit Models for Analysing Consumer Choice Behaviour
The multinomial logit model (MNL) is one of the most frequently used statistical models in marketing applications. It allows to relate an unordered categorical response variable, for example representing the choice of a brand, to a vector of covariates such as the price of the brand or variables characterising the consumer. In its classical form, all covariates enter in strictly parametric, linear form into the utility function of the MNL model. In this paper, we introduce semiparametric extensions, where smooth effects of continuous covariates are modelled by penalised splines. A mixed model representation of these penalised splines is employed to obtain estimates of the corresponding smoothing parameters, leading to a fully automated estimation procedure. To validate semiparametric models against parametric models, we utilise proper scoring rules and compare parametric and semiparametric approaches for a number of brand choice data sets
Semiparametric posterior limits
We review the Bayesian theory of semiparametric inference following Bickel
and Kleijn (2012) and Kleijn and Knapik (2013). After an overview of efficiency
in parametric and semiparametric estimation problems, we consider the
Bernstein-von Mises theorem (see, e.g., Le Cam and Yang (1990)) and generalize
it to (LAN) regular and (LAE) irregular semiparametric estimation problems. We
formulate a version of the semiparametric Bernstein-von Mises theorem that does
not depend on least-favourable submodels, thus bypassing the most restrictive
condition in the presentation of Bickel and Kleijn (2012). The results are
applied to the (regular) estimation of the linear coefficient in partial linear
regression (with a Gaussian nuisance prior) and of the kernel bandwidth in a
model of normal location mixtures (with a Dirichlet nuisance prior), as well as
the (irregular) estimation of the boundary of the support of a monotone family
of densities (with a Gaussian nuisance prior).Comment: 47 pp., 1 figure, submitted for publication. arXiv admin note:
substantial text overlap with arXiv:1007.017
Semiparametric theory and empirical processes in causal inference
In this paper we review important aspects of semiparametric theory and
empirical processes that arise in causal inference problems. We begin with a
brief introduction to the general problem of causal inference, and go on to
discuss estimation and inference for causal effects under semiparametric
models, which allow parts of the data-generating process to be unrestricted if
they are not of particular interest (i.e., nuisance functions). These models
are very useful in causal problems because the outcome process is often complex
and difficult to model, and there may only be information available about the
treatment process (at best). Semiparametric theory gives a framework for
benchmarking efficiency and constructing estimators in such settings. In the
second part of the paper we discuss empirical process theory, which provides
powerful tools for understanding the asymptotic behavior of semiparametric
estimators that depend on flexible nonparametric estimators of nuisance
functions. These tools are crucial for incorporating machine learning and other
modern methods into causal inference analyses. We conclude by examining related
extensions and future directions for work in semiparametric causal inference
- …