2,758 research outputs found
Streaming, Distributed Variational Inference for Bayesian Nonparametrics
This paper presents a methodology for creating streaming, distributed
inference algorithms for Bayesian nonparametric (BNP) models. In the proposed
framework, processing nodes receive a sequence of data minibatches, compute a
variational posterior for each, and make asynchronous streaming updates to a
central model. In contrast to previous algorithms, the proposed framework is
truly streaming, distributed, asynchronous, learning-rate-free, and
truncation-free. The key challenge in developing the framework, arising from
the fact that BNP models do not impose an inherent ordering on their
components, is finding the correspondence between minibatch and central BNP
posterior components before performing each update. To address this, the paper
develops a combinatorial optimization problem over component correspondences,
and provides an efficient solution technique. The paper concludes with an
application of the methodology to the DP mixture model, with experimental
results demonstrating its practical scalability and performance.Comment: This paper was presented at NIPS 2015. Please use the following
BibTeX citation: @inproceedings{Campbell15_NIPS, Author = {Trevor Campbell
and Julian Straub and John W. {Fisher III} and Jonathan P. How}, Title =
{Streaming, Distributed Variational Inference for Bayesian Nonparametrics},
Booktitle = {Advances in Neural Information Processing Systems (NIPS)}, Year
= {2015}
Speculative Approximations for Terascale Analytics
Model calibration is a major challenge faced by the plethora of statistical
analytics packages that are increasingly used in Big Data applications.
Identifying the optimal model parameters is a time-consuming process that has
to be executed from scratch for every dataset/model combination even by
experienced data scientists. We argue that the incapacity to evaluate multiple
parameter configurations simultaneously and the lack of support to quickly
identify sub-optimal configurations are the principal causes. In this paper, we
develop two database-inspired techniques for efficient model calibration.
Speculative parameter testing applies advanced parallel multi-query processing
methods to evaluate several configurations concurrently. The number of
configurations is determined adaptively at runtime, while the configurations
themselves are extracted from a distribution that is continuously learned
following a Bayesian process. Online aggregation is applied to identify
sub-optimal configurations early in the processing by incrementally sampling
the training dataset and estimating the objective function corresponding to
each configuration. We design concurrent online aggregation estimators and
define halting conditions to accurately and timely stop the execution. We apply
the proposed techniques to distributed gradient descent optimization -- batch
and incremental -- for support vector machines and logistic regression models.
We implement the resulting solutions in GLADE PF-OLA -- a state-of-the-art Big
Data analytics system -- and evaluate their performance over terascale-size
synthetic and real datasets. The results confirm that as many as 32
configurations can be evaluated concurrently almost as fast as one, while
sub-optimal configurations are detected accurately in as little as a
fraction of the time
- …