13,969 research outputs found
Stochastic Gradient Descent as Approximate Bayesian Inference
Stochastic Gradient Descent with a constant learning rate (constant SGD)
simulates a Markov chain with a stationary distribution. With this perspective,
we derive several new results. (1) We show that constant SGD can be used as an
approximate Bayesian posterior inference algorithm. Specifically, we show how
to adjust the tuning parameters of constant SGD to best match the stationary
distribution to a posterior, minimizing the Kullback-Leibler divergence between
these two distributions. (2) We demonstrate that constant SGD gives rise to a
new variational EM algorithm that optimizes hyperparameters in complex
probabilistic models. (3) We also propose SGD with momentum for sampling and
show how to adjust the damping coefficient accordingly. (4) We analyze MCMC
algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we
quantify the approximation errors due to finite learning rates. Finally (5), we
use the stochastic process perspective to give a short proof of why Polyak
averaging is optimal. Based on this idea, we propose a scalable approximate
MCMC algorithm, the Averaged Stochastic Gradient Sampler.Comment: 35 pages, published version (JMLR 2017
Black-box -divergence Minimization
Black-box alpha (BB-) is a new approximate inference method based on
the minimization of -divergences. BB- scales to large datasets
because it can be implemented using stochastic gradient descent. BB-
can be applied to complex probabilistic models with little effort since it only
requires as input the likelihood function and its gradients. These gradients
can be easily obtained using automatic differentiation. By changing the
divergence parameter , the method is able to interpolate between
variational Bayes (VB) () and an algorithm similar to
expectation propagation (EP) (). Experiments on probit regression
and neural network regression and classification problems show that BB-
with non-standard settings of , such as , usually
produces better predictions than with (VB) or (EP).Comment: Accepted at ICML 2016. The first version (v1) was presented at NIPS
workshops on Advances in Approximate Bayesian Inference and Black Box
Learning and Inferenc
PASS-GLM: polynomial approximate sufficient statistics for scalable Bayesian GLM inference
Generalized linear models (GLMs) -- such as logistic regression, Poisson
regression, and robust regression -- provide interpretable models for diverse
data types. Probabilistic approaches, particularly Bayesian ones, allow
coherent estimates of uncertainty, incorporation of prior information, and
sharing of power across experiments via hierarchical models. In practice,
however, the approximate Bayesian methods necessary for inference have either
failed to scale to large data sets or failed to provide theoretical guarantees
on the quality of inference. We propose a new approach based on constructing
polynomial approximate sufficient statistics for GLMs (PASS-GLM). We
demonstrate that our method admits a simple algorithm as well as trivial
streaming and distributed extensions that do not compound error across
computations. We provide theoretical guarantees on the quality of point (MAP)
estimates, the approximate posterior, and posterior mean and uncertainty
estimates. We validate our approach empirically in the case of logistic
regression using a quadratic approximation and show competitive performance
with stochastic gradient descent, MCMC, and the Laplace approximation in terms
of speed and multiple measures of accuracy -- including on an advertising data
set with 40 million data points and 20,000 covariates.Comment: In Proceedings of the 31st Annual Conference on Neural Information
Processing Systems (NIPS 2017). v3: corrected typos in Appendix
A Variational Analysis of Stochastic Gradient Algorithms
Stochastic Gradient Descent (SGD) is an important algorithm in machine
learning. With constant learning rates, it is a stochastic process that, after
an initial phase of convergence, generates samples from a stationary
distribution. We show that SGD with constant rates can be effectively used as
an approximate posterior inference algorithm for probabilistic modeling.
Specifically, we show how to adjust the tuning parameters of SGD such as to
match the resulting stationary distribution to the posterior. This analysis
rests on interpreting SGD as a continuous-time stochastic process and then
minimizing the Kullback-Leibler divergence between its stationary distribution
and the target posterior. (This is in the spirit of variational inference.) In
more detail, we model SGD as a multivariate Ornstein-Uhlenbeck process and then
use properties of this process to derive the optimal parameters. This
theoretical framework also connects SGD to modern scalable inference
algorithms; we analyze the recently proposed stochastic gradient Fisher scoring
under this perspective. We demonstrate that SGD with properly chosen constant
rates gives a new way to optimize hyperparameters in probabilistic models.Comment: 8 pages, 3 figure
Provable Bayesian Inference via Particle Mirror Descent
Bayesian methods are appealing in their flexibility in modeling complex data
and ability in capturing uncertainty in parameters. However, when Bayes' rule
does not result in tractable closed-form, most approximate inference algorithms
lack either scalability or rigorous guarantees. To tackle this challenge, we
propose a simple yet provable algorithm, \emph{Particle Mirror Descent} (PMD),
to iteratively approximate the posterior density. PMD is inspired by stochastic
functional mirror descent where one descends in the density space using a small
batch of data points at each iteration, and by particle filtering where one
uses samples to approximate a function. We prove result of the first kind that,
with particles, PMD provides a posterior density estimator that converges
in terms of -divergence to the true posterior in rate . We
demonstrate competitive empirical performances of PMD compared to several
approximate inference algorithms in mixture models, logistic regression, sparse
Gaussian processes and latent Dirichlet allocation on large scale datasets.Comment: 38 pages, 26 figure
Advances in Variational Inference
Many modern unsupervised or semi-supervised machine learning algorithms rely
on Bayesian probabilistic models. These models are usually intractable and thus
require approximate inference. Variational inference (VI) lets us approximate a
high-dimensional Bayesian posterior with a simpler variational distribution by
solving an optimization problem. This approach has been successfully used in
various models and large-scale applications. In this review, we give an
overview of recent trends in variational inference. We first introduce standard
mean field variational inference, then review recent advances focusing on the
following aspects: (a) scalable VI, which includes stochastic approximations,
(b) generic VI, which extends the applicability of VI to a large class of
otherwise intractable models, such as non-conjugate models, (c) accurate VI,
which includes variational models beyond the mean field approximation or with
atypical divergences, and (d) amortized VI, which implements the inference over
local latent variables with inference networks. Finally, we provide a summary
of promising future research directions
Conjugate-Computation Variational Inference : Converting Variational Inference in Non-Conjugate Models to Inferences in Conjugate Models
Variational inference is computationally challenging in models that contain
both conjugate and non-conjugate terms. Methods specifically designed for
conjugate models, even though computationally efficient, find it difficult to
deal with non-conjugate terms. On the other hand, stochastic-gradient methods
can handle the non-conjugate terms but they usually ignore the conjugate
structure of the model which might result in slow convergence. In this paper,
we propose a new algorithm called Conjugate-computation Variational Inference
(CVI) which brings the best of the two worlds together -- it uses conjugate
computations for the conjugate terms and employs stochastic gradients for the
rest. We derive this algorithm by using a stochastic mirror-descent method in
the mean-parameter space, and then expressing each gradient step as a
variational inference in a conjugate model. We demonstrate our algorithm's
applicability to a large class of models and establish its convergence. Our
experimental results show that our method converges much faster than the
methods that ignore the conjugate structure of the model.Comment: Published in AI-Stats 2017. Fixed some typos. This version contains a
short paragraph in the conclusions section which we could not add in the
conference version due to space constraint
A Simple Baseline for Bayesian Uncertainty in Deep Learning
We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose
approach for uncertainty representation and calibration in deep learning.
Stochastic Weight Averaging (SWA), which computes the first moment of
stochastic gradient descent (SGD) iterates with a modified learning rate
schedule, has recently been shown to improve generalization in deep learning.
With SWAG, we fit a Gaussian using the SWA solution as the first moment and a
low rank plus diagonal covariance also derived from the SGD iterates, forming
an approximate posterior distribution over neural network weights; we then
sample from this Gaussian distribution to perform Bayesian model averaging. We
empirically find that SWAG approximates the shape of the true posterior, in
accordance with results describing the stationary distribution of SGD iterates.
Moreover, we demonstrate that SWAG performs well on a wide variety of tasks,
including out of sample detection, calibration, and transfer learning, in
comparison to many popular alternatives including MC dropout, KFAC Laplace,
SGLD, and temperature scaling.Comment: Published at NeurIPS 201
Distributed Bayesian Learning with Stochastic Natural-gradient Expectation Propagation and the Posterior Server
This paper makes two contributions to Bayesian machine learning algorithms.
Firstly, we propose stochastic natural gradient expectation propagation (SNEP),
a novel alternative to expectation propagation (EP), a popular variational
inference algorithm. SNEP is a black box variational algorithm, in that it does
not require any simplifying assumptions on the distribution of interest, beyond
the existence of some Monte Carlo sampler for estimating the moments of the EP
tilted distributions. Further, as opposed to EP which has no guarantee of
convergence, SNEP can be shown to be convergent, even when using Monte Carlo
moment estimates. Secondly, we propose a novel architecture for distributed
Bayesian learning which we call the posterior server. The posterior server
allows scalable and robust Bayesian learning in cases where a data set is
stored in a distributed manner across a cluster, with each compute node
containing a disjoint subset of data. An independent Monte Carlo sampler is run
on each compute node, with direct access only to the local data subset, but
which targets an approximation to the global posterior distribution given all
data across the whole cluster. This is achieved by using a distributed
asynchronous implementation of SNEP to pass messages across the cluster. We
demonstrate SNEP and the posterior server on distributed Bayesian learning of
logistic regression and neural networks.
Keywords: Distributed Learning, Large Scale Learning, Deep Learning, Bayesian
Learn- ing, Variational Inference, Expectation Propagation, Stochastic
Approximation, Natural Gradient, Markov chain Monte Carlo, Parameter Server,
Posterior Server.Comment: 37 pages, 7 figure
Scalable Training of Inference Networks for Gaussian-Process Models
Inference in Gaussian process (GP) models is computationally challenging for
large data, and often difficult to approximate with a small number of inducing
points. We explore an alternative approximation that employs stochastic
inference networks for a flexible inference. Unfortunately, for such networks,
minibatch training is difficult to be able to learn meaningful correlations
over function outputs for a large dataset. We propose an algorithm that enables
such training by tracking a stochastic, functional mirror-descent algorithm. At
each iteration, this only requires considering a finite number of input
locations, resulting in a scalable and easy-to-implement algorithm. Empirical
results show comparable and, sometimes, superior performance to existing sparse
variational GP methods.Comment: ICML 2019. Update results added in the camera-ready versio
- …