Search CORE

32,768 research outputs found

Smoothed Gradients for Stochastic Variational Inference

Author: Blei David
Mandt Stephan
Publication venue
Publication date: 17/11/2014
Field of study

Stochastic variational inference (SVI) lets us scale up Bayesian computation to massive data. It uses stochastic optimization to fit a variational distribution, following easy-to-compute noisy natural gradients. As with most traditional stochastic optimization methods, SVI takes precautions to use unbiased stochastic gradients whose expectations are equal to the true gradients. In this paper, we explore the idea of following biased stochastic gradients in SVI. Our method replaces the natural gradient with a similarly constructed vector that uses a fixed-window moving average of some of its previous terms. We will demonstrate the many advantages of this technique. First, its computational cost is the same as for SVI and storage requirements only multiply by a constant factor. Second, it enjoys significant variance reduction over the unbiased estimates, smaller bias than averaged gradients, and leads to smaller mean-squared error against the full gradient. We test our method on latent Dirichlet allocation with three large corpora.Comment: Appears in Neural Information Processing Systems, 201

arXiv.org e-Print Archive

CiteSeerX

Probabilistic Line Searches for Stochastic Optimization

Author: Hennig Philipp
Mahsereci Maren
Publication venue
Publication date: 01/12/2015
Field of study

In deterministic optimization, line searches are a standard tool ensuring stability and efficiency. Where only stochastic gradients are available, no direct equivalent has so far been formulated, because uncertain gradients do not allow for a strict sequence of decisions collapsing the search space. We construct a probabilistic line search by combining the structure of existing deterministic methods with notions from Bayesian optimization. Our method retains a Gaussian process surrogate of the univariate optimization objective, and uses a probabilistic belief over the Wolfe conditions to monitor the descent. The algorithm has very low computational cost, and no user-controlled parameters. Experiments show that it effectively removes the need to define a learning rate for stochastic gradient descent.Comment: Extended version of the NIPS '15 conference paper, includes detailed pseudo-code, 59 pages, 35 figure

arXiv.org e-Print Archive

Publikationsserver der Universität Tübingen

MPG.PuRe

Reducing Reparameterization Gradient Variance

Author: Adams Ryan P.
D'Amour Alexander
Foti Nicholas J.
Miller Andrew C.
Publication venue
Publication date: 01/01/2017
Field of study

Optimization with noisy gradients has become ubiquitous in statistics and machine learning. Reparameterization gradients, or gradient estimates computed via the "reparameterization trick," represent a class of noisy gradients often used in Monte Carlo variational inference (MCVI). However, when these gradient estimators are too noisy, the optimization procedure can be slow or fail to converge. One way to reduce noise is to use more samples for the gradient estimate, but this can be computationally expensive. Instead, we view the noisy gradient as a random variable, and form an inexpensive approximation of the generating procedure for the gradient sample. This approximation has high correlation with the noisy gradient by construction, making it a useful control variate for variance reduction. We demonstrate our approach on non-conjugate multi-level hierarchical models and a Bayesian neural net where we observed gradient variance reductions of multiple orders of magnitude (20-2,000x)

arXiv.org e-Print Archive

Princeton University Open Access Repository

Bayesian filtering unifies adaptive and non-adaptive neural network optimization methods

Author: Aitchison Laurence
Publication venue
Publication date: 31/07/2019
Field of study

We formulate the problem of neural network optimization as Bayesian filtering, where the observations are the backpropagated gradients. While neural network optimization has previously been studied using natural gradient methods which are closely related to Bayesian inference, they were unable to recover standard optimizers such as Adam and RMSprop with a root-mean-square gradient normalizer, instead getting a mean-square normalizer. To recover the root-mean-square normalizer, we find it necessary to account for the temporal dynamics of all the other parameters as they are geing optimized. The resulting optimizer, AdaBayes, adaptively transitions between SGD-like and Adam-like behaviour, automatically recovers AdamW, a state of the art variant of Adam with decoupled weight decay, and has generalisation performance competitive with SGD

arXiv.org e-Print Archive

Explore Bristol Research