151 research outputs found
The promises and pitfalls of Stochastic Gradient Langevin Dynamics
International audienceStochastic Gradient Langevin Dynamics (SGLD) has emerged as a key MCMC algorithm for Bayesian learning from large scale datasets. While SGLD with decreasing step sizes converges weakly to the posterior distribution, the algorithm is often used with a constant step size in practice and has demonstrated successes in machine learning tasks. The current practice is to set the step size inversely proportional to N where N is the number of training samples. As N becomes large, we show that the SGLD algorithm has an invariant probability measure which significantly departs from the target posterior and behaves like Stochastic Gradient Descent (SGD). This difference is inherently due to the high variance of the stochastic gradients. Several strategies have been suggested to reduce this effect; among them, SGLD Fixed Point (SGLDFP) uses carefully designed control variates to reduce the variance of the stochastic gradients. We show that SGLDFP gives approximate samples from the posterior distribution, with an accuracy comparable to the Langevin Monte Carlo (LMC) algorithm for a computational cost sublinear in the number of data points. We provide a detailed analysis of the Wasserstein distances between LMC, SGLD, SGLDFP and SGD and explicit expressions of the means and covariance matrices of their invariant distributions. Our findings are supported by limited numerical experiments
Hamiltonian monte carlo with energy conserving subsampling
© 2019 Khue-Dung Dang, Matias Quiroz, Robert Kohn, Minh-Ngoc Tran, Mattias Villani. Hamiltonian Monte Carlo (HMC) samples efficiently from high-dimensional posterior distributions with proposed parameter draws obtained by iterating on a discretized version of the Hamiltonian dynamics. The iterations make HMC computationally costly, especially in problems with large data sets, since it is necessary to compute posterior densities and their derivatives with respect to the parameters. Naively computing the Hamiltonian dynamics on a subset of the data causes HMC to lose its key ability to generate distant parameter proposals with high acceptance probability. The key insight in our article is that efficient subsampling HMC for the parameters is possible if both the dynamics and the acceptance probability are computed from the same data subsample in each complete HMC iteration. We show that this is possible to do in a principled way in a HMC-within-Gibbs framework where the subsample is updated using a pseudo marginal MH step and the parameters are then updated using an HMC step, based on the current subsample. We show that our subsampling methods are fast and compare favorably to two popular sampling algorithms that use gradient estimates from data subsampling. We also explore the current limitations of subsampling HMC algorithms by varying the quality of the variance reducing control variates used in the estimators of the posterior density and its gradients
Langevin Quasi-Monte Carlo
Langevin Monte Carlo (LMC) and its stochastic gradient versions are powerful
algorithms for sampling from complex high-dimensional distributions. To sample
from a distribution with density , LMC
iteratively generates the next sample by taking a step in the gradient
direction with added Gaussian perturbations. Expectations w.r.t. the
target distribution are estimated by averaging over LMC samples. In
ordinary Monte Carlo, it is well known that the estimation error can be
substantially reduced by replacing independent random samples by quasi-random
samples like low-discrepancy sequences. In this work, we show that the
estimation error of LMC can also be reduced by using quasi-random samples.
Specifically, we propose to use completely uniformly distributed (CUD)
sequences with certain low-discrepancy property to generate the Gaussian
perturbations. Under smoothness and convexity conditions, we prove that LMC
with a low-discrepancy CUD sequence achieves smaller error than standard LMC.
The theoretical analysis is supported by compelling numerical experiments,
which demonstrate the effectiveness of our approach
Efficient and Generalizable Tuning Strategies for Stochastic Gradient MCMC
Stochastic gradient Markov chain Monte Carlo (SGMCMC) is a popular class of
algorithms for scalable Bayesian inference. However, these algorithms include
hyperparameters such as step size or batch size that influence the accuracy of
estimators based on the obtained posterior samples. As a result, these
hyperparameters must be tuned by the practitioner and currently no principled
and automated way to tune them exists. Standard MCMC tuning methods based on
acceptance rates cannot be used for SGMCMC, thus requiring alternative tools
and diagnostics. We propose a novel bandit-based algorithm that tunes the
SGMCMC hyperparameters by minimizing the Stein discrepancy between the true
posterior and its Monte Carlo approximation. We provide theoretical results
supporting this approach and assess various Stein-based discrepancies. We
support our results with experiments on both simulated and real datasets, and
find that this method is practical for a wide range of applications
Partitioned integrators for thermodynamic parameterization of neural networks
Traditionally, neural networks are parameterized using optimization
procedures such as stochastic gradient descent, RMSProp and ADAM. These
procedures tend to drive the parameters of the network toward a local minimum.
In this article, we employ alternative "sampling" algorithms (referred to here
as "thermodynamic parameterization methods") which rely on discretized
stochastic differential equations for a defined target distribution on
parameter space. We show that the thermodynamic perspective already improves
neural network training. Moreover, by partitioning the parameters based on
natural layer structure we obtain schemes with very rapid convergence for data
sets with complicated loss landscapes.
We describe easy-to-implement hybrid partitioned numerical algorithms, based
on discretized stochastic differential equations, which are adapted to
feed-forward neural networks, including a multi-layer Langevin algorithm,
AdLaLa (combining the adaptive Langevin and Langevin algorithms) and LOL
(combining Langevin and Overdamped Langevin); we examine the convergence of
these methods using numerical studies and compare their performance among
themselves and in relation to standard alternatives such as stochastic gradient
descent and ADAM. We present evidence that thermodynamic parameterization
methods can be (i) faster, (ii) more accurate, and (iii) more robust than
standard algorithms used within machine learning frameworks
Stochastic Gradient Langevin Dynamics Based on Quantized Optimization
Stochastic learning dynamics based on Langevin or Levy stochastic
differential equations (SDEs) in deep neural networks control the variance of
noise by varying the size of the mini-batch or directly those of injecting
noise. Since the noise variance affects the approximation performance, the
design of the additive noise is significant in SDE-based learning and practical
implementation. In this paper, we propose an alternative stochastic descent
learning equation based on quantized optimization for non-convex objective
functions, adopting a stochastic analysis perspective. The proposed method
employs a quantized optimization approach that utilizes Langevin SDE dynamics,
allowing for controllable noise with an identical distribution without the need
for additive noise or adjusting the mini-batch size. Numerical experiments
demonstrate the effectiveness of the proposed algorithm on vanilla convolution
neural network(CNN) models and the ResNet-50 architecture across various data
sets. Furthermore, we provide a simple PyTorch implementation of the proposed
algorithm.Comment: preprin
- …