19,884 research outputs found
Optimal Matrix Momentum Stochastic Approximation and Applications to Q-learning
Acceleration is an increasingly common theme in the stochastic optimization
literature. The two most common examples are Nesterov's method, and Polyak's
momentum technique. In this paper two new algorithms are introduced for root
finding problems: 1) PolSA is a root finding algorithm with specially designed
matrix momentum, and 2) NeSA can be regarded as a variant of Nesterov's
algorithm, or a simplification of PolSA. The PolSA algorithm is new even in the
context of optimization (when cast as a root finding problem).
The research surveyed in this paper is motivated by applications to
reinforcement learning. It is well known that most variants of TD- and
Q-learning may be cast as SA (stochastic approximation) algorithms, and the
tools from general SA theory can be used to investigate convergence and bounds
on convergence rate. In particular, the asymptotic variance is a common metric
of performance for SA algorithms, and is also one among many metrics used in
assessing the performance of stochastic optimization algorithms. There are two
well known SA techniques that are known to have optimal asymptotic variance:
the Ruppert-Polyak averaging technique, and stochastic Newton-Raphson (SNR).
The former algorithm can have extremely bad transient performance, and the
latter can be computationally expensive. It is demonstrated here that parameter
estimates from the new PolSA algorithm couple with those of the ideal (but more
complex) SNR algorithm. The new algorithm is thus a third approach to obtain
optimal asymptotic covariance.
These strong results require assumptions on the model. A linearized model is
considered, and the noise is assumed to be a martingale difference sequence.
Numerical results are obtained in a non-linear setting that is the motivation
for this work: In PolSA implementations of Q-learning it is observed that
coupling occurs with SNR in this non-ideal setting
Stochastic Modified Equations and Dynamics of Stochastic Gradient Algorithms I: Mathematical Foundations
We develop the mathematical foundations of the stochastic modified equations
(SME) framework for analyzing the dynamics of stochastic gradient algorithms,
where the latter is approximated by a class of stochastic differential
equations with small noise parameters. We prove that this approximation can be
understood mathematically as an weak approximation, which leads to a number of
precise and useful results on the approximations of stochastic gradient descent
(SGD), momentum SGD and stochastic Nesterov's accelerated gradient method in
the general setting of stochastic objectives. We also demonstrate through
explicit calculations that this continuous-time approach can uncover important
analytical insights into the stochastic gradient algorithms under consideration
that may not be easy to obtain in a purely discrete-time setting
Fluctuation-dissipation relations for stochastic gradient descent
The notion of the stationary equilibrium ensemble has played a central role
in statistical mechanics. In machine learning as well, training serves as
generalized equilibration that drives the probability distribution of model
parameters toward stationarity. Here, we derive stationary
fluctuation-dissipation relations that link measurable quantities and
hyperparameters in the stochastic gradient descent algorithm. These relations
hold exactly for any stationary state and can in particular be used to
adaptively set training schedule. We can further use the relations to
efficiently extract information pertaining to a loss-function landscape such as
the magnitudes of its Hessian and anharmonicity. Our claims are empirically
verified.Comment: 15 pages, 6 figures; v2: final version accepted at ICLR 2019, with
derivations/assumptions clarified and Adam/AMSGrad experiments adde
A Kronecker-factored approximate Fisher matrix for convolution layers
Second-order optimization methods such as natural gradient descent have the
potential to speed up training of neural networks by correcting for the
curvature of the loss function. Unfortunately, the exact natural gradient is
impractical to compute for large models, and most approximations either require
an expensive iterative procedure or make crude approximations to the curvature.
We present Kronecker Factors for Convolution (KFC), a tractable approximation
to the Fisher matrix for convolutional networks based on a structured
probabilistic model for the distribution over backpropagated derivatives.
Similarly to the recently proposed Kronecker-Factored Approximate Curvature
(K-FAC), each block of the approximate Fisher matrix decomposes as the
Kronecker product of small matrices, allowing for efficient inversion. KFC
captures important curvature information while still yielding comparably
efficient updates to stochastic gradient descent (SGD). We show that the
updates are invariant to commonly used reparameterizations, such as centering
of the activations. In our experiments, approximate natural gradient descent
with KFC was able to train convolutional networks several times faster than
carefully tuned SGD. Furthermore, it was able to train the networks in 10-20
times fewer iterations than SGD, suggesting its potential applicability in a
distributed setting
Weight Normalization: A Simple Reparameterization to Accelerate Training of Deep Neural Networks
We present weight normalization: a reparameterization of the weight vectors
in a neural network that decouples the length of those weight vectors from
their direction. By reparameterizing the weights in this way we improve the
conditioning of the optimization problem and we speed up convergence of
stochastic gradient descent. Our reparameterization is inspired by batch
normalization but does not introduce any dependencies between the examples in a
minibatch. This means that our method can also be applied successfully to
recurrent models such as LSTMs and to noise-sensitive applications such as deep
reinforcement learning or generative models, for which batch normalization is
less well suited. Although our method is much simpler, it still provides much
of the speed-up of full batch normalization. In addition, the computational
overhead of our method is lower, permitting more optimization steps to be taken
in the same amount of time. We demonstrate the usefulness of our method on
applications in supervised image recognition, generative modelling, and deep
reinforcement learning
On the Acceleration of L-BFGS with Second-Order Information and Stochastic Batches
This paper proposes a framework of L-BFGS based on the (approximate)
second-order information with stochastic batches, as a novel approach to the
finite-sum minimization problems. Different from the classical L-BFGS where
stochastic batches lead to instability, we use a smooth estimate for the
evaluations of the gradient differences while achieving acceleration by
well-scaling the initial Hessians. We provide theoretical analyses for both
convex and nonconvex cases. In addition, we demonstrate that within the popular
applications of least-square and cross-entropy losses, the algorithm admits a
simple implementation in the distributed environment. Numerical experiments
support the efficiency of our algorithms
Stochastic Gradient Descent as Approximate Bayesian Inference
Stochastic Gradient Descent with a constant learning rate (constant SGD)
simulates a Markov chain with a stationary distribution. With this perspective,
we derive several new results. (1) We show that constant SGD can be used as an
approximate Bayesian posterior inference algorithm. Specifically, we show how
to adjust the tuning parameters of constant SGD to best match the stationary
distribution to a posterior, minimizing the Kullback-Leibler divergence between
these two distributions. (2) We demonstrate that constant SGD gives rise to a
new variational EM algorithm that optimizes hyperparameters in complex
probabilistic models. (3) We also propose SGD with momentum for sampling and
show how to adjust the damping coefficient accordingly. (4) We analyze MCMC
algorithms. For Langevin Dynamics and Stochastic Gradient Fisher Scoring, we
quantify the approximation errors due to finite learning rates. Finally (5), we
use the stochastic process perspective to give a short proof of why Polyak
averaging is optimal. Based on this idea, we propose a scalable approximate
MCMC algorithm, the Averaged Stochastic Gradient Sampler.Comment: 35 pages, published version (JMLR 2017
Kalman Gradient Descent: Adaptive Variance Reduction in Stochastic Optimization
We introduce Kalman Gradient Descent, a stochastic optimization algorithm
that uses Kalman filtering to adaptively reduce gradient variance in stochastic
gradient descent by filtering the gradient estimates. We present both a
theoretical analysis of convergence in a non-convex setting and experimental
results which demonstrate improved performance on a variety of machine learning
areas including neural networks and black box variational inference. We also
present a distributed version of our algorithm that enables large-dimensional
optimization, and we extend our algorithm to SGD with momentum and RMSProp.Comment: 25 pages, 5 figure
A Simple Baseline for Bayesian Uncertainty in Deep Learning
We propose SWA-Gaussian (SWAG), a simple, scalable, and general purpose
approach for uncertainty representation and calibration in deep learning.
Stochastic Weight Averaging (SWA), which computes the first moment of
stochastic gradient descent (SGD) iterates with a modified learning rate
schedule, has recently been shown to improve generalization in deep learning.
With SWAG, we fit a Gaussian using the SWA solution as the first moment and a
low rank plus diagonal covariance also derived from the SGD iterates, forming
an approximate posterior distribution over neural network weights; we then
sample from this Gaussian distribution to perform Bayesian model averaging. We
empirically find that SWAG approximates the shape of the true posterior, in
accordance with results describing the stationary distribution of SGD iterates.
Moreover, we demonstrate that SWAG performs well on a wide variety of tasks,
including out of sample detection, calibration, and transfer learning, in
comparison to many popular alternatives including MC dropout, KFAC Laplace,
SGLD, and temperature scaling.Comment: Published at NeurIPS 201
Accelerated Gossip via Stochastic Heavy Ball Method
In this paper we show how the stochastic heavy ball method (SHB) -- a popular
method for solving stochastic convex and non-convex optimization problems
--operates as a randomized gossip algorithm. In particular, we focus on two
special cases of SHB: the Randomized Kaczmarz method with momentum and its
block variant. Building upon a recent framework for the design and analysis of
randomized gossip algorithms, [Loizou Richtarik, 2016] we interpret the
distributed nature of the proposed methods. We present novel protocols for
solving the average consensus problem where in each step all nodes of the
network update their values but only a subset of them exchange their private
values. Numerical experiments on popular wireless sensor networks showing the
benefits of our protocols are also presented.Comment: 8 pages, 5 Figures, 56th Annual Allerton Conference on Communication,
Control, and Computing, 201
- …