7,157 research outputs found
Unbiasing Truncated Backpropagation Through Time
Truncated Backpropagation Through Time (truncated BPTT) is a widespread
method for learning recurrent computational graphs. Truncated BPTT keeps the
computational benefits of Backpropagation Through Time (BPTT) while relieving
the need for a complete backtrack through the whole data sequence at every
step. However, truncation favors short-term dependencies: the gradient estimate
of truncated BPTT is biased, so that it does not benefit from the convergence
guarantees from stochastic gradient theory. We introduce Anticipated Reweighted
Truncated Backpropagation (ARTBP), an algorithm that keeps the computational
benefits of truncated BPTT, while providing unbiasedness. ARTBP works by using
variable truncation lengths together with carefully chosen compensation factors
in the backpropagation equation. We check the viability of ARTBP on two tasks.
First, a simple synthetic task where careful balancing of temporal dependencies
at different scales is needed: truncated BPTT displays unreliable performance,
and in worst case scenarios, divergence, while ARTBP converges reliably.
Second, on Penn Treebank character-level language modelling, ARTBP slightly
outperforms truncated BPTT
Efficient Optimization of Loops and Limits with Randomized Telescoping Sums
We consider optimization problems in which the objective requires an inner
loop with many steps or is the limit of a sequence of increasingly costly
approximations. Meta-learning, training recurrent neural networks, and
optimization of the solutions to differential equations are all examples of
optimization problems with this character. In such problems, it can be
expensive to compute the objective function value and its gradient, but
truncating the loop or using less accurate approximations can induce biases
that damage the overall solution. We propose randomized telescope (RT) gradient
estimators, which represent the objective as the sum of a telescoping series
and sample linear combinations of terms to provide cheap unbiased gradient
estimates. We identify conditions under which RT estimators achieve
optimization convergence rates independent of the length of the loop or the
required accuracy of the approximation. We also derive a method for tuning RT
estimators online to maximize a lower bound on the expected decrease in loss
per unit of computation. We evaluate our adaptive RT estimators on a range of
applications including meta-optimization of learning rates, variational
inference of ODE parameters, and training an LSTM to model long sequences
Online Natural Gradient as a Kalman Filter
We cast Amari's natural gradient in statistical learning as a specific case
of Kalman filtering. Namely, applying an extended Kalman filter to estimate a
fixed unknown parameter of a probabilistic model from a series of observations,
is rigorously equivalent to estimating this parameter via an online stochastic
natural gradient descent on the log-likelihood of the observations.
In the i.i.d. case, this relation is a consequence of the "information
filter" phrasing of the extended Kalman filter. In the recurrent (state space,
non-i.i.d.) case, we prove that the joint Kalman filter over states and
parameters is a natural gradient on top of real-time recurrent learning (RTRL),
a classical algorithm to train recurrent models.
This exact algebraic correspondence provides relevant interpretations for
natural gradient hyperparameters such as learning rates or initialization and
regularization of the Fisher information matrix.Comment: 3rd version: expanded intr
- …