195 research outputs found
Optimizing Neural Networks with Kronecker-factored Approximate Curvature
We propose an efficient method for approximating natural gradient descent in
neural networks which we call Kronecker-Factored Approximate Curvature (K-FAC).
K-FAC is based on an efficiently invertible approximation of a neural network's
Fisher information matrix which is neither diagonal nor low-rank, and in some
cases is completely non-sparse. It is derived by approximating various large
blocks of the Fisher (corresponding to entire layers) as being the Kronecker
product of two much smaller matrices. While only several times more expensive
to compute than the plain stochastic gradient, the updates produced by K-FAC
make much more progress optimizing the objective, which results in an algorithm
that can be much faster than stochastic gradient descent with momentum in
practice. And unlike some previously proposed approximate
natural-gradient/Newton methods which use high-quality non-diagonal curvature
matrices (such as Hessian-free optimization), K-FAC works very well in highly
stochastic optimization regimes. This is because the cost of storing and
inverting K-FAC's approximation to the curvature matrix does not depend on the
amount of data used to estimate it, which is a feature typically associated
only with diagonal or low-rank approximations to the curvature matrix.Comment: Reduction ratio formula corrected. Removed incorrect claim about
geodesics in footnot
A Kronecker-factored approximate Fisher matrix for convolution layers
Second-order optimization methods such as natural gradient descent have the
potential to speed up training of neural networks by correcting for the
curvature of the loss function. Unfortunately, the exact natural gradient is
impractical to compute for large models, and most approximations either require
an expensive iterative procedure or make crude approximations to the curvature.
We present Kronecker Factors for Convolution (KFC), a tractable approximation
to the Fisher matrix for convolutional networks based on a structured
probabilistic model for the distribution over backpropagated derivatives.
Similarly to the recently proposed Kronecker-Factored Approximate Curvature
(K-FAC), each block of the approximate Fisher matrix decomposes as the
Kronecker product of small matrices, allowing for efficient inversion. KFC
captures important curvature information while still yielding comparably
efficient updates to stochastic gradient descent (SGD). We show that the
updates are invariant to commonly used reparameterizations, such as centering
of the activations. In our experiments, approximate natural gradient descent
with KFC was able to train convolutional networks several times faster than
carefully tuned SGD. Furthermore, it was able to train the networks in 10-20
times fewer iterations than SGD, suggesting its potential applicability in a
distributed setting
Importance Weighted Autoencoders
The variational autoencoder (VAE; Kingma, Welling (2014)) is a recently
proposed generative model pairing a top-down generative network with a
bottom-up recognition network which approximates posterior inference. It
typically makes strong assumptions about posterior inference, for instance that
the posterior distribution is approximately factorial, and that its parameters
can be approximated with nonlinear regression from the observations. As we show
empirically, the VAE objective can lead to overly simplified representations
which fail to use the network's entire modeling capacity. We present the
importance weighted autoencoder (IWAE), a generative model with the same
architecture as the VAE, but which uses a strictly tighter log-likelihood lower
bound derived from importance weighting. In the IWAE, the recognition network
uses multiple samples to approximate the posterior, giving it increased
flexibility to model complex posteriors which do not fit the VAE modeling
assumptions. We show empirically that IWAEs learn richer latent space
representations than VAEs, leading to improved test log-likelihood on density
estimation benchmarks.Comment: Submitted to ICLR 201
Fast Convergence of Natural Gradient Descent for Overparameterized Neural Networks
Natural gradient descent has proven effective at mitigating the effects of
pathological curvature in neural network optimization, but little is known
theoretically about its convergence properties, especially for \emph{nonlinear}
networks. In this work, we analyze for the first time the speed of convergence
of natural gradient descent on nonlinear neural networks with squared-error
loss. We identify two conditions which guarantee efficient convergence from
random initializations: (1) the Jacobian matrix (of network's output for all
training cases with respect to the parameters) has full row rank, and (2) the
Jacobian matrix is stable for small perturbations around the initialization.
For two-layer ReLU neural networks, we prove that these two conditions do in
fact hold throughout the training, under the assumptions of nondegenerate
inputs and overparameterization. We further extend our analysis to more general
loss functions. Lastly, we show that K-FAC, an approximate natural gradient
descent method, also converges to global minima under the same assumptions, and
we give a bound on the rate of this convergence.Comment: NeurIPS 201
Testing MCMC code
Markov Chain Monte Carlo (MCMC) algorithms are a workhorse of probabilistic
modeling and inference, but are difficult to debug, and are prone to silent
failure if implemented naively. We outline several strategies for testing the
correctness of MCMC algorithms. Specifically, we advocate writing code in a
modular way, where conditional probability calculations are kept separate from
the logic of the sampler. We discuss strategies for both unit testing and
integration testing. As a running example, we show how a Python implementation
of Gibbs sampling for a mixture of Gaussians model can be tested.Comment: Presented at the 2014 NIPS workshop on Software Engineering for
Machine Learnin
Accurate and Conservative Estimates of MRF Log-likelihood using Reverse Annealing
Markov random fields (MRFs) are difficult to evaluate as generative models
because computing the test log-probabilities requires the intractable partition
function. Annealed importance sampling (AIS) is widely used to estimate MRF
partition functions, and often yields quite accurate results. However, AIS is
prone to overestimate the log-likelihood with little indication that anything
is wrong. We present the Reverse AIS Estimator (RAISE), a stochastic lower
bound on the log-likelihood of an approximation to the original MRF model.
RAISE requires only the same MCMC transition operators as standard AIS.
Experimental results indicate that RAISE agrees closely with AIS
log-probability estimates for RBMs, DBMs, and DBNs, but typically errs on the
side of underestimating, rather than overestimating, the log-likelihood
Noisy Natural Gradient as Variational Inference
Variational Bayesian neural nets combine the flexibility of deep learning
with Bayesian uncertainty estimation. Unfortunately, there is a tradeoff
between cheap but simple variational families (e.g.~fully factorized) or
expensive and complicated inference procedures. We show that natural gradient
ascent with adaptive weight noise implicitly fits a variational posterior to
maximize the evidence lower bound (ELBO). This insight allows us to train
full-covariance, fully factorized, or matrix-variate Gaussian variational
posteriors using noisy versions of natural gradient, Adam, and K-FAC,
respectively, making it possible to scale up to modern-size ConvNets. On
standard regression benchmarks, our noisy K-FAC algorithm makes better
predictions and matches Hamiltonian Monte Carlo's predictive variances better
than existing methods. Its improved uncertainty estimates lead to more
efficient exploration in active learning, and intrinsic motivation for
reinforcement learning
Reversible Recurrent Neural Networks
Recurrent neural networks (RNNs) provide state-of-the-art performance in
processing sequential data but are memory intensive to train, limiting the
flexibility of RNN models which can be trained. Reversible RNNs---RNNs for
which the hidden-to-hidden transition can be reversed---offer a path to reduce
the memory requirements of training, as hidden states need not be stored and
instead can be recomputed during backpropagation. We first show that perfectly
reversible RNNs, which require no storage of the hidden activations, are
fundamentally limited because they cannot forget information from their hidden
state. We then provide a scheme for storing a small number of bits in order to
allow perfect reversal with forgetting. Our method achieves comparable
performance to traditional models while reducing the activation memory cost by
a factor of 10--15. We extend our technique to attention-based
sequence-to-sequence models, where it maintains performance while reducing
activation memory cost by a factor of 5--10 in the encoder, and a factor of
10--15 in the decoder.Comment: Published as a conference paper at NIPS 201
Learning Wake-Sleep Recurrent Attention Models
Despite their success, convolutional neural networks are computationally
expensive because they must examine all image locations. Stochastic
attention-based models have been shown to improve computational efficiency at
test time, but they remain difficult to train because of intractable posterior
inference and high variance in the stochastic gradient estimates. Borrowing
techniques from the literature on training deep generative models, we present
the Wake-Sleep Recurrent Attention Model, a method for training stochastic
attention networks which improves posterior inference and which reduces the
variability in the stochastic gradients. We show that our method can greatly
speed up the training time for stochastic attention networks in the domains of
image classification and caption generation.Comment: To appear in NIPS 201
Sandwiching the marginal likelihood using bidirectional Monte Carlo
Computing the marginal likelihood (ML) of a model requires marginalizing out
all of the parameters and latent variables, a difficult high-dimensional
summation or integration problem. To make matters worse, it is often hard to
measure the accuracy of one's ML estimates. We present bidirectional Monte
Carlo, a technique for obtaining accurate log-ML estimates on data simulated
from a model. This method obtains stochastic lower bounds on the log-ML using
annealed importance sampling or sequential Monte Carlo, and obtains stochastic
upper bounds by running these same algorithms in reverse starting from an exact
posterior sample. The true value can be sandwiched between these two stochastic
bounds with high probability. Using the ground truth log-ML estimates obtained
from our method, we quantitatively evaluate a wide variety of existing ML
estimators on several latent variable models: clustering, a low rank
approximation, and a binary attributes model. These experiments yield insights
into how to accurately estimate marginal likelihoods
- …