170 research outputs found
Auto-encoders: reconstruction versus compression
We discuss the similarities and differences between training an auto-encoder
to minimize the reconstruction error, and training the same auto-encoder to
compress the data via a generative model. Minimizing a codelength for the data
using an auto-encoder is equivalent to minimizing the reconstruction error plus
some correcting terms which have an interpretation as either a denoising or
contractive property of the decoding function. These terms are related but not
identical to those used in denoising or contractive auto-encoders [Vincent et
al. 2010, Rifai et al. 2011]. In particular, the codelength viewpoint fully
determines an optimal noise level for the denoising criterion
Online Natural Gradient as a Kalman Filter
We cast Amari's natural gradient in statistical learning as a specific case
of Kalman filtering. Namely, applying an extended Kalman filter to estimate a
fixed unknown parameter of a probabilistic model from a series of observations,
is rigorously equivalent to estimating this parameter via an online stochastic
natural gradient descent on the log-likelihood of the observations.
In the i.i.d. case, this relation is a consequence of the "information
filter" phrasing of the extended Kalman filter. In the recurrent (state space,
non-i.i.d.) case, we prove that the joint Kalman filter over states and
parameters is a natural gradient on top of real-time recurrent learning (RTRL),
a classical algorithm to train recurrent models.
This exact algebraic correspondence provides relevant interpretations for
natural gradient hyperparameters such as learning rates or initialization and
regularization of the Fisher information matrix.Comment: 3rd version: expanded intr
Unbiasing Truncated Backpropagation Through Time
Truncated Backpropagation Through Time (truncated BPTT) is a widespread
method for learning recurrent computational graphs. Truncated BPTT keeps the
computational benefits of Backpropagation Through Time (BPTT) while relieving
the need for a complete backtrack through the whole data sequence at every
step. However, truncation favors short-term dependencies: the gradient estimate
of truncated BPTT is biased, so that it does not benefit from the convergence
guarantees from stochastic gradient theory. We introduce Anticipated Reweighted
Truncated Backpropagation (ARTBP), an algorithm that keeps the computational
benefits of truncated BPTT, while providing unbiasedness. ARTBP works by using
variable truncation lengths together with carefully chosen compensation factors
in the backpropagation equation. We check the viability of ARTBP on two tasks.
First, a simple synthetic task where careful balancing of temporal dependencies
at different scales is needed: truncated BPTT displays unreliable performance,
and in worst case scenarios, divergence, while ARTBP converges reliably.
Second, on Penn Treebank character-level language modelling, ARTBP slightly
outperforms truncated BPTT
A curved Brunn--Minkowski inequality on the discrete hypercube
We compare two approaches to Ricci curvature on non-smooth spaces, in the
case of the discrete hypercube . While the coarse Ricci curvature of
the first author readily yields a positive value for curvature, the
displacement convexity property of Lott, Sturm and the second author could not
be fully implemented. Yet along the way we get new results of a combinatorial
and probabilistic nature, including a curved Brunn--Minkowski inequality on the
discrete hypercube.Comment: Latest version: improved constants, minor correction
Layer-wise learning of deep generative models
When using deep, multi-layered architectures to build generative models of
data, it is difficult to train all layers at once. We propose a layer-wise
training procedure admitting a performance guarantee compared to the global
optimum. It is based on an optimistic proxy of future performance, the best
latent marginal. We interpret auto-encoders in this setting as generative
models, by showing that they train a lower bound of this criterion. We test the
new learning procedure against a state of the art method (stacked RBMs), and
find it to improve performance. Both theory and experiments highlight the
importance, when training deep architectures, of using an inference model (from
data to hidden variables) richer than the generative model (from hidden
variables to data)
Objective Improvement in Information-Geometric Optimization
Information-Geometric Optimization (IGO) is a unified framework of stochastic
algorithms for optimization problems. Given a family of probability
distributions, IGO turns the original optimization problem into a new
maximization problem on the parameter space of the probability distributions.
IGO updates the parameter of the probability distribution along the natural
gradient, taken with respect to the Fisher metric on the parameter manifold,
aiming at maximizing an adaptive transform of the objective function. IGO
recovers several known algorithms as particular instances: for the family of
Bernoulli distributions IGO recovers PBIL, for the family of Gaussian
distributions the pure rank-mu CMA-ES update is recovered, and for exponential
families in expectation parametrization the cross-entropy/ML method is
recovered. This article provides a theoretical justification for the IGO
framework, by proving that any step size not greater than 1 guarantees monotone
improvement over the course of optimization, in terms of q-quantile values of
the objective function f. The range of admissible step sizes is independent of
f and its domain. We extend the result to cover the case of different step
sizes for blocks of the parameters in the IGO algorithm. Moreover, we prove
that expected fitness improves over time when fitness-proportional selection is
applied, in which case the RPP algorithm is recovered
- …