68 research outputs found
Efficient Methods for Unsupervised Learning of Probabilistic Models
In this thesis I develop a variety of techniques to train, evaluate, and
sample from intractable and high dimensional probabilistic models. Abstract
exceeds arXiv space limitations -- see PDF
Eliminating all bad Local Minima from Loss Landscapes without even adding an Extra Unit
Recent work has noted that all bad local minima can be removed from neural
network loss landscapes, by adding a single unit with a particular
parameterization. We show that the core technique from these papers can be used
to remove all bad local minima from any loss landscape, so long as the global
minimum has a loss of zero. This procedure does not require the addition of
auxiliary units, or even that the loss be associated with a neural network. The
method of action involves all bad local minima being converted into bad
(non-local) minima at infinity in terms of auxiliary parameters
PCA of high dimensional random walks with comparison to neural network training
One technique to visualize the training of neural networks is to perform PCA
on the parameters over the course of training and to project to the subspace
spanned by the first few PCA components. In this paper we compare this
technique to the PCA of a high dimensional random walk. We compute the
eigenvalues and eigenvectors of the covariance of the trajectory and prove that
in the long trajectory and high dimensional limit most of the variance is in
the first few PCA components, and that the projection of the trajectory onto
any subspace spanned by PCA components is a Lissajous curve. We generalize
these results to a random walk with momentum and to an Ornstein-Uhlenbeck
processes (i.e., a random walk in a quadratic potential) and show that in high
dimensions the walk is not mean reverting, but will instead be trapped at a
fixed distance from the minimum. We finally compare the distribution of PCA
variances and the PCA projected training trajectories of a linear model trained
on CIFAR-10 and ResNet-50-v2 trained on Imagenet and find that the distribution
of PCA variances resembles a random walk with drift
Note on Equivalence Between Recurrent Neural Network Time Series Models and Variational Bayesian Models
We observe that the standard log likelihood training objective for a
Recurrent Neural Network (RNN) model of time series data is equivalent to a
variational Bayesian training objective, given the proper choice of generative
and inference models. This perspective may motivate extensions to both RNNs and
variational Bayesian models. We propose one such extension, where multiple
particles are used for the hidden state of an RNN, allowing a natural
representation of uncertainty or multimodality
Efficient and optimal binary Hopfield associative memory storage using minimum probability flow
We present an algorithm to store binary memories in a Hopfield neural network
using minimum probability flow, a recent technique to fit parameters in
energy-based probabilistic models. In the case of memories without noise, our
algorithm provably achieves optimal pattern storage (which we show is at least
one pattern per neuron) and outperforms classical methods both in speed and
memory recovery. Moreover, when trained on noisy or corrupted versions of a
fixed set of binary patterns, our algorithm finds networks which correctly
store the originals. We also demonstrate this finding visually with the
unsupervised storage and clean-up of large binary fingerprint images from
significantly corrupted samples.Comment: 6 pages, 4 figures, 2012 Neural Information Processing Systems (NIPS)
workshop on Discrete Optimization in Machine Learning (DISCML
A universal tradeoff between power, precision and speed in physical communication
Maximizing the speed and precision of communication while minimizing power
dissipation is a fundamental engineering design goal. Also, biological systems
achieve remarkable speed, precision and power efficiency using poorly
understood physical design principles. Powerful theories like information
theory and thermodynamics do not provide general limits on power, precision and
speed. Here we go beyond these classical theories to prove that the product of
precision and speed is universally bounded by power dissipation in any physical
communication channel whose dynamics is faster than that of the signal.
Moreover, our derivation involves a novel connection between friction and
information geometry. These results may yield insight into both the engineering
design of communication devices and the structure and function of biological
signaling systems.Comment: 15 pages, 3 figure
Density estimation using Real NVP
Unsupervised learning of probabilistic models is a central yet challenging
problem in machine learning. Specifically, designing models with tractable
learning, sampling, inference and evaluation is crucial in solving this task.
We extend the space of such models using real-valued non-volume preserving
(real NVP) transformations, a set of powerful invertible and learnable
transformations, resulting in an unsupervised learning algorithm with exact
log-likelihood computation, exact sampling, exact inference of latent
variables, and an interpretable latent space. We demonstrate its ability to
model natural images on four datasets through sampling, log-likelihood
evaluation and latent variable manipulations.Comment: 10 pages of main content, 3 pages of bibliography, 18 pages of
appendix. Accepted at ICLR 201
Analyzing noise in autoencoders and deep networks
Autoencoders have emerged as a useful framework for unsupervised learning of
internal representations, and a wide variety of apparently conceptually
disparate regularization techniques have been proposed to generate useful
features. Here we extend existing denoising autoencoders to additionally inject
noise before the nonlinearity, and at the hidden unit activations. We show that
a wide variety of previous methods, including denoising, contractive, and
sparse autoencoders, as well as dropout can be interpreted using this
framework. This noise injection framework reaps practical benefits by providing
a unified strategy to develop new internal representations by designing the
nature of the injected noise. We show that noisy autoencoders outperform
denoising autoencoders at the very task of denoising, and are competitive with
other single-layer techniques on MNIST, and CIFAR-10. We also show that types
of noise other than dropout improve performance in a deep network through
sparsifying, decorrelating, and spreading information across representations
Fast large-scale optimization by unifying stochastic gradient and quasi-Newton methods
We present an algorithm for minimizing a sum of functions that combines the
computational efficiency of stochastic gradient descent (SGD) with the second
order curvature information leveraged by quasi-Newton methods. We unify these
disparate approaches by maintaining an independent Hessian approximation for
each contributing function in the sum. We maintain computational tractability
and limit memory requirements even for high dimensional optimization problems
by storing and manipulating these quadratic approximations in a shared, time
evolving, low dimensional subspace. Each update step requires only a single
contributing function or minibatch evaluation (as in SGD), and each step is
scaled using an approximate inverse Hessian and little to no adjustment of
hyperparameters is required (as is typical for quasi-Newton methods). This
algorithm contrasts with earlier stochastic second order techniques that treat
the Hessian of each contributing function as a noisy approximation to the full
Hessian, rather than as a target for direct estimation. We experimentally
demonstrate improved convergence on seven diverse optimization problems. The
algorithm is released as open source Python and MATLAB packages
Generalizing Hamiltonian Monte Carlo with Neural Networks
We present a general-purpose method to train Markov chain Monte Carlo
kernels, parameterized by deep neural networks, that converge and mix quickly
to their target distribution. Our method generalizes Hamiltonian Monte Carlo
and is trained to maximize expected squared jumped distance, a proxy for mixing
speed. We demonstrate large empirical gains on a collection of simple but
challenging distributions, for instance achieving a 106x improvement in
effective sample size in one case, and mixing when standard HMC makes no
measurable progress in a second. Finally, we show quantitative and qualitative
gains on a real-world task: latent-variable generative modeling. We release an
open source TensorFlow implementation of the algorithm.Comment: ICLR 201
- …