4,920 research outputs found
Pushing Stochastic Gradient towards Second-Order Methods -- Backpropagation Learning with Transformations in Nonlinearities
Recently, we proposed to transform the outputs of each hidden neuron in a
multi-layer perceptron network to have zero output and zero slope on average,
and use separate shortcut connections to model the linear dependencies instead.
We continue the work by firstly introducing a third transformation to normalize
the scale of the outputs of each hidden neuron, and secondly by analyzing the
connections to second order optimization methods. We show that the
transformations make a simple stochastic gradient behave closer to second-order
optimization methods and thus speed up learning. This is shown both in theory
and with experiments. The experiments on the third transformation show that
while it further increases the speed of learning, it can also hurt performance
by converging to a worse local optimum, where both the inputs and outputs of
many hidden neurons are close to zero.Comment: 10 pages, 5 figures, ICLR201
Metric-Free Natural Gradient for Joint-Training of Boltzmann Machines
This paper introduces the Metric-Free Natural Gradient (MFNG) algorithm for
training Boltzmann Machines. Similar in spirit to the Hessian-Free method of
Martens [8], our algorithm belongs to the family of truncated Newton methods
and exploits an efficient matrix-vector product to avoid explicitely storing
the natural gradient metric . This metric is shown to be the expected second
derivative of the log-partition function (under the model distribution), or
equivalently, the variance of the vector of partial derivatives of the energy
function. We evaluate our method on the task of joint-training a 3-layer Deep
Boltzmann Machine and show that MFNG does indeed have faster per-epoch
convergence compared to Stochastic Maximum Likelihood with centering, though
wall-clock performance is currently not competitive
Generalized Batch Normalization: Towards Accelerating Deep Neural Networks
Utilizing recently introduced concepts from statistics and quantitative risk
management, we present a general variant of Batch Normalization (BN) that
offers accelerated convergence of Neural Network training compared to
conventional BN. In general, we show that mean and standard deviation are not
always the most appropriate choice for the centering and scaling procedure
within the BN transformation, particularly if ReLU follows the normalization
step. We present a Generalized Batch Normalization (GBN) transformation, which
can utilize a variety of alternative deviation measures for scaling and
statistics for centering, choices which naturally arise from the theory of
generalized deviation measures and risk theory in general. When used in
conjunction with the ReLU non-linearity, the underlying risk theory suggests
natural, arguably optimal choices for the deviation measure and statistic.
Utilizing the suggested deviation measure and statistic, we show experimentally
that training is accelerated more so than with conventional BN, often with
improved error rate as well. Overall, we propose a more flexible BN
transformation supported by a complimentary theoretical framework that can
potentially guide design choices.Comment: accepted at AAAI-1
Learning Generative Models across Incomparable Spaces
Generative Adversarial Networks have shown remarkable success in learning a
distribution that faithfully recovers a reference distribution in its entirety.
However, in some cases, we may want to only learn some aspects (e.g., cluster
or manifold structure), while modifying others (e.g., style, orientation or
dimension). In this work, we propose an approach to learn generative models
across such incomparable spaces, and demonstrate how to steer the learned
distribution towards target properties. A key component of our model is the
Gromov-Wasserstein distance, a notion of discrepancy that compares
distributions relationally rather than absolutely. While this framework
subsumes current generative models in identically reproducing distributions,
its inherent flexibility allows application to tasks in manifold learning,
relational learning and cross-domain learning.Comment: International Conference on Machine Learning (ICML
- …