488 research outputs found
Variational Inference with Numerical Derivatives: variance reduction through coupling
The Black Box Variational Inference (Ranganath et al. (2014)) algorithm
provides a universal method for Variational Inference, but taking advantage of
special properties of the approximation family or of the target can improve the
convergence speed significantly. For example, if the approximation family is a
transformation family, such as a Gaussian, then switching to the
reparameterization gradient (Kingma and Welling (2014)) often yields a major
reduction in gradient variance. Ultimately, reducing the variance can reduce
the computational cost and yield better approximations.
We present a new method to extend the reparameterization trick to more
general exponential families including the Wishart, Gamma, and Student
distributions. Variational Inference with Numerical Derivatives (VIND)
approximates the gradient with numerical derivatives and reduces its variance
using a tight coupling of the approximation family. The resulting algorithm is
simple to implement and can profit from widely known couplings. Our experiments
confirm that VIND effectively decreases the gradient variance and therefore
improves the posterior approximation in relevant cases. It thus provides an
efficient yet simple Variational Inference method for computing non-Gaussian
approximations.Comment: Under review (NEURIPS 2019
Backprop-Q: Generalized Backpropagation for Stochastic Computation Graphs
In real-world scenarios, it is appealing to learn a model carrying out
stochastic operations internally, known as stochastic computation graphs
(SCGs), rather than learning a deterministic mapping. However, standard
backpropagation is not applicable to SCGs. We attempt to address this issue
from the angle of cost propagation, with local surrogate costs, called
Q-functions, constructed and learned for each stochastic node in an SCG. Then,
the SCG can be trained based on these surrogate costs using standard
backpropagation. We propose the entire framework as a solution to generalize
backpropagation for SCGs, which resembles an actor-critic architecture but
based on a graph. For broad applicability, we study a variety of SCG structures
from one cost to multiple costs. We utilize recent advances in reinforcement
learning (RL) and variational Bayes (VB), such as off-policy critic learning
and unbiased-and-low-variance gradient estimation, and review them in the
context of SCGs. The generalized backpropagation extends transported learning
signals beyond gradients between stochastic nodes while preserving the benefit
of backpropagating gradients through deterministic nodes. Experimental
suggestions and concerns are listed to help design and test any specific model
using this framework.Comment: NeurIPS 2018 Deep Reinforcement Learning Worksho
KF-LAX: Kronecker-factored curvature estimation for control variate optimization in reinforcement learning
A key challenge for gradient based optimization methods in model-free
reinforcement learning is to develop an approach that is sample efficient and
has low variance. In this work, we apply Kronecker-factored curvature
estimation technique (KFAC) to a recently proposed gradient estimator for
control variate optimization, RELAX, to increase the sample efficiency of using
this gradient estimation method in reinforcement learning. The performance of
the proposed method is demonstrated on a synthetic problem and a set of three
discrete control task Atari games
Probabilistic Binary Neural Networks
Low bit-width weights and activations are an effective way of combating the
increasing need for both memory and compute power of Deep Neural Networks. In
this work, we present a probabilistic training method for Neural Network with
both binary weights and activations, called BLRNet. By embracing stochasticity
during training, we circumvent the need to approximate the gradient of
non-differentiable functions such as sign(), while still obtaining a fully
Binary Neural Network at test time. Moreover, it allows for anytime ensemble
predictions for improved performance and uncertainty estimates by sampling from
the weight distribution. Since all operations in a layer of the BLRNet operate
on random variables, we introduce stochastic versions of Batch Normalization
and max pooling, which transfer well to a deterministic network at test time.
We evaluate the BLRNet on multiple standardized benchmarks
A Review of Learning with Deep Generative Models from Perspective of Graphical Modeling
This document aims to provide a review on learning with deep generative
models (DGMs), which is an highly-active area in machine learning and more
generally, artificial intelligence. This review is not meant to be a tutorial,
but when necessary, we provide self-contained derivations for completeness.
This review has two features. First, though there are different perspectives to
classify DGMs, we choose to organize this review from the perspective of
graphical modeling, because the learning methods for directed DGMs and
undirected DGMs are fundamentally different. Second, we differentiate model
definitions from model learning algorithms, since different learning algorithms
can be applied to solve the learning problem on the same model, and an
algorithm can be applied to learn different models. We thus separate model
definition and model learning, with more emphasis on reviewing, differentiating
and connecting different learning algorithms. We also discuss promising future
research directions.Comment: add SN-GANs, SA-GANs, conditional generation (cGANs, AC-GANs). arXiv
admin note: text overlap with arXiv:1606.00709, arXiv:1801.03558 by other
author
A New Distribution on the Simplex with Auto-Encoding Applications
We construct a new distribution for the simplex using the Kumaraswamy
distribution and an ordered stick-breaking process. We explore and develop the
theoretical properties of this new distribution and prove that it exhibits
symmetry under the same conditions as the well-known Dirichlet. Like the
Dirichlet, the new distribution is adept at capturing sparsity but, unlike the
Dirichlet, has an exact and closed form reparameterization--making it well
suited for deep variational Bayesian modeling. We demonstrate the
distribution's utility in a variety of semi-supervised auto-encoding tasks. In
all cases, the resulting models achieve competitive performance commensurate
with their simplicity, use of explicit probability models, and abstinence from
adversarial training.Comment: 15 pages, 6 figures, 1 table
Training recurrent networks online without backtracking
We introduce the "NoBackTrack" algorithm to train the parameters of dynamical
systems such as recurrent neural networks. This algorithm works in an online,
memoryless setting, thus requiring no backpropagation through time, and is
scalable, avoiding the large computational and memory cost of maintaining the
full gradient of the current state with respect to the parameters.
The algorithm essentially maintains, at each time, a single search direction
in parameter space. The evolution of this search direction is partly stochastic
and is constructed in such a way to provide, at every time, an unbiased random
estimate of the gradient of the loss function with respect to the parameters.
Because the gradient estimate is unbiased, on average over time the parameter
is updated as it should.
The resulting gradient estimate can then be fed to a lightweight Kalman-like
filter to yield an improved algorithm. For recurrent neural networks, the
resulting algorithms scale linearly with the number of parameters.
Small-scale experiments confirm the suitability of the approach, showing that
the stochastic approximation of the gradient introduced in the algorithm is not
detrimental to learning. In particular, the Kalman-like version of NoBackTrack
is superior to backpropagation through time (BPTT) when the time span of
dependencies in the data is longer than the truncation span for BPTT
A Comprehensive guide to Bayesian Convolutional Neural Network with Variational Inference
Artificial Neural Networks are connectionist systems that perform a given
task by learning on examples without having prior knowledge about the task.
This is done by finding an optimal point estimate for the weights in every
node. Generally, the network using point estimates as weights perform well with
large datasets, but they fail to express uncertainty in regions with little or
no data, leading to overconfident decisions.
In this paper, Bayesian Convolutional Neural Network (BayesCNN) using
Variational Inference is proposed, that introduces probability distribution
over the weights. Furthermore, the proposed BayesCNN architecture is applied to
tasks like Image Classification, Image Super-Resolution and Generative
Adversarial Networks. The results are compared to point-estimates based
architectures on MNIST, CIFAR-10 and CIFAR-100 datasets for Image
CLassification task, on BSD300 dataset for Image Super Resolution task and on
CIFAR10 dataset again for Generative Adversarial Network task.
BayesCNN is based on Bayes by Backprop which derives a variational
approximation to the true posterior. We, therefore, introduce the idea of
applying two convolutional operations, one for the mean and one for the
variance. Our proposed method not only achieves performances equivalent to
frequentist inference in identical architectures but also incorporate a
measurement for uncertainties and regularisation. It further eliminates the use
of dropout in the model. Moreover, we predict how certain the model prediction
is based on the epistemic and aleatoric uncertainties and empirically show how
the uncertainty can decrease, allowing the decisions made by the network to
become more deterministic as the training accuracy increases. Finally, we
propose ways to prune the Bayesian architecture and to make it more
computational and time effective.Comment: arXiv admin note: text overlap with arXiv:1506.02158,
arXiv:1703.04977 by other author
Fast Second-Order Stochastic Backpropagation for Variational Inference
We propose a second-order (Hessian or Hessian-free) based optimization method
for variational inference inspired by Gaussian backpropagation, and argue that
quasi-Newton optimization can be developed as well. This is accomplished by
generalizing the gradient computation in stochastic backpropagation via a
reparametrization trick with lower complexity. As an illustrative example, we
apply this approach to the problems of Bayesian logistic regression and
variational auto-encoder (VAE). Additionally, we compute bounds on the
estimator variance of intractable expectations for the family of Lipschitz
continuous function. Our method is practical, scalable and model free. We
demonstrate our method on several real-world datasets and provide comparisons
with other stochastic gradient methods to show substantial enhancement in
convergence rates.Comment: Accepted by NIPS 201
A Classification Supervised Auto-Encoder Based on Predefined Evenly-Distributed Class Centroids
Classic variational autoencoders are used to learn complex data
distributions, that are built on standard function approximators. Especially,
VAE has shown promise on a lot of complex task. In this paper, a new
autoencoder model - classification supervised autoencoder (CSAE) based on
predefined evenly-distributed class centroids (PEDCC) is proposed. Our method
uses PEDCC of latent variables to train the network to ensure the maximization
of inter-class distance and the minimization of inner-class distance. Instead
of learning mean/variance of latent variables distribution and taking
reparameterization of VAE, latent variables of CSAE are directly used to
classify and as input of decoder. In addition, a new loss function is proposed
to combine the loss function of classification. Based on the basic structure of
the universal autoencoder, we realized the comprehensive optimal results of
encoding, decoding, classification, and good model generalization performance
at the same time. Theoretical advantages are reflected in experimental results.Comment: 16 pages,12 figures, 4 table
- …