2,733 research outputs found
Reducing Reparameterization Gradient Variance
Optimization with noisy gradients has become ubiquitous in statistics and
machine learning. Reparameterization gradients, or gradient estimates computed
via the "reparameterization trick," represent a class of noisy gradients often
used in Monte Carlo variational inference (MCVI). However, when these gradient
estimators are too noisy, the optimization procedure can be slow or fail to
converge. One way to reduce noise is to use more samples for the gradient
estimate, but this can be computationally expensive. Instead, we view the noisy
gradient as a random variable, and form an inexpensive approximation of the
generating procedure for the gradient sample. This approximation has high
correlation with the noisy gradient by construction, making it a useful control
variate for variance reduction. We demonstrate our approach on non-conjugate
multi-level hierarchical models and a Bayesian neural net where we observed
gradient variance reductions of multiple orders of magnitude (20-2,000x)
Variational Dropout and the Local Reparameterization Trick
We investigate a local reparameterizaton technique for greatly reducing the
variance of stochastic gradients for variational Bayesian inference (SGVB) of a
posterior over model parameters, while retaining parallelizability. This local
reparameterization translates uncertainty about global parameters into local
noise that is independent across datapoints in the minibatch. Such
parameterizations can be trivially parallelized and have variance that is
inversely proportional to the minibatch size, generally leading to much faster
convergence. Additionally, we explore a connection with dropout: Gaussian
dropout objectives correspond to SGVB with local reparameterization, a
scale-invariant prior and proportionally fixed posterior variance. Our method
allows inference of more flexibly parameterized posteriors; specifically, we
propose variational dropout, a generalization of Gaussian dropout where the
dropout rates are learned, often leading to better models. The method is
demonstrated through several experiments
Reparameterizing the Birkhoff Polytope for Variational Permutation Inference
Many matching, tracking, sorting, and ranking problems require probabilistic
reasoning about possible permutations, a set that grows factorially with
dimension. Combinatorial optimization algorithms may enable efficient point
estimation, but fully Bayesian inference poses a severe challenge in this
high-dimensional, discrete space. To surmount this challenge, we start with the
usual step of relaxing a discrete set (here, of permutation matrices) to its
convex hull, which here is the Birkhoff polytope: the set of all
doubly-stochastic matrices. We then introduce two novel transformations: first,
an invertible and differentiable stick-breaking procedure that maps
unconstrained space to the Birkhoff polytope; second, a map that rounds points
toward the vertices of the polytope. Both transformations include a temperature
parameter that, in the limit, concentrates the densities on permutation
matrices. We then exploit these transformations and reparameterization
gradients to introduce variational inference over permutation matrices, and we
demonstrate its utility in a series of experiments
- …