193 research outputs found
Kernelized Wasserstein Natural Gradient
Many machine learning problems can be expressed as the optimization of some
cost functional over a parametric family of probability distributions. It is
often beneficial to solve such optimization problems using natural gradient
methods. These methods are invariant to the parametrization of the family, and
thus can yield more effective optimization. Unfortunately, computing the
natural gradient is challenging as it requires inverting a high dimensional
matrix at each iteration. We propose a general framework to approximate the
natural gradient for the Wasserstein metric, by leveraging a dual formulation
of the metric restricted to a Reproducing Kernel Hilbert Space. Our approach
leads to an estimator for gradient direction that can trade-off accuracy and
computational cost, with theoretical guarantees. We verify its accuracy on
simple examples, and show the advantage of using such an estimator in
classification tasks on Cifar10 and Cifar100 empirically
Stein operators, kernels and discrepancies for multivariate continuous distributions
We present a general framework for setting up Stein's method for multivariate continuous distributions. The approach gives a collection of Stein characterizations, among which we highlight score-Stein operators and kernel-Stein operators. Applications include copu-las and distance between posterior distributions. We give a general explicit construction for Stein kernels for elliptical distributions and discuss Stein kernels in generality, highlighting connections with Fisher information and mass transport. Finally, a goodness-of-fit test based on Stein discrepancies is given
Birth-death dynamics for sampling: Global convergence, approximations and their asymptotics
Motivated by the challenge of sampling Gibbs measures with nonconvex
potentials, we study a continuum birth-death dynamics. We improve results in
previous works [51,57] and provide weaker hypotheses under which the
probability density of the birth-death governed by Kullback-Leibler divergence
or by divergence converge exponentially fast to the Gibbs equilibrium
measure, with a universal rate that is independent of the potential barrier. To
build a practical numerical sampler based on the pure birth-death dynamics, we
consider an interacting particle system, which is inspired by the gradient flow
structure and the classical Fokker-Planck equation and relies on kernel-based
approximations of the measure. Using the technique of -convergence of
gradient flows, we show that on the torus, smooth and bounded positive
solutions of the kernelized dynamics converge on finite time intervals, to the
pure birth-death dynamics as the kernel bandwidth shrinks to zero. Moreover we
provide quantitative estimates on the bias of minimizers of the energy
corresponding to the kernelized dynamics. Finally we prove the long-time
asymptotic results on the convergence of the asymptotic states of the
kernelized dynamics towards the Gibbs measure.Comment: significant mathematical changes with more rigor on gradient flow
Repulsive Deep Ensembles are Bayesian
Deep ensembles have recently gained popularity in the deep learning community
for their conceptual simplicity and efficiency. However, maintaining functional
diversity between ensemble members that are independently trained with gradient
descent is challenging. This can lead to pathologies when adding more ensemble
members, such as a saturation of the ensemble performance, which converges to
the performance of a single model. Moreover, this does not only affect the
quality of its predictions, but even more so the uncertainty estimates of the
ensemble, and thus its performance on out-of-distribution data. We hypothesize
that this limitation can be overcome by discouraging different ensemble members
from collapsing to the same function. To this end, we introduce a kernelized
repulsive term in the update rule of the deep ensembles. We show that this
simple modification not only enforces and maintains diversity among the members
but, even more importantly, transforms the maximum a posteriori inference into
proper Bayesian inference. Namely, we show that the training dynamics of our
proposed repulsive ensembles follow a Wasserstein gradient flow of the KL
divergence with the true posterior. We study repulsive terms in weight and
function space and empirically compare their performance to standard ensembles
and Bayesian baselines on synthetic and real-world prediction tasks
Particle-based Variational Inference with Preconditioned Functional Gradient Flow
Particle-based variational inference (VI) minimizes the KL divergence between
model samples and the target posterior with gradient flow estimates. With the
popularity of Stein variational gradient descent (SVGD), the focus of
particle-based VI algorithms has been on the properties of functions in
Reproducing Kernel Hilbert Space (RKHS) to approximate the gradient flow.
However, the requirement of RKHS restricts the function class and algorithmic
flexibility. This paper remedies the problem by proposing a general framework
to obtain tractable functional gradient flow estimates. The functional gradient
flow in our framework can be defined by a general functional regularization
term that includes the RKHS norm as a special case. We use our framework to
propose a new particle-based VI algorithm: preconditioned functional gradient
flow (PFG). Compared with SVGD, the proposed method has several advantages:
larger function class; greater scalability in large particle-size scenarios;
better adaptation to ill-conditioned distributions; provable continuous-time
convergence in KL divergence. Non-linear function classes such as neural
networks can be incorporated to estimate the gradient flow. Both theory and
experiments have shown the effectiveness of our framework.Comment: 34 pages, 8 figure
- …