193 research outputs found

    Kernelized Wasserstein Natural Gradient

    Full text link
    Many machine learning problems can be expressed as the optimization of some cost functional over a parametric family of probability distributions. It is often beneficial to solve such optimization problems using natural gradient methods. These methods are invariant to the parametrization of the family, and thus can yield more effective optimization. Unfortunately, computing the natural gradient is challenging as it requires inverting a high dimensional matrix at each iteration. We propose a general framework to approximate the natural gradient for the Wasserstein metric, by leveraging a dual formulation of the metric restricted to a Reproducing Kernel Hilbert Space. Our approach leads to an estimator for gradient direction that can trade-off accuracy and computational cost, with theoretical guarantees. We verify its accuracy on simple examples, and show the advantage of using such an estimator in classification tasks on Cifar10 and Cifar100 empirically

    Stein operators, kernels and discrepancies for multivariate continuous distributions

    Get PDF
    We present a general framework for setting up Stein's method for multivariate continuous distributions. The approach gives a collection of Stein characterizations, among which we highlight score-Stein operators and kernel-Stein operators. Applications include copu-las and distance between posterior distributions. We give a general explicit construction for Stein kernels for elliptical distributions and discuss Stein kernels in generality, highlighting connections with Fisher information and mass transport. Finally, a goodness-of-fit test based on Stein discrepancies is given

    Birth-death dynamics for sampling: Global convergence, approximations and their asymptotics

    Full text link
    Motivated by the challenge of sampling Gibbs measures with nonconvex potentials, we study a continuum birth-death dynamics. We improve results in previous works [51,57] and provide weaker hypotheses under which the probability density of the birth-death governed by Kullback-Leibler divergence or by χ2\chi^2 divergence converge exponentially fast to the Gibbs equilibrium measure, with a universal rate that is independent of the potential barrier. To build a practical numerical sampler based on the pure birth-death dynamics, we consider an interacting particle system, which is inspired by the gradient flow structure and the classical Fokker-Planck equation and relies on kernel-based approximations of the measure. Using the technique of Γ\Gamma-convergence of gradient flows, we show that on the torus, smooth and bounded positive solutions of the kernelized dynamics converge on finite time intervals, to the pure birth-death dynamics as the kernel bandwidth shrinks to zero. Moreover we provide quantitative estimates on the bias of minimizers of the energy corresponding to the kernelized dynamics. Finally we prove the long-time asymptotic results on the convergence of the asymptotic states of the kernelized dynamics towards the Gibbs measure.Comment: significant mathematical changes with more rigor on gradient flow

    Repulsive Deep Ensembles are Bayesian

    Full text link
    Deep ensembles have recently gained popularity in the deep learning community for their conceptual simplicity and efficiency. However, maintaining functional diversity between ensemble members that are independently trained with gradient descent is challenging. This can lead to pathologies when adding more ensemble members, such as a saturation of the ensemble performance, which converges to the performance of a single model. Moreover, this does not only affect the quality of its predictions, but even more so the uncertainty estimates of the ensemble, and thus its performance on out-of-distribution data. We hypothesize that this limitation can be overcome by discouraging different ensemble members from collapsing to the same function. To this end, we introduce a kernelized repulsive term in the update rule of the deep ensembles. We show that this simple modification not only enforces and maintains diversity among the members but, even more importantly, transforms the maximum a posteriori inference into proper Bayesian inference. Namely, we show that the training dynamics of our proposed repulsive ensembles follow a Wasserstein gradient flow of the KL divergence with the true posterior. We study repulsive terms in weight and function space and empirically compare their performance to standard ensembles and Bayesian baselines on synthetic and real-world prediction tasks

    Particle-based Variational Inference with Preconditioned Functional Gradient Flow

    Full text link
    Particle-based variational inference (VI) minimizes the KL divergence between model samples and the target posterior with gradient flow estimates. With the popularity of Stein variational gradient descent (SVGD), the focus of particle-based VI algorithms has been on the properties of functions in Reproducing Kernel Hilbert Space (RKHS) to approximate the gradient flow. However, the requirement of RKHS restricts the function class and algorithmic flexibility. This paper remedies the problem by proposing a general framework to obtain tractable functional gradient flow estimates. The functional gradient flow in our framework can be defined by a general functional regularization term that includes the RKHS norm as a special case. We use our framework to propose a new particle-based VI algorithm: preconditioned functional gradient flow (PFG). Compared with SVGD, the proposed method has several advantages: larger function class; greater scalability in large particle-size scenarios; better adaptation to ill-conditioned distributions; provable continuous-time convergence in KL divergence. Non-linear function classes such as neural networks can be incorporated to estimate the gradient flow. Both theory and experiments have shown the effectiveness of our framework.Comment: 34 pages, 8 figure
    corecore