14 research outputs found

    Stein Variational Gradient Descent with Multiple Kernel

    Full text link
    Stein variational gradient descent (SVGD) and its variants have shown promising successes in approximate inference for complex distributions. In practice, we notice that the kernel used in SVGD-based methods has a decisive effect on the empirical performance. Radial basis function (RBF) kernel with median heuristics is a common choice in previous approaches, but unfortunately this has proven to be sub-optimal. Inspired by the paradigm of Multiple Kernel Learning (MKL), our solution to this flaw is using a combination of multiple kernels to approximate the optimal kernel, rather than a single one which may limit the performance and flexibility. Specifically, we first extend Kernelized Stein Discrepancy (KSD) to its multiple kernels view called Multiple Kernelized Stein Discrepancy (MKSD) and then leverage MKSD to construct a general algorithm Multiple Kernel SVGD (MK-SVGD). Further, MKSVGD can automatically assign a weight to each kernel without any other parameters, which means that our method not only gets rid of optimal kernel dependence but also maintains computational efficiency. Experiments on various tasks and models demonstrate that our proposed method consistently matches or outperforms the competing methods

    Particle-based Variational Inference with Preconditioned Functional Gradient Flow

    Full text link
    Particle-based variational inference (VI) minimizes the KL divergence between model samples and the target posterior with gradient flow estimates. With the popularity of Stein variational gradient descent (SVGD), the focus of particle-based VI algorithms has been on the properties of functions in Reproducing Kernel Hilbert Space (RKHS) to approximate the gradient flow. However, the requirement of RKHS restricts the function class and algorithmic flexibility. This paper remedies the problem by proposing a general framework to obtain tractable functional gradient flow estimates. The functional gradient flow in our framework can be defined by a general functional regularization term that includes the RKHS norm as a special case. We use our framework to propose a new particle-based VI algorithm: preconditioned functional gradient flow (PFG). Compared with SVGD, the proposed method has several advantages: larger function class; greater scalability in large particle-size scenarios; better adaptation to ill-conditioned distributions; provable continuous-time convergence in KL divergence. Non-linear function classes such as neural networks can be incorporated to estimate the gradient flow. Both theory and experiments have shown the effectiveness of our framework.Comment: 34 pages, 8 figure

    Accelerated Information Gradient flow

    Full text link
    We present a framework for Nesterov's accelerated gradient flows in probability space. Here four examples of information metrics are considered, including Fisher-Rao metric, Wasserstein-2 metric, Kalman-Wasserstein metric and Stein metric. For both Fisher-Rao and Wasserstein-2 metrics, we prove convergence properties of accelerated gradient flows. In implementations, we propose a sampling-efficient discrete-time algorithm for Wasserstein-2, Kalman-Wasserstein and Stein accelerated gradient flows with a restart technique. We also formulate a kernel bandwidth selection method, which learns the gradient of logarithm of density from Brownian-motion samples. Numerical experiments, including Bayesian logistic regression and Bayesian neural network, show the strength of the proposed methods compared with state-of-the-art algorithms.Comment: 33 page

    On the geometry of Stein variational gradient descent

    Get PDF
    Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean-field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performs gains of these in various numerical experiments
    corecore