14 research outputs found
Stein Variational Gradient Descent with Multiple Kernel
Stein variational gradient descent (SVGD) and its variants have shown
promising successes in approximate inference for complex distributions. In
practice, we notice that the kernel used in SVGD-based methods has a decisive
effect on the empirical performance. Radial basis function (RBF) kernel with
median heuristics is a common choice in previous approaches, but unfortunately
this has proven to be sub-optimal. Inspired by the paradigm of Multiple Kernel
Learning (MKL), our solution to this flaw is using a combination of multiple
kernels to approximate the optimal kernel, rather than a single one which may
limit the performance and flexibility. Specifically, we first extend Kernelized
Stein Discrepancy (KSD) to its multiple kernels view called Multiple Kernelized
Stein Discrepancy (MKSD) and then leverage MKSD to construct a general
algorithm Multiple Kernel SVGD (MK-SVGD). Further, MKSVGD can automatically
assign a weight to each kernel without any other parameters, which means that
our method not only gets rid of optimal kernel dependence but also maintains
computational efficiency. Experiments on various tasks and models demonstrate
that our proposed method consistently matches or outperforms the competing
methods
Particle-based Variational Inference with Preconditioned Functional Gradient Flow
Particle-based variational inference (VI) minimizes the KL divergence between
model samples and the target posterior with gradient flow estimates. With the
popularity of Stein variational gradient descent (SVGD), the focus of
particle-based VI algorithms has been on the properties of functions in
Reproducing Kernel Hilbert Space (RKHS) to approximate the gradient flow.
However, the requirement of RKHS restricts the function class and algorithmic
flexibility. This paper remedies the problem by proposing a general framework
to obtain tractable functional gradient flow estimates. The functional gradient
flow in our framework can be defined by a general functional regularization
term that includes the RKHS norm as a special case. We use our framework to
propose a new particle-based VI algorithm: preconditioned functional gradient
flow (PFG). Compared with SVGD, the proposed method has several advantages:
larger function class; greater scalability in large particle-size scenarios;
better adaptation to ill-conditioned distributions; provable continuous-time
convergence in KL divergence. Non-linear function classes such as neural
networks can be incorporated to estimate the gradient flow. Both theory and
experiments have shown the effectiveness of our framework.Comment: 34 pages, 8 figure
Accelerated Information Gradient flow
We present a framework for Nesterov's accelerated gradient flows in
probability space. Here four examples of information metrics are considered,
including Fisher-Rao metric, Wasserstein-2 metric, Kalman-Wasserstein metric
and Stein metric. For both Fisher-Rao and Wasserstein-2 metrics, we prove
convergence properties of accelerated gradient flows. In implementations, we
propose a sampling-efficient discrete-time algorithm for Wasserstein-2,
Kalman-Wasserstein and Stein accelerated gradient flows with a restart
technique. We also formulate a kernel bandwidth selection method, which learns
the gradient of logarithm of density from Brownian-motion samples. Numerical
experiments, including Bayesian logistic regression and Bayesian neural
network, show the strength of the proposed methods compared with
state-of-the-art algorithms.Comment: 33 page
On the geometry of Stein variational gradient descent
Bayesian inference problems require sampling or approximating high-dimensional probability distributions. The focus of this paper is on the recently introduced Stein variational gradient descent methodology, a class of algorithms that rely on iterated steepest descent steps with respect to a reproducing kernel Hilbert space norm. This construction leads to interacting particle systems, the mean-field limit of which is a gradient flow on the space of probability distributions equipped with a certain geometrical structure. We leverage this viewpoint to shed some light on the convergence properties of the algorithm, in particular addressing the problem of choosing a suitable positive definite kernel function. Our analysis leads us to considering certain nondifferentiable kernels with adjusted tails. We demonstrate significant performs gains of these in various numerical experiments