15 research outputs found
A Formalization of The Natural Gradient Method for General Similarity Measures
In optimization, the natural gradient method is well-known for likelihood
maximization. The method uses the Kullback-Leibler divergence, corresponding
infinitesimally to the Fisher-Rao metric, which is pulled back to the parameter
space of a family of probability distributions. This way, gradients with
respect to the parameters respect the Fisher-Rao geometry of the space of
distributions, which might differ vastly from the standard Euclidean geometry
of the parameter space, often leading to faster convergence. However, when
minimizing an arbitrary similarity measure between distributions, it is
generally unclear which metric to use. We provide a general framework that,
given a similarity measure, derives a metric for the natural gradient. We then
discuss connections between the natural gradient method and multiple other
optimization techniques in the literature. Finally, we provide computations of
the formal natural gradient to show overlap with well-known cases and to
compute natural gradients in novel frameworks
Efficient Wasserstein Natural Gradients for Reinforcement Learning
A novel optimization approach is proposed for application to policy gradient methods and evolution strategies for reinforcement learning (RL). The procedure uses a computationally efficient Wasserstein natural gradient (WNG) descent that takes advantage of the geometry induced by a Wasserstein penalty to speed optimization. This method follows the recent theme in RL of including a divergence penalty in the objective to establish a trust region. Experiments on challenging tasks demonstrate improvements in both computational cost and performance over advanced baseline
Solving general elliptical mixture models through an approximate Wasserstein manifold
We address the estimation problem for general finite mixture models, with a
particular focus on the elliptical mixture models (EMMs). Compared to the
widely adopted Kullback-Leibler divergence, we show that the Wasserstein
distance provides a more desirable optimisation space. We thus provide a stable
solution to the EMMs that is both robust to initialisations and reaches a
superior optimum by adaptively optimising along a manifold of an approximate
Wasserstein distance. To this end, we first provide a unifying account of
computable and identifiable EMMs, which serves as a basis to rigorously address
the underpinning optimisation problem. Due to a probability constraint, solving
this problem is extremely cumbersome and unstable, especially under the
Wasserstein distance. To relieve this issue, we introduce an efficient
optimisation method on a statistical manifold defined under an approximate
Wasserstein distance, which allows for explicit metrics and computable
operations, thus significantly stabilising and improving the EMM estimation. We
further propose an adaptive method to accelerate the convergence. Experimental
results demonstrate the excellent performance of the proposed EMM solver.Comment: This work has been accepted to AAAI2020. Note that this version also
corrects a small error on the Equation (16) in proo
On parameter estimation with the Wasserstein distance
Statistical inference can be performed by minimizing, over the parameter
space, the Wasserstein distance between model distributions and the empirical
distribution of the data. We study asymptotic properties of such minimum
Wasserstein distance estimators, complementing results derived by Bassetti,
Bodini and Regazzini in 2006. In particular, our results cover the misspecified
setting, in which the data-generating process is not assumed to be part of the
family of distributions described by the model. Our results are motivated by
recent applications of minimum Wasserstein estimators to complex generative
models. We discuss some difficulties arising in the approximation of these
estimators and illustrate their behavior in several numerical experiments. Two
of our examples are taken from the literature on approximate Bayesian
computation and have likelihood functions that are not analytically tractable.
Two other examples involve misspecified models.Comment: 29 pages (+18 pages of appendices), 6 figures. To appear in
Information and Inference: A Journal of the IMA. A previous version of this
paper contained work on approximate Bayesian computation with the Wasserstein
distance, which can now be found at arxiv:1905.0374