4,790 research outputs found
Approximate Newton Methods for Policy Search in Markov Decision Processes
Approximate Newton methods are standard optimization tools which aim to maintain the benefits of Newton's method, such as a fast rate of convergence, while alleviating its drawbacks, such as computationally expensive calculation or estimation of the inverse Hessian. In this work we investigate approximate Newton methods for policy optimization in Markov decision processes (MDPs). We first analyse the structure of the Hessian of the total expected reward, which is a standard objective function for MDPs. We show that, like the gradient, the Hessian exhibits useful structure in the context of MDPs and we use this analysis to motivate two Gauss-Newton methods for MDPs. Like the Gauss- Newton method for non-linear least squares, these methods drop certain terms in the Hessian. The approximate Hessians possess desirable properties, such as negative definiteness, and we demonstrate several important performance guarantees including guaranteed ascent directions, invariance to affine transformation of the parameter space and convergence guarantees. We finally provide a unifying perspective of key policy search algorithms, demonstrating that our second Gauss- Newton algorithm is closely related to both the EM-algorithm and natural gradient ascent applied to MDPs, but performs significantly better in practice on a range of challenging domains
A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes
Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being considered the current state of the art in the field. In this article we provide a unifying perspective of these two algorithms by showing that their step-directions in the parameter space are closely related to the search direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative gradient-based method for Markov Decision Processes. We are able show that the algorithm has numerous desirable properties, absent in the naive application of Newton's method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent
Convergence Analysis of the Approximate Newton Method for Markov Decision Processes
Recently two approximate Newton methods were proposed for the optimisation of
Markov Decision Processes. While these methods were shown to have desirable
properties, such as a guarantee that the preconditioner is
negative-semidefinite when the policy is -concave with respect to the
policy parameters, and were demonstrated to have strong empirical performance
in challenging domains, such as the game of Tetris, no convergence analysis was
provided. The purpose of this paper is to provide such an analysis. We start by
providing a detailed analysis of the Hessian of a Markov Decision Process,
which is formed of a negative-semidefinite component, a positive-semidefinite
component and a remainder term. The first part of our analysis details how the
negative-semidefinite and positive-semidefinite components relate to each
other, and how these two terms contribute to the Hessian. The next part of our
analysis shows that under certain conditions, relating to the richness of the
policy class, the remainder term in the Hessian vanishes in the vicinity of a
local optimum. Finally, we bound the behaviour of this remainder term in terms
of the mixing time of the Markov chain induced by the policy parameters, where
this part of the analysis is applicable over the entire parameter space. Given
this analysis of the Hessian we then provide our local convergence analysis of
the approximate Newton framework.Comment: This work has been removed because a more recent piece (A
Gauss-Newton method for Markov Decision Processes, T. Furmston & G. Lever) of
work has subsumed i
Is the Bellman residual a bad proxy?
This paper aims at theoretically and empirically comparing two standard
optimization criteria for Reinforcement Learning: i) maximization of the mean
value and ii) minimization of the Bellman residual. For that purpose, we place
ourselves in the framework of policy search algorithms, that are usually
designed to maximize the mean value, and derive a method that minimizes the
residual over policies. A theoretical analysis
shows how good this proxy is to policy optimization, and notably that it is
better than its value-based counterpart. We also propose experiments on
randomly generated generic Markov decision processes, specifically designed for
studying the influence of the involved concentrability coefficient. They show
that the Bellman residual is generally a bad proxy to policy optimization and
that directly maximizing the mean value is much better, despite the current
lack of deep theoretical analysis. This might seem obvious, as directly
addressing the problem of interest is usually better, but given the prevalence
of (projected) Bellman residual minimization in value-based reinforcement
learning, we believe that this question is worth to be considered.Comment: Final NIPS 2017 version (title, among other things, changed
Polynomial Time Algorithms for Branching Markov Decision Processes and Probabilistic Min(Max) Polynomial Bellman Equations
We show that one can approximate the least fixed point solution for a
multivariate system of monotone probabilistic max(min) polynomial equations,
referred to as maxPPSs (and minPPSs, respectively), in time polynomial in both
the encoding size of the system of equations and in log(1/epsilon), where
epsilon > 0 is the desired additive error bound of the solution. (The model of
computation is the standard Turing machine model.) We establish this result
using a generalization of Newton's method which applies to maxPPSs and minPPSs,
even though the underlying functions are only piecewise-differentiable. This
generalizes our recent work which provided a P-time algorithm for purely
probabilistic PPSs.
These equations form the Bellman optimality equations for several important
classes of infinite-state Markov Decision Processes (MDPs). Thus, as a
corollary, we obtain the first polynomial time algorithms for computing to
within arbitrary desired precision the optimal value vector for several classes
of infinite-state MDPs which arise as extensions of classic, and heavily
studied, purely stochastic processes. These include both the problem of
maximizing and mininizing the termination (extinction) probability of
multi-type branching MDPs, stochastic context-free MDPs, and 1-exit Recursive
MDPs.
Furthermore, we also show that we can compute in P-time an epsilon-optimal
policy for both maximizing and minimizing branching, context-free, and
1-exit-Recursive MDPs, for any given desired epsilon > 0. This is despite the
fact that actually computing optimal strategies is Sqrt-Sum-hard and
PosSLP-hard in this setting.
We also derive, as an easy consequence of these results, an FNP upper bound
on the complexity of computing the value (within arbitrary desired precision)
of branching simple stochastic games (BSSGs)
- …