Search CORE

3,506 research outputs found

A Unifying Perspective of Parametric Policy Search Methods for Markov Decision Processes

Author: Barber D
Furmston T
Publication venue: Neural Information Processing Systems Foundation
Publication date: 01/01/2012
Field of study

Parametric policy search algorithms are one of the methods of choice for the optimisation of Markov Decision Processes, with Expectation Maximisation and natural gradient ascent being considered the current state of the art in the field. In this article we provide a unifying perspective of these two algorithms by showing that their step-directions in the parameter space are closely related to the search direction of an approximate Newton method. This analysis leads naturally to the consideration of this approximate Newton method as an alternative gradient-based method for Markov Decision Processes. We are able show that the algorithm has numerous desirable properties, absent in the naive application of Newton's method, that make it a viable alternative to either Expectation Maximisation or natural gradient ascent. Empirical results suggest that the algorithm has excellent convergence and robustness properties, performing strongly in comparison to both Expectation Maximisation and natural gradient ascent

CiteSeerX

UCL Discovery

Smoothing Policies and Safe Policy Gradients

Author: Papini Matteo
Pirotta Matteo
Restelli Marcello
Publication venue
Publication date: 08/05/2019
Field of study

Policy gradient algorithms are among the best candidates for the much anticipated application of reinforcement learning to real-world control tasks, such as the ones arising in robotics. However, the trial-and-error nature of these methods introduces safety issues whenever the learning phase itself must be performed on a physical system. In this paper, we address a specific safety formulation, where danger is encoded in the reward signal and the learning agent is constrained to never worsen its performance. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows to identify those meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimators. By a joint, adaptive selection of these meta-parameters, we obtain a safe policy gradient algorithm

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

UPF Digital Repository

Convergence Analysis of the Approximate Newton Method for Markov Decision Processes

Author: Furmston Thomas
Lever Guy
Publication venue
Publication date: 04/08/2015
Field of study

Recently two approximate Newton methods were proposed for the optimisation of Markov Decision Processes. While these methods were shown to have desirable properties, such as a guarantee that the preconditioner is negative-semidefinite when the policy is

\log

-concave with respect to the policy parameters, and were demonstrated to have strong empirical performance in challenging domains, such as the game of Tetris, no convergence analysis was provided. The purpose of this paper is to provide such an analysis. We start by providing a detailed analysis of the Hessian of a Markov Decision Process, which is formed of a negative-semidefinite component, a positive-semidefinite component and a remainder term. The first part of our analysis details how the negative-semidefinite and positive-semidefinite components relate to each other, and how these two terms contribute to the Hessian. The next part of our analysis shows that under certain conditions, relating to the richness of the policy class, the remainder term in the Hessian vanishes in the vicinity of a local optimum. Finally, we bound the behaviour of this remainder term in terms of the mixing time of the Markov chain induced by the policy parameters, where this part of the analysis is applicable over the entire parameter space. Given this analysis of the Hessian we then provide our local convergence analysis of the approximate Newton framework.Comment: This work has been removed because a more recent piece (A Gauss-Newton method for Markov Decision Processes, T. Furmston & G. Lever) of work has subsumed i

arXiv.org e-Print Archive

CiteSeerX

Expected Policy Gradients

Author: Ciosek Kamil
Whiteson Shimon
Publication venue
Publication date: 01/01/2018
Field of study

We propose expected policy gradients (EPG), which unify stochastic policy gradients (SPG) and deterministic policy gradients (DPG) for reinforcement learning. Inspired by expected sarsa, EPG integrates across the action when estimating the gradient, instead of relying only on the action in the sampled trajectory. We establish a new general policy gradient theorem, of which the stochastic and deterministic policy gradient theorems are special cases. We also prove that EPG reduces the variance of the gradient estimates without requiring deterministic policies and, for the Gaussian case, with no computational overhead. Finally, we show that it is optimal in a certain sense to explore with a Gaussian policy such that the covariance is proportional to the exponential of the scaled Hessian of the critic with respect to the actions. We present empirical results confirming that this new form of exploration substantially outperforms DPG with the Ornstein-Uhlenbeck heuristic in four challenging MuJoCo domains.Comment: Conference paper, AAAI-18, 12 pages including supplemen

arXiv.org e-Print Archive

Oxford University Research Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

A view of Estimation of Distribution Algorithms through the lens of Expectation-Maximization

Author: Brookes David H.
Busia Akosua
Fannjiang Clara
Listgarten Jennifer
Murphy Kevin
Publication venue
Publication date: 11/06/2020
Field of study

We show that a large class of Estimation of Distribution Algorithms, including, but not limited to, Covariance Matrix Adaption, can be written as a Monte Carlo Expectation-Maximization algorithm, and as exact EM in the limit of infinite samples. Because EM sits on a rigorous statistical foundation and has been thoroughly analyzed, this connection provides a new coherent framework with which to reason about EDAs

arXiv.org e-Print Archive

Recommended from our members

Econometrics: A bird's eye view

Author: Geweke J
Horowitz JL
Pesaran MH
Publication venue: Macmillan
Publication date: 01/01/2008
Field of study

As a unified discipline, econometrics is still relatively young and has been transforming and expanding very rapidly over the past few decades. Major advances have taken place in the analysis of cross sectional data by means of semi-parametric and non-parametric techniques. Heterogeneity of economic relations across individuals, firms and industries is increasingly acknowledge and attempts have been made to take them into account either by integrating out their effects or by modeling the sources of heterogeneity when suitable panel data exists. The counterfactual considerations that underlie policy analysis and treatment evaluation have been given a more satisfactory foundation. New time series econometric techniques have been developed and employed extensively in the areas of macroeconometrics and finance. Non-linear econometric techniques are used increasingly in the analysis of cross section and time series observations. Applications of Bayesian techniques to econometric problems have been given new impetus largely thanks to advances in computer power and computational techniques. The use of Bayesian techniques have in turn provided the investigators with a unifying framework where the tasks and forecasting, decision making, model evaluation and learning can be considered as parts of the same interactive and iterative process; thus paving the way for establishing the foundation of the "real time econometrics". This paper attempts to provide an overview of some of these developments

Apollo (Cambridge)