Search CORE

1,950 research outputs found

Policy gradient in Lipschitz Markov Decision Processes

Author: Bascetta Luca
Pirotta Matteo
Restelli Marcello
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/01/2015
Field of study

This paper is about the exploitation of Lipschitz continuity properties for Markov Decision Processes to safely speed up policy-gradient algorithms. Starting from assumptions about the Lipschitz continuity of the state-transition model, the reward function, and the policies considered in the learning process, we show that both the expected return of a policy and its gradient are Lipschitz continuous w.r.t. policy parameters. By leveraging such properties, we define policy-parameter updates that guarantee a performance improvement at each iteration. The proposed methods are empirically evaluated and compared to other related approaches using different configurations of three popular control scenarios: the linear quadratic regulator, the mass-spring-damper system and the ship-steering control

Archivio istituzionale della ricerca - Politecnico di Milano

Policy gradient in Lipschitz Markov Decision Processes

Author: DP Bertsekas
H Ammar
I Grondman
J Baxter
J Peters
J Peters
JC Spall
JJ Moré
K Hinderer
L Armijo
Luca Bascetta
M Pirotta
Marcello Restelli
Matteo Pirotta
ML Puterman
MP Deisenroth
N Ferns
N Vlassis
P Wagner
RJ Williams
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref

Smoothing Policies and Safe Policy Gradients

Author: Papini Matteo
Pirotta Matteo
Restelli Marcello
Publication venue
Publication date: 08/05/2019
Field of study

Policy gradient algorithms are among the best candidates for the much anticipated application of reinforcement learning to real-world control tasks, such as the ones arising in robotics. However, the trial-and-error nature of these methods introduces safety issues whenever the learning phase itself must be performed on a physical system. In this paper, we address a specific safety formulation, where danger is encoded in the reward signal and the learning agent is constrained to never worsen its performance. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows to identify those meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimators. By a joint, adaptive selection of these meta-parameters, we obtain a safe policy gradient algorithm

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

UPF Digital Repository

Algorithms for CVaR Optimization in MDPs

Author: Chow Yinlam
Ghavamzadeh Mohammad
Publication venue
Publication date: 10/07/2014
Field of study

In many sequential decision-making problems we may want to manage risk by minimizing some measure of variability in costs in addition to minimizing a standard criterion. Conditional value-at-risk (CVaR) is a relatively new risk measure that addresses some of the shortcomings of the well-known variance-related risk measures, and because of its computational efficiencies has gained popularity in finance and operations research. In this paper, we consider the mean-CVaR optimization problem in MDPs. We first derive a formula for computing the gradient of this risk-sensitive objective function. We then devise policy gradient and actor-critic algorithms that each uses a specific method to estimate this gradient and updates the policy parameters in the descent direction. We establish the convergence of our algorithms to locally risk-sensitive optimal policies. Finally, we demonstrate the usefulness of our algorithms in an optimal stopping problem.Comment: Submitted to NIPS 1

arXiv.org e-Print Archive

CiteSeerX