Smoothing Policies and Safe Policy Gradients

Papini, Matteo; Pirotta, Matteo; Restelli, Marcello

research

Smoothing Policies and Safe Policy Gradients

Authors: Matteo Papini
Matteo Pirotta
Marcello Restelli
Publication date: 8 May 2019
Publisher
Doi

Abstract

Policy gradient algorithms are among the best candidates for the much anticipated application of reinforcement learning to real-world control tasks, such as the ones arising in robotics. However, the trial-and-error nature of these methods introduces safety issues whenever the learning phase itself must be performed on a physical system. In this paper, we address a specific safety formulation, where danger is encoded in the reward signal and the learning agent is constrained to never worsen its performance. By studying actor-only policy gradient from a stochastic optimization perspective, we establish improvement guarantees for a wide class of parametric policies, generalizing existing results on Gaussian policies. This, together with novel upper bounds on the variance of policy gradient estimators, allows to identify those meta-parameter schedules that guarantee monotonic improvement with high probability. The two key meta-parameters are the step size of the parameter updates and the batch size of the gradient estimators. By a joint, adaptive selection of these meta-parameters, we obtain a safe policy gradient algorithm

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:1905.03231

Last time updated on 02/06/2019

Archivio istituzionale della ricerca - Politecnico di Milano

oai:re.public.polimi.it:11311/...

Last time updated on 21/03/2023

UPF Digital Repository

oai:repositori.upf.edu:10230/5...

Last time updated on 15/04/2023