Search CORE

7 research outputs found

Diverse Exploration for Fast and Safe Policy Improvement

Author: Cohen Andrew
Wright Robert
Yu Lei
Publication venue
Publication date: 22/02/2018
Field of study

We study an important yet under-addressed problem of quickly and safely improving policies in online reinforcement learning domains. As its solution, we propose a novel exploration strategy - diverse exploration (DE), which learns and deploys a diverse set of safe policies to explore the environment. We provide DE theory explaining why diversity in behavior policies enables effective exploration without sacrificing exploitation. Our empirical study shows that an online policy improvement algorithm framework implementing the DE strategy can achieve both fast policy improvement and safe online performance.Comment: AAAI1

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Adaptive Batch Size for Safe Policy Gradients

Author: Papini Matteo
Pirotta Matteo
Restelli Marcello
Publication venue: HAL CCSD
Publication date: 01/01/2017
Field of study

International audiencePolicy gradient methods are among the best Reinforcement Learning (RL) techniques to solve complex control problems. In real-world RL applications, it is common to have a good initial policy whose performance needs to be improved and it may not be acceptable to try bad policies during the learning process. Although several methods for choosing the step size exist, research paid less attention to determine the batch size, that is the number of samples used to estimate the gradient direction for each update of the policy parameters. In this paper, we propose a set of methods to jointly optimize the step and the batch sizes that guarantee (with high probability) to improve the policy performance after each update. Besides providing theoretical guarantees, we show numerical simulations to analyse the behaviour of our methods

Archivio istituzionale della ricerca - Politecnico di Milano

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Configurable Markov Decision Processes

Author: Metelli Alberto Maria
Mutti Mirco
Restelli Marcello
Publication venue
Publication date: 01/01/2018
Field of study

In many real-world problems, there is the possibility to configure, to a limited extent, some environmental parameters to improve the performance of a learning agent. In this paper, we propose a novel framework, Configurable Markov Decision Processes (Conf-MDPs), to model this new type of interaction with the environment. Furthermore, we provide a new learning algorithm, Safe Policy-Model Iteration (SPMI), to jointly and adaptively optimize the policy and the environment configuration. After having introduced our approach and derived some theoretical results, we present the experimental evaluation in two explicative problems to show the benefits of the environment configurability on the performance of the learned policy

arXiv.org e-Print Archive

Archivio istituzionale della ricerca - Politecnico di Milano

Archivio istituzionale della ricerca - Alma Mater Studiorum Università di Bologna

Configurable Markov Decision Processes

Author: Metelli Alberto Maria
Mutti Mirco
Restelli Marcello
Publication venue: PMLR
Publication date: 01/01/2018
Field of study

Archivio istituzionale della ricerca - Politecnico di Milano

A fast and reliable policy improvement algorithm

Author: Abbasi-Yadkori Yasin
Bartlett Peter
Wright Stephen
Publication venue: Proceedings of Machine Learning Research
Publication date: 01/01/2016
Field of study

We introduce a simple, efficient method that improves stochastic policies for Markov decision processes. The computational complexity is the same as that of the value estimation problem. We prove that when the value estimation error is small, this method gives an improvement in performance that increases with certain variance properties of the initial policy and transition dynamics. Performance in numerical experiments compares favorably with previous policy improvement algorithms

Queensland University of Technology ePrints Archive