Search CORE

14 research outputs found

Fingerprint Policy Optimisation for Robust Reinforcement Learning

Author: Osborne Michael A.
Paul Supratik
Whiteson Shimon
Publication venue
Publication date: 27/05/2019
Field of study

Policy gradient methods ignore the potential value of adjusting environment variables: unobservable state features that are randomly determined by the environment in a physical setting, but are controllable in a simulator. This can lead to slow learning, or convergence to suboptimal policies, if the environment variable has a large impact on the transition dynamics. In this paper, we present fingerprint policy optimisation (FPO), which finds a policy that is optimal in expectation across the distribution of environment variables. The central idea is to use Bayesian optimisation (BO) to actively select the distribution of the environment variable that maximises the improvement generated by each iteration of the policy gradient method. To make this BO practical, we contribute two easy-to-compute low-dimensional fingerprints of the current policy. Our experiments show that FPO can efficiently learn policies that are robust to significant rare events, which are unlikely to be observable under random sampling, but are key to learning good policies.Comment: ICML 201

arXiv.org e-Print Archive

Oxford University Research Archive

Accelerated Policy Evaluation: Learning Adversarial Environments with Adaptive Importance Sampling

Author: Huang Peide
Huang Zhiyuan
Lam Henry
Li Fengpei
Oguchi Kentaro
Qi Xuewei
Xu Mengdi
Zhao Ding
Zhu Jiacheng
Publication venue
Publication date: 19/06/2021
Field of study

The evaluation of rare but high-stakes events remains one of the main difficulties in obtaining reliable policies from intelligent agents, especially in large or continuous state/action spaces where limited scalability enforces the use of a prohibitively large number of testing iterations. On the other hand, a biased or inaccurate policy evaluation in a safety-critical system could potentially cause unexpected catastrophic failures during deployment. In this paper, we propose the Accelerated Policy Evaluation (APE) method, which simultaneously uncovers rare events and estimates the rare event probability in Markov decision processes. The APE method treats the environment nature as an adversarial agent and learns towards, through adaptive importance sampling, the zero-variance sampling distribution for the policy evaluation. Moreover, APE is scalable to large discrete or continuous spaces by incorporating function approximators. We investigate the convergence properties of proposed algorithms under suitable regularity conditions. Our empirical studies show that APE estimates rare event probability with a smaller variance while only using orders of magnitude fewer samples compared to baseline methods in both multi-agent and single-agent environments.Comment: 10 pages, 5 figure

arXiv.org e-Print Archive

Correctness-guaranteed strategy synthesis and compression for multi-agent autonomous systems

Author: Enoiu Eduard
Gu Rong
Jensen Peter G.
Lundqvist Kristina
Seceleanu Cristina
Publication venue: 'Elsevier BV'
Publication date: 01/12/2022
Field of study

VBN

Importance sampling in reinforcement learning with an estimated behavior policy

Author: A Dvoretzky
B Delyon
C Gelada
CD Manning
CJ Oates
D Silver
E Greensmith
K Hirano
M Henmi
ML Puterman
PC Austin
PR Rosenbaum
Q Liu
R Bellman
RJ Williams
RS Sutton
RY Rubinstein
S Mahadevan
SP Singh
V Mnih
Publication venue: 'Springer Science and Business Media LLC'
Publication date: 01/06/2021
Field of study

Crossref

Edinburgh Research Explorer

Robust Reinforcement Learning with Bayesian Optimisation and Quadrature

Author: Chatzilygeroudis Konstantinos
Ciosek Kamil
Mouret Jean-Baptiste
Osborne Michael,
Paul Supratik
Whiteson Shimon
Publication venue: Microtome Publishing
Publication date: 01/01/2020
Field of study

International audienceBayesian optimisation has been successfully applied to a variety of reinforcement learning problems. However, the traditional approach for learning optimal policies in simulators does not utilise the opportunity to improve learning by adjusting certain environment variables: state features that are unobservable and randomly determined by the environment in a physical setting but are controllable in a simulator. This article considers the problem of finding a robust policy while taking into account the impact of environment variables. We present alternating optimisation and quadrature (ALOQ), which uses Bayesian optimisation and Bayesian quadrature to address such settings. We also present transferable ALOQ (TALOQ), for settings where simulator inaccuracies lead to difficulty in transferring the learnt policy to the physical system. We show that our algorithms are robust to the presence of significant rare events, which may not be observable under random sampling but play a substantial role in determining the optimal policy. Experimental results across different domains show that our algorithms learn robust policies efficiently

INRIA a CCSD electronic archive server