14 research outputs found
Fingerprint Policy Optimisation for Robust Reinforcement Learning
Policy gradient methods ignore the potential value of adjusting environment
variables: unobservable state features that are randomly determined by the
environment in a physical setting, but are controllable in a simulator. This
can lead to slow learning, or convergence to suboptimal policies, if the
environment variable has a large impact on the transition dynamics. In this
paper, we present fingerprint policy optimisation (FPO), which finds a policy
that is optimal in expectation across the distribution of environment
variables. The central idea is to use Bayesian optimisation (BO) to actively
select the distribution of the environment variable that maximises the
improvement generated by each iteration of the policy gradient method. To make
this BO practical, we contribute two easy-to-compute low-dimensional
fingerprints of the current policy. Our experiments show that FPO can
efficiently learn policies that are robust to significant rare events, which
are unlikely to be observable under random sampling, but are key to learning
good policies.Comment: ICML 201
Accelerated Policy Evaluation: Learning Adversarial Environments with Adaptive Importance Sampling
The evaluation of rare but high-stakes events remains one of the main
difficulties in obtaining reliable policies from intelligent agents, especially
in large or continuous state/action spaces where limited scalability enforces
the use of a prohibitively large number of testing iterations. On the other
hand, a biased or inaccurate policy evaluation in a safety-critical system
could potentially cause unexpected catastrophic failures during deployment. In
this paper, we propose the Accelerated Policy Evaluation (APE) method, which
simultaneously uncovers rare events and estimates the rare event probability in
Markov decision processes. The APE method treats the environment nature as an
adversarial agent and learns towards, through adaptive importance sampling, the
zero-variance sampling distribution for the policy evaluation. Moreover, APE is
scalable to large discrete or continuous spaces by incorporating function
approximators. We investigate the convergence properties of proposed algorithms
under suitable regularity conditions. Our empirical studies show that APE
estimates rare event probability with a smaller variance while only using
orders of magnitude fewer samples compared to baseline methods in both
multi-agent and single-agent environments.Comment: 10 pages, 5 figure
Robust Reinforcement Learning with Bayesian Optimisation and Quadrature
International audienceBayesian optimisation has been successfully applied to a variety of reinforcement learning problems. However, the traditional approach for learning optimal policies in simulators does not utilise the opportunity to improve learning by adjusting certain environment variables: state features that are unobservable and randomly determined by the environment in a physical setting but are controllable in a simulator. This article considers the problem of finding a robust policy while taking into account the impact of environment variables. We present alternating optimisation and quadrature (ALOQ), which uses Bayesian optimisation and Bayesian quadrature to address such settings. We also present transferable ALOQ (TALOQ), for settings where simulator inaccuracies lead to difficulty in transferring the learnt policy to the physical system. We show that our algorithms are robust to the presence of significant rare events, which may not be observable under random sampling but play a substantial role in determining the optimal policy. Experimental results across different domains show that our algorithms learn robust policies efficiently