Bayesian Optimisation for Planning And Reinforcement Learning

Morere, Philippe

Bayesian Optimisation for Planning And Reinforcement Learning

Authors: Philippe Morere
Publication date: 31 January 2019
Publisher: Faculty of Engineering, School of Computer Science

Abstract

This thesis addresses the problem of achieving efficient non-myopic decision making by explicitly balancing exploration and exploitation. Decision making, both in planning and reinforcement learning (RL), enables agents or robots to complete tasks by acting on their environments. Complexity arises when completing objectives requires sacrificing short-term performance in order to achieve better long-term performance. Decision making algorithms with this characteristic are known as non-myopic, and require long sequences of actions to be evaluated, thereby greatly increasing the search space size. Optimal behaviours need balance two key quantities: exploration and exploitation. Exploitation takes advantage of previously acquired information or high performing solutions, whereas exploration focuses on acquiring more informative data. The balance between these quantities is crucial in both RL and planning. This thesis brings the following contributions: Firstly, a reward function trading off exploration and exploitation of gradients for sequential planning is proposed. It is based on Bayesian optimisation (BO) and is combined to a non-myopic planner to achieve efficient spatial monitoring. Secondly, the algorithm is extended to continuous actions spaces, called continuous belief tree search (CBTS), and uses BO to dynamically sample actions within a tree search, balancing high-performing actions and novelty. Finally, the framework is extended to RL, for which a multi-objective methodology for explicit exploration and exploitation balance is proposed. The two objectives are modelled explicitly and balanced at a policy level, as in BO. This allows for online exploration strategies, as well as a data-efficient model-free RL algorithm achieving exploration by minimising the uncertainty of Q-values (EMU-Q). The proposed algorithms are evaluated on different simulated and real-world robotics problems, displaying superior performance in terms of sample efficiency and exploration

Similar works

Full text

Open in the Core reader

Download PDF

Available Versions

Sydney eScholarship

oai:ses.library.usyd.edu.au:21...

Last time updated on 29/10/2019