unknown

Learning to soar: exploration strategies in reinforcement learning for resource-constrained missions

Abstract

An unpowered aerial glider learning to soar in a wind field presents a new manifestation of the exploration-exploitation trade-off. This thesis proposes a directed, adaptive and nonmyopic exploration strategy in a temporal difference reinforcement learning framework for tackling the resource-constrained exploration-exploitation task of this autonomous soaring problem. The complete learning algorithm is developed in a SARSA() framework, which uses a Gaussian process with a squared exponential covariance function to approximate the value function. The three key contributions of this thesis form the proposed exploration-exploitation strategy. Firstly, a new information measure is derived from the change in the variance volume surrounding the Gaussian process estimate. This measure of information gain is used to define the exploration reward of an observation. Secondly, a nonmyopic information value is presented that captures both the immediate exploration reward due to taking an action as well as future exploration opportunities that result. Finally, this information value is combined with the state-action value of SARSA() through a dynamic weighting factor to produce an exploration-exploitation management scheme for resource-constrained learning systems. The proposed learning strategy encourages either exploratory or exploitative behaviour depending on the requirements of the learning task and the available resources. The performance of the learning algorithms presented in this thesis is compared against other SARSA() methods. Results show that actively directing exploration to regions of the state-action space with high uncertainty improves the rate of learning, while dynamic management of the exploration-exploitation behaviour according to the available resources produces prudent learning behaviour in resource-constrained systems

    Similar works