12 research outputs found
Fingerprint Policy Optimisation for Robust Reinforcement Learning
Policy gradient methods ignore the potential value of adjusting environment
variables: unobservable state features that are randomly determined by the
environment in a physical setting, but are controllable in a simulator. This
can lead to slow learning, or convergence to suboptimal policies, if the
environment variable has a large impact on the transition dynamics. In this
paper, we present fingerprint policy optimisation (FPO), which finds a policy
that is optimal in expectation across the distribution of environment
variables. The central idea is to use Bayesian optimisation (BO) to actively
select the distribution of the environment variable that maximises the
improvement generated by each iteration of the policy gradient method. To make
this BO practical, we contribute two easy-to-compute low-dimensional
fingerprints of the current policy. Our experiments show that FPO can
efficiently learn policies that are robust to significant rare events, which
are unlikely to be observable under random sampling, but are key to learning
good policies.Comment: ICML 201
Kernel quadrature with randomly pivoted Cholesky
This paper presents new quadrature rules for functions in a reproducing
kernel Hilbert space using nodes drawn by a sampling algorithm known as
randomly pivoted Cholesky. The resulting computational procedure compares
favorably to previous kernel quadrature methods, which either achieve low
accuracy or require solving a computationally challenging sampling problem.
Theoretical and numerical results show that randomly pivoted Cholesky is fast
and achieves comparable quadrature error rates to more computationally
expensive quadrature schemes based on continuous volume sampling, thinning, and
recombination. Randomly pivoted Cholesky is easily adapted to complicated
geometries with arbitrary kernels, unlocking new potential for kernel
quadrature.Comment: 19 pages, 3 figures; NeurIPS 2023 (spotlight), camera-ready versio
Robust Reinforcement Learning with Bayesian Optimisation and Quadrature
International audienceBayesian optimisation has been successfully applied to a variety of reinforcement learning problems. However, the traditional approach for learning optimal policies in simulators does not utilise the opportunity to improve learning by adjusting certain environment variables: state features that are unobservable and randomly determined by the environment in a physical setting but are controllable in a simulator. This article considers the problem of finding a robust policy while taking into account the impact of environment variables. We present alternating optimisation and quadrature (ALOQ), which uses Bayesian optimisation and Bayesian quadrature to address such settings. We also present transferable ALOQ (TALOQ), for settings where simulator inaccuracies lead to difficulty in transferring the learnt policy to the physical system. We show that our algorithms are robust to the presence of significant rare events, which may not be observable under random sampling but play a substantial role in determining the optimal policy. Experimental results across different domains show that our algorithms learn robust policies efficiently
A survey on policy search algorithms for learning robot controllers in a handful of trials
International audienceMost policy search (PS) algorithms require thousands of training episodes to find an effective policy, which is often infeasible with a physical robot. This survey article focuses on the extreme other end of the spectrum: how can a robot adapt with only a handful of trials (a dozen) and a few minutes? By analogy with the word “big-data,” we refer to this challenge as “micro-data reinforcement learning.” In this article, we show that a first strategy is to leverage prior knowledge on the policy structure (e.g., dynamic movement primitives), on the policy parameters (e.g., demonstrations), or on the dynamics (e.g., simulators). A second strategy is to create data-driven surrogate models of the expected reward (e.g., Bayesian optimization) or the dynamical model (e.g., model-based PS), so that the policy optimizer queries the model instead of the real system. Overall, all successful micro-data algorithms combine these two strategies by varying the kind of model and prior knowledge. The current scientific challenges essentially revolve around scaling up to complex robots, designing generic priors, and optimizing the computing time
Bayesian Quadrature with Prior Information: Modeling and Policies
Quadrature is the problem of estimating intractable integrals. Such integrals regularly arise in engineering and the natural sciences, especially when Bayesian methods are applied; examples include model evidences, normalizing constants and marginal distributions. This dissertation explores Bayesian quadrature, a probabilistic, model-based quadrature method. Specifically, we study different ways in which Bayesian quadrature can be adapted to account for different kinds of prior information one may have about the task. We demonstrate that by taking into account prior knowledge, Bayesian quadrature can outperform commonly used numerical methods that are agnostic to prior knowledge, such as Monte Carlo based integration. We focus on two types of information that are (a) frequently available when faced with an intractable integral and (b) can be (approximately) incorporated into Bayesian quadrature:
• Natural bounds on the possible values that the integrand can take, e.g., when the integrand is a probability density function, it must nonnegative everywhere.• Knowledge about how the integral estimate will be used, i.e., for settings where quadrature is a subroutine, different downstream inference tasks can result in different priorities or desiderata for the estimate.
These types of prior information are used to inform two aspects of the Bayesian quadrature inference routine:
• Modeling: how the belief on the integrand can be tailored to account for the additional information.• Policies: where the integrand will be observed given a constrained budget of observations.
This second aspect of Bayesian quadrature, policies for deciding where to observe the integrand, can be framed as an experimental design problem, where an agent must choose locations to evaluate a function of interest so as to maximize some notion of value. We will study the broader area of sequential experimental design, applying ideas from Bayesian decision theory to develop an efficient and nonmyopic policy for general sequential experimental design problems. We consider other sequential experimental design tasks such as Bayesian optimization and active search; in the latter, we focus on facilitating human–computer partnerships with the goal of aiding human agents engaged in data foraging through the use of active search based suggestions and an interactive visual interface. Finally, this dissertation will return to Bayesian quadrature and discuss the batch setting for experimental design, where multiple observations of the function in question are made simultaneously