12 research outputs found

    Fingerprint Policy Optimisation for Robust Reinforcement Learning

    Full text link
    Policy gradient methods ignore the potential value of adjusting environment variables: unobservable state features that are randomly determined by the environment in a physical setting, but are controllable in a simulator. This can lead to slow learning, or convergence to suboptimal policies, if the environment variable has a large impact on the transition dynamics. In this paper, we present fingerprint policy optimisation (FPO), which finds a policy that is optimal in expectation across the distribution of environment variables. The central idea is to use Bayesian optimisation (BO) to actively select the distribution of the environment variable that maximises the improvement generated by each iteration of the policy gradient method. To make this BO practical, we contribute two easy-to-compute low-dimensional fingerprints of the current policy. Our experiments show that FPO can efficiently learn policies that are robust to significant rare events, which are unlikely to be observable under random sampling, but are key to learning good policies.Comment: ICML 201

    Kernel quadrature with randomly pivoted Cholesky

    Full text link
    This paper presents new quadrature rules for functions in a reproducing kernel Hilbert space using nodes drawn by a sampling algorithm known as randomly pivoted Cholesky. The resulting computational procedure compares favorably to previous kernel quadrature methods, which either achieve low accuracy or require solving a computationally challenging sampling problem. Theoretical and numerical results show that randomly pivoted Cholesky is fast and achieves comparable quadrature error rates to more computationally expensive quadrature schemes based on continuous volume sampling, thinning, and recombination. Randomly pivoted Cholesky is easily adapted to complicated geometries with arbitrary kernels, unlocking new potential for kernel quadrature.Comment: 19 pages, 3 figures; NeurIPS 2023 (spotlight), camera-ready versio

    Robust Reinforcement Learning with Bayesian Optimisation and Quadrature

    Get PDF
    International audienceBayesian optimisation has been successfully applied to a variety of reinforcement learning problems. However, the traditional approach for learning optimal policies in simulators does not utilise the opportunity to improve learning by adjusting certain environment variables: state features that are unobservable and randomly determined by the environment in a physical setting but are controllable in a simulator. This article considers the problem of finding a robust policy while taking into account the impact of environment variables. We present alternating optimisation and quadrature (ALOQ), which uses Bayesian optimisation and Bayesian quadrature to address such settings. We also present transferable ALOQ (TALOQ), for settings where simulator inaccuracies lead to difficulty in transferring the learnt policy to the physical system. We show that our algorithms are robust to the presence of significant rare events, which may not be observable under random sampling but play a substantial role in determining the optimal policy. Experimental results across different domains show that our algorithms learn robust policies efficiently

    A survey on policy search algorithms for learning robot controllers in a handful of trials

    Get PDF
    International audienceMost policy search (PS) algorithms require thousands of training episodes to find an effective policy, which is often infeasible with a physical robot. This survey article focuses on the extreme other end of the spectrum: how can a robot adapt with only a handful of trials (a dozen) and a few minutes? By analogy with the word “big-data,” we refer to this challenge as “micro-data reinforcement learning.” In this article, we show that a first strategy is to leverage prior knowledge on the policy structure (e.g., dynamic movement primitives), on the policy parameters (e.g., demonstrations), or on the dynamics (e.g., simulators). A second strategy is to create data-driven surrogate models of the expected reward (e.g., Bayesian optimization) or the dynamical model (e.g., model-based PS), so that the policy optimizer queries the model instead of the real system. Overall, all successful micro-data algorithms combine these two strategies by varying the kind of model and prior knowledge. The current scientific challenges essentially revolve around scaling up to complex robots, designing generic priors, and optimizing the computing time

    Bayesian Quadrature with Prior Information: Modeling and Policies

    Get PDF
    Quadrature is the problem of estimating intractable integrals. Such integrals regularly arise in engineering and the natural sciences, especially when Bayesian methods are applied; examples include model evidences, normalizing constants and marginal distributions. This dissertation explores Bayesian quadrature, a probabilistic, model-based quadrature method. Specifically, we study different ways in which Bayesian quadrature can be adapted to account for different kinds of prior information one may have about the task. We demonstrate that by taking into account prior knowledge, Bayesian quadrature can outperform commonly used numerical methods that are agnostic to prior knowledge, such as Monte Carlo based integration. We focus on two types of information that are (a) frequently available when faced with an intractable integral and (b) can be (approximately) incorporated into Bayesian quadrature: • Natural bounds on the possible values that the integrand can take, e.g., when the integrand is a probability density function, it must nonnegative everywhere.• Knowledge about how the integral estimate will be used, i.e., for settings where quadrature is a subroutine, different downstream inference tasks can result in different priorities or desiderata for the estimate. These types of prior information are used to inform two aspects of the Bayesian quadrature inference routine: • Modeling: how the belief on the integrand can be tailored to account for the additional information.• Policies: where the integrand will be observed given a constrained budget of observations. This second aspect of Bayesian quadrature, policies for deciding where to observe the integrand, can be framed as an experimental design problem, where an agent must choose locations to evaluate a function of interest so as to maximize some notion of value. We will study the broader area of sequential experimental design, applying ideas from Bayesian decision theory to develop an efficient and nonmyopic policy for general sequential experimental design problems. We consider other sequential experimental design tasks such as Bayesian optimization and active search; in the latter, we focus on facilitating human–computer partnerships with the goal of aiding human agents engaged in data foraging through the use of active search based suggestions and an interactive visual interface. Finally, this dissertation will return to Bayesian quadrature and discuss the batch setting for experimental design, where multiple observations of the function in question are made simultaneously
    corecore