61 research outputs found

    Off-Policy Actor-Critic

    Get PDF
    This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.Comment: Full version of the paper, appendix and errata included; Proceedings of the 2012 International Conference on Machine Learnin

    Step-size Optimization for Continual Learning

    Full text link
    In continual learning, a learner has to keep learning from the data over its whole life time. A key issue is to decide what knowledge to keep and what knowledge to let go. In a neural network, this can be implemented by using a step-size vector to scale how much gradient samples change network weights. Common algorithms, like RMSProp and Adam, use heuristics, specifically normalization, to adapt this step-size vector. In this paper, we show that those heuristics ignore the effect of their adaptation on the overall objective function, for example by moving the step-size vector away from better step-size vectors. On the other hand, stochastic meta-gradient descent algorithms, like IDBD (Sutton, 1992), explicitly optimize the step-size vector with respect to the overall objective function. On simple problems, we show that IDBD is able to consistently improve step-size vectors, where RMSProp and Adam do not. We explain the differences between the two approaches and their respective limitations. We conclude by suggesting that combining both approaches could be a promising future direction to improve the performance of neural networks in continual learning

    Apprentissage par Renforcement sans Modèle et avec Action Continue

    Get PDF
    National audienceL'apprentissage par renforcement est souvent considéré comme une solution potentielle pour permettre à un robot de s'adapter en temps réel aux changements imprédictibles d'un environnement ; mais avec des actions continues, peu d'algorithmes existants sont utilisables pour un tel apprentissage temps réel. Les méthodes les plus efficaces utilisent une politique paramétrée, souvent en combinaison avec une estimation, elle aussi paramétrée, de la fonction de valeur de cette politique. Le but de cet article est d'étudier de telles méthodes acteur-critique afin de constituer un algorithme complètement spécifié et utilisable en pratique. Nos contributions incluent 1) le développement d'une extension des algorithmes d'optimisation de politique par gradient pour l'utilisation des traces d'éligibilité, 2) une comparaison empirique des algorithmes résultants pour des actions continues, 3) l'évaluation d'une technique de mise à l'échelle du gradient qui peut améliorer les performances significativement. Finalement, nous appliquerons l'un de ces algorithmes sur un robot avec une boucle sensori-motrice rapide (10ms). L'ensemble de ces résultats constitue une étape importante pour la conception d'algorithmes de contrôle avec des actions continues et facilement utilisable en pratique

    Meta-descent for Online, Continual Prediction

    Full text link
    This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update---a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the step-size parameters to minimize prediction error. These meta-descent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental meta-descent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.Comment: AAAI Conference on Artificial Intelligence 2019. v2: Correction to Baird's counterexample. A bug in the code lead to results being reported for AMSGrad in this experiment, when they were actually results for Ada

    Learning the structure of Factored Markov Decision Processes in reinforcement learning problems

    Full text link
    International audienceRecent decision-theoric planning algorithms are able to find optimal solutions in large problems, using Factored Markov Decision Processes (FMDPs). However, these algorithms need a perfect knowledge of the structure of the problem. In this paper, we propose SDYNA, a general framework for addressing large reinforcement learning problems by trial-and-error and with no initial knowledge of their structure. SDYNA integrates incremental planning algorithms based on FMDPs with supervised learning techniques building structured representations of the problem. We describe SPITI, an instantiation of SDYNA, that uses incremental decision tree induction to learn the structure of a problem combined with an incremental version of the Structured Value Iteration algorithm. We show that SPITI can build a factored representation of a reinforcement learning problem and may improve the policy faster than tabular reinforcement learning algorithms by exploiting the generalization property of decision tree induction algorithms

    Deterministic Policy Gradient Algorithms

    Get PDF
    International audienceIn this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic pol- icy gradient has a particularly appealing form: it is the expected gradient of the action-value func- tion. This simple form means that the deter- ministic policy gradient can be estimated much more efficiently than the usual stochastic pol- icy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counter- parts in high-dimensional action spaces

    Evolving a Neural Model of Insect Path Integration

    Get PDF
    Path integration is an important navigation strategy in many animal species. We use a genetic algorithm to evolve a novel neural model of path integration, based on input from cells that encode the heading of the agent in a manner comparable to the polarization-sensitive interneurons found in insects. The home vector is encoded as a population code across a circular array of cells that integrate this input. This code can be used to control return to the home position. We demonstrate the capabilities of the network under noisy conditions in simulation and on a robot
    corecore