Search CORE

61 research outputs found

Off-Policy Actor-Critic

Author: Degris Thomas
Sutton Richard S.
White Martha
Publication venue
Publication date: 01/01/2012
Field of study

This paper presents the first actor-critic algorithm for off-policy reinforcement learning. Our algorithm is online and incremental, and its per-time-step complexity scales linearly with the number of learned weights. Previous work on actor-critic algorithms is limited to the on-policy setting and does not take advantage of the recent advances in off-policy gradient temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable a target policy to be learned while following and obtaining data from another (behavior) policy. For many problems, however, actor-critic methods are more practical than action value methods (like Greedy-GQ) because they explicitly represent the policy; consequently, the policy can be stochastic and utilize a large action space. In this paper, we illustrate how to practically combine the generality and learning potential of off-policy learning with the flexibility in action selection given by actor-critic methods. We derive an incremental, linear time and space complexity algorithm that includes eligibility traces, prove convergence under assumptions similar to previous off-policy algorithms, and empirically show better or comparable performance to existing algorithms on standard reinforcement-learning benchmark problems.Comment: Full version of the paper, appendix and errata included; Proceedings of the 2012 International Conference on Machine Learnin

arXiv.org e-Print Archive

CiteSeerX

INRIA a CCSD electronic archive server

Step-size Optimization for Continual Learning

Author: Degris Thomas
Javed Khurram
Liu Yuxin
Sharifnassab Arsalan
Sutton Richard
Publication venue
Publication date: 30/01/2024
Field of study

In continual learning, a learner has to keep learning from the data over its whole life time. A key issue is to decide what knowledge to keep and what knowledge to let go. In a neural network, this can be implemented by using a step-size vector to scale how much gradient samples change network weights. Common algorithms, like RMSProp and Adam, use heuristics, specifically normalization, to adapt this step-size vector. In this paper, we show that those heuristics ignore the effect of their adaptation on the overall objective function, for example by moving the step-size vector away from better step-size vectors. On the other hand, stochastic meta-gradient descent algorithms, like IDBD (Sutton, 1992), explicitly optimize the step-size vector with respect to the overall objective function. On simple problems, we show that IDBD is able to consistently improve step-size vectors, where RMSProp and Adam do not. We explain the differences between the two approaches and their respective limitations. We conclude by suggesting that combining both approaches could be a promising future direction to improve the performance of neural networks in continual learning

arXiv.org e-Print Archive

Apprentissage par Renforcement sans Modèle et avec Action Continue

Author: Degris Nicolas
Pilarski Patrick,
Sutton Richard,
Publication venue: HAL CCSD
Publication date: 22/05/2012
Field of study

National audienceL'apprentissage par renforcement est souvent considéré comme une solution potentielle pour permettre à un robot de s'adapter en temps réel aux changements imprédictibles d'un environnement ; mais avec des actions continues, peu d'algorithmes existants sont utilisables pour un tel apprentissage temps réel. Les méthodes les plus efficaces utilisent une politique paramétrée, souvent en combinaison avec une estimation, elle aussi paramétrée, de la fonction de valeur de cette politique. Le but de cet article est d'étudier de telles méthodes acteur-critique afin de constituer un algorithme complètement spécifié et utilisable en pratique. Nos contributions incluent 1) le développement d'une extension des algorithmes d'optimisation de politique par gradient pour l'utilisation des traces d'éligibilité, 2) une comparaison empirique des algorithmes résultants pour des actions continues, 3) l'évaluation d'une technique de mise à l'échelle du gradient qui peut améliorer les performances significativement. Finalement, nous appliquerons l'un de ces algorithmes sur un robot avec une boucle sensori-motrice rapide (10ms). L'ensemble de ces résultats constitue une étape importante pour la conception d'algorithmes de contrôle avec des actions continues et facilement utilisable en pratique

INRIA a CCSD electronic archive server

Meta-descent for Online, Continual Prediction

Author: Degris Thomas
Jacobsen Andrew
Linke Cameron
Schlegel Matthew
White Adam
White Martha
Publication venue
Publication date: 17/07/2019
Field of study

This paper investigates different vector step-size adaptation approaches for non-stationary online, continual prediction problems. Vanilla stochastic gradient descent can be considerably improved by scaling the update with a vector of appropriately chosen step-sizes. Many methods, including AdaGrad, RMSProp, and AMSGrad, keep statistics about the learning process to approximate a second order update---a vector approximation of the inverse Hessian. Another family of approaches use meta-gradient descent to adapt the step-size parameters to minimize prediction error. These meta-descent strategies are promising for non-stationary problems, but have not been as extensively explored as quasi-second order methods. We first derive a general, incremental meta-descent algorithm, called AdaGain, designed to be applicable to a much broader range of algorithms, including those with semi-gradient updates or even those with accelerations, such as RMSProp. We provide an empirical comparison of methods from both families. We conclude that methods from both families can perform well, but in non-stationary prediction problems the meta-descent methods exhibit advantages. Our method is particularly robust across several prediction problems, and is competitive with the state-of-the-art method on a large-scale, time-series prediction problem on real data from a mobile robot.Comment: AAAI Conference on Artificial Intelligence 2019. v2: Correction to Baird's counterexample. A bug in the code lead to results being reported for AMSGrad in this experiment, when they were actually results for Ada

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Learning the structure of Factored Markov Decision Processes in reinforcement learning problems

Author: Degris Thomas
Sigaud Olivier
Wuillemin Pierre-Henri
Publication venue: 'Association for Computing Machinery (ACM)'
Publication date: 01/01/2006
Field of study

International audienceRecent decision-theoric planning algorithms are able to find optimal solutions in large problems, using Factored Markov Decision Processes (FMDPs). However, these algorithms need a perfect knowledge of the structure of the problem. In this paper, we propose SDYNA, a general framework for addressing large reinforcement learning problems by trial-and-error and with no initial knowledge of their structure. SDYNA integrates incremental planning algorithms based on FMDPs with supervised learning techniques building structured representations of the problem. We describe SPITI, an instantiation of SDYNA, that uses incremental decision tree induction to learn the structure of a problem combined with an incremental version of the Structured Value Iteration algorithm. We show that SPITI can build a factored representation of a reinforcement learning problem and may improve the policy faster than tabular reinforcement learning algorithms by exploiting the generalization property of decision tree induction algorithms

Crossref

HAL Descartes

Hal-Diderot

Deterministic Policy Gradient Algorithms

Author: Degris Thomas
Heess Nicolas
Lever Guy
Riedmiller Martin
Silver David
Wierstra Daan
Publication venue: HAL CCSD
Publication date: 21/06/2014
Field of study

International audienceIn this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic pol- icy gradient has a particularly appealing form: it is the expected gradient of the action-value func- tion. This simple form means that the deter- ministic policy gradient can be estimated much more efficiently than the usual stochastic pol- icy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counter- parts in high-dimensional action spaces

INRIA a CCSD electronic archive server

Evolving a Neural Model of Insect Path Integration

Author: Barbara Webb
Bisch-Knaden S.
Brown M.
Degris T.
Egorov A.
Fullmer B.
Haferlach T.
Hartmann G.
Jan Wessnitzer
Michael Mangan
Mittelstaedt H.
Moller P.
Ronacher B.
Thomas Haferlach
Publication venue: 'SAGE Publications'
Publication date: 01/09/2007
Field of study

Path integration is an important navigation strategy in many animal species. We use a genetic algorithm to evolve a novel neural model of path integration, based on input from cells that encode the heading of the agent in a manner comparable to the polarization-sensitive interneurons found in insects. The home vector is encoded as a population code across a circular array of cells that integrate this input. This code can be used to control return to the home position. We demonstrate the capabilities of the network under noisy conditions in simulation and on a robot

University of Lincoln Institutional Repository

Crossref

Edinburgh Research Explorer

A Continuous Attractor Network Model Without Recurrent Excitation: Maintenance and Integration in the Head Direction Cell System

Author: A Compte
A Gonzalo-Ruiz
A Samsonovich
AA Koulakov
AD Redish
AD Redish
Angelo Arleo
AP Georgopoulos
AV Lukashin
BL Matthews
BL McNaughton
BS Gutkin
CE Jahr
Christian Boucheny
CR Laing
D Hansel
D Wirtshafter
DC Somers
FP Battaglia
GB Ermentrout
GV Allen
GV Allen
GV Allen
H Blair
H Shibata
HT Blair
J Cho
J Rubin
JBJ Ranck
JJ Knierim
JP Bassett
JP Goodridge
JP Goodridge
JP Goodridge
JP Goodridge
JS Taube
JS Taube
JS Taube
JS Taube
JS Taube
JS Taube
K Zhang
LL Chen
M Camperi
M Tsodyks
MB Zugaro
MB Zugaro
N Brunel
N Brunel
Nicolas Brunel
P Song
PE Sharp
PE Sharp
R Ben-Yishai
R Ben-Yishai
R Liu
RR Llinás
RW Stackman
S Kali
SI Amari
SI Wiener
SI Wiener
SJY Mizumori
T Degris
WE Skaggs
X-J Wang
Publication venue: 'Springer Science and Business Media LLC'
Publication date
Field of study

Crossref