61 research outputs found
Off-Policy Actor-Critic
This paper presents the first actor-critic algorithm for off-policy
reinforcement learning. Our algorithm is online and incremental, and its
per-time-step complexity scales linearly with the number of learned weights.
Previous work on actor-critic algorithms is limited to the on-policy setting
and does not take advantage of the recent advances in off-policy gradient
temporal-difference learning. Off-policy techniques, such as Greedy-GQ, enable
a target policy to be learned while following and obtaining data from another
(behavior) policy. For many problems, however, actor-critic methods are more
practical than action value methods (like Greedy-GQ) because they explicitly
represent the policy; consequently, the policy can be stochastic and utilize a
large action space. In this paper, we illustrate how to practically combine the
generality and learning potential of off-policy learning with the flexibility
in action selection given by actor-critic methods. We derive an incremental,
linear time and space complexity algorithm that includes eligibility traces,
prove convergence under assumptions similar to previous off-policy algorithms,
and empirically show better or comparable performance to existing algorithms on
standard reinforcement-learning benchmark problems.Comment: Full version of the paper, appendix and errata included; Proceedings
of the 2012 International Conference on Machine Learnin
Step-size Optimization for Continual Learning
In continual learning, a learner has to keep learning from the data over its
whole life time. A key issue is to decide what knowledge to keep and what
knowledge to let go. In a neural network, this can be implemented by using a
step-size vector to scale how much gradient samples change network weights.
Common algorithms, like RMSProp and Adam, use heuristics, specifically
normalization, to adapt this step-size vector. In this paper, we show that
those heuristics ignore the effect of their adaptation on the overall objective
function, for example by moving the step-size vector away from better step-size
vectors. On the other hand, stochastic meta-gradient descent algorithms, like
IDBD (Sutton, 1992), explicitly optimize the step-size vector with respect to
the overall objective function. On simple problems, we show that IDBD is able
to consistently improve step-size vectors, where RMSProp and Adam do not. We
explain the differences between the two approaches and their respective
limitations. We conclude by suggesting that combining both approaches could be
a promising future direction to improve the performance of neural networks in
continual learning
Apprentissage par Renforcement sans Modèle et avec Action Continue
National audienceL'apprentissage par renforcement est souvent considéré comme une solution potentielle pour permettre à un robot de s'adapter en temps réel aux changements imprédictibles d'un environnement ; mais avec des actions continues, peu d'algorithmes existants sont utilisables pour un tel apprentissage temps réel. Les méthodes les plus efficaces utilisent une politique paramétrée, souvent en combinaison avec une estimation, elle aussi paramétrée, de la fonction de valeur de cette politique. Le but de cet article est d'étudier de telles méthodes acteur-critique afin de constituer un algorithme complètement spécifié et utilisable en pratique. Nos contributions incluent 1) le développement d'une extension des algorithmes d'optimisation de politique par gradient pour l'utilisation des traces d'éligibilité, 2) une comparaison empirique des algorithmes résultants pour des actions continues, 3) l'évaluation d'une technique de mise à l'échelle du gradient qui peut améliorer les performances significativement. Finalement, nous appliquerons l'un de ces algorithmes sur un robot avec une boucle sensori-motrice rapide (10ms). L'ensemble de ces résultats constitue une étape importante pour la conception d'algorithmes de contrôle avec des actions continues et facilement utilisable en pratique
Meta-descent for Online, Continual Prediction
This paper investigates different vector step-size adaptation approaches for
non-stationary online, continual prediction problems. Vanilla stochastic
gradient descent can be considerably improved by scaling the update with a
vector of appropriately chosen step-sizes. Many methods, including AdaGrad,
RMSProp, and AMSGrad, keep statistics about the learning process to approximate
a second order update---a vector approximation of the inverse Hessian. Another
family of approaches use meta-gradient descent to adapt the step-size
parameters to minimize prediction error. These meta-descent strategies are
promising for non-stationary problems, but have not been as extensively
explored as quasi-second order methods. We first derive a general, incremental
meta-descent algorithm, called AdaGain, designed to be applicable to a much
broader range of algorithms, including those with semi-gradient updates or even
those with accelerations, such as RMSProp. We provide an empirical comparison
of methods from both families. We conclude that methods from both families can
perform well, but in non-stationary prediction problems the meta-descent
methods exhibit advantages. Our method is particularly robust across several
prediction problems, and is competitive with the state-of-the-art method on a
large-scale, time-series prediction problem on real data from a mobile robot.Comment: AAAI Conference on Artificial Intelligence 2019. v2: Correction to
Baird's counterexample. A bug in the code lead to results being reported for
AMSGrad in this experiment, when they were actually results for Ada
Learning the structure of Factored Markov Decision Processes in reinforcement learning problems
International audienceRecent decision-theoric planning algorithms are able to find optimal solutions in large problems, using Factored Markov Decision Processes (FMDPs). However, these algorithms need a perfect knowledge of the structure of the problem. In this paper, we propose SDYNA, a general framework for addressing large reinforcement learning problems by trial-and-error and with no initial knowledge of their structure. SDYNA integrates incremental planning algorithms based on FMDPs with supervised learning techniques building structured representations of the problem. We describe SPITI, an instantiation of SDYNA, that uses incremental decision tree induction to learn the structure of a problem combined with an incremental version of the Structured Value Iteration algorithm. We show that SPITI can build a factored representation of a reinforcement learning problem and may improve the policy faster than tabular reinforcement learning algorithms by exploiting the generalization property of decision tree induction algorithms
Deterministic Policy Gradient Algorithms
International audienceIn this paper we consider deterministic policy gradient algorithms for reinforcement learning with continuous actions. The deterministic pol- icy gradient has a particularly appealing form: it is the expected gradient of the action-value func- tion. This simple form means that the deter- ministic policy gradient can be estimated much more efficiently than the usual stochastic pol- icy gradient. To ensure adequate exploration, we introduce an off-policy actor-critic algorithm that learns a deterministic target policy from an exploratory behaviour policy. We demonstrate that deterministic policy gradient algorithms can significantly outperform their stochastic counter- parts in high-dimensional action spaces
Evolving a Neural Model of Insect Path Integration
Path integration is an important navigation strategy in many animal species. We use a genetic algorithm to evolve a novel neural model of path integration, based on input from cells that encode the heading of the agent in a manner comparable to the polarization-sensitive interneurons found in insects. The home vector is encoded as a population code across a circular array of cells that integrate this input. This code can be used to control return to the home position. We demonstrate the capabilities of the network under noisy conditions in simulation and on a robot
- …