33 research outputs found
Self-organizing developmental reinforcement learning
International audienceThis paper presents a developmental reinforcement learning framework aimed at exploring rich, complex and large sensorimotor spaces. The core of this architecture is made of a function approximator based on a Dynamic Self-Organizing Map (DSOM). The life-long online learning property of the DSOM allows us to take a developmental approach to learning a robotic task: the perception and motor skills of the robot can grow in richness and complexity during learning. This architecture is tested on a robotic task that looks simple but is still challenging for reinforcement learning.Cet article présente un cadre d'apprentissage par renforcement développemental qui permet d'explorer des espaces sensorimoteurs riches et complexes. Le coeur de cette architecture se compose d'un approximateur de fonction s'appuyant sur une carte auto-organisatrice dynamique (DSOM). Les propriétés de cette carte DSOM, notamment en matière d'apprentissage continu et en-ligne, permettent une approche développementale de l'apprentissage de tâches robotiques : les perceptions et les capacités motrices d'un robot peuvent devenir de plus en plus riches et complexes au cours de l'apprentissage. Cette architecture est testée sur une tâche robotique qui semble simple mais qui pose quand même un défi pour l'apprentissage par renforcement
Empowerment for Continuous Agent-Environment Systems
This paper develops generalizations of empowerment to continuous states.
Empowerment is a recently introduced information-theoretic quantity motivated
by hypotheses about the efficiency of the sensorimotor loop in biological
organisms, but also from considerations stemming from curiosity-driven
learning. Empowemerment measures, for agent-environment systems with stochastic
transitions, how much influence an agent has on its environment, but only that
influence that can be sensed by the agent sensors. It is an
information-theoretic generalization of joint controllability (influence on
environment) and observability (measurement by sensors) of the environment by
the agent, both controllability and observability being usually defined in
control theory as the dimensionality of the control/observation spaces. Earlier
work has shown that empowerment has various interesting and relevant
properties, e.g., it allows us to identify salient states using only the
dynamics, and it can act as intrinsic reward without requiring an external
reward. However, in this previous work empowerment was limited to the case of
small-scale and discrete domains and furthermore state transition probabilities
were assumed to be known. The goal of this paper is to extend empowerment to
the significantly more important and relevant case of continuous vector-valued
state spaces and initially unknown state transition probabilities. The
continuous state space is addressed by Monte-Carlo approximation; the unknown
transitions are addressed by model learning and prediction for which we apply
Gaussian processes regression with iterated forecasting. In a number of
well-known continuous control tasks we examine the dynamics induced by
empowerment and include an application to exploration and online model
learning
Revisiting natural actor-critics with value function approximation
Reinforcement learning (RL) is generally considered as the machine learning answer to the optimal con-trol problem. In this paradigm, an agent learns to control optimally a dynamic system through interactions. At each time step i, the dynamic system is in a given state si and receives from the agent a command (or action) ai. According to its own dynamics, the system transits to a new state si+1, and a reward ri is given to the agent. The objective is to learn a control policy maximizing the expected cumulative discounted reward. Actor-critics approaches were among the first to be proposed for handling the RL problem [1]. In this setting, two structures are maintained, one for the actor (the control organ) and one for the critic (the value function which models the expected cumulative reward to be maximized). One advantage of such an approach is that it does not require knowledge about the system dynamics to learn an optimal policy. However, the introduction of the state-action value (orQ-) function [6] led to a focus of research community in pure critic methods, for which the control policy is derived from the Q-function and has no longer a specific representation. Actually, in contrast with value function, state-action value function allows deriving a greedy policy without knowing system dynamics, and function approximation (which is a way to handle large problems) is easier to combine with pure critic approaches. Pure critic algorithms therefore aim at learning this Q-function. However, actor-critics have numerous advantages over pure critics: a separat
MapReduce for Parallel Reinforcement Learning
Abstract. We investigate the parallelization of reinforcement learning algorithms using MapReduce, a popular parallel computing framework. We present parallel versions of several dynamic programming algorithms, including policy evaluation, policy iteration, and off-policy updates. Furthermore, we design parallel reinforcement learning algorithms to deal with large scale problems using linear function approximation, including model-based projection, least squares policy iteration, temporal difference learning and recent gradient temporal difference learning algorithms. We give time and space complexity analysis of the proposed algorithms. This study demonstrates how parallelization opens new avenues for solving large scale reinforcement learning problems.
Learning Graph-based Representations for Continuous Reinforcement Learning Domains
Abstract. Graph-based domain representations have been used in discrete reinforcement learning domains as basis for, e.g., autonomous skill discovery and representation learning. These abilities are also highly relevant for learning in domains which have structured, continuous state spaces as they allow to decompose complex problems into simpler ones and reduce the burden of handengineering features. However, since graphs are inherently discrete structures, the extension of these approaches to continuous domains is not straight-forward. We argue that graphs should be seen as discrete, generative models of continuous domains. Based on this intuition, we define the likelihood of a graph for a given set of observed state transitions and derive a heuristic method entitled FIGE that allows to learn graph-based representations of continuous domains with large likelihood. Based on FIGE, we present a new skill discovery approach for continuous domains. Furthermore, we show that the learning of representations can be considerably improved by using FIGE.
Min max generalization for deterministic batch mode reinforcement learning: relaxation schemes
We study the min max optimization problem introduced in Fonteneau et al. [Towards min max reinforcement learning, ICAART 2010, Springer, Heidelberg, 2011, pp. 61–77] for computing policies for batch mode reinforcement learning in a deterministic setting with fixed, finite time horizon. First, we show that the min part of this problem is NP-hard. We then provide two relaxation schemes. The first relaxation scheme works by dropping some constraints in order to obtain a problem that is solvable in polynomial time. The second relaxation scheme, based on a Lagrangian relaxation where all constraints are dualized, can also be solved in polynomial time. We also theoretically prove and empirically illustrate that both relaxation schemes provide better results than those given in [Fonteneau et al., 2011, as cited above]
Compositional Models for Reinforcement Learning
Abstract. Innovations such as optimistic exploration, function approximation, and hierarchical decomposition have helped scale reinforcement learning to more complex environments, but these three ideas have rarely been studied together. This paper develops a unified framework that formalizes these algorithmic contributions as operators on learned models of the environment. Our formalism reveals some synergies among these innovations, and it suggests a straightforward way to compose them. The resulting algorithm, Fitted R-MAXQ, is the first to combine the function approximation of fitted algorithms, the efficient model-based exploration of R-MAX, and the hierarchical decompostion of MAXQ.