20 research outputs found
Performance Dynamics and Termination Errors in Reinforcement Learning: A Unifying Perspective
In reinforcement learning, a decision needs to be made at some point as to
whether it is worthwhile to carry on with the learning process or to terminate
it. In many such situations, stochastic elements are often present which govern
the occurrence of rewards, with the sequential occurrences of positive rewards
randomly interleaved with negative rewards. For most practical learners, the
learning is considered useful if the number of positive rewards always exceeds
the negative ones. A situation that often calls for learning termination is
when the number of negative rewards exceeds the number of positive rewards.
However, while this seems reasonable, the error of premature termination,
whereby termination is enacted along with the conclusion of learning failure
despite the positive rewards eventually far outnumber the negative ones, can be
significant. In this paper, using combinatorial analysis we study the error
probability in wrongly terminating a reinforcement learning activity which
undermines the effectiveness of an optimal policy, and we show that the
resultant error can be quite high. Whilst we demonstrate mathematically that
such errors can never be eliminated, we propose some practical mechanisms that
can effectively reduce such errors. Simulation experiments have been carried
out, the results of which are in close agreement with our theoretical findings.Comment: Short Paper in AIKE 201
Stochastic Reinforcement Learning
In reinforcement learning episodes, the rewards and punishments are often
non-deterministic, and there are invariably stochastic elements governing the
underlying situation. Such stochastic elements are often numerous and cannot be
known in advance, and they have a tendency to obscure the underlying rewards
and punishments patterns. Indeed, if stochastic elements were absent, the same
outcome would occur every time and the learning problems involved could be
greatly simplified. In addition, in most practical situations, the cost of an
observation to receive either a reward or punishment can be significant, and
one would wish to arrive at the correct learning conclusion by incurring
minimum cost. In this paper, we present a stochastic approach to reinforcement
learning which explicitly models the variability present in the learning
environment and the cost of observation. Criteria and rules for learning
success are quantitatively analyzed, and probabilities of exceeding the
observation cost bounds are also obtained.Comment: AIKE 201
Grounding the Meanings in Sensorimotor Behavior using Reinforcement Learning
The recent outburst of interest in cognitive developmental robotics is fueled by the ambition to propose ecologically plausible mechanisms of how, among other things, a learning agent/robot could ground linguistic meanings in its sensorimotor behavior. Along this stream, we propose a model that allows the simulated iCub robot to learn the meanings of actions (point, touch, and push) oriented toward objects in robotâs peripersonal space. In our experiments, the iCub learns to execute motor actions and comment on them. Architecturally, the model is composed of three neural-network-based modules that are trained in different ways. The first module, a two-layer perceptron, is trained by back-propagation to attend to the target position in the visual scene, given the low-level visual information and the feature-based target information. The second module, having the form of an actor-critic architecture, is the most distinguishing part of our model, and is trained by a continuous version of reinforcement learning to execute actions as sequences, based on a linguistic command. The third module, an echo-state network, is trained to provide the linguistic description of the executed actions. The trained model generalizes well in case of novel action-target combinations with randomized initial arm positions. It can also promptly adapt its behavior if the action/target suddenly changes during motor execution
RODE: Learning Roles to Decompose Multi-Agent Tasks
Role-based learning holds the promise of achieving scalable multi-agent
learning by decomposing complex tasks using roles. However, it is largely
unclear how to efficiently discover such a set of roles. To solve this problem,
we propose to first decompose joint action spaces into restricted role action
spaces by clustering actions according to their effects on the environment and
other agents. Learning a role selector based on action effects makes role
discovery much easier because it forms a bi-level learning hierarchy -- the
role selector searches in a smaller role space and at a lower temporal
resolution, while role policies learn in significantly reduced primitive
action-observation spaces. We further integrate information about action
effects into the role policies to boost learning efficiency and policy
generalization. By virtue of these advances, our method (1) outperforms the
current state-of-the-art MARL algorithms on 10 of the 14 scenarios that
comprise the challenging StarCraft II micromanagement benchmark and (2)
achieves rapid transfer to new environments with three times the number of
agents. Demonstrative videos are available at
https://sites.google.com/view/rode-marl