4,781 research outputs found
Reinforcement Learning: A Survey
This paper surveys the field of reinforcement learning from a
computer-science perspective. It is written to be accessible to researchers
familiar with machine learning. Both the historical basis of the field and a
broad selection of current work are summarized. Reinforcement learning is the
problem faced by an agent that learns behavior through trial-and-error
interactions with a dynamic environment. The work described here has a
resemblance to work in psychology, but differs considerably in the details and
in the use of the word ``reinforcement.'' The paper discusses central issues of
reinforcement learning, including trading off exploration and exploitation,
establishing the foundations of the field via Markov decision theory, learning
from delayed reinforcement, constructing empirical models to accelerate
learning, making use of generalization and hierarchy, and coping with hidden
state. It concludes with a survey of some implemented systems and an assessment
of the practical utility of current methods for reinforcement learning.Comment: See http://www.jair.org/ for any accompanying file
Reinforcement Learning using Augmented Neural Networks
Neural networks allow Q-learning reinforcement learning agents such as deep
Q-networks (DQN) to approximate complex mappings from state spaces to value
functions. However, this also brings drawbacks when compared to other function
approximators such as tile coding or their generalisations, radial basis
functions (RBF) because they introduce instability due to the side effect of
globalised updates present in neural networks. This instability does not even
vanish in neural networks that do not have any hidden layers. In this paper, we
show that simple modifications to the structure of the neural network can
improve stability of DQN learning when a multi-layer perceptron is used for
function approximation.Comment: 7 pages; two columns; 4 figure
Structures for Sophisticated Behaviour: Feudal Hierarchies and World Models
This thesis explores structured, reward-based behaviour in artificial agents and in animals. In Part I we investigate how reinforcement learning agents can learn to cooperate. Drawing inspiration from the hierarchical organisation of human societies, we propose the framework of Feudal Multi-agent Hierarchies (FMH), in which coordination of many agents is facilitated by a manager agent. We outline the structure of FMH and demonstrate its potential for decentralised learning and control. We show that, given an adequate set of subgoals from which to choose, FMH performs, and particularly scales, substantially better than cooperative approaches that use shared rewards. We next investigate training FMH in simulation to solve a complex information gathering task. Our approach introduces a ‘Centralised Policy Actor-Critic’ (CPAC) and an alteration to the conventional multi-agent policy gradient, which allows one multi-agent system to advise the training of another. We further exploit this idea for communicating agents with shared rewards and demonstrate its efficacy. In Part II we examine how animals discover and exploit underlying statistical structure in their environments, even when such structure is difficult to learn and use. By analysing behavioural data from an extended experiment with rats, we show that such hidden structure can indeed be learned, but also that subjects suffer from imperfections in their ability to infer their current state. We account for their behaviour using a Hidden Markov Model, in which recent observations are integrated imperfectly with evidence from the past. We find that over the course of training, subjects learn to track their progress through the task more accurately, a change that our model largely attributes to the more reliable integration of past evidenc
Off-Policy Action Anticipation in Multi-Agent Reinforcement Learning
Learning anticipation in Multi-Agent Reinforcement Learning (MARL) is a reasoning paradigm where agents anticipate the learning steps of other agents to improve cooperation among themselves. As MARL uses gradient-based optimization, learning anticipation requires using Higher-Order Gradients (HOG), with so-called HOG methods. Existing HOG methods are based on policy parameter anticipation, i.e., agents anticipate the changes in policy parameters of other agents. Currently, however, these existing HOG methods have only been applied to differentiable games or games with small state spaces. In this work, we demonstrate that in the case of non-differentiable games with large state spaces, existing HOG methods do not perform well and are inefficient due to their inherent limitations related to policy parameter anticipation and multiple sampling stages. To overcome these problems, we propose Off-Policy Action Anticipation (OffPA2), a novel framework that approaches learning anticipation through action anticipation, i.e., agents anticipate the changes in actions of other agents, via off-policy sampling. We theoretically analyze our proposed OffPA2 and employ it to develop multiple HOG methods that are applicable to non-differentiable games with large state spaces. We conduct a large set of experiments and illustrate that our proposed HOG methods outperform the existing ones regarding efficiency and performance
Learning Existing Social Conventions via Observationally Augmented Self-Play
In order for artificial agents to coordinate effectively with people, they
must act consistently with existing conventions (e.g. how to navigate in
traffic, which language to speak, or how to coordinate with teammates). A
group's conventions can be viewed as a choice of equilibrium in a coordination
game. We consider the problem of an agent learning a policy for a coordination
game in a simulated environment and then using this policy when it enters an
existing group. When there are multiple possible conventions we show that
learning a policy via multi-agent reinforcement learning (MARL) is likely to
find policies which achieve high payoffs at training time but fail to
coordinate with the real group into which the agent enters. We assume access to
a small number of samples of behavior from the true convention and show that we
can augment the MARL objective to help it find policies consistent with the
real group's convention. In three environments from the literature - traffic,
communication, and team coordination - we observe that augmenting MARL with a
small amount of imitation learning greatly increases the probability that the
strategy found by MARL fits well with the existing social convention. We show
that this works even in an environment where standard training methods very
rarely find the true convention of the agent's partners.Comment: Published in AAAI-AIES2019 - Best Pape
Recommended from our members
On Building Generalizable Learning Agents
It has been a long-standing goal in Artificial Intelligence (AI) to build machines that can solve tasks that humans can. Thanks to the recent rapid progress in data-driven methods, which train agents to solve tasks by learning from massive training data, there have been many successes in applying such learning approaches to handle and even solve a number of extremely challenging tasks, including image classification, language generation, robotics control, and several multi-player games. The key factor for all these data-driven successes is that the trained agents can generalize to test scenarios that are unseen during training. This generalization capability is the foundation for building any practical AI system. This thesis studies generalization, the fundamental challenge in AI, and proposes solutions to improve the generalization performances of learning agents in a variety of problems. We start by providing a formal formulation of the generalization problem in the context of reinforcement learning and proposing 4 principles within this formulation to guide the design of training techniques for improved generalization. We validate the effectiveness of our proposed principles by considering 4 different domains, from simple to complex, and developing domain-specific techniques following these principles. Particularly, we begin with the simplest domain, i.e., path-finding on graphs (Part I), and then consider visual navigation in a 3D world (Part II) and competition in complex multi-agent games (Part III), and lastly tackle some natural language processing tasks (Part IV). Empirical evidences demonstrate that the proposed principles can generally lead to much improved generalization performances in a wide range of problems
BabyWalk: Going Farther in Vision-and-Language Navigation by Taking Baby Steps
Learning to follow instructions is of fundamental importance to autonomous
agents for vision-and-language navigation (VLN). In this paper, we study how an
agent can navigate long paths when learning from a corpus that consists of
shorter ones. We show that existing state-of-the-art agents do not generalize
well. To this end, we propose BabyWalk, a new VLN agent that is learned to
navigate by decomposing long instructions into shorter ones (BabySteps) and
completing them sequentially. A special design memory buffer is used by the
agent to turn its past experiences into contexts for future steps. The learning
process is composed of two phases. In the first phase, the agent uses imitation
learning from demonstration to accomplish BabySteps. In the second phase, the
agent uses curriculum-based reinforcement learning to maximize rewards on
navigation tasks with increasingly longer instructions. We create two new
benchmark datasets (of long navigation tasks) and use them in conjunction with
existing ones to examine BabyWalk's generalization ability. Empirical results
show that BabyWalk achieves state-of-the-art results on several metrics, in
particular, is able to follow long instructions better. The codes and the
datasets are released on our project page https://github.com/Sha-Lab/babywalk.Comment: Accepted by ACL 202
- …