15,067 research outputs found
Learning Finite State Representations of Recurrent Policy Networks
Recurrent neural networks (RNNs) are an effective representation of control
policies for a wide range of reinforcement and imitation learning problems. RNN
policies, however, are particularly difficult to explain, understand, and
analyze due to their use of continuous-valued memory vectors and observation
features. In this paper, we introduce a new technique, Quantized Bottleneck
Insertion, to learn finite representations of these vectors and features. The
result is a quantized representation of the RNN that can be analyzed to improve
our understanding of memory use and general behavior. We present results of
this approach on synthetic environments and six Atari games. The resulting
finite representations are surprisingly small in some cases, using as few as 3
discrete memory states and 10 observations for a perfect Pong policy. We also
show that these finite policy representations lead to improved
interpretability.Comment: Preprint. Under review at ICLR 201
Recurrent Predictive State Policy Networks
We introduce Recurrent Predictive State Policy (RPSP) networks, a recurrent
architecture that brings insights from predictive state representations to
reinforcement learning in partially observable environments. Predictive state
policy networks consist of a recursive filter, which keeps track of a belief
about the state of the environment, and a reactive policy that directly maps
beliefs to actions, to maximize the cumulative reward. The recursive filter
leverages predictive state representations (PSRs) (Rosencrantz and Gordon,
2004; Sun et al., 2016) by modeling predictive state-- a prediction of the
distribution of future observations conditioned on history and future actions.
This representation gives rise to a rich class of statistically consistent
algorithms (Hefny et al., 2018) to initialize the recursive filter. Predictive
state serves as an equivalent representation of a belief state. Therefore, the
policy component of the RPSP-network can be purely reactive, simplifying
training while still allowing optimal behaviour. Moreover, we use the PSR
interpretation during training as well, by incorporating prediction error in
the loss function. The entire network (recursive filter and reactive policy) is
still differentiable and can be trained using gradient based methods. We
optimize our policy using a combination of policy gradient based on rewards
(Williams, 1992) and gradient descent based on prediction error. We show the
efficacy of RPSP-networks under partial observability on a set of robotic
control tasks from OpenAI Gym. We empirically show that RPSP-networks perform
well compared with memory-preserving networks such as GRUs, as well as finite
memory models, being the overall best performing method
Learning Deep Neural Network Policies with Continuous Memory States
Policy learning for partially observed control tasks requires policies that
can remember salient information from past observations. In this paper, we
present a method for learning policies with internal memory for
high-dimensional, continuous systems, such as robotic manipulators. Our
approach consists of augmenting the state and action space of the system with
continuous-valued memory states that the policy can read from and write to.
Learning general-purpose policies with this type of memory representation
directly is difficult, because the policy must automatically figure out the
most salient information to memorize at each time step. We show that, by
decomposing this policy search problem into a trajectory optimization phase and
a supervised learning phase through a method called guided policy search, we
can acquire policies with effective memorization and recall strategies.
Intuitively, the trajectory optimization phase chooses the values of the memory
states that will make it easier for the policy to produce the right action in
future states, while the supervised learning phase encourages the policy to use
memorization actions to produce those memory states. We evaluate our method on
tasks involving continuous control in manipulation and navigation settings, and
show that our method can learn complex policies that successfully complete a
range of tasks that require memory
A Survey and Critique of Multiagent Deep Reinforcement Learning
Deep reinforcement learning (RL) has achieved outstanding results in recent
years. This has led to a dramatic increase in the number of applications and
methods. Recent works have explored learning beyond single-agent scenarios and
have considered multiagent learning (MAL) scenarios. Initial results report
successes in complex multiagent domains, although there are several challenges
to be addressed. The primary goal of this article is to provide a clear
overview of current multiagent deep reinforcement learning (MDRL) literature.
Additionally, we complement the overview with a broader analysis: (i) we
revisit previous key components, originally presented in MAL and RL, and
highlight how they have been adapted to multiagent deep reinforcement learning
settings. (ii) We provide general guidelines to new practitioners in the area:
describing lessons learned from MDRL works, pointing to recent benchmarks, and
outlining open avenues of research. (iii) We take a more critical tone raising
practical challenges of MDRL (e.g., implementation and computational demands).
We expect this article will help unify and motivate future research to take
advantage of the abundant literature that exists (e.g., RL and MAL) in a joint
effort to promote fruitful research in the multiagent community.Comment: Under review since Oct 2018. Earlier versions of this work had the
title: "Is multiagent deep reinforcement learning the answer or the question?
A brief survey
General Value Function Networks
State construction is important for learning in partially observable
environments. A general purpose strategy for state construction is to learn the
state update using a Recurrent Neural Network (RNN), which updates the internal
state using the current internal state and the most recent observation. This
internal state provides a summary of the observed sequence, to facilitate
accurate predictions and decision-making. At the same time, RNNs can be hard to
specify and train for non-experts. Training RNNs is notoriously tricky,
particularly as the common strategy to approximate gradients back in time,
called truncated Back-prop Through Time (BPTT), can be sensitive to the
truncation window. Further, domain-expertise---which can usually help constrain
the function class and so improve trainability---can be difficult to
incorporate into complex recurrent units used within RNNs. In this work, we
explore how to use multi-step predictions, as a simple and general approach to
inject prior knowledge, while retaining much of the generality and learning
power behind RNNs. In particular, we revisit the idea of using predictions to
construct state and ask: does constraining (parts of) the state to consist of
predictions about the future improve RNN trainability? We formulate a novel RNN
architecture, called a General Value Function Network (GVFN), where each
internal state component corresponds to a prediction about the future
represented as a value function. We first provide an objective for optimizing
GVFNs, and derive several algorithms to optimize this objective. We then show
that GVFNs are more robust to the truncation level, in many cases only
requiring one-step gradient updates
Non-Markovian Control with Gated End-to-End Memory Policy Networks
Partially observable environments present an important open challenge in the
domain of sequential control learning with delayed rewards. Despite numerous
attempts during the two last decades, the majority of reinforcement learning
algorithms and associated approximate models, applied to this context, still
assume Markovian state transitions. In this paper, we explore the use of a
recently proposed attention-based model, the Gated End-to-End Memory Network,
for sequential control. We call the resulting model the Gated End-to-End Memory
Policy Network. More precisely, we use a model-free value-based algorithm to
learn policies for partially observed domains using this memory-enhanced neural
network. This model is end-to-end learnable and it features unbounded memory.
Indeed, because of its attention mechanism and associated non-parametric
memory, the proposed model allows us to define an attention mechanism over the
observation stream unlike recurrent models. We show encouraging results that
illustrate the capability of our attention-based model in the context of the
continuous-state non-stationary control problem of stock trading. We also
present an OpenAI Gym environment for simulated stock exchange and explain its
relevance as a benchmark for the field of non-Markovian decision process
learning.Comment: 11 pages, 1 figure, 1 tabl
On Learning to Think: Algorithmic Information Theory for Novel Combinations of Reinforcement Learning Controllers and Recurrent Neural World Models
This paper addresses the general problem of reinforcement learning (RL) in
partially observable environments. In 2013, our large RL recurrent neural
networks (RNNs) learned from scratch to drive simulated cars from
high-dimensional video input. However, real brains are more powerful in many
ways. In particular, they learn a predictive model of their initially unknown
environment, and somehow use it for abstract (e.g., hierarchical) planning and
reasoning. Guided by algorithmic information theory, we describe RNN-based AIs
(RNNAIs) designed to do the same. Such an RNNAI can be trained on never-ending
sequences of tasks, some of them provided by the user, others invented by the
RNNAI itself in a curious, playful fashion, to improve its RNN-based world
model. Unlike our previous model-building RNN-based RL machines dating back to
1990, the RNNAI learns to actively query its model for abstract reasoning and
planning and decision making, essentially "learning to think." The basic ideas
of this report can be applied to many other cases where one RNN-like system
exploits the algorithmic information content of another. They are taken from a
grant proposal submitted in Fall 2014, and also explain concepts such as
"mirror neurons." Experimental results will be described in separate papers.Comment: 36 pages, 1 figure. arXiv admin note: substantial text overlap with
arXiv:1404.782
Reinforcement Learning in POMDPs with Memoryless Options and Option-Observation Initiation Sets
Many real-world reinforcement learning problems have a hierarchical nature,
and often exhibit some degree of partial observability. While hierarchy and
partial observability are usually tackled separately (for instance by combining
recurrent neural networks and options), we show that addressing both problems
simultaneously is simpler and more efficient in many cases. More specifically,
we make the initiation set of options conditional on the previously-executed
option, and show that options with such Option-Observation Initiation Sets
(OOIs) are at least as expressive as Finite State Controllers (FSCs), a
state-of-the-art approach for learning in POMDPs. OOIs are easy to design based
on an intuitive description of the task, lead to explainable policies and keep
the top-level and option policies memoryless. Our experiments show that OOIs
allow agents to learn optimal policies in challenging POMDPs, while being much
more sample-efficient than a recurrent neural network over options
Memorize or generalize? Searching for a compositional RNN in a haystack
Neural networks are very powerful learning systems, but they do not readily
generalize from one task to the other. This is partly due to the fact that they
do not learn in a compositional way, that is, by discovering skills that are
shared by different tasks, and recombining them to solve new problems. In this
paper, we explore the compositional generalization capabilities of recurrent
neural networks (RNNs). We first propose the lookup table composition domain as
a simple setup to test compositional behaviour and show that it is
theoretically possible for a standard RNN to learn to behave compositionally in
this domain when trained with standard gradient descent and provided with
additional supervision. We then remove this additional supervision and perform
a search over a large number of model initializations to investigate the
proportion of RNNs that can still converge to a compositional solution. We
discover that a small but non-negligible proportion of RNNs do reach partial
compositional solutions even without special architectural constraints. This
suggests that a combination of gradient descent and evolutionary strategies
directly favouring the minority models that developed more compositional
approaches might suffice to lead standard RNNs towards compositional solutions.Comment: AEGAP Workshop (ICML 2018
On Improving Deep Reinforcement Learning for POMDPs
Deep Reinforcement Learning (RL) recently emerged as one of the most
competitive approaches for learning in sequential decision making problems with
fully observable environments, e.g., computer Go. However, very little work has
been done in deep RL to handle partially observable environments. We propose a
new architecture called Action-specific Deep Recurrent Q-Network (ADRQN) to
enhance learning performance in partially observable domains. Actions are
encoded by a fully connected layer and coupled with a convolutional observation
to form an action-observation pair. The time series of action-observation pairs
are then integrated by an LSTM layer that learns latent states based on which a
fully connected layer computes Q-values as in conventional Deep Q-Networks
(DQNs). We demonstrate the effectiveness of our new architecture in several
partially observable domains, including flickering Atari games.Comment: 7 pages, 6 figures, 3 table
- …