521 research outputs found
Count-Based Exploration with the Successor Representation
In this paper we introduce a simple approach for exploration in reinforcement
learning (RL) that allows us to develop theoretically justified algorithms in
the tabular case but that is also extendable to settings where function
approximation is required. Our approach is based on the successor
representation (SR), which was originally introduced as a representation
defining state generalization by the similarity of successor states. Here we
show that the norm of the SR, while it is being learned, can be used as a
reward bonus to incentivize exploration. In order to better understand this
transient behavior of the norm of the SR we introduce the substochastic
successor representation (SSR) and we show that it implicitly counts the number
of times each state (or feature) has been observed. We use this result to
introduce an algorithm that performs as well as some theoretically
sample-efficient approaches. Finally, we extend these ideas to a deep RL
algorithm and show that it achieves state-of-the-art performance in Atari 2600
games when in a low sample-complexity regime.Comment: This paper appears in the Proceedings of the 34th AAAI Conference on
Artificial Intelligence (AAAI 2020
Distributional Reinforcement Learning with Quantile Regression
In reinforcement learning an agent interacts with the environment by taking
actions and observing the next state and reward. When sampled
probabilistically, these state transitions, rewards, and actions can all induce
randomness in the observed long-term return. Traditionally, reinforcement
learning algorithms average over this randomness to estimate the value
function. In this paper, we build on recent work advocating a distributional
approach to reinforcement learning in which the distribution over returns is
modeled explicitly instead of only estimating the mean. That is, we examine
methods of learning the value distribution instead of the value function. We
give results that close a number of gaps between the theoretical and
algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we
extend existing results to the approximate distribution setting. Second, we
present a novel distributional reinforcement learning algorithm consistent with
our theoretical formulation. Finally, we evaluate this new algorithm on the
Atari 2600 games, observing that it significantly outperforms many of the
recent improvements on DQN, including the related distributional algorithm C51
Increasing the Action Gap: New Operators for Reinforcement Learning
This paper introduces new optimality-preserving operators on Q-functions. We
first describe an operator for tabular representations, the consistent Bellman
operator, which incorporates a notion of local policy consistency. We show that
this local consistency leads to an increase in the action gap at each state;
increasing this gap, we argue, mitigates the undesirable effects of
approximation and estimation errors on the induced greedy policies. This
operator can also be applied to discretized continuous space and time problems,
and we provide empirical results evidencing superior performance in this
context. Extending the idea of a locally consistent operator, we then derive
sufficient conditions for an operator to preserve optimality, leading to a
family of operators which includes our consistent Bellman operator. As
corollaries we provide a proof of optimality for Baird's advantage learning
algorithm and derive other gap-increasing operators with interesting
properties. We conclude with an empirical study on 60 Atari 2600 games
illustrating the strong potential of these new operators
- …