55 research outputs found
Better Optimism By Bayes: Adaptive Planning with Rich Models
The computational costs of inference and planning have confined Bayesian
model-based reinforcement learning to one of two dismal fates: powerful
Bayes-adaptive planning but only for simplistic models, or powerful, Bayesian
non-parametric models but using simple, myopic planning strategies such as
Thompson sampling. We ask whether it is feasible and truly beneficial to
combine rich probabilistic models with a closer approximation to fully Bayesian
planning. First, we use a collection of counterexamples to show formal problems
with the over-optimism inherent in Thompson sampling. Then we leverage
state-of-the-art techniques in efficient Bayes-adaptive planning and
non-parametric Bayesian methods to perform qualitatively better than both
existing conventional algorithms and Thompson sampling on two contextual
bandit-like problems.Comment: 11 pages, 11 figure
Deep Reinforcement Learning with Double Q-learning
The popular Q-learning algorithm is known to overestimate action values under
certain conditions. It was not previously known whether, in practice, such
overestimations are common, whether they harm performance, and whether they can
generally be prevented. In this paper, we answer all these questions
affirmatively. In particular, we first show that the recent DQN algorithm,
which combines Q-learning with a deep neural network, suffers from substantial
overestimations in some games in the Atari 2600 domain. We then show that the
idea behind the Double Q-learning algorithm, which was introduced in a tabular
setting, can be generalized to work with large-scale function approximation. We
propose a specific adaptation to the DQN algorithm and show that the resulting
algorithm not only reduces the observed overestimations, as hypothesized, but
that this also leads to much better performance on several games.Comment: AAAI 201
Increasing the Action Gap: New Operators for Reinforcement Learning
This paper introduces new optimality-preserving operators on Q-functions. We
first describe an operator for tabular representations, the consistent Bellman
operator, which incorporates a notion of local policy consistency. We show that
this local consistency leads to an increase in the action gap at each state;
increasing this gap, we argue, mitigates the undesirable effects of
approximation and estimation errors on the induced greedy policies. This
operator can also be applied to discretized continuous space and time problems,
and we provide empirical results evidencing superior performance in this
context. Extending the idea of a locally consistent operator, we then derive
sufficient conditions for an operator to preserve optimality, leading to a
family of operators which includes our consistent Bellman operator. As
corollaries we provide a proof of optimality for Baird's advantage learning
algorithm and derive other gap-increasing operators with interesting
properties. We conclude with an empirical study on 60 Atari 2600 games
illustrating the strong potential of these new operators
Acceleration in Policy Optimization
We work towards a unifying paradigm for accelerating policy optimization
methods in reinforcement learning (RL) by integrating foresight in the policy
improvement step via optimistic and adaptive updates. Leveraging the connection
between policy iteration and policy gradient methods, we view policy
optimization algorithms as iteratively solving a sequence of surrogate
objectives, local lower bounds on the original objective. We define optimism as
predictive modelling of the future behavior of a policy, and adaptivity as
taking immediate and anticipatory corrective actions to mitigate accumulating
errors from overshooting predictions or delayed responses to change. We use
this shared lens to jointly express other well-known algorithms, including
model-based policy improvement based on forward search, and optimistic
meta-learning algorithms. We analyze properties of this formulation, and show
connections to other accelerated optimization algorithms. Then, we design an
optimistic policy gradient algorithm, adaptive via meta-gradient learning, and
empirically highlight several design choices pertaining to acceleration, in an
illustrative task
- …