16 research outputs found
Deep Conservative Policy Iteration
Conservative Policy Iteration (CPI) is a founding algorithm of Approximate
Dynamic Programming (ADP). Its core principle is to stabilize greediness
through stochastic mixtures of consecutive policies. It comes with strong
theoretical guarantees, and inspired approaches in deep Reinforcement Learning
(RL). However, CPI itself has rarely been implemented, never with neural
networks, and only experimented on toy problems. In this paper, we show how CPI
can be practically combined with deep RL with discrete actions. We also
introduce adaptive mixture rates inspired by the theory. We experiment
thoroughly the resulting algorithm on the simple Cartpole problem, and validate
the proposed method on a representative subset of Atari games. Overall, this
work suggests that revisiting classic ADP may lead to improved and more stable
deep RL algorithms.Comment: AAAI 2020 (long version
Munchausen Reinforcement Learning
Bootstrapping is a core mechanism in Reinforcement Learning (RL). Most
algorithms, based on temporal differences, replace the true value of a
transiting state by their current estimate of this value. Yet, another estimate
could be leveraged to bootstrap RL: the current policy. Our core contribution
stands in a very simple idea: adding the scaled log-policy to the immediate
reward. We show that slightly modifying Deep Q-Network (DQN) in that way
provides an agent that is competitive with distributional methods on Atari
games, without making use of distributional RL, n-step returns or prioritized
replay. To demonstrate the versatility of this idea, we also use it together
with an Implicit Quantile Network (IQN). The resulting agent outperforms
Rainbow on Atari, installing a new State of the Art with very little
modifications to the original algorithm. To add to this empirical study, we
provide strong theoretical insights on what happens under the hood -- implicit
Kullback-Leibler regularization and increase of the action-gap.Comment: NeurIPS 2020. Code:
https://github.com/google-research/google-research/tree/master/munchausen_r
Momentum in Reinforcement Learning
We adapt the optimization's concept of momentum to reinforcement learning.
Seeing the state-action value functions as an analog to the gradients in
optimization, we interpret momentum as an average of consecutive -functions.
We derive Momentum Value Iteration (MoVI), a variation of Value Iteration that
incorporates this momentum idea. Our analysis shows that this allows MoVI to
average errors over successive iterations. We show that the proposed approach
can be readily extended to deep learning. Specifically, we propose a simple
improvement on DQN based on MoVI, and experiment it on Atari games.Comment: AISTATS 202
Offline Reinforcement Learning with Pseudometric Learning
Offline Reinforcement Learning methods seek to learn a policy from logged
transitions of an environment, without any interaction. In the presence of
function approximation, and under the assumption of limited coverage of the
state-action space of the environment, it is necessary to enforce the policy to
visit state-action pairs close to the support of logged transitions. In this
work, we propose an iterative procedure to learn a pseudometric (closely
related to bisimulation metrics) from logged transitions, and use it to define
this notion of closeness. We show its convergence and extend it to the function
approximation setting. We then use this pseudometric to define a new lookup
based bonus in an actor-critic algorithm: PLOFF. This bonus encourages the
actor to stay close, in terms of the defined pseudometric, to the support of
logged transitions. Finally, we evaluate the method on hand manipulation and
locomotion tasks.Comment: ICML 202
GKD: Generalized Knowledge Distillation for Auto-regressive Sequence Models
Knowledge distillation is commonly used for compressing neural networks to
reduce their inference cost and memory footprint. However, current distillation
methods for auto-regressive models, such as generative language models (LMs),
suffer from two key issues: (1) distribution mismatch between output sequences
during training and the sequences generated by the student during its
deployment, and (2) model under-specification, where the student model may not
be expressive enough to fit the teacher's distribution. To address these
issues, we propose Generalized Knowledge Distillation (GKD). GKD mitigates
distribution mismatch by sampling output sequences from the student during
training. Furthermore, GKD handles model under-specification by optimizing
alternative divergences, such as reverse KL, that focus on generating samples
from the student that are likely under the teacher's distribution. We
demonstrate that GKD outperforms commonly-used approaches for distilling LLMs
on summarization, machine translation, and arithmetic reasoning tasks.Comment: First two authors contributed equall
Momentum in Reinforcement Learning
International audienceWe adapt the optimization's concept of momentum to reinforcement learning. Seeing the state-action value functions as an analog to the gradients in optimization, we interpret momentum as an average of consecutive q-functions. We derive Momentum Value Iteration (MoVI), a variation of Value iteration that incorporates this momentum idea. Our analysis shows that this allows MoVI to average errors over successive iterations. We show that the proposed approach can be readily extended to deep learning. Specifically,we propose a simple improvement on DQN based on MoVI, and experiment it on Atari games
Offline Reinforcement Learning as Anti-Exploration
Offline Reinforcement Learning (RL) aims at learning an optimal control from
a fixed dataset, without interactions with the system. An agent in this setting
should avoid selecting actions whose consequences cannot be predicted from the
data. This is the converse of exploration in RL, which favors such actions. We
thus take inspiration from the literature on bonus-based exploration to design
a new offline RL agent. The core idea is to subtract a prediction-based
exploration bonus from the reward, instead of adding it for exploration. This
allows the policy to stay close to the support of the dataset. We connect this
approach to a more common regularization of the learned policy towards the
data. Instantiated with a bonus based on the prediction error of a variational
autoencoder, we show that our agent is competitive with the state of the art on
a set of continuous control locomotion and manipulation tasks
Leverage the Average: an Analysis of KL Regularization in Reinforcement Learning
International audienceRecent Reinforcement Learning (RL) algorithms making use of Kullback-Leibler (KL) regularization as a core component have shown outstanding performance. Yet, only little is understood theoretically about why KL regularization helps, so far. We study KL regularization within an approximate value iteration scheme and show that it implicitly averages q-values. Leveraging this insight, we provide a very strong performance bound, the very first to combine two desirable aspects: a linear dependency to the horizon (instead of quadratic) and an error propagation term involving an averaging effect of the estimation errors (instead of an accumulation effect). We also study the more general case of an additional entropy regularizer. The resulting abstract scheme encompasses many existing RL algorithms. Some of our assumptions do not hold with neural networks, so we complement this theoretical analysis with an extensive empirical study