16 research outputs found
Offline Reinforcement Learning as Anti-Exploration
Offline Reinforcement Learning (RL) aims at learning an optimal control from
a fixed dataset, without interactions with the system. An agent in this setting
should avoid selecting actions whose consequences cannot be predicted from the
data. This is the converse of exploration in RL, which favors such actions. We
thus take inspiration from the literature on bonus-based exploration to design
a new offline RL agent. The core idea is to subtract a prediction-based
exploration bonus from the reward, instead of adding it for exploration. This
allows the policy to stay close to the support of the dataset. We connect this
approach to a more common regularization of the learned policy towards the
data. Instantiated with a bonus based on the prediction error of a variational
autoencoder, we show that our agent is competitive with the state of the art on
a set of continuous control locomotion and manipulation tasks
Goal-Conditioned Reinforcement Learning with Imagined Subgoals
Goal-conditioned reinforcement learning endows an agent with a large variety
of skills, but it often struggles to solve tasks that require more temporally
extended reasoning. In this work, we propose to incorporate imagined subgoals
into policy learning to facilitate learning of complex tasks. Imagined subgoals
are predicted by a separate high-level policy, which is trained simultaneously
with the policy and its critic. This high-level policy predicts intermediate
states halfway to the goal using the value function as a reachability metric.
We don't require the policy to reach these subgoals explicitly. Instead, we use
them to define a prior policy, and incorporate this prior into a KL-constrained
policy iteration scheme to speed up and regularize learning. Imagined subgoals
are used during policy learning, but not during test time, where we only apply
the learned policy. We evaluate our approach on complex robotic navigation and
manipulation tasks and show that it outperforms existing methods by a large
margin.Comment: ICML 2021. See the project webpage at
https://www.di.ens.fr/willow/research/ris
On Multi-objective Policy Optimization as a Tool for Reinforcement Learning
Many advances that have improved the robustness and efficiency of deep
reinforcement learning (RL) algorithms can, in one way or another, be
understood as introducing additional objectives, or constraints, in the policy
optimization step. This includes ideas as far ranging as exploration bonuses,
entropy regularization, and regularization toward teachers or data priors when
learning from experts or in offline RL. Often, task reward and auxiliary
objectives are in conflict with each other and it is therefore natural to treat
these examples as instances of multi-objective (MO) optimization problems. We
study the principles underlying MORL and introduce a new algorithm,
Distillation of a Mixture of Experts (DiME), that is intuitive and
scale-invariant under some conditions. We highlight its strengths on standard
MO benchmark problems and consider case studies in which we recast offline RL
and learning from experts as MO problems. This leads to a natural algorithmic
formulation that sheds light on the connection between existing approaches. For
offline RL, we use the MO perspective to derive a simple algorithm, that
optimizes for the standard RL objective plus a behavioral cloning term. This
outperforms state-of-the-art on two established offline RL benchmarks
Active Predicting Coding: Brain-Inspired Reinforcement Learning for Sparse Reward Robotic Control Problems
In this article, we propose a backpropagation-free approach to robotic
control through the neuro-cognitive computational framework of neural
generative coding (NGC), designing an agent built completely from powerful
predictive coding/processing circuits that facilitate dynamic, online learning
from sparse rewards, embodying the principles of planning-as-inference.
Concretely, we craft an adaptive agent system, which we call active predictive
coding (ActPC), that balances an internally-generated epistemic signal (meant
to encourage intelligent exploration) with an internally-generated instrumental
signal (meant to encourage goal-seeking behavior) to ultimately learn how to
control various simulated robotic systems as well as a complex robotic arm
using a realistic robotics simulator, i.e., the Surreal Robotics Suite, for the
block lifting task and can pick-and-place problems. Notably, our experimental
results demonstrate that our proposed ActPC agent performs well in the face of
sparse (extrinsic) reward signals and is competitive with or outperforms
several powerful backprop-based RL approaches.Comment: Contains appendix with pseudocode and additional detail