47 research outputs found
Reinforcement Learning from Imperfect Demonstrations
Robust real-world learning should benefit from both demonstrations and
interactions with the environment. Current approaches to learning from
demonstration and reward perform supervised learning on expert demonstration
data and use reinforcement learning to further improve performance based on the
reward received from the environment. These tasks have divergent losses which
are difficult to jointly optimize and such methods can be very sensitive to
noisy demonstrations. We propose a unified reinforcement learning algorithm,
Normalized Actor-Critic (NAC), that effectively normalizes the Q-function,
reducing the Q-values of actions unseen in the demonstration data. NAC learns
an initial policy network from demonstrations and refines the policy in the
environment, surpassing the demonstrator's performance. Crucially, both
learning from demonstration and interactive refinement use the same objective,
unlike prior approaches that combine distinct supervised and reinforcement
losses. This makes NAC robust to suboptimal demonstration data since the method
is not forced to mimic all of the examples in the dataset. We show that our
unified reinforcement learning algorithm can learn robustly and outperform
existing baselines when evaluated on several realistic driving games
Multi-Preference Actor Critic
Policy gradient algorithms typically combine discounted future rewards with
an estimated value function, to compute the direction and magnitude of
parameter updates. However, for most Reinforcement Learning tasks, humans can
provide additional insight to constrain the policy learning. We introduce a
general method to incorporate multiple different feedback channels into a
single policy gradient loss. In our formulation, the Multi-Preference Actor
Critic (M-PAC), these different types of feedback are implemented as
constraints on the policy. We use a Lagrangian relaxation to satisfy these
constraints using gradient descent while learning a policy that maximizes
rewards. Experiments in Atari and Pendulum verify that constraints are being
respected and can accelerate the learning process.Comment: NeurIPS Workshop on Deep RL, 201
Reinforcement Learning for Nested Polar Code Construction
In this paper, we model nested polar code construction as a Markov decision
process (MDP), and tackle it with advanced reinforcement learning (RL)
techniques. First, an MDP environment with state, action, and reward is defined
in the context of polar coding. Specifically, a state represents the
construction of an polar code, an action specifies its reduction to an
subcode, and reward is the decoding performance. A neural network
architecture consisting of both policy and value networks is proposed to
generate actions based on the observed states, aiming at maximizing the overall
rewards. A loss function is defined to trade off between exploitation and
exploration. To further improve learning efficiency and quality, an `integrated
learning' paradigm is proposed. It first employs a genetic algorithm to
generate a population of (sub-)optimal polar codes for each , and then
uses them as prior knowledge to refine the policy in RL. Such a paradigm is
shown to accelerate the training process, and converge at better performances.
Simulation results show that the proposed learning-based polar constructions
achieve comparable, or even better, performances than the state of the art
under successive cancellation list (SCL) decoders. Last but not least, this is
achieved without exploiting any expert knowledge from polar coding theory in
the learning algorithms.Comment: 8 pages, 10 figures, propose a multi-stage genetic algorith
Action Space Shaping in Deep Reinforcement Learning
Reinforcement learning (RL) has been successful in training agents in various
learning environments, including video-games. However, such work modifies and
shrinks the action space from the game's original. This is to avoid trying
"pointless" actions and to ease the implementation. Currently, this is mostly
done based on intuition, with little systematic research supporting the design
decisions. In this work, we aim to gain insight on these action space
modifications by conducting extensive experiments in video-game environments.
Our results show how domain-specific removal of actions and discretization of
continuous actions can be crucial for successful learning. With these insights,
we hope to ease the use of RL in new environments, by clarifying what
action-spaces are easy to learn.Comment: To appear in IEEE Conference on Games 2020. Experiment code is
available at https://github.com/Miffyli/rl-action-space-shapin
Towards intervention-centric causal reasoning in learning agents
Interventions are central to causal learning and reasoning. Yet ultimately an
intervention is an abstraction: an agent embedded in a physical environment
(perhaps modeled as a Markov decision process) does not typically come equipped
with the notion of an intervention -- its action space is typically
ego-centric, without actions of the form `intervene on X'. Such a
correspondence between ego-centric actions and interventions would be
challenging to hard-code. It would instead be better if an agent learnt which
sequence of actions allow it to make targeted manipulations of the environment,
and learnt corresponding representations that permitted learning from
observation. Here we show how a meta-learning approach can be used to perform
causal learning in this challenging setting, where the action-space is not a
set of interventions and the observation space is a high-dimensional space with
a latent causal structure. A meta-reinforcement learning algorithm is used to
learn relationships that transfer on observational causal learning tasks. This
work shows how advances in deep reinforcement learning and meta-learning can
provide intervention-centric causal learning in high-dimensional environments
with a latent causal structure.Comment: 11 page, 4 figures. Presented at ICLR 2020 workshop 'Causal learning
for decision making
Jointly Pre-training with Supervised, Autoencoder, and Value Losses for Deep Reinforcement Learning
Deep Reinforcement Learning (DRL) algorithms are known to be data
inefficient. One reason is that a DRL agent learns both the feature and the
policy tabula rasa. Integrating prior knowledge into DRL algorithms is one way
to improve learning efficiency since it helps to build helpful representations.
In this work, we consider incorporating human knowledge to accelerate the
asynchronous advantage actor-critic (A3C) algorithm by pre-training a small
amount of non-expert human demonstrations. We leverage the supervised
autoencoder framework and propose a novel pre-training strategy that jointly
trains a weighted supervised classification loss, an unsupervised
reconstruction loss, and an expected return loss. The resulting pre-trained
model learns more useful features compared to independently training in
supervised or unsupervised fashion. Our pre-training method drastically
improved the learning performance of the A3C agent in Atari games of Pong and
MsPacman, exceeding the performance of the state-of-the-art algorithms at a
much smaller number of game interactions. Our method is light-weight and easy
to implement in a single machine. For reproducibility, our code is available at
github.com/gabrieledcjr/DeepRL/tree/A3C-ALA2019Comment: Accepted in Adaptive and Learning Agents (ALA) Workshop at AAMA
Goal-conditioned Imitation Learning
Designing rewards for Reinforcement Learning (RL) is challenging because it
needs to convey the desired task, be efficient to optimize, and be easy to
compute. The latter is particularly problematic when applying RL to robotics,
where detecting whether the desired configuration is reached might require
considerable supervision and instrumentation. Furthermore, we are often
interested in being able to reach a wide range of configurations, hence setting
up a different reward every time might be unpractical. Methods like Hindsight
Experience Replay (HER) have recently shown promise to learn policies able to
reach many goals, without the need of a reward. Unfortunately, without tricks
like resetting to points along the trajectory, HER might require many samples
to discover how to reach certain areas of the state-space. In this work we
investigate different approaches to incorporate demonstrations to drastically
speed up the convergence to a policy able to reach any goal, also surpassing
the performance of an agent trained with other Imitation Learning algorithms.
Furthermore, we show our method can also be used when the available expert
trajectories do not contain the actions, which can leverage kinesthetic or
third person demonstration. The code is available at
https://sites.google.com/view/goalconditioned-il/.Comment: Published at NeurIPS 201
PRIMAL: Pathfinding via Reinforcement and Imitation Multi-Agent Learning
Multi-agent path finding (MAPF) is an essential component of many
large-scale, real-world robot deployments, from aerial swarms to warehouse
automation. However, despite the community's continued efforts, most
state-of-the-art MAPF planners still rely on centralized planning and scale
poorly past a few hundred agents. Such planning approaches are maladapted to
real-world deployments, where noise and uncertainty often require paths be
recomputed online, which is impossible when planning times are in seconds to
minutes. We present PRIMAL, a novel framework for MAPF that combines
reinforcement and imitation learning to teach fully-decentralized policies,
where agents reactively plan paths online in a partially-observable world while
exhibiting implicit coordination. This framework extends our previous work on
distributed learning of collaborative policies by introducing demonstrations of
an expert MAPF planner during training, as well as careful reward shaping and
environment sampling. Once learned, the resulting policy can be copied onto any
number of agents and naturally scales to different team sizes and world
dimensions. We present results on randomized worlds with up to 1024 agents and
compare success rates against state-of-the-art MAPF planners. Finally, we
experimentally validate the learned policies in a hybrid simulation of a
factory mockup, involving both real-world and simulated robots.Comment: \c{opyright} 20XX IEEE. Personal use of this material is permitted.
Permission from IEEE must be obtained for all other uses, in any current or
future media, including reprinting/republishing this material for advertising
or promotional purposes, creating new collective works, for resale or
redistribution to servers or lists, or reuse of any copyrighted component of
this work in other work
Hierarchical Deep Q-Network from Imperfect Demonstrations in Minecraft
We present Hierarchical Deep Q-Network (HDQfD) that took first place in the
MineRL competition. HDQfD works on imperfect demonstrations and utilizes the
hierarchical structure of expert trajectories. We introduce the procedure of
extracting an effective sequence of meta-actions and subgoals from
demonstration data. We present a structured task-dependent replay buffer and
adaptive prioritizing technique that allow the HDQfD agent to gradually erase
poor-quality expert data from the buffer. In this paper, we present the details
of the HDQfD algorithm and give the experimental results in the Minecraft
domain
Transfer Learning for Related Reinforcement Learning Tasks via Image-to-Image Translation
Despite the remarkable success of Deep RL in learning control policies from
raw pixels, the resulting models do not generalize. We demonstrate that a
trained agent fails completely when facing small visual changes, and that
fine-tuning---the common transfer learning paradigm---fails to adapt to these
changes, to the extent that it is faster to re-train the model from scratch. We
show that by separating the visual transfer task from the control policy we
achieve substantially better sample efficiency and transfer behavior, allowing
an agent trained on the source task to transfer well to the target tasks. The
visual mapping from the target to the source domain is performed using
unaligned GANs, resulting in a control policy that can be further improved
using imitation learning from imperfect demonstrations. We demonstrate the
approach on synthetic visual variants of the Breakout game, as well as on
transfer between subsequent levels of Road Fighter, a Nintendo car-driving
game. A visualization of our approach can be seen in
https://youtu.be/4mnkzYyXMn4 and https://youtu.be/KCGTrQi6Ogo .Comment: Proceedings of the 36th International Conference on Machine Learning
(ICML 2019