28 research outputs found
Overcoming Exploration in Reinforcement Learning with Demonstrations
Exploration in environments with sparse rewards has been a persistent problem
in reinforcement learning (RL). Many tasks are natural to specify with a sparse
reward, and manually shaping a reward function can result in suboptimal
performance. However, finding a non-zero reward is exponentially more difficult
with increasing task horizon or action dimensionality. This puts many
real-world tasks out of practical reach of RL methods. In this work, we use
demonstrations to overcome the exploration problem and successfully learn to
perform long-horizon, multi-step robotics tasks with continuous control such as
stacking blocks with a robot arm. Our method, which builds on top of Deep
Deterministic Policy Gradients and Hindsight Experience Replay, provides an
order of magnitude of speedup over RL on simulated robotics tasks. It is simple
to implement and makes only the additional assumption that we can collect a
small set of demonstrations. Furthermore, our method is able to solve tasks not
solvable by either RL or behavior cloning alone, and often ends up
outperforming the demonstrator policy.Comment: 8 pages, ICRA 201
Deep Q-learning from Demonstrations
Deep reinforcement learning (RL) has achieved several high profile successes
in difficult decision-making problems. However, these algorithms typically
require a huge amount of data before they reach reasonable performance. In
fact, their performance during learning can be extremely poor. This may be
acceptable for a simulator, but it severely limits the applicability of deep RL
to many real-world tasks, where the agent must learn in the real environment.
In this paper we study a setting where the agent may access data from previous
control of the system. We present an algorithm, Deep Q-learning from
Demonstrations (DQfD), that leverages small sets of demonstration data to
massively accelerate the learning process even from relatively small amounts of
demonstration data and is able to automatically assess the necessary ratio of
demonstration data while learning thanks to a prioritized replay mechanism.
DQfD works by combining temporal difference updates with supervised
classification of the demonstrator's actions. We show that DQfD has better
initial performance than Prioritized Dueling Double Deep Q-Networks (PDD DQN)
as it starts with better scores on the first million steps on 41 of 42 games
and on average it takes PDD DQN 83 million steps to catch up to DQfD's
performance. DQfD learns to out-perform the best demonstration given in 14 of
42 games. In addition, DQfD leverages human demonstrations to achieve
state-of-the-art results for 11 games. Finally, we show that DQfD performs
better than three related algorithms for incorporating demonstration data into
DQN.Comment: Published at AAAI 2018. Previously on arxiv as "Learning from
Demonstrations for Real World Reinforcement Learning
Pattern transfer learning for reinforcement learning in order dispatching
Order dispatch is one of the central problems to ridesharing platforms. Recently, value-based reinforcement learning algorithms have shown promising performance to solve this task. However, in real-world applications, the demand-supply system is typically nonstationary over time, posing challenges to reutilizing data generated in different time periods to learn the value function. In this work, motivated by the fact that the relative relationship between the values of some states is largely stable across various environments, we propose a pattern transfer learning framework for value-based reinforcement learning in the order dispatch problem. Our method efficiently captures the value patterns by incorporating a concordance penalty. The superior performance of the proposed method is supported by experiments
Adaptive Policy Learning for Offline-to-Online Reinforcement Learning
Conventional reinforcement learning (RL) needs an environment to collect
fresh data, which is impractical when online interactions are costly. Offline
RL provides an alternative solution by directly learning from the previously
collected dataset. However, it will yield unsatisfactory performance if the
quality of the offline datasets is poor. In this paper, we consider an
offline-to-online setting where the agent is first learned from the offline
dataset and then trained online, and propose a framework called Adaptive Policy
Learning for effectively taking advantage of offline and online data.
Specifically, we explicitly consider the difference between the online and
offline data and apply an adaptive update scheme accordingly, that is, a
pessimistic update strategy for the offline dataset and an optimistic/greedy
update scheme for the online dataset. Such a simple and effective method
provides a way to mix the offline and online RL and achieve the best of both
worlds. We further provide two detailed algorithms for implementing the
framework through embedding value or policy-based RL algorithms into it.
Finally, we conduct extensive experiments on popular continuous control tasks,
and results show that our algorithm can learn the expert policy with high
sample efficiency even when the quality of offline dataset is poor, e.g.,
random dataset.Comment: AAAI202