Search CORE

28 research outputs found

Hindsight policy gradients

Author: Mutz F
Rauber P
Schmidhuber J
Ummadisingu A
Publication venue
Publication date: 01/01/2019
Field of study

A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enable sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this paper, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency.Comment: Accepted to ICLR 201

arXiv.org e-Print Archive

Queen Mary Research Online

Reinforcement Learning in Sparse-Reward Environments with Hindsight Policy Gradients

Author: Mutz F
Rauber P
Schmidhuber J
Ummadisingu A
Publication venue: 'MIT Press - Journals'
Publication date: 01/05/2021
Field of study

A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enabling sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this letter, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency

Queen Mary Research Online

Episodic self-imitation learning with hindsight

Author: Bharath A
Dai T
Liu H
Publication venue: 'MDPI AG'
Publication date: 01/10/2020
Field of study

Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good state–action pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation

arXiv.org e-Print Archive

Aberdeen University Research

Spiral - Imperial College Digital Repository