28 research outputs found
Hindsight policy gradients
A reinforcement learning agent that needs to pursue different goals across
episodes requires a goal-conditional policy. In addition to their potential to
generalize desirable behavior to unseen goals, such policies may also enable
higher-level planning based on subgoals. In sparse-reward environments, the
capacity to exploit information about the degree to which an arbitrary goal has
been achieved while another goal was intended appears crucial to enable sample
efficient learning. However, reinforcement learning agents have only recently
been endowed with such capacity for hindsight. In this paper, we demonstrate
how hindsight can be introduced to policy gradient methods, generalizing this
idea to a broad class of successful algorithms. Our experiments on a diverse
selection of sparse-reward environments show that hindsight leads to a
remarkable increase in sample efficiency.Comment: Accepted to ICLR 201
Reinforcement Learning in Sparse-Reward Environments with Hindsight Policy Gradients
A reinforcement learning agent that needs to pursue different goals across episodes requires a goal-conditional policy. In addition to their potential to generalize desirable behavior to unseen goals, such policies may also enable higher-level planning based on subgoals. In sparse-reward environments, the capacity to exploit information about the degree to which an arbitrary goal has been achieved while another goal was intended appears crucial to enabling sample efficient learning. However, reinforcement learning agents have only recently been endowed with such capacity for hindsight. In this letter, we demonstrate how hindsight can be introduced to policy gradient methods, generalizing this idea to a broad class of successful algorithms. Our experiments on a diverse selection of sparse-reward environments show that hindsight leads to a remarkable increase in sample efficiency
Episodic self-imitation learning with hindsight
Episodic self-imitation learning, a novel self-imitation algorithm with a trajectory selection module and an adaptive loss function, is proposed to speed up reinforcement learning. Compared to the original self-imitation learning algorithm, which samples good stateāaction pairs from the experience replay buffer, our agent leverages entire episodes with hindsight to aid self-imitation learning. A selection module is introduced to filter uninformative samples from each episode of the update. The proposed method overcomes the limitations of the standard self-imitation learning algorithm, a transitions-based method which performs poorly in handling continuous control environments with sparse rewards. From the experiments, episodic self-imitation learning is shown to perform better than baseline on-policy algorithms, achieving comparable performance to state-of-the-art off-policy algorithms in several simulated robot control tasks. The trajectory selection module is shown to prevent the agent learning undesirable hindsight experiences. With the capability of solving sparse reward problems in continuous control settings, episodic self-imitation learning has the potential to be applied to real-world problems that have continuous action spaces, such as robot guidance and manipulation