7 research outputs found
Learning with Options that Terminate Off-Policy
A temporally abstract action, or an option, is specified by a policy and a
termination condition: the policy guides option behavior, and the termination
condition roughly determines its length. Generally, learning with longer
options (like learning with multi-step returns) is known to be more efficient.
However, if the option set for the task is not ideal, and cannot express the
primitive optimal policy exactly, shorter options offer more flexibility and
can yield a better solution. Thus, the termination condition puts learning
efficiency at odds with solution quality. We propose to resolve this dilemma by
decoupling the behavior and target terminations, just like it is done with
policies in off-policy learning. To this end, we give a new algorithm,
Q(\beta), that learns the solution with respect to any termination condition,
regardless of how the options actually terminate. We derive Q(\beta) by casting
learning with options into a common framework with well-studied multi-step
off-policy learning. We validate our algorithm empirically, and show that it
holds up to its motivating claims.Comment: AAAI 201
Multi-agent Hierarchical Reinforcement Learning with Dynamic Termination
In a multi-agent system, an agent's optimal policy will typically depend on
the policies chosen by others. Therefore, a key issue in multi-agent systems
research is that of predicting the behaviours of others, and responding
promptly to changes in such behaviours. One obvious possibility is for each
agent to broadcast their current intention, for example, the currently executed
option in a hierarchical reinforcement learning framework. However, this
approach results in inflexibility of agents if options have an extended
duration and are dynamic. While adjusting the executed option at each step
improves flexibility from a single-agent perspective, frequent changes in
options can induce inconsistency between an agent's actual behaviour and its
broadcast intention. In order to balance flexibility and predictability, we
propose a dynamic termination Bellman equation that allows the agents to
flexibly terminate their options. We evaluate our model empirically on a set of
multi-agent pursuit and taxi tasks, and show that our agents learn to adapt
flexibly across scenarios that require different termination behaviours.Comment: PRICAI 201
Composing Diverse Policies for Temporally Extended Tasks
Robot control policies for temporally extended and sequenced tasks are often
characterized by discontinuous switches between different local dynamics. These
change-points are often exploited in hierarchical motion planning to build
approximate models and to facilitate the design of local, region-specific
controllers. However, it becomes combinatorially challenging to implement such
a pipeline for complex temporally extended tasks, especially when the
sub-controllers work on different information streams, time scales and action
spaces. In this paper, we introduce a method that can compose diverse policies
comprising motion planning trajectories, dynamic motion primitives and neural
network controllers. We introduce a global goal scoring estimator that uses
local, per-motion primitive dynamics models and corresponding activation
state-space sets to sequence diverse policies in a locally optimal fashion. We
use expert demonstrations to convert what is typically viewed as a
gradient-based learning process into a planning process without explicitly
specifying pre- and post-conditions. We first illustrate the proposed framework
using an MDP benchmark to showcase robustness to action and model dynamics
mismatch, and then with a particularly complex physical gear assembly task,
solved on a PR2 robot. We show that the proposed approach successfully
discovers the optimal sequence of controllers and solves both tasks
efficiently.Comment: arXiv admin note: substantial text overlap with arXiv:1906.1009
CRISP: Curriculum inducing Primitive Informed Subgoal Prediction
Hierarchical reinforcement learning is a promising approach that uses
temporal abstraction to solve complex long horizon problems. However,
simultaneously learning a hierarchy of policies is unstable as it is
challenging to train higher-level policy when the lower-level primitive is
non-stationary. In this paper, we propose a novel hierarchical algorithm CRISP
to generate a curriculum of achievable subgoals for evolving lower-level
primitives using reinforcement learning and imitation learning. The lower level
primitive periodically performs data relabeling on a handful of expert
demonstrations using our primitive informed parsing approach to handle
non-stationarity. Since our approach uses a handful of expert demonstrations,
it is suitable for most robotic control tasks. Experimental evaluations on
complex robotic maze navigation and robotic manipulation environments show that
inducing hierarchical curriculum learning significantly improves sample
efficiency, and results in efficient goal conditioned policies for solving
temporally extended tasks. We perform real world robotic experiments on complex
manipulation tasks and demonstrate that CRISP consistently outperforms the
baselines