26,966 research outputs found
DAC: The Double Actor-Critic Architecture for Learning Options
We reformulate the option framework as two parallel augmented MDPs. Under
this novel formulation, all policy optimization algorithms can be used off the
shelf to learn intra-option policies, option termination conditions, and a
master policy over options. We apply an actor-critic algorithm on each
augmented MDP, yielding the Double Actor-Critic (DAC) architecture.
Furthermore, we show that, when state-value functions are used as critics, one
critic can be expressed in terms of the other, and hence only one critic is
necessary. We conduct an empirical study on challenging robot simulation tasks.
In a transfer learning setting, DAC outperforms both its hierarchy-free
counterpart and previous gradient-based option learning algorithms.Comment: NeurIPS 201
Natural Option Critic
The recently proposed option-critic architecture Bacon et al. provide a
stochastic policy gradient approach to hierarchical reinforcement learning.
Specifically, they provide a way to estimate the gradient of the expected
discounted return with respect to parameters that define a finite number of
temporally extended actions, called \textit{options}. In this paper we show how
the option-critic architecture can be extended to estimate the natural gradient
of the expected discounted return. To this end, the central questions that we
consider in this paper are: 1) what is the definition of the natural gradient
in this context, 2) what is the Fisher information matrix associated with an
option's parameterized policy, 3) what is the Fisher information matrix
associated with an option's parameterized termination function, and 4) how can
a compatible function approximation approach be leveraged to obtain natural
gradient estimates for both the parameterized policy and parameterized
termination functions of an option with per-time-step time and space complexity
linear in the total number of parameters. Based on answers to these questions
we introduce the natural option critic algorithm. Experimental results showcase
improvement over the vanilla gradient approach
When Waiting is not an Option : Learning Options with a Deliberation Cost
Recent work has shown that temporally extended actions (options) can be
learned fully end-to-end as opposed to being specified in advance. While the
problem of "how" to learn options is increasingly well understood, the question
of "what" good options should be has remained elusive. We formulate our answer
to what "good" options should be in the bounded rationality framework (Simon,
1957) through the notion of deliberation cost. We then derive practical
gradient-based learning algorithms to implement this objective. Our results in
the Arcade Learning Environment (ALE) show increased performance and
interpretability
Efficient Deep Reinforcement Learning via Adaptive Policy Transfer
Transfer Learning (TL) has shown great potential to accelerate Reinforcement
Learning (RL) by leveraging prior knowledge from past learned policies of
relevant tasks. Existing transfer approaches either explicitly computes the
similarity between tasks or select appropriate source policies to provide
guided explorations for the target task. However, how to directly optimize the
target policy by alternatively utilizing knowledge from appropriate source
policies without explicitly measuring the similarity is currently missing. In
this paper, we propose a novel Policy Transfer Framework (PTF) to accelerate RL
by taking advantage of this idea. Our framework learns when and which source
policy is the best to reuse for the target policy and when to terminate it by
modeling multi-policy transfer as the option learning problem. PTF can be
easily combined with existing deep RL approaches. Experimental results show it
significantly accelerates the learning process and surpasses state-of-the-art
policy transfer methods in terms of learning efficiency and final performance
in both discrete and continuous action spaces.Comment: Accepted by IJCAI'202
- …