Search CORE

26,966 research outputs found

DAC: The Double Actor-Critic Architecture for Learning Options

Author: Whiteson Shimon
Zhang Shangtong
Publication venue
Publication date: 11/09/2019
Field of study

We reformulate the option framework as two parallel augmented MDPs. Under this novel formulation, all policy optimization algorithms can be used off the shelf to learn intra-option policies, option termination conditions, and a master policy over options. We apply an actor-critic algorithm on each augmented MDP, yielding the Double Actor-Critic (DAC) architecture. Furthermore, we show that, when state-value functions are used as critics, one critic can be expressed in terms of the other, and hence only one critic is necessary. We conduct an empirical study on challenging robot simulation tasks. In a transfer learning setting, DAC outperforms both its hierarchy-free counterpart and previous gradient-based option learning algorithms.Comment: NeurIPS 201

arXiv.org e-Print Archive

Oxford University Research Archive

Natural Option Critic

Author: Thomas Philip S.
Tiwari Saket
Publication venue
Publication date: 04/12/2018
Field of study

The recently proposed option-critic architecture Bacon et al. provide a stochastic policy gradient approach to hierarchical reinforcement learning. Specifically, they provide a way to estimate the gradient of the expected discounted return with respect to parameters that define a finite number of temporally extended actions, called \textit{options}. In this paper we show how the option-critic architecture can be extended to estimate the natural gradient of the expected discounted return. To this end, the central questions that we consider in this paper are: 1) what is the definition of the natural gradient in this context, 2) what is the Fisher information matrix associated with an option's parameterized policy, 3) what is the Fisher information matrix associated with an option's parameterized termination function, and 4) how can a compatible function approximation approach be leveraged to obtain natural gradient estimates for both the parameterized policy and parameterized termination functions of an option with per-time-step time and space complexity linear in the total number of parameters. Based on answers to these questions we introduce the natural option critic algorithm. Experimental results showcase improvement over the vanilla gradient approach

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

When Waiting is not an Option : Learning Options with a Deliberation Cost

Author: Bacon Pierre-Luc
Harb Jean
Klissarov Martin
Precup Doina
Publication venue
Publication date: 13/09/2017
Field of study

Recent work has shown that temporally extended actions (options) can be learned fully end-to-end as opposed to being specified in advance. While the problem of "how" to learn options is increasingly well understood, the question of "what" good options should be has remained elusive. We formulate our answer to what "good" options should be in the bounded rationality framework (Simon, 1957) through the notion of deliberation cost. We then derive practical gradient-based learning algorithms to implement this objective. Our results in the Arcade Learning Environment (ALE) show increased performance and interpretability

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Efficient Deep Reinforcement Learning via Adaptive Policy Transfer

Author: Cheng Yingfeng
Fan Changjie
Hao Jianye
Hu Yujing
Liu Wulong
Meng Zhaopeng
Peng Jiajie
Wang Weixun
Wang Zhaodong
Yang Tianpei
Zhang Zongzhang
Publication venue
Publication date: 25/05/2020
Field of study

Transfer Learning (TL) has shown great potential to accelerate Reinforcement Learning (RL) by leveraging prior knowledge from past learned policies of relevant tasks. Existing transfer approaches either explicitly computes the similarity between tasks or select appropriate source policies to provide guided explorations for the target task. However, how to directly optimize the target policy by alternatively utilizing knowledge from appropriate source policies without explicitly measuring the similarity is currently missing. In this paper, we propose a novel Policy Transfer Framework (PTF) to accelerate RL by taking advantage of this idea. Our framework learns when and which source policy is the best to reuse for the target policy and when to terminate it by modeling multi-policy transfer as the option learning problem. PTF can be easily combined with existing deep RL approaches. Experimental results show it significantly accelerates the learning process and surpasses state-of-the-art policy transfer methods in terms of learning efficiency and final performance in both discrete and continuous action spaces.Comment: Accepted by IJCAI'202

arXiv.org e-Print Archive

Crossref