Search CORE

8,340 research outputs found

Learning with Options that Terminate Off-Policy

Author: Bacon Pierre-Luc
Harutyunyan Anna
Nowe Ann
Precup Doina
Vrancx Peter
Publication venue
Publication date: 02/12/2017
Field of study

A temporally abstract action, or an option, is specified by a policy and a termination condition: the policy guides option behavior, and the termination condition roughly determines its length. Generally, learning with longer options (like learning with multi-step returns) is known to be more efficient. However, if the option set for the task is not ideal, and cannot express the primitive optimal policy exactly, shorter options offer more flexibility and can yield a better solution. Thus, the termination condition puts learning efficiency at odds with solution quality. We propose to resolve this dilemma by decoupling the behavior and target terminations, just like it is done with policies in off-policy learning. To this end, we give a new algorithm, Q(\beta), that learns the solution with respect to any termination condition, regardless of how the options actually terminate. We derive Q(\beta) by casting learning with options into a common framework with well-studied multi-step off-policy learning. We validate our algorithm empirically, and show that it holds up to its motivating claims.Comment: AAAI 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Multi-step Reinforcement Learning: A Unifying Algorithm

Author: De Asis Kristopher
Hernandez-Garcia J. Fernando
Holland G. Zacharias
Sutton Richard S.
Publication venue
Publication date: 29/04/2018
Field of study

Unifying seemingly disparate algorithmic ideas to produce better performing algorithms has been a longstanding goal in reinforcement learning. As a primary example, TD(

\lambda

) elegantly unifies one-step TD prediction with Monte Carlo methods through the use of eligibility traces and the trace-decay parameter

\lambda

. Currently, there are a multitude of algorithms that can be used to perform TD control, including Sarsa,

Q

-learning, and Expected Sarsa. These methods are often studied in the one-step case, but they can be extended across multiple time steps to achieve better performance. Each of these algorithms is seemingly distinct, and no one dominates the others for all problems. In this paper, we study a new multi-step action-value algorithm called

Q(\sigma)

which unifies and generalizes these existing algorithms, while subsuming them as special cases. A new parameter,

\sigma

, is introduced to allow the degree of sampling performed by the algorithm at each step during its backup to be continuously varied, with Sarsa existing at one extreme (full sampling), and Expected Sarsa existing at the other (pure expectation).

Q(\sigma)

is generally applicable to both on- and off-policy learning, but in this work we focus on experiments in the on-policy case. Our results show that an intermediate value of

\sigma

, which results in a mixture of the existing algorithms, performs better than either extreme. The mixture can also be varied dynamically which can result in even greater performance.Comment: Appeared at the Thirty-Second AAAI Conference on Artificial Intelligence (AAAI-18

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications