Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in
  RL

Oliehoek, Frans A.; Spaan, Matthijs T. J.; Suau, Miguel

Bad Habits: Policy Confounding and Out-of-Trajectory Generalization in RL

Authors: Frans A. Oliehoek
Matthijs T. J. Spaan
Miguel Suau
Publication date: 4 June 2023
Publisher

Abstract

Reinforcement learning agents may sometimes develop habits that are effective only when specific policies are followed. After an initial exploration phase in which agents try out different actions, they eventually converge toward a particular policy. When this occurs, the distribution of state-action trajectories becomes narrower, and agents start experiencing the same transitions again and again. At this point, spurious correlations may arise. Agents may then pick up on these correlations and learn state representations that do not generalize beyond the agent's trajectory distribution. In this paper, we provide a mathematical characterization of this phenomenon, which we refer to as policy confounding, and show, through a series of examples, when and how it occurs in practice

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2306.02419

Last time updated on 08/06/2023