1 Introduction We assume a standard reinforcement learning setup: an agent interacts with an environment mod- eled as a partially-observable Markov decision process. Consider the situation after a sequence of interactions. The agent has now accumulated data and would like to use that data to select how it will act next. In particular, it has accumulated a sequence of observations, actions, and rewards and it would like to select a policy, a mapping from observations to actions, for future interaction with the world. Ultimately, the goal of the agent is to find a policy mapping that maximizes the agent-s return, the sum of rewards experienced
To submit an update or takedown request for this paper, please submit an Update/Correction/Removal Request.