Approximation Benefits of Policy Gradient Methods with Aggregated States

Russo, Daniel

Approximation Benefits of Policy Gradient Methods with Aggregated States

Authors: Daniel Russo
Publication date: 7 January 2021
Publisher

Abstract

Folklore suggests that policy gradient can be more robust to misspecification than its relative, approximate policy iteration. This paper studies the case of state-aggregation, where the state space is partitioned and either the policy or value function approximation is held constant over partitions. This paper shows a policy gradient method converges to a policy whose regret per-period is bounded by

\epsilon

, the largest difference between two elements of the state-action value function belonging to a common partition. With the same representation, both approximate policy iteration and approximate value iteration can produce policies whose per-period regret scales as

\epsilon/(1-\gamma)

, where

\gamma

is a discount factor. Theoretical results synthesize recent analysis of policy gradient methods with insights of Van Roy (2006) into the critical role of state-relevance weights in approximate dynamic programming

Similar works

Full text

Available Versions

arXiv.org e-Print Archive

oai:arXiv.org:2007.11684

Last time updated on 29/07/2020