Folklore suggests that policy gradient can be more robust to misspecification
than its relative, approximate policy iteration. This paper studies the case of
state-aggregation, where the state space is partitioned and either the policy
or value function approximation is held constant over partitions. This paper
shows a policy gradient method converges to a policy whose regret per-period is
bounded by ϵ, the largest difference between two elements of the
state-action value function belonging to a common partition. With the same
representation, both approximate policy iteration and approximate value
iteration can produce policies whose per-period regret scales as
ϵ/(1−γ), where γ is a discount factor. Theoretical results
synthesize recent analysis of policy gradient methods with insights of Van Roy
(2006) into the critical role of state-relevance weights in approximate dynamic
programming