548 research outputs found
When is Agnostic Reinforcement Learning Statistically Tractable?
We study the problem of agnostic PAC reinforcement learning (RL): given a
policy class , how many rounds of interaction with an unknown MDP (with a
potentially large state and action space) are required to learn an
-suboptimal policy with respect to ? Towards that end, we
introduce a new complexity measure, called the \emph{spanning capacity}, that
depends solely on the set and is independent of the MDP dynamics. With a
generative model, we show that for any policy class , bounded spanning
capacity characterizes PAC learnability. However, for online RL, the situation
is more subtle. We show there exists a policy class with a bounded
spanning capacity that requires a superpolynomial number of samples to learn.
This reveals a surprising separation for agnostic learnability between
generative access and online access models (as well as between
deterministic/stochastic MDPs under online access). On the positive side, we
identify an additional \emph{sunflower} structure, which in conjunction with
bounded spanning capacity enables statistically efficient online RL via a new
algorithm called POPLER, which takes inspiration from classical importance
sampling methods as well as techniques for reachable-state identification and
policy evaluation in reward-free exploration.Comment: Accepted to NeurIPS 202
- …