Reducing reinforcement learning to supervised learning is a well-studied and
effective approach that leverages the benefits of compact function
approximation to deal with large-scale Markov decision processes.
Independently, the boosting methodology (e.g. AdaBoost) has proven to be
indispensable in designing efficient and accurate classification algorithms by
combining inaccurate rules-of-thumb.
In this paper, we take a further step: we reduce reinforcement learning to a
sequence of weak learning problems. Since weak learners perform only marginally
better than random guesses, such subroutines constitute a weaker assumption
than the availability of an accurate supervised learning oracle. We prove that
the sample complexity and running time bounds of the proposed method do not
explicitly depend on the number of states.
While existing results on boosting operate on convex losses, the value
function over policies is non-convex. We show how to use a non-convex variant
of the Frank-Wolfe method for boosting, that additionally improves upon the
known sample complexity and running time even for reductions to supervised
learning.Comment: Now in sync with camera ready for NeurIPS 202