Settling the Sample Complexity of Model-Based Offline Reinforcement Learning

Abstract

This paper is concerned with offline reinforcement learning (RL), which learns using pre-collected data without further exploration. Effective offline RL would be able to accommodate distribution shift and limited data coverage. However, prior algorithms or analyses either suffer from suboptimal sample complexities or incur high burn-in cost to reach sample optimality, thus posing an impediment to efficient offline RL in sample-starved applications. We demonstrate that the model-based (or "plug-in") approach achieves minimax-optimal sample complexity without burn-in cost for tabular Markov decision processes (MDPs). Concretely, consider a finite-horizon (resp. γ\gamma-discounted infinite-horizon) MDP with SS states and horizon HH (resp. effective horizon 11−γ\frac{1}{1-\gamma}), and suppose the distribution shift of data is reflected by some single-policy clipped concentrability coefficient Cclipped⋆C^{\star}_{\text{clipped}}. We prove that model-based offline RL yields ε\varepsilon-accuracy with a sample complexity of {H4SCclipped⋆ε2(finite-horizon MDPs)SCclipped⋆(1−γ)3ε2(infinite-horizon MDPs) \begin{cases} \frac{H^{4}SC_{\text{clipped}}^{\star}}{\varepsilon^{2}} & (\text{finite-horizon MDPs}) \frac{SC_{\text{clipped}}^{\star}}{(1-\gamma)^{3}\varepsilon^{2}} & (\text{infinite-horizon MDPs}) \end{cases} up to log factor, which is minimax optimal for the entire ε\varepsilon-range. The proposed algorithms are ``pessimistic'' variants of value iteration with Bernstein-style penalties, and do not require sophisticated variance reduction. Our analysis framework is established upon delicate leave-one-out decoupling arguments in conjunction with careful self-bounding techniques tailored to MDPs

    Similar works

    Full text

    thumbnail-image

    Available Versions