This paper is concerned with offline reinforcement learning (RL), which
learns using pre-collected data without further exploration. Effective offline
RL would be able to accommodate distribution shift and limited data coverage.
However, prior algorithms or analyses either suffer from suboptimal sample
complexities or incur high burn-in cost to reach sample optimality, thus posing
an impediment to efficient offline RL in sample-starved applications.
We demonstrate that the model-based (or "plug-in") approach achieves
minimax-optimal sample complexity without burn-in cost for tabular Markov
decision processes (MDPs). Concretely, consider a finite-horizon (resp.
γ-discounted infinite-horizon) MDP with S states and horizon H
(resp. effective horizon 1−γ1​), and suppose the distribution
shift of data is reflected by some single-policy clipped concentrability
coefficient Cclipped⋆​. We prove that model-based offline RL
yields ε-accuracy with a sample complexity of {ε2H4SCclipped⋆​​​(finite-horizon MDPs)(1−γ)3ε2SCclipped⋆​​​(infinite-horizon MDPs)​ up to log factor, which is
minimax optimal for the entire ε-range. The proposed algorithms are
``pessimistic'' variants of value iteration with Bernstein-style penalties, and
do not require sophisticated variance reduction. Our analysis framework is
established upon delicate leave-one-out decoupling arguments in conjunction
with careful self-bounding techniques tailored to MDPs