7 research outputs found
Is Pessimism Provably Efficient for Offline RL?
We study offline reinforcement learning (RL), which aims to learn an optimal
policy based on a dataset collected a priori. Due to the lack of further
interactions with the environment, offline RL suffers from the insufficient
coverage of the dataset, which eludes most existing theoretical analysis. In
this paper, we propose a pessimistic variant of the value iteration algorithm
(PEVI), which incorporates an uncertainty quantifier as the penalty function.
Such a penalty function simply flips the sign of the bonus function for
promoting exploration in online RL, which makes it easily implementable and
compatible with general function approximators.
Without assuming the sufficient coverage of the dataset, we establish a
data-dependent upper bound on the suboptimality of PEVI for general Markov
decision processes (MDPs). When specialized to linear MDPs, it matches the
information-theoretic lower bound up to multiplicative factors of the dimension
and horizon. In other words, pessimism is not only provably efficient but also
minimax optimal. In particular, given the dataset, the learned policy serves as
the ``best effort'' among all policies, as no other policies can do better. Our
theoretical analysis identifies the critical role of pessimism in eliminating a
notion of spurious correlation, which emerges from the ``irrelevant''
trajectories that are less covered by the dataset and not informative for the
optimal policy.Comment: 53 pages, 3 figure
Bridging Offline Reinforcement Learning and Imitation Learning: A Tale of Pessimism
Offline (or batch) reinforcement learning (RL) algorithms seek to learn an
optimal policy from a fixed dataset without active data collection. Based on
the composition of the offline dataset, two main categories of methods are
used: imitation learning which is suitable for expert datasets and vanilla
offline RL which often requires uniform coverage datasets. From a practical
standpoint, datasets often deviate from these two extremes and the exact data
composition is usually unknown a priori. To bridge this gap, we present a new
offline RL framework that smoothly interpolates between the two extremes of
data composition, hence unifying imitation learning and vanilla offline RL. The
new framework is centered around a weak version of the concentrability
coefficient that measures the deviation from the behavior policy to the expert
policy alone.
Under this new framework, we further investigate the question on algorithm
design: can one develop an algorithm that achieves a minimax optimal rate and
also adapts to unknown data composition? To address this question, we consider
a lower confidence bound (LCB) algorithm developed based on pessimism in the
face of uncertainty in offline RL. We study finite-sample properties of LCB as
well as information-theoretic limits in multi-armed bandits, contextual
bandits, and Markov decision processes (MDPs). Our analysis reveals surprising
facts about optimality rates. In particular, in all three settings, LCB
achieves a faster rate of for nearly-expert datasets compared to the
usual rate of in offline RL, where is the number of samples in
the batch dataset. In the case of contextual bandits with at least two
contexts, we prove that LCB is adaptively optimal for the entire data
composition range, achieving a smooth transition from imitation learning to
offline RL. We further show that LCB is almost adaptively optimal in MDPs.Comment: 84 pages, 6 figure