14 research outputs found
Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning
Off-policy policy evaluation (OPE) is the problem of estimating the online performance of a policy using only pre-collected historical data generated by another policy. Given the increasing interest in deploying learning-based methods for safety-critical applications, many recent OPE methods have recently been proposed. Due to disparate experimental conditions from recent literature, the relative performance of current OPE methods is not well understood. In this work, we present the first comprehensive empirical analysis of a broad suite of OPE methods. Based on thousands of experiments and detailed empirical analyses, we offer a summarized set of guidelines for effectively using OPE in practice, and suggest directions for future research
Supervised Off-Policy Ranking
Off-policy evaluation (OPE) leverages data generated by other policies to
evaluate a target policy. Previous OPE methods mainly focus on precisely
estimating the true performance of a policy. We observe that in many
applications, (1) the end goal of OPE is to compare two or multiple candidate
policies and choose a good one, which is actually a much simpler task than
evaluating their true performance; and (2) there are usually multiple policies
that have been deployed in real-world systems and thus whose true performance
is known through serving real users. Inspired by the two observations, in this
work, we define a new problem, supervised off-policy ranking (SOPR), which aims
to rank a set of new/target policies based on supervised learning by leveraging
off-policy data and policies with known performance. We further propose a
method for supervised off-policy ranking that learns a policy scoring model by
correctly ranking training policies with known performance rather than
estimating their precise performance. Our method leverages logged states and
policies to learn a Transformer based model that maps offline interaction data
including logged states and the actions taken by a target policy on these
states to a score. Experiments on different games, datasets, training policy
sets, and test policy sets show that our method outperforms strong baseline OPE
methods in terms of both rank correlation and performance gap between the truly
best and the best of the ranked top three policies. Furthermore, our method is
more stable than baseline methods
Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks
We study the statistical theory of offline reinforcement learning (RL) with
deep ReLU network function approximation. We analyze a variant of fitted-Q
iteration (FQI) algorithm under a new dynamic condition that we call Besov
dynamic closure, which encompasses the conditions from prior analyses for deep
neural network function approximation. Under Besov dynamic closure, we prove
that the FQI-type algorithm enjoys the sample complexity of
where is a distribution shift measure, is the
dimensionality of the state-action space, is the (possibly fractional)
smoothness parameter of the underlying MDP, and is a user-specified
precision. This is an improvement over the sample complexity of
in the prior result [Yang et al., 2019] where is an
algorithmic iteration number which is arbitrarily large in practice.
Importantly, our sample complexity is obtained under the new general dynamic
condition and a data-dependent structure where the latter is either ignored in
prior algorithms or improperly handled by prior analyses. This is the first
comprehensive analysis for offline RL with deep ReLU network function
approximation under a general setting.Comment: A short version published in the ICML Workshop on Reinforcement
Learning Theory, 202