14 research outputs found

    Empirical Study of Off-Policy Policy Evaluation for Reinforcement Learning

    Get PDF
    Off-policy policy evaluation (OPE) is the problem of estimating the online performance of a policy using only pre-collected historical data generated by another policy. Given the increasing interest in deploying learning-based methods for safety-critical applications, many recent OPE methods have recently been proposed. Due to disparate experimental conditions from recent literature, the relative performance of current OPE methods is not well understood. In this work, we present the first comprehensive empirical analysis of a broad suite of OPE methods. Based on thousands of experiments and detailed empirical analyses, we offer a summarized set of guidelines for effectively using OPE in practice, and suggest directions for future research

    Supervised Off-Policy Ranking

    Full text link
    Off-policy evaluation (OPE) leverages data generated by other policies to evaluate a target policy. Previous OPE methods mainly focus on precisely estimating the true performance of a policy. We observe that in many applications, (1) the end goal of OPE is to compare two or multiple candidate policies and choose a good one, which is actually a much simpler task than evaluating their true performance; and (2) there are usually multiple policies that have been deployed in real-world systems and thus whose true performance is known through serving real users. Inspired by the two observations, in this work, we define a new problem, supervised off-policy ranking (SOPR), which aims to rank a set of new/target policies based on supervised learning by leveraging off-policy data and policies with known performance. We further propose a method for supervised off-policy ranking that learns a policy scoring model by correctly ranking training policies with known performance rather than estimating their precise performance. Our method leverages logged states and policies to learn a Transformer based model that maps offline interaction data including logged states and the actions taken by a target policy on these states to a score. Experiments on different games, datasets, training policy sets, and test policy sets show that our method outperforms strong baseline OPE methods in terms of both rank correlation and performance gap between the truly best and the best of the ranked top three policies. Furthermore, our method is more stable than baseline methods

    Sample Complexity of Offline Reinforcement Learning with Deep ReLU Networks

    Full text link
    We study the statistical theory of offline reinforcement learning (RL) with deep ReLU network function approximation. We analyze a variant of fitted-Q iteration (FQI) algorithm under a new dynamic condition that we call Besov dynamic closure, which encompasses the conditions from prior analyses for deep neural network function approximation. Under Besov dynamic closure, we prove that the FQI-type algorithm enjoys the sample complexity of O~(κ1+d/α⋅ϵ−2−2d/α)\tilde{\mathcal{O}}\left( \kappa^{1 + d/\alpha} \cdot \epsilon^{-2 - 2d/\alpha} \right) where κ\kappa is a distribution shift measure, dd is the dimensionality of the state-action space, α\alpha is the (possibly fractional) smoothness parameter of the underlying MDP, and ϵ\epsilon is a user-specified precision. This is an improvement over the sample complexity of O~(K⋅κ2+d/α⋅ϵ−2−d/α)\tilde{\mathcal{O}}\left( K \cdot \kappa^{2 + d/\alpha} \cdot \epsilon^{-2 - d/\alpha} \right) in the prior result [Yang et al., 2019] where KK is an algorithmic iteration number which is arbitrarily large in practice. Importantly, our sample complexity is obtained under the new general dynamic condition and a data-dependent structure where the latter is either ignored in prior algorithms or improperly handled by prior analyses. This is the first comprehensive analysis for offline RL with deep ReLU network function approximation under a general setting.Comment: A short version published in the ICML Workshop on Reinforcement Learning Theory, 202
    corecore