4,149 research outputs found
Doubly Robust Off-policy Value Evaluation for Reinforcement Learning
We study the problem of off-policy value evaluation in reinforcement learning
(RL), where one aims to estimate the value of a new policy based on data
collected by a different policy. This problem is often a critical step when
applying RL in real-world problems. Despite its importance, existing general
methods either have uncontrolled bias or suffer high variance. In this work, we
extend the doubly robust estimator for bandits to sequential decision-making
problems, which gets the best of both worlds: it is guaranteed to be unbiased
and can have a much lower variance than the popular importance sampling
estimators. We demonstrate the estimator's accuracy in several benchmark
problems, and illustrate its use as a subroutine in safe policy improvement. We
also provide theoretical results on the hardness of the problem, and show that
our estimator can match the lower bound in certain scenarios.Comment: 14 pages; 4 figures; ICML 201
More Robust Doubly Robust Off-policy Evaluation
We study the problem of off-policy evaluation (OPE) in reinforcement learning
(RL), where the goal is to estimate the performance of a policy from the data
generated by another policy(ies). In particular, we focus on the doubly robust
(DR) estimators that consist of an importance sampling (IS) component and a
performance model, and utilize the low (or zero) bias of IS and low variance of
the model at the same time. Although the accuracy of the model has a huge
impact on the overall performance of DR, most of the work on using the DR
estimators in OPE has been focused on improving the IS part, and not much on
how to learn the model. In this paper, we propose alternative DR estimators,
called more robust doubly robust (MRDR), that learn the model parameter by
minimizing the variance of the DR estimator. We first present a formulation for
learning the DR model in RL. We then derive formulas for the variance of the DR
estimator in both contextual bandits and RL, such that their gradients
w.r.t.~the model parameters can be estimated from the samples, and propose
methods to efficiently minimize the variance. We prove that the MRDR estimators
are strongly consistent and asymptotically optimal. Finally, we evaluate MRDR
in bandits and RL benchmark problems, and compare its performance with the
existing methods
Double Reinforcement Learning for Efficient Off-Policy Evaluation in Markov Decision Processes
Off-policy evaluation (OPE) in reinforcement learning allows one to evaluate
novel decision policies without needing to conduct exploration, which is often
costly or otherwise infeasible. We consider for the first time the
semiparametric efficiency limits of OPE in Markov decision processes (MDPs),
where actions, rewards, and states are memoryless. We show existing OPE
estimators may fail to be efficient in this setting. We develop a new estimator
based on cross-fold estimation of -functions and marginalized density
ratios, which we term double reinforcement learning (DRL). We show that DRL is
efficient when both components are estimated at fourth-root rates and is also
doubly robust when only one component is consistent. We investigate these
properties empirically and demonstrate the performance benefits due to
harnessing memorylessness
Off-Policy Exploitability-Evaluation in Two-Player Zero-Sum Markov Games
Off-policy evaluation (OPE) is the problem of evaluating new policies using
historical data obtained from a different policy. In the recent OPE context,
most studies have focused on single-player cases, and not on multi-player
cases. In this study, we propose OPE estimators constructed by the doubly
robust and double reinforcement learning estimators in two-player zero-sum
Markov games. The proposed estimators project exploitability that is often used
as a metric for determining how close a policy profile (i.e., a tuple of
policies) is to a Nash equilibrium in two-player zero-sum games. We prove the
exploitability estimation error bounds for the proposed estimators. We then
propose the methods to find the best candidate policy profile by selecting the
policy profile that minimizes the estimated exploitability from a given policy
profile class. We prove the regret bounds of the policy profiles selected by
our methods. Finally, we demonstrate the effectiveness and performance of the
proposed estimators through experiments
Horizon: Facebook's Open Source Applied Reinforcement Learning Platform
In this paper we present Horizon, Facebook's open source applied
reinforcement learning (RL) platform. Horizon is an end-to-end platform
designed to solve industry applied RL problems where datasets are large
(millions to billions of observations), the feedback loop is slow (vs. a
simulator), and experiments must be done with care because they don't run in a
simulator. Unlike other RL platforms, which are often designed for fast
prototyping and experimentation, Horizon is designed with production use cases
as top of mind. The platform contains workflows to train popular deep RL
algorithms and includes data preprocessing, feature transformation, distributed
training, counterfactual policy evaluation, optimized serving, and a
model-based data understanding tool. We also showcase and describe real
examples where reinforcement learning models trained with Horizon significantly
outperformed and replaced supervised learning systems at Facebook.Comment: 10 page
Safe Policy Improvement with Baseline Bootstrapping
This paper considers Safe Policy Improvement (SPI) in Batch Reinforcement
Learning (Batch RL): from a fixed dataset and without direct access to the true
environment, train a policy that is guaranteed to perform at least as well as
the baseline policy used to collect the data. Our approach, called SPI with
Baseline Bootstrapping (SPIBB), is inspired by the knows-what-it-knows
paradigm: it bootstraps the trained policy with the baseline when the
uncertainty is high. Our first algorithm, -SPIBB, comes with SPI
theoretical guarantees. We also implement a variant, -SPIBB, that
is even more efficient in practice. We apply our algorithms to a motivational
stochastic gridworld domain and further demonstrate on randomly generated MDPs
the superiority of SPIBB with respect to existing algorithms, not only in
safety but also in mean performance. Finally, we implement a model-free version
of SPIBB and show its benefits on a navigation task with deep RL implementation
called SPIBB-DQN, which is, to the best of our knowledge, the first RL
algorithm relying on a neural network representation able to train efficiently
and reliably from batch data, without any interaction with the environment.Comment: accepted as a long oral at ICML201
Learning When-to-Treat Policies
Many applied decision-making problems have a dynamic component: The
policymaker needs not only to choose whom to treat, but also when to start
which treatment. For example, a medical doctor may choose between postponing
treatment (watchful waiting) and prescribing one of several available
treatments during the many visits from a patient. We develop an "advantage
doubly robust" estimator for learning such dynamic treatment rules using
observational data under the assumption of sequential ignorability. We prove
welfare regret bounds that generalize results for doubly robust learning in the
single-step setting, and show promising empirical performance in several
different contexts. Our approach is practical for policy optimization, and does
not need any structural (e.g., Markovian) assumptions
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
In this paper we present a new way of predicting the performance of a
reinforcement learning policy given historical data that may have been
generated by a different policy. The ability to evaluate a policy from
historical data is important for applications where the deployment of a bad
policy can be dangerous or costly. We show empirically that our algorithm
produces estimates that often have orders of magnitude lower mean squared error
than existing methods---it makes more efficient use of the available data. Our
new estimator is based on two advances: an extension of the doubly robust
estimator (Jiang and Li, 2015), and a new way to mix between model based
estimates and importance sampling based estimates
Behaviour Policy Estimation in Off-Policy Policy Evaluation: Calibration Matters
In this work, we consider the problem of estimating a behaviour policy for
use in Off-Policy Policy Evaluation (OPE) when the true behaviour policy is
unknown. Via a series of empirical studies, we demonstrate how accurate OPE is
strongly dependent on the calibration of estimated behaviour policy models: how
precisely the behaviour policy is estimated from data. We show how powerful
parametric models such as neural networks can result in highly uncalibrated
behaviour policy models on a real-world medical dataset, and illustrate how a
simple, non-parametric, k-nearest neighbours model produces better calibrated
behaviour policy estimates and can be used to obtain superior importance
sampling-based OPE estimates.Comment: Accepted to workshop on Machine Learning for Causal Inference,
Counterfactual Prediction, and Autonomous Action at ICML 201
Data-Efficient Policy Evaluation Through Behavior Policy Search
We consider the task of evaluating a policy for a Markov decision process
(MDP). The standard unbiased technique for evaluating a policy is to deploy the
policy and observe its performance. We show that the data collected from
deploying a different policy, commonly called the behavior policy, can be used
to produce unbiased estimates with lower mean squared error than this standard
technique. We derive an analytic expression for the optimal behavior policy ---
the behavior policy that minimizes the mean squared error of the resulting
estimates. Because this expression depends on terms that are unknown in
practice, we propose a novel policy evaluation sub-problem, behavior policy
search: searching for a behavior policy that reduces mean squared error. We
present a behavior policy search algorithm and empirically demonstrate its
effectiveness in lowering the mean squared error of policy performance
estimates.Comment: Accepted to ICML 2017; Extended version; 15 page
- …