115 research outputs found
Data-Efficient Off-Policy Policy Evaluation for Reinforcement Learning
In this paper we present a new way of predicting the performance of a
reinforcement learning policy given historical data that may have been
generated by a different policy. The ability to evaluate a policy from
historical data is important for applications where the deployment of a bad
policy can be dangerous or costly. We show empirically that our algorithm
produces estimates that often have orders of magnitude lower mean squared error
than existing methods---it makes more efficient use of the available data. Our
new estimator is based on two advances: an extension of the doubly robust
estimator (Jiang and Li, 2015), and a new way to mix between model based
estimates and importance sampling based estimates
Behaviour Policy Estimation in Off-Policy Policy Evaluation: Calibration Matters
In this work, we consider the problem of estimating a behaviour policy for
use in Off-Policy Policy Evaluation (OPE) when the true behaviour policy is
unknown. Via a series of empirical studies, we demonstrate how accurate OPE is
strongly dependent on the calibration of estimated behaviour policy models: how
precisely the behaviour policy is estimated from data. We show how powerful
parametric models such as neural networks can result in highly uncalibrated
behaviour policy models on a real-world medical dataset, and illustrate how a
simple, non-parametric, k-nearest neighbours model produces better calibrated
behaviour policy estimates and can be used to obtain superior importance
sampling-based OPE estimates.Comment: Accepted to workshop on Machine Learning for Causal Inference,
Counterfactual Prediction, and Autonomous Action at ICML 201
Stochastic Doubly Robust Gradient
When training a machine learning model with observational data, it is often
encountered that some values are systemically missing. Learning from the
incomplete data in which the missingness depends on some covariates may lead to
biased estimation of parameters and even harm the fairness of decision outcome.
This paper proposes how to adjust the causal effect of covariates on the
missingness when training models using stochastic gradient descent (SGD).
Inspired by the design of doubly robust estimator and its theoretical property
of double robustness, we introduce stochastic doubly robust gradient (SDRG)
consisting of two models: weight-corrected gradients for inverse propensity
score weighting and per-covariate control variates for regression adjustment.
Also, we identify the connection between double robustness and variance
reduction in SGD by demonstrating the SDRG algorithm with a unifying framework
for variance reduced SGD. The performance of our approach is empirically tested
by showing the convergence in training image classifiers with several examples
of missing data.Comment: 9 pages, 2 figure
Doubly Robust Off-Policy Actor-Critic Algorithms for Reinforcement Learning
We study the problem of off-policy critic evaluation in several variants of
value-based off-policy actor-critic algorithms. Off-policy actor-critic
algorithms require an off-policy critic evaluation step, to estimate the value
of the new policy after every policy gradient update. Despite enormous success
of off-policy policy gradients on control tasks, existing general methods
suffer from high variance and instability, partly because the policy
improvement depends on gradient of the estimated value function. In this work,
we present a new way of off-policy policy evaluation in actor-critic, based on
the doubly robust estimators. We extend the doubly robust estimator from
off-policy policy evaluation (OPE) to actor-critic algorithms that consist of a
reward estimator performance model. We find that doubly robust estimation of
the critic can significantly improve performance in continuous control tasks.
Furthermore, in cases where the reward function is stochastic that can lead to
high variance, doubly robust critic estimation can improve performance under
corrupted, stochastic reward signals, indicating its usefulness for robust and
safe reinforcement learning.Comment: In Submission; Appeared at NeurIPS 2019 Workshop on Safety and
Robustness in Decision Makin
Improving Sepsis Treatment Strategies by Combining Deep and Kernel-Based Reinforcement Learning
Sepsis is the leading cause of mortality in the ICU. It is challenging to
manage because individual patients respond differently to treatment. Thus,
tailoring treatment to the individual patient is essential for the best
outcomes. In this paper, we take steps toward this goal by applying a
mixture-of-experts framework to personalize sepsis treatment. The mixture model
selectively alternates between neighbor-based (kernel) and deep reinforcement
learning (DRL) experts depending on patient's current history. On a large
retrospective cohort, this mixture-based approach outperforms physician, kernel
only, and DRL-only experts.Comment: AMIA 2018 Annual Symposiu
The Advantage of Doubling: A Deep Reinforcement Learning Approach to Studying the Double Team in the NBA
During the 2017 NBA playoffs, Celtics coach Brad Stevens was faced with a
difficult decision when defending against the Cavaliers: "Do you double and
risk giving up easy shots, or stay at home and do the best you can?" It's a
tough call, but finding a good defensive strategy that effectively incorporates
doubling can make all the difference in the NBA. In this paper, we analyze
double teaming in the NBA, quantifying the trade-off between risk and reward.
Using player trajectory data pertaining to over 643,000 possessions, we
identified when the ball handler was double teamed. Given these data and the
corresponding outcome (i.e., was the defense successful), we used deep
reinforcement learning to estimate the quality of the defensive actions. We
present qualitative and quantitative results summarizing our learned defensive
strategy for defending. We show that our policy value estimates are predictive
of points per possession and win percentage. Overall, the proposed framework
represents a step toward a more comprehensive understanding of defensive
strategies in the NBA.Comment: Accepted to MIT Sloan Sports Analytics 2018. First two authors
contributed equall
The Actor Search Tree Critic (ASTC) for Off-Policy POMDP Learning in Medical Decision Making
Off-policy reinforcement learning enables near-optimal policy from suboptimal
experience, thereby provisions opportunity for artificial intelligence
applications in healthcare. Previous works have mainly framed patient-clinician
interactions as Markov decision processes, while true physiological states are
not necessarily fully observable from clinical data. We capture this situation
with partially observable Markov decision process, in which an agent optimises
its actions in a belief represented as a distribution of patient states
inferred from individual history trajectories. A Gaussian mixture model is
fitted for the observed data. Moreover, we take into account the fact that
nuance in pharmaceutical dosage could presumably result in significantly
different effect by modelling a continuous policy through a Gaussian
approximator directly in the policy space, i.e. the actor. To address the
challenge of infinite number of possible belief states which renders exact
value iteration intractable, we evaluate and plan for only every encountered
belief, through heuristic search tree by tightly maintaining lower and upper
bounds of the true value of belief. We further resort to function
approximations to update value bounds estimation, i.e. the critic, so that the
tree search can be improved through more compact bounds at the fringe nodes
that will be back-propagated to the root. Both actor and critic parameters are
learned via gradient-based approaches. Our proposed policy trained from real
intensive care unit data is capable of dictating dosing on vasopressors and
intravenous fluids for sepsis patients that lead to the best patient outcomes
Truly Batch Apprenticeship Learning with Deep Successor Features
We introduce a novel apprenticeship learning algorithm to learn an expert's
underlying reward structure in off-policy model-free \emph{batch} settings.
Unlike existing methods that require a dynamics model or additional data
acquisition for on-policy evaluation, our algorithm requires only the batch
data of observed expert behavior. Such settings are common in real-world
tasks---health care, finance or industrial processes ---where accurate
simulators do not exist or data acquisition is costly. To address challenges in
batch settings, we introduce Deep Successor Feature Networks(DSFN) that
estimate feature expectations in an off-policy setting and a
transition-regularized imitation network that produces a near-expert initial
policy and an efficient feature representation. Our algorithm achieves superior
results in batch settings on both control benchmarks and a vital clinical task
of sepsis management in the Intensive Care Unit.Comment: 10 pages, 3 figures, Under Conference Revie
Off-policy Bandit and Reinforcement Learning
We develop a method for predicting the performance of reinforcement learning
and bandit algorithms, given historical data that may have been generated by a
different algorithm. Our estimator has the property that its prediction
converges in probability to the true performance of a counterfactual algorithm
at the fast rate, as the sample size increases. We also show a
correct way to estimate the variance of our prediction, thus allowing the
analyst to quantify the uncertainty in the prediction. These properties hold
even when the analyst does not know which among a large number of potentially
important state variables are really important. These theoretical guarantees
make our estimator safe to use. We finally apply it to improve advertisement
design by a major advertisement company. We find that our method produces
smaller mean squared errors than state-of-the-art methods
Using Options and Covariance Testing for Long Horizon Off-Policy Policy Evaluation
Evaluating a policy by deploying it in the real world can be risky and
costly. Off-policy policy evaluation (OPE) algorithms use historical data
collected from running a previous policy to evaluate a new policy, which
provides a means for evaluating a policy without requiring it to ever be
deployed. Importance sampling is a popular OPE method because it is robust to
partial observability and works with continuous states and actions. However,
the amount of historical data required by importance sampling can scale
exponentially with the horizon of the problem: the number of sequential
decisions that are made. We propose using policies over temporally extended
actions, called options, and show that combining these policies with importance
sampling can significantly improve performance for long-horizon problems. In
addition, we can take advantage of special cases that arise due to
options-based policies to further improve the performance of importance
sampling. We further generalize these special cases to a general covariance
testing rule that can be used to decide which weights to drop in an IS
estimate, and derive a new IS algorithm called Incremental Importance Sampling
that can provide significantly more accurate estimates for a broad class of
domains
- …