5 research outputs found
From Importance Sampling to Doubly Robust Policy Gradient
We show that on-policy policy gradient (PG) and its variance reduction
variants can be derived by taking finite difference of function evaluations
supplied by estimators from the importance sampling (IS) family for off-policy
evaluation (OPE). Starting from the doubly robust (DR) estimator (Jiang & Li,
2016), we provide a simple derivation of a very general and flexible form of
PG, which subsumes the state-of-the-art variance reduction technique (Cheng et
al., 2019) as its special case and immediately hints at further variance
reduction opportunities overlooked by existing literature. We analyze the
variance of the new DR-PG estimator, compare it to existing methods as well as
the Cramer-Rao lower bound of policy gradient, and empirically show its
effectiveness.Comment: ICML 202
Doubly Robust Off-Policy Value and Gradient Estimation for Deterministic Policies
Offline reinforcement learning, wherein one uses off-policy data logged by a
fixed behavior policy to evaluate and learn new policies, is crucial in
applications where experimentation is limited such as medicine. We study the
estimation of policy value and gradient of a deterministic policy from
off-policy data when actions are continuous. Targeting deterministic policies,
for which action is a deterministic function of state, is crucial since optimal
policies are always deterministic (up to ties). In this setting, standard
importance sampling and doubly robust estimators for policy value and gradient
fail because the density ratio does not exist. To circumvent this issue, we
propose several new doubly robust estimators based on different kernelization
approaches. We analyze the asymptotic mean-squared error of each of these under
mild rate conditions for nuisance estimators. Specifically, we demonstrate how
to obtain a rate that is independent of the horizon length
A Unified Off-Policy Evaluation Approach for General Value Function
General Value Function (GVF) is a powerful tool to represent both the {\em
predictive} and {\em retrospective} knowledge in reinforcement learning (RL).
In practice, often multiple interrelated GVFs need to be evaluated jointly with
pre-collected off-policy samples. In the literature, the gradient temporal
difference (GTD) learning method has been adopted to evaluate GVFs in the
off-policy setting, but such an approach may suffer from a large estimation
error even if the function approximation class is sufficiently expressive.
Moreover, none of the previous work have formally established the convergence
guarantee to the ground truth GVFs under the function approximation settings.
In this paper, we address both issues through the lens of a class of GVFs with
causal filtering, which cover a wide range of RL applications such as reward
variance, value gradient, cost in anomaly detection, stationary distribution
gradient, etc. We propose a new algorithm called GenTD for off-policy GVFs
evaluation and show that GenTD learns multiple interrelated multi-dimensional
GVFs as efficiently as a single canonical scalar value function. We further
show that unlike GTD, the learned GVFs by GenTD are guaranteed to converge to
the ground truth GVFs as long as the function approximation power is
sufficiently large. To our best knowledge, GenTD is the first off-policy GVF
evaluation algorithm that has global optimality guarantee.Comment: submitted for publicatio
Doubly Robust Off-Policy Actor-Critic: Convergence and Optimality
Designing off-policy reinforcement learning algorithms is typically a very
challenging task, because a desirable iteration update often involves an
expectation over an on-policy distribution. Prior off-policy actor-critic (AC)
algorithms have introduced a new critic that uses the density ratio for
adjusting the distribution mismatch in order to stabilize the convergence, but
at the cost of potentially introducing high biases due to the estimation errors
of both the density ratio and value function. In this paper, we develop a
doubly robust off-policy AC (DR-Off-PAC) for discounted MDP, which can take
advantage of learned nuisance functions to reduce estimation errors. Moreover,
DR-Off-PAC adopts a single timescale structure, in which both actor and critics
are updated simultaneously with constant stepsize, and is thus more sample
efficient than prior algorithms that adopt either two timescale or nested-loop
structure. We study the finite-time convergence rate and characterize the
sample complexity for DR-Off-PAC to attain an -accurate optimal
policy. We also show that the overall convergence of DR-Off-PAC is doubly
robust to the approximation errors that depend only on the expressive power of
approximation functions. To the best of our knowledge, our study establishes
the first overall sample complexity analysis for a single time-scale off-policy
AC algorithm.Comment: Published in ICML 202
Statistically Efficient Off-Policy Policy Gradients
Policy gradient methods in reinforcement learning update policy parameters by
taking steps in the direction of an estimated gradient of policy value. In this
paper, we consider the statistically efficient estimation of policy gradients
from off-policy data, where the estimation is particularly non-trivial. We
derive the asymptotic lower bound on the feasible mean-squared error in both
Markov and non-Markov decision processes and show that existing estimators fail
to achieve it in general settings. We propose a meta-algorithm that achieves
the lower bound without any parametric assumptions and exhibits a unique 3-way
double robustness property. We discuss how to estimate nuisances that the
algorithm relies on. Finally, we establish guarantees on the rate at which we
approach a stationary point when we take steps in the direction of our new
estimated policy gradient