6,348 research outputs found
Efficiently Breaking the Curse of Horizon in Off-Policy Evaluation with Double Reinforcement Learning
Off-policy evaluation (OPE) in reinforcement learning is notoriously
difficult in long- and infinite-horizon settings due to diminishing overlap
between behavior and target policies. In this paper, we study the role of
Markovian and time-invariant structure in efficient OPE. We first derive the
efficiency limits for OPE when one assumes each of these structures. This
precisely characterizes the curse of horizon: in time-variant processes, OPE is
only feasible in the near-on-policy setting, where behavior and target policies
are sufficiently similar. But, in time-invariant Markov decision processes, our
bounds show that truly-off-policy evaluation is feasible, even with only just
one dependent trajectory, and provide the limits of how well we could hope to
do. We develop a new estimator based on Double Reinforcement Learning (DRL)
that leverages this structure for OPE. Our DRL estimator simultaneously uses
estimated stationary density ratios and -functions and remains efficient
when both are estimated at slow, nonparametric rates and remains consistent
when either is estimated consistently. We investigate these properties and the
performance benefits of leveraging the problem structure for more efficient
OPE.Comment: In Ver 1, we defined the efficiency bound by taking the limit of the
Crammer bound, which is different from the standard definition of the
efficiency bound. In Ver 3, we significantly changed the derivation of the
efficiency bound by following standard (i.i.d) semiparametric theory. Then,
we also derived the valid efficient influence functio
Efficient Bayes-Adaptive Reinforcement Learning using Sample-Based Search
Bayesian model-based reinforcement learning is a formally elegant approach to
learning optimal behaviour under model uncertainty, trading off exploration and
exploitation in an ideal way. Unfortunately, finding the resulting
Bayes-optimal policies is notoriously taxing, since the search space becomes
enormous. In this paper we introduce a tractable, sample-based method for
approximate Bayes-optimal planning which exploits Monte-Carlo tree search. Our
approach outperformed prior Bayesian model-based RL algorithms by a significant
margin on several well-known benchmark problems -- because it avoids expensive
applications of Bayes rule within the search tree by lazily sampling models
from the current beliefs. We illustrate the advantages of our approach by
showing it working in an infinite state space domain which is qualitatively out
of reach of almost all previous work in Bayesian exploration.Comment: 14 pages, 7 figures, includes supplementary material. Advances in
Neural Information Processing Systems (NIPS) 201
- …