163 research outputs found
Future-Dependent Value-Based Off-Policy Evaluation in POMDPs
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs)
with general function approximation. Existing methods such as sequential
importance sampling estimators and fitted-Q evaluation suffer from the curse of
horizon in POMDPs. To circumvent this problem, we develop a novel model-free
OPE method by introducing future-dependent value functions that take future
proxies as inputs. Future-dependent value functions play similar roles as
classical value functions in fully-observable MDPs. We derive a new Bellman
equation for future-dependent value functions as conditional moment equations
that use history proxies as instrumental variables. We further propose a
minimax learning method to learn future-dependent value functions using the new
Bellman equation. We obtain the PAC result, which implies our OPE estimator is
consistent as long as futures and histories contain sufficient information
about latent states, and the Bellman completeness. Finally, we extend our
methods to learning of dynamics and establish the connection between our
approach and the well-known spectral learning methods in POMDPs.Comment: This paper was accepted in NeurIPS 202
Future-dependent value-based off-policy evaluation in POMDPs
We study off-policy evaluation (OPE) for partially observable MDPs (POMDPs) with general function approximation. Existing methods such as sequential importance sampling estimators suffer from the curse of horizon in POMDPs. To circumvent this problem, we develop a novel model-free OPE method by introducing future-dependent value functions that take future proxies as inputs and perform a similar role to that of classical value functions in fully-observable MDPs. We derive a new off-policy Bellman equation for future-dependent value functions as conditional moment equations that use history proxies as instrumental variables. We further propose a minimax learning method to learn future-dependent value functions using the new Bellman equation. We obtain the PAC result, which implies our OPE estimator is close to the true policy value under Bellman completeness, as long as futures and histories contain sufficient information about latent states. Our code is available at https://github.com/aiueola/neurips2023-future-dependent-ope
The Role of Coverage in Online Reinforcement Learning
Coverage conditions -- which assert that the data logging distribution
adequately covers the state space -- play a fundamental role in determining the
sample complexity of offline reinforcement learning. While such conditions
might seem irrelevant to online reinforcement learning at first glance, we
establish a new connection by showing -- somewhat surprisingly -- that the mere
existence of a data distribution with good coverage can enable sample-efficient
online RL. Concretely, we show that coverability -- that is, existence of a
data distribution that satisfies a ubiquitous coverage condition called
concentrability -- can be viewed as a structural property of the underlying
MDP, and can be exploited by standard algorithms for sample-efficient
exploration, even when the agent does not know said distribution. We complement
this result by proving that several weaker notions of coverage, despite being
sufficient for offline RL, are insufficient for online RL. We also show that
existing complexity measures for online RL, including Bellman rank and
Bellman-Eluder dimension, fail to optimally capture coverability, and propose a
new complexity measure, the sequential extrapolation coefficient, to provide a
unification
The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning
While distributional reinforcement learning (RL) has demonstrated empirical
success, the question of when and why it is beneficial has remained unanswered.
In this work, we provide one explanation for the benefits of distributional RL
through the lens of small-loss bounds, which scale with the instance-dependent
optimal cost. If the optimal cost is small, our bounds are stronger than those
from non-distributional approaches. As warmup, we show that learning the cost
distribution leads to small-loss regret bounds in contextual bandits (CB), and
we find that distributional CB empirically outperforms the state-of-the-art on
three challenging tasks. For online RL, we propose a distributional
version-space algorithm that constructs confidence sets using maximum
likelihood estimation, and we prove that it achieves small-loss regret in the
tabular MDPs and enjoys small-loss PAC bounds in latent variable models.
Building on similar insights, we propose a distributional offline RL algorithm
based on the pessimism principle and prove that it enjoys small-loss PAC
bounds, which exhibit a novel robustness property. For both online and offline
RL, our results provide the first theoretical benefits of learning
distributions even when we only need the mean for making decisions
- …