The quintessential model-based reinforcement-learning agent iteratively
refines its estimates or prior beliefs about the true underlying model of the
environment. Recent empirical successes in model-based reinforcement learning
with function approximation, however, eschew the true model in favor of a
surrogate that, while ignoring various facets of the environment, still
facilitates effective planning over behaviors. Recently formalized as the value
equivalence principle, this algorithmic technique is perhaps unavoidable as
real-world reinforcement learning demands consideration of a simple,
computationally-bounded agent interacting with an overwhelmingly complex
environment, whose underlying dynamics likely exceed the agent's capacity for
representation. In this work, we consider the scenario where agent limitations
may entirely preclude identifying an exactly value-equivalent model,
immediately giving rise to a trade-off between identifying a model that is
simple enough to learn while only incurring bounded sub-optimality. To address
this problem, we introduce an algorithm that, using rate-distortion theory,
iteratively computes an approximately-value-equivalent, lossy compression of
the environment which an agent may feasibly target in lieu of the true model.
We prove an information-theoretic, Bayesian regret bound for our algorithm that
holds for any finite-horizon, episodic sequential decision-making problem.
Crucially, our regret bound can be expressed in one of two possible forms,
providing a performance guarantee for finding either the simplest model that
achieves a desired sub-optimality gap or, alternatively, the best model given a
limit on agent capacity.Comment: Accepted to Neural Information Processing Systems (NeurIPS) 202