We present new algorithms for computing and approximating bisimulation
metrics in Markov Decision Processes (MDPs). Bisimulation metrics are an
elegant formalism that capture behavioral equivalence between states and
provide strong theoretical guarantees on differences in optimal behaviour.
Unfortunately, their computation is expensive and requires a tabular
representation of the states, which has thus far rendered them impractical for
large problems. In this paper we present a new version of the metric that is
tied to a behavior policy in an MDP, along with an analysis of its theoretical
properties. We then present two new algorithms for approximating bisimulation
metrics in large, deterministic MDPs. The first does so via sampling and is
guaranteed to converge to the true metric. The second is a differentiable loss
which allows us to learn an approximation even for continuous state MDPs, which
prior to this work had not been possible.Comment: To appear in Proceedings of the Thirty-Fourth AAAI Conference on
Artificial Intelligence (AAAI-20