9 research outputs found

    Interpreting Distributional Reinforcement Learning: A Regularization Perspective

    Distributional reinforcement learning~(RL) is a class of state-of-the-art algorithms that estimate the whole distribution of the total return rather than only its expectation. Despite the remarkable performance of distributional RL, a theoretical understanding of its advantages over expectation-based RL remains elusive. In this paper, we attribute the superiority of distributional RL to its regularization effect in terms of the value distribution information regardless of its expectation. Firstly, by leverage of a variant of the gross error model in robust statistics, we decompose the value distribution into its expectation and the remaining distribution part. As such, the extra benefit of distributional RL compared with expectation-based RL is mainly interpreted as the impact of a \textit{risk-sensitive entropy regularization} within the Neural Fitted Z-Iteration framework. Meanwhile, we establish a bridge between the risk-sensitive entropy regularization of distributional RL and the vanilla entropy in maximum entropy RL, focusing specifically on actor-critic algorithms. It reveals that distributional RL induces a corrected reward function and thus promotes a risk-sensitive exploration against the intrinsic uncertainty of the environment. Finally, extensive experiments corroborate the role of the regularization effect of distributional RL and uncover mutual impacts of different entropy regularization. Our research paves a way towards better interpreting the efficacy of distributional RL algorithms, especially through the lens of regularization

    Value-Distributional Model-Based Reinforcement Learning

    Quantifying uncertainty about a policy's long-term performance is important to solve sequential decision-making tasks. We study the problem from a model-based Bayesian reinforcement learning perspective, where the goal is to learn the posterior distribution over value functions induced by parameter (epistemic) uncertainty of the Markov decision process. Previous work restricts the analysis to a few moments of the distribution over values or imposes a particular distribution shape, e.g., Gaussians. Inspired by distributional reinforcement learning, we introduce a Bellman operator whose fixed-point is the value distribution function. Based on our theory, we propose Epistemic Quantile-Regression (EQR), a model-based algorithm that learns a value distribution function that can be used for policy optimization. Evaluation across several continuous-control tasks shows performance benefits with respect to established model-based and model-free algorithms

    The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learning

    We study the multi-step off-policy learning approach to distributional RL. Despite the apparent similarity between value-based RL and distributional RL, our study reveals intriguing and fundamental differences between the two cases in the multi-step setting. We identify a novel notion of path-dependent distributional TD error, which is indispensable for principled multi-step distributional RL. The distinction from the value-based case bears important implications on concepts such as backward-view algorithms. Our work provides the first theoretical guarantees on multi-step off-policy distributional RL algorithms, including results that apply to the small number of existing approaches to multi-step distributional RL. In addition, we derive a novel algorithm, Quantile Regression-Retrace, which leads to a deep RL agent QR-DQN-Retrace that shows empirical improvements over QR-DQN on the Atari-57 benchmark. Collectively, we shed light on how unique challenges in multi-step distributional RL can be addressed both in theory and practice