9 research outputs found
Interpreting Distributional Reinforcement Learning: A Regularization Perspective
Distributional reinforcement learning~(RL) is a class of state-of-the-art
algorithms that estimate the whole distribution of the total return rather than
only its expectation. Despite the remarkable performance of distributional RL,
a theoretical understanding of its advantages over expectation-based RL remains
elusive. In this paper, we attribute the superiority of distributional RL to
its regularization effect in terms of the value distribution information
regardless of its expectation. Firstly, by leverage of a variant of the gross
error model in robust statistics, we decompose the value distribution into its
expectation and the remaining distribution part. As such, the extra benefit of
distributional RL compared with expectation-based RL is mainly interpreted as
the impact of a \textit{risk-sensitive entropy regularization} within the
Neural Fitted Z-Iteration framework. Meanwhile, we establish a bridge between
the risk-sensitive entropy regularization of distributional RL and the vanilla
entropy in maximum entropy RL, focusing specifically on actor-critic
algorithms. It reveals that distributional RL induces a corrected reward
function and thus promotes a risk-sensitive exploration against the intrinsic
uncertainty of the environment. Finally, extensive experiments corroborate the
role of the regularization effect of distributional RL and uncover mutual
impacts of different entropy regularization. Our research paves a way towards
better interpreting the efficacy of distributional RL algorithms, especially
through the lens of regularization
Value-Distributional Model-Based Reinforcement Learning
Quantifying uncertainty about a policy's long-term performance is important
to solve sequential decision-making tasks. We study the problem from a
model-based Bayesian reinforcement learning perspective, where the goal is to
learn the posterior distribution over value functions induced by parameter
(epistemic) uncertainty of the Markov decision process. Previous work restricts
the analysis to a few moments of the distribution over values or imposes a
particular distribution shape, e.g., Gaussians. Inspired by distributional
reinforcement learning, we introduce a Bellman operator whose fixed-point is
the value distribution function. Based on our theory, we propose Epistemic
Quantile-Regression (EQR), a model-based algorithm that learns a value
distribution function that can be used for policy optimization. Evaluation
across several continuous-control tasks shows performance benefits with respect
to established model-based and model-free algorithms
The Nature of Temporal Difference Errors in Multi-step Distributional Reinforcement Learning
We study the multi-step off-policy learning approach to distributional RL.
Despite the apparent similarity between value-based RL and distributional RL,
our study reveals intriguing and fundamental differences between the two cases
in the multi-step setting. We identify a novel notion of path-dependent
distributional TD error, which is indispensable for principled multi-step
distributional RL. The distinction from the value-based case bears important
implications on concepts such as backward-view algorithms. Our work provides
the first theoretical guarantees on multi-step off-policy distributional RL
algorithms, including results that apply to the small number of existing
approaches to multi-step distributional RL. In addition, we derive a novel
algorithm, Quantile Regression-Retrace, which leads to a deep RL agent
QR-DQN-Retrace that shows empirical improvements over QR-DQN on the Atari-57
benchmark. Collectively, we shed light on how unique challenges in multi-step
distributional RL can be addressed both in theory and practice