5,124 research outputs found
Distributional Reinforcement Learning with Quantile Regression
In reinforcement learning an agent interacts with the environment by taking
actions and observing the next state and reward. When sampled
probabilistically, these state transitions, rewards, and actions can all induce
randomness in the observed long-term return. Traditionally, reinforcement
learning algorithms average over this randomness to estimate the value
function. In this paper, we build on recent work advocating a distributional
approach to reinforcement learning in which the distribution over returns is
modeled explicitly instead of only estimating the mean. That is, we examine
methods of learning the value distribution instead of the value function. We
give results that close a number of gaps between the theoretical and
algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we
extend existing results to the approximate distribution setting. Second, we
present a novel distributional reinforcement learning algorithm consistent with
our theoretical formulation. Finally, we evaluate this new algorithm on the
Atari 2600 games, observing that it significantly outperforms many of the
recent improvements on DQN, including the related distributional algorithm C51
Quantile Constrained Reinforcement Learning: A Reinforcement Learning Framework Constraining Outage Probability
Constrained reinforcement learning (RL) is an area of RL whose objective is
to find an optimal policy that maximizes expected cumulative return while
satisfying a given constraint. Most of the previous constrained RL works
consider expected cumulative sum cost as the constraint. However, optimization
with this constraint cannot guarantee a target probability of outage event that
the cumulative sum cost exceeds a given threshold. This paper proposes a
framework, named Quantile Constrained RL (QCRL), to constrain the quantile of
the distribution of the cumulative sum cost that is a necessary and sufficient
condition to satisfy the outage constraint. This is the first work that tackles
the issue of applying the policy gradient theorem to the quantile and provides
theoretical results for approximating the gradient of the quantile. Based on
the derived theoretical results and the technique of the Lagrange multiplier,
we construct a constrained RL algorithm named Quantile Constrained Policy
Optimization (QCPO). We use distributional RL with the Large Deviation
Principle (LDP) to estimate quantiles and tail probability of the cumulative
sum cost for the implementation of QCPO. The implemented algorithm satisfies
the outage probability constraint after the training period.Comment: NeurIPS 202
The Statistical Benefits of Quantile Temporal-Difference Learning for Value Estimation
We study the problem of temporal-difference-based policy evaluation in
reinforcement learning. In particular, we analyse the use of a distributional
reinforcement learning algorithm, quantile temporal-difference learning (QTD),
for this task. We reach the surprising conclusion that even if a practitioner
has no interest in the return distribution beyond the mean, QTD (which learns
predictions about the full distribution of returns) may offer performance
superior to approaches such as classical TD learning, which predict only the
mean return, even in the tabular setting.Comment: ICML 202
- …