9 research outputs found
Increasing the Action Gap: New Operators for Reinforcement Learning
This paper introduces new optimality-preserving operators on Q-functions. We
first describe an operator for tabular representations, the consistent Bellman
operator, which incorporates a notion of local policy consistency. We show that
this local consistency leads to an increase in the action gap at each state;
increasing this gap, we argue, mitigates the undesirable effects of
approximation and estimation errors on the induced greedy policies. This
operator can also be applied to discretized continuous space and time problems,
and we provide empirical results evidencing superior performance in this
context. Extending the idea of a locally consistent operator, we then derive
sufficient conditions for an operator to preserve optimality, leading to a
family of operators which includes our consistent Bellman operator. As
corollaries we provide a proof of optimality for Baird's advantage learning
algorithm and derive other gap-increasing operators with interesting
properties. We conclude with an empirical study on 60 Atari 2600 games
illustrating the strong potential of these new operators
Addressing Function Approximation Error in Actor-Critic Methods
In value-based reinforcement learning methods such as deep Q-learning,
function approximation errors are known to lead to overestimated value
estimates and suboptimal policies. We show that this problem persists in an
actor-critic setting and propose novel mechanisms to minimize its effects on
both the actor and the critic. Our algorithm builds on Double Q-learning, by
taking the minimum value between a pair of critics to limit overestimation. We
draw the connection between target networks and overestimation bias, and
suggest delaying policy updates to reduce per-update error and further improve
performance. We evaluate our method on the suite of OpenAI gym tasks,
outperforming the state of the art in every environment tested.Comment: Accepted at ICML 201
Estimating the maximum expected value in continuous reinforcement learning problems
This paper is about the estimation of the maximum expected value of an infinite set of random variables. This estimation problem is relevant in many fields, like the Reinforcement Learning (RL) one. In RL it is well known that, in some stochastic environments, a bias in the estimation error can increase step-by-step the approximation error leading to large overestimates of the true action values. Recently, some approaches have been proposed to reduce such bias in order to get better action-value estimates, but are limited to finite problems. In this paper, we leverage on the recently proposed weighted estimator and on Gaussian process regression to derive a new method that is able to natively handle infinitely many random variables. We show how these techniques can be used to face both continuous state and continuous actions RL problems. To evaluate the effectiveness of the proposed approach we perform empirical comparisons with related approaches
Suppressing Overestimation in Q-Learning through Adversarial Behaviors
The goal of this paper is to propose a new Q-learning algorithm with a dummy
adversarial player, which is called dummy adversarial Q-learning (DAQ), that
can effectively regulate the overestimation bias in standard Q-learning. With
the dummy player, the learning can be formulated as a two-player zero-sum game.
The proposed DAQ unifies several Q-learning variations to control
overestimation biases, such as maxmin Q-learning and minmax Q-learning
(proposed in this paper) in a single framework. The proposed DAQ is a simple
but effective way to suppress the overestimation bias thourgh dummy adversarial
behaviors and can be easily applied to off-the-shelf reinforcement learning
algorithms to improve the performances. A finite-time convergence of DAQ is
analyzed from an integrated perspective by adapting an adversarial Q-learning.
The performance of the suggested DAQ is empirically demonstrated under various
benchmark environments
Deep Reinforcement Learning with Weighted Q-Learning
Overestimation of the maximum action-value is a well-known problem that
hinders Q-Learning performance, leading to suboptimal policies and unstable
learning. Among several Q-Learning variants proposed to address this issue,
Weighted Q-Learning (WQL) effectively reduces the bias and shows remarkable
results in stochastic environments. WQL uses a weighted sum of the estimated
action-values, where the weights correspond to the probability of each
action-value being the maximum; however, the computation of these probabilities
is only practical in the tabular settings. In this work, we provide the
methodological advances to benefit from the WQL properties in Deep
Reinforcement Learning (DRL), by using neural networks with Dropout Variational
Inference as an effective approximation of deep Gaussian processes. In
particular, we adopt the Concrete Dropout variant to obtain calibrated
estimates of epistemic uncertainty in DRL. We show that model uncertainty in
DRL can be useful not only for action selection, but also action evaluation. We
analyze how the novel Weighted Deep Q-Learning algorithm reduces the bias
w.r.t. relevant baselines and provide empirical evidence of its advantages on
several representative benchmarks.Comment: Corrected typo