Overestimation of the maximum action-value is a well-known problem that
hinders Q-Learning performance, leading to suboptimal policies and unstable
learning. Among several Q-Learning variants proposed to address this issue,
Weighted Q-Learning (WQL) effectively reduces the bias and shows remarkable
results in stochastic environments. WQL uses a weighted sum of the estimated
action-values, where the weights correspond to the probability of each
action-value being the maximum; however, the computation of these probabilities
is only practical in the tabular settings. In this work, we provide the
methodological advances to benefit from the WQL properties in Deep
Reinforcement Learning (DRL), by using neural networks with Dropout Variational
Inference as an effective approximation of deep Gaussian processes. In
particular, we adopt the Concrete Dropout variant to obtain calibrated
estimates of epistemic uncertainty in DRL. We show that model uncertainty in
DRL can be useful not only for action selection, but also action evaluation. We
analyze how the novel Weighted Deep Q-Learning algorithm reduces the bias
w.r.t. relevant baselines and provide empirical evidence of its advantages on
several representative benchmarks.Comment: Corrected typo