272 research outputs found
The Potential of the Return Distribution for Exploration in RL
This paper studies the potential of the return distribution for exploration
in deterministic reinforcement learning (RL) environments. We study network
losses and propagation mechanisms for Gaussian, Categorical and Gaussian
mixture distributions. Combined with exploration policies that leverage this
return distribution, we solve, for example, a randomized Chain task of length
100, which has not been reported before when learning with neural networks.Comment: Published at the Exploration in Reinforcement Learning Workshop at
the 35th International Conference on Machine Learning, Stockholm, Swede
Distributional Policy Optimization: An Alternative Approach for Continuous Control
We identify a fundamental problem in policy gradient-based methods in
continuous control. As policy gradient methods require the agent's underlying
probability distribution, they limit policy representation to parametric
distribution classes. We show that optimizing over such sets results in local
movement in the action space and thus convergence to sub-optimal solutions. We
suggest a novel distributional framework, able to represent arbitrary
distribution functions over the continuous action space. Using this framework,
we construct a generative scheme, trained using an off-policy actor-critic
paradigm, which we call the Generative Actor Critic (GAC). Compared to policy
gradient methods, GAC does not require knowledge of the underlying probability
distribution, thereby overcoming these limitations. Empirical evaluation shows
that our approach is comparable and often surpasses current state-of-the-art
baselines in continuous domains.Comment: Accepted to NeurIPS 201
Variational inference for the multi-armed contextual bandit
In many biomedical, science, and engineering problems, one must sequentially
decide which action to take next so as to maximize rewards. One general class
of algorithms for optimizing interactions with the world, while simultaneously
learning how the world operates, is the multi-armed bandit setting and, in
particular, the contextual bandit case. In this setting, for each executed
action, one observes rewards that are dependent on a given 'context', available
at each interaction with the world. The Thompson sampling algorithm has
recently been shown to enjoy provable optimality properties for this set of
problems, and to perform well in real-world settings. It facilitates generative
and interpretable modeling of the problem at hand. Nevertheless, the design and
complexity of the model limit its application, since one must both sample from
the distributions modeled and calculate their expected rewards. We here show
how these limitations can be overcome using variational inference to
approximate complex models, applying to the reinforcement learning case
advances developed for the inference case in the machine learning community
over the past two decades. We consider contextual multi-armed bandit
applications where the true reward distribution is unknown and complex, which
we approximate with a mixture model whose parameters are inferred via
variational inference. We show how the proposed variational Thompson sampling
approach is accurate in approximating the true distribution, and attains
reduced regrets even with complex reward distributions. The proposed algorithm
is valuable for practical scenarios where restrictive modeling assumptions are
undesirable.Comment: The software used for this study is publicly available at
https://github.com/iurteaga/bandit
SENTINEL: Taming Uncertainty with Ensemble-based Distributional Reinforcement Learning
In this paper, we consider risk-sensitive sequential decision-making in model-based reinforcement learning (RL).We introduce a novel quantification of risk, namely \emph{composite risk}, which takes into account both aleatory and epistemic risk during the learning process.Previous works have considered aleatory or epistemic risk individually, or, an additive combination of the two.We demonstrate that the additive formulation is a particular case of the composite risk, which underestimates the actual CVaR risk even while learning a mixture of Gaussians.In contrast, the composite risk provides a more accurate estimate.We propose to use a bootstrapping method, SENTINEL-K, for distributional RL. SENTINEL-K uses an ensemble of\ua0K\ua0learners to estimate the return distribution and additionally uses follow the regularized leader (FTRL) from bandit literature for providing a better estimate of the risk on the return distribution.Finally, we experimentally verify that SENTINEL-K estimates the return distribution better, and while used with composite risk estimate, demonstrates better risk-sensitive performance than competing RL algorithms
Model Selection in Bayesian Neural Networks via Horseshoe Priors
Bayesian Neural Networks (BNNs) have recently received increasing attention
for their ability to provide well-calibrated posterior uncertainties. However,
model selection---even choosing the number of nodes---remains an open question.
In this work, we apply a horseshoe prior over node pre-activations of a
Bayesian neural network, which effectively turns off nodes that do not help
explain the data. We demonstrate that our prior prevents the BNN from
under-fitting even when the number of nodes required is grossly over-estimated.
Moreover, this model selection over the number of nodes doesn't come at the
expense of predictive or computational performance; in fact, we learn smaller
networks with comparable predictive performance to current approaches
Controlling Overestimation Bias with Truncated Mixture of Continuous Distributional Quantile Critics
The overestimation bias is one of the major impediments to accurate
off-policy learning. This paper investigates a novel way to alleviate the
overestimation bias in a continuous control setting. Our method---Truncated
Quantile Critics, TQC,---blends three ideas: distributional representation of a
critic, truncation of critics prediction, and ensembling of multiple critics.
Distributional representation and truncation allow for arbitrary granular
overestimation control, while ensembling provides additional score
improvements. TQC outperforms the current state of the art on all environments
from the continuous control benchmark suite, demonstrating 25% improvement on
the most challenging Humanoid environment.Comment: Under review by the International Conference on Machine Learnin
Sticking the Landing: Simple, Lower-Variance Gradient Estimators for Variational Inference
We propose a simple and general variant of the standard reparameterized
gradient estimator for the variational evidence lower bound. Specifically, we
remove a part of the total derivative with respect to the variational
parameters that corresponds to the score function. Removing this term produces
an unbiased gradient estimator whose variance approaches zero as the
approximate posterior approaches the exact posterior. We analyze the behavior
of this gradient estimator theoretically and empirically, and generalize it to
more complex variational distributions such as mixtures and importance-weighted
posteriors
Distributional Reinforcement Learning via Moment Matching
We consider the problem of learning a set of probability distributions from
the empirical Bellman dynamics in distributional reinforcement learning (RL), a
class of state-of-the-art methods that estimate the distribution, as opposed to
only the expectation, of the total return. We formulate a method that learns a
finite set of statistics from each return distribution via neural networks, as
in (Bellemare, Dabney, and Munos 2017; Dabney et al. 2018b). Existing
distributional RL methods however constrain the learned statistics to
\emph{predefined} functional forms of the return distribution which is both
restrictive in representation and difficult in maintaining the predefined
statistics. Instead, we learn \emph{unrestricted} statistics, i.e.,
deterministic (pseudo-)samples, of the return distribution by leveraging a
technique from hypothesis testing known as maximum mean discrepancy (MMD),
which leads to a simpler objective amenable to backpropagation. Our method can
be interpreted as implicitly matching all orders of moments between a return
distribution and its Bellman target. We establish sufficient conditions for the
contraction of the distributional Bellman operator and provide finite-sample
analysis for the deterministic samples in distribution approximation.
Experiments on the suite of Atari games show that our method outperforms the
standard distributional RL baselines and sets a new record in the Atari games
for non-distributed agents.Comment: To appear in AAAI'21; code available at
https://github.com/thanhnguyentang/mmdr
A Distributional Perspective on Reinforcement Learning
In this paper we argue for the fundamental importance of the value
distribution: the distribution of the random return received by a reinforcement
learning agent. This is in contrast to the common approach to reinforcement
learning which models the expectation of this return, or value. Although there
is an established body of literature studying the value distribution, thus far
it has always been used for a specific purpose such as implementing risk-aware
behaviour. We begin with theoretical results in both the policy evaluation and
control settings, exposing a significant distributional instability in the
latter. We then use the distributional perspective to design a new algorithm
which applies Bellman's equation to the learning of approximate value
distributions. We evaluate our algorithm using the suite of games from the
Arcade Learning Environment. We obtain both state-of-the-art results and
anecdotal evidence demonstrating the importance of the value distribution in
approximate reinforcement learning. Finally, we combine theoretical and
empirical evidence to highlight the ways in which the value distribution
impacts learning in the approximate setting.Comment: ICML 201
Critic Regularized Regression
Offline reinforcement learning (RL), also known as batch RL, offers the
prospect of policy optimization from large pre-recorded datasets without online
environment interaction. It addresses challenges with regard to the cost of
data collection and safety, both of which are particularly pertinent to
real-world applications of RL. Unfortunately, most off-policy algorithms
perform poorly when learning from a fixed dataset. In this paper, we propose a
novel offline RL algorithm to learn policies from data using a form of
critic-regularized regression (CRR). We find that CRR performs surprisingly
well and scales to tasks with high-dimensional state and action spaces --
outperforming several state-of-the-art offline RL algorithms by a significant
margin on a wide range of benchmark tasks.Comment: 23 page
- …