11,397 research outputs found
Parametric Return Density Estimation for Reinforcement Learning
Most conventional Reinforcement Learning (RL) algorithms aim to optimize
decision-making rules in terms of the expected returns. However, especially for
risk management purposes, other risk-sensitive criteria such as the
value-at-risk or the expected shortfall are sometimes preferred in real
applications. Here, we describe a parametric method for estimating density of
the returns, which allows us to handle various criteria in a unified manner. We
first extend the Bellman equation for the conditional expected return to cover
a conditional probability density of the returns. Then we derive an extension
of the TD-learning algorithm for estimating the return densities in an unknown
environment. As test instances, several parametric density estimation
algorithms are presented for the Gaussian, Laplace, and skewed Laplace
distributions. We show that these algorithms lead to risk-sensitive as well as
robust RL paradigms through numerical experiments.Comment: Appears in Proceedings of the Twenty-Sixth Conference on Uncertainty
in Artificial Intelligence (UAI2010
Distributional Reinforcement Learning for Efficient Exploration
In distributional reinforcement learning (RL), the estimated distribution of
value function models both the parametric and intrinsic uncertainties. We
propose a novel and efficient exploration method for deep RL that has two
components. The first is a decaying schedule to suppress the intrinsic
uncertainty. The second is an exploration bonus calculated from the upper
quantiles of the learned distribution. In Atari 2600 games, our method
outperforms QR-DQN in 12 out of 14 hard games (achieving 483 \% average gain
across 49 games in cumulative rewards over QR-DQN with a big win in Venture).
We also compared our algorithm with QR-DQN in a challenging 3D driving
simulator (CARLA). Results show that our algorithm achieves near-optimal safety
rewards twice faster than QRDQN
Stein Variational Policy Gradient
Policy gradient methods have been successfully applied to many complex
reinforcement learning problems. However, policy gradient methods suffer from
high variance, slow convergence, and inefficient exploration. In this work, we
introduce a maximum entropy policy optimization framework which explicitly
encourages parameter exploration, and show that this framework can be reduced
to a Bayesian inference problem. We then propose a novel Stein variational
policy gradient method (SVPG) which combines existing policy gradient methods
and a repulsive functional to generate a set of diverse but well-behaved
policies. SVPG is robust to initialization and can easily be implemented in a
parallel manner. On continuous control problems, we find that implementing SVPG
on top of REINFORCE and advantage actor-critic algorithms improves both average
return and data efficiency
Efficient exploration with Double Uncertain Value Networks
This paper studies directed exploration for reinforcement learning agents by
tracking uncertainty about the value of each available action. We identify two
sources of uncertainty that are relevant for exploration. The first originates
from limited data (parametric uncertainty), while the second originates from
the distribution of the returns (return uncertainty). We identify methods to
learn these distributions with deep neural networks, where we estimate
parametric uncertainty with Bayesian drop-out, while return uncertainty is
propagated through the Bellman equation as a Gaussian distribution. Then, we
identify that both can be jointly estimated in one network, which we call the
Double Uncertain Value Network. The policy is directly derived from the learned
distributions based on Thompson sampling. Experimental results show that both
types of uncertainty may vastly improve learning in domains with a strong
exploration challenge.Comment: Deep Reinforcement Learning Symposium @ Conference on Neural
Information Processing Systems (NIPS) 201
Relative Entropy Regularized Policy Iteration
We present an off-policy actor-critic algorithm for Reinforcement Learning
(RL) that combines ideas from gradient-free optimization via stochastic search
with learned action-value function. The result is a simple procedure consisting
of three steps: i) policy evaluation by estimating a parametric action-value
function; ii) policy improvement via the estimation of a local non-parametric
policy; and iii) generalization by fitting a parametric policy. Each step can
be implemented in different ways, giving rise to several algorithm variants.
Our algorithm draws on connections to existing literature on black-box
optimization and 'RL as an inference' and it can be seen either as an extension
of the Maximum a Posteriori Policy Optimisation algorithm (MPO) [Abdolmaleki et
al., 2018a], or as an extension of Trust Region Covariance Matrix Adaptation
Evolutionary Strategy (CMA-ES) [Abdolmaleki et al., 2017b; Hansen et al., 1997]
to a policy iteration scheme. Our comparison on 31 continuous control tasks
from parkour suite [Heess et al., 2017], DeepMind control suite [Tassa et al.,
2018] and OpenAI Gym [Brockman et al., 2016] with diverse properties, limited
amount of compute and a single set of hyperparameters, demonstrate the
effectiveness of our method and the state of art results. Videos, summarizing
results, can be found at goo.gl/HtvJKR
QUOTA: The Quantile Option Architecture for Reinforcement Learning
In this paper, we propose the Quantile Option Architecture (QUOTA) for
exploration based on recent advances in distributional reinforcement learning
(RL). In QUOTA, decision making is based on quantiles of a value distribution,
not only the mean. QUOTA provides a new dimension for exploration via making
use of both optimism and pessimism of a value distribution. We demonstrate the
performance advantage of QUOTA in both challenging video games and physical
robot simulators.Comment: AAAI 201
Randomized Value Functions via Multiplicative Normalizing Flows
Randomized value functions offer a promising approach towards the challenge
of efficient exploration in complex environments with high dimensional state
and action spaces. Unlike traditional point estimate methods, randomized value
functions maintain a posterior distribution over action-space values. This
prevents the agent's behavior policy from prematurely exploiting early
estimates and falling into local optima. In this work, we leverage recent
advances in variational Bayesian neural networks and combine these with
traditional Deep Q-Networks (DQN) and Deep Deterministic Policy Gradient (DDPG)
to achieve randomized value functions for high-dimensional domains. In
particular, we augment DQN and DDPG with multiplicative normalizing flows in
order to track a rich approximate posterior distribution over the parameters of
the value function. This allows the agent to perform approximate Thompson
sampling in a computationally efficient manner via stochastic gradient methods.
We demonstrate the benefits of our approach through an empirical comparison in
high dimensional environments
Combining Parametric and Nonparametric Models for Off-Policy Evaluation
We consider a model-based approach to perform batch off-policy evaluation in
reinforcement learning. Our method takes a mixture-of-experts approach to
combine parametric and non-parametric models of the environment such that the
final value estimate has the least expected error. We do so by first estimating
the local accuracy of each model and then using a planner to select which model
to use at every time step as to minimize the return error estimate along entire
trajectories. Across a variety of domains, our mixture-based approach
outperforms the individual models alone as well as state-of-the-art importance
sampling-based estimators
Policy Optimization via Importance Sampling
Policy optimization is an effective reinforcement learning approach to solve
continuous control tasks. Recent achievements have shown that alternating
online and offline optimization is a successful choice for efficient trajectory
reuse. However, deciding when to stop optimizing and collect new trajectories
is non-trivial, as it requires to account for the variance of the objective
function estimate. In this paper, we propose a novel, model-free, policy search
algorithm, POIS, applicable in both action-based and parameter-based settings.
We first derive a high-confidence bound for importance sampling estimation;
then we define a surrogate objective function, which is optimized offline
whenever a new batch of trajectories is collected. Finally, the algorithm is
tested on a selection of continuous control tasks, with both linear and deep
policies, and compared with state-of-the-art policy optimization methods
Action-depedent Control Variates for Policy Optimization via Stein's Identity
Policy gradient methods have achieved remarkable successes in solving
challenging reinforcement learning problems. However, it still often suffers
from the large variance issue on policy gradient estimation, which leads to
poor sample efficiency during training. In this work, we propose a control
variate method to effectively reduce variance for policy gradient methods.
Motivated by the Stein's identity, our method extends the previous control
variate methods used in REINFORCE and advantage actor-critic by introducing
more general action-dependent baseline functions. Empirical studies show that
our method significantly improves the sample efficiency of the state-of-the-art
policy gradient approaches.Comment: The first two authors contributed equally. Author ordering determined
by coin flip over a Google Hangout. Accepted by ICLR 201
- …