1,965 research outputs found
Wasserstein Robust Reinforcement Learning
Reinforcement learning algorithms, though successful, tend to over-fit to
training environments hampering their application to the real-world. This paper
proposes -- a robust reinforcement learning
algorithm with significant robust performance on low and high-dimensional
control tasks. Our method formalises robust reinforcement learning as a novel
min-max game with a Wasserstein constraint for a correct and convergent solver.
Apart from the formulation, we also propose an efficient and scalable solver
following a novel zero-order optimisation method that we believe can be useful
to numerical optimisation in general. We empirically demonstrate significant
gains compared to standard and robust state-of-the-art algorithms on
high-dimensional MuJuCo environments
Robust Reinforcement Learning with Wasserstein Constraint
Robust Reinforcement Learning aims to find the optimal policy with some
extent of robustness to environmental dynamics. Existing learning algorithms
usually enable the robustness through disturbing the current state or
simulating environmental parameters in a heuristic way, which lack quantified
robustness to the system dynamics (i.e. transition probability). To overcome
this issue, we leverage Wasserstein distance to measure the disturbance to the
reference transition kernel. With Wasserstein distance, we are able to connect
transition kernel disturbance to the state disturbance, i.e. reduce an
infinite-dimensional optimization problem to a finite-dimensional risk-aware
problem. Through the derived risk-aware optimal Bellman equation, we show the
existence of optimal robust policies, provide a sensitivity analysis for the
perturbations, and then design a novel robust learning algorithm--Wasserstein
Robust Advantage Actor-Critic algorithm (WRAAC). The effectiveness of the
proposed algorithm is verified in the Cart-Pole environment
Distributional Robustness and Regularization in Reinforcement Learning
Distributionally Robust Optimization (DRO) has enabled to prove the
equivalence between robustness and regularization in classification and
regression, thus providing an analytical reason why regularization generalizes
well in statistical learning. Although DRO's extension to sequential
decision-making overcomes through the robust
Markov Decision Process (MDP) setting, the resulting formulation is hard to
solve, especially on large domains. On the other hand, existing regularization
methods in reinforcement learning only address
due to stochasticity. Our study aims to facilitate robust reinforcement
learning by establishing a dual relation between robust MDPs and
regularization. We introduce Wasserstein distributionally robust MDPs and prove
that they hold out-of-sample performance guarantees. Then, we introduce a new
regularizer for empirical value functions and show that it lower bounds the
Wasserstein distributionally robust value function. We extend the result to
linear value function approximation for large state spaces. Our approach
provides an alternative formulation of robustness with guaranteed finite-sample
performance. Moreover, it suggests using regularization as a practical tool for
dealing with in reinforcement learning methods.Comment: Accepted at the "Theoretical Foundations of Reinforcement Learning"
Workshop - ICML 202
Distributional Reinforcement Learning with Quantile Regression
In reinforcement learning an agent interacts with the environment by taking
actions and observing the next state and reward. When sampled
probabilistically, these state transitions, rewards, and actions can all induce
randomness in the observed long-term return. Traditionally, reinforcement
learning algorithms average over this randomness to estimate the value
function. In this paper, we build on recent work advocating a distributional
approach to reinforcement learning in which the distribution over returns is
modeled explicitly instead of only estimating the mean. That is, we examine
methods of learning the value distribution instead of the value function. We
give results that close a number of gaps between the theoretical and
algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we
extend existing results to the approximate distribution setting. Second, we
present a novel distributional reinforcement learning algorithm consistent with
our theoretical formulation. Finally, we evaluate this new algorithm on the
Atari 2600 games, observing that it significantly outperforms many of the
recent improvements on DQN, including the related distributional algorithm C51
Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning, Extended version
This work tackles the problem of robust zero-shot planning in non-stationary
stochastic environments. We study Markov Decision Processes (MDPs) evolving
over time and consider Model-Based Reinforcement Learning algorithms in this
setting. We make two hypotheses: 1) the environment evolves continuously with a
bounded evolution rate; 2) a current model is known at each decision epoch but
not its evolution. Our contribution can be presented in four points. 1) we
define a specific class of MDPs that we call Non-Stationary MDPs (NSMDPs). We
introduce the notion of regular evolution by making an hypothesis of
Lipschitz-Continuity on the transition and reward functions w.r.t. time; 2) we
consider a planning agent using the current model of the environment but
unaware of its future evolution. This leads us to consider a worst-case method
where the environment is seen as an adversarial agent; 3) following this
approach, we propose the Risk-Averse Tree-Search (RATS) algorithm, a zero-shot
Model-Based method similar to Minimax search; 4) we illustrate the benefits
brought by RATS empirically and compare its performance with reference
Model-Based algorithms.Comment: Published at NeurIPS 2019, 17 pages, 3 figure
A Distributional Perspective on Reinforcement Learning
In this paper we argue for the fundamental importance of the value
distribution: the distribution of the random return received by a reinforcement
learning agent. This is in contrast to the common approach to reinforcement
learning which models the expectation of this return, or value. Although there
is an established body of literature studying the value distribution, thus far
it has always been used for a specific purpose such as implementing risk-aware
behaviour. We begin with theoretical results in both the policy evaluation and
control settings, exposing a significant distributional instability in the
latter. We then use the distributional perspective to design a new algorithm
which applies Bellman's equation to the learning of approximate value
distributions. We evaluate our algorithm using the suite of games from the
Arcade Learning Environment. We obtain both state-of-the-art results and
anecdotal evidence demonstrating the importance of the value distribution in
approximate reinforcement learning. Finally, we combine theoretical and
empirical evidence to highlight the ways in which the value distribution
impacts learning in the approximate setting.Comment: ICML 201
Implicit Quantile Networks for Distributional Reinforcement Learning
In this work, we build on recent advances in distributional reinforcement
learning to give a generally applicable, flexible, and state-of-the-art
distributional variant of DQN. We achieve this by using quantile regression to
approximate the full quantile function for the state-action return
distribution. By reparameterizing a distribution over the sample space, this
yields an implicitly defined return distribution and gives rise to a large
class of risk-sensitive policies. We demonstrate improved performance on the 57
Atari 2600 games in the ALE, and use our algorithm's implicitly defined
distributions to study the effects of risk-sensitive policies in Atari games.Comment: ICML 201
Policy Optimization as Wasserstein Gradient Flows
Policy optimization is a core component of reinforcement learning (RL), and
most existing RL methods directly optimize parameters of a policy based on
maximizing the expected total reward, or its surrogate. Though often achieving
encouraging empirical success, its underlying mathematical principle on {\em
policy-distribution} optimization is unclear. We place policy optimization into
the space of probability measures, and interpret it as Wasserstein gradient
flows. On the probability-measure space, under specified circumstances, policy
optimization becomes a convex problem in terms of distribution optimization. To
make optimization feasible, we develop efficient algorithms by numerically
solving the corresponding discrete gradient flows. Our technique is applicable
to several RL settings, and is related to many state-of-the-art
policy-optimization algorithms. Empirical results verify the effectiveness of
our framework, often obtaining better performance compared to related
algorithms.Comment: Accepted by ICML 2018; Initial version on Deep Reinforcement Learning
Symposium at NIPS 201
On Wasserstein Reinforcement Learning and the Fokker-Planck equation
Policy gradients methods often achieve better performance when the change in
policy is limited to a small Kullback-Leibler divergence. We derive policy
gradients where the change in policy is limited to a small Wasserstein distance
(or trust region). This is done in the discrete and continuous multi-armed
bandit settings with entropy regularisation. We show that in the small steps
limit with respect to the Wasserstein distance , policy dynamics are
governed by the Fokker-Planck (heat) equation, following the
Jordan-Kinderlehrer-Otto result. This means that policies undergo diffusion and
advection, concentrating near actions with high reward. This helps elucidate
the nature of convergence in the probability matching setup, and provides
justification for empirical practices such as Gaussian policy priors and
additive gradient noise
Autoregressive Quantile Networks for Generative Modeling
We introduce autoregressive implicit quantile networks (AIQN), a
fundamentally different approach to generative modeling than those commonly
used, that implicitly captures the distribution using quantile regression. AIQN
is able to achieve superior perceptual quality and improvements in evaluation
metrics, without incurring a loss of sample diversity. The method can be
applied to many existing models and architectures. In this work we extend the
PixelCNN model with AIQN and demonstrate results on CIFAR-10 and ImageNet using
Inception score, FID, non-cherry-picked samples, and inpainting results. We
consistently observe that AIQN yields a highly stable algorithm that improves
perceptual quality while maintaining a highly diverse distribution.Comment: ICML 201
- …