Search CORE

1,965 research outputs found

Wasserstein Robust Reinforcement Learning

Author: Abdullah Mohammed Amin
Ammar Haitham Bou
Luo Rui
Milenkovic Vladimir
Ren Hang
Wang Jun
Zhang Mingtian
Publication venue
Publication date: 16/09/2019
Field of study

Reinforcement learning algorithms, though successful, tend to over-fit to training environments hampering their application to the real-world. This paper proposes

\text{W}\text{R}^{2}\text{L}

-- a robust reinforcement learning algorithm with significant robust performance on low and high-dimensional control tasks. Our method formalises robust reinforcement learning as a novel min-max game with a Wasserstein constraint for a correct and convergent solver. Apart from the formulation, we also propose an efficient and scalable solver following a novel zero-order optimisation method that we believe can be useful to numerical optimisation in general. We empirically demonstrate significant gains compared to standard and robust state-of-the-art algorithms on high-dimensional MuJuCo environments

arXiv.org e-Print Archive

Robust Reinforcement Learning with Wasserstein Constraint

Author: Hong Xin
Hou Linfang
Lan Yanyan
Ma Zhiming
Pang Liang
Yin Dawei
Publication venue
Publication date: 01/06/2020
Field of study

Robust Reinforcement Learning aims to find the optimal policy with some extent of robustness to environmental dynamics. Existing learning algorithms usually enable the robustness through disturbing the current state or simulating environmental parameters in a heuristic way, which lack quantified robustness to the system dynamics (i.e. transition probability). To overcome this issue, we leverage Wasserstein distance to measure the disturbance to the reference transition kernel. With Wasserstein distance, we are able to connect transition kernel disturbance to the state disturbance, i.e. reduce an infinite-dimensional optimization problem to a finite-dimensional risk-aware problem. Through the derived risk-aware optimal Bellman equation, we show the existence of optimal robust policies, provide a sensitivity analysis for the perturbations, and then design a novel robust learning algorithm--Wasserstein Robust Advantage Actor-Critic algorithm (WRAAC). The effectiveness of the proposed algorithm is verified in the Cart-Pole environment

arXiv.org e-Print Archive

Distributional Robustness and Regularization in Reinforcement Learning

Author: Derman Esther
Mannor Shie
Publication venue
Publication date: 14/07/2020
Field of study

Distributionally Robust Optimization (DRO) has enabled to prove the equivalence between robustness and regularization in classification and regression, thus providing an analytical reason why regularization generalizes well in statistical learning. Although DRO's extension to sequential decision-making overcomes

\textit{external uncertainty}

through the robust Markov Decision Process (MDP) setting, the resulting formulation is hard to solve, especially on large domains. On the other hand, existing regularization methods in reinforcement learning only address

\textit{internal uncertainty}

due to stochasticity. Our study aims to facilitate robust reinforcement learning by establishing a dual relation between robust MDPs and regularization. We introduce Wasserstein distributionally robust MDPs and prove that they hold out-of-sample performance guarantees. Then, we introduce a new regularizer for empirical value functions and show that it lower bounds the Wasserstein distributionally robust value function. We extend the result to linear value function approximation for large state spaces. Our approach provides an alternative formulation of robustness with guaranteed finite-sample performance. Moreover, it suggests using regularization as a practical tool for dealing with

\textit{external uncertainty}

in reinforcement learning methods.Comment: Accepted at the "Theoretical Foundations of Reinforcement Learning" Workshop - ICML 202

arXiv.org e-Print Archive

Distributional Reinforcement Learning with Quantile Regression

Author: Bellemare Marc G.
Dabney Will
Munos Rémi
Rowland Mark
Publication venue
Publication date: 27/10/2017
Field of study

In reinforcement learning an agent interacts with the environment by taking actions and observing the next state and reward. When sampled probabilistically, these state transitions, rewards, and actions can all induce randomness in the observed long-term return. Traditionally, reinforcement learning algorithms average over this randomness to estimate the value function. In this paper, we build on recent work advocating a distributional approach to reinforcement learning in which the distribution over returns is modeled explicitly instead of only estimating the mean. That is, we examine methods of learning the value distribution instead of the value function. We give results that close a number of gaps between the theoretical and algorithmic results given by Bellemare, Dabney, and Munos (2017). First, we extend existing results to the approximate distribution setting. Second, we present a novel distributional reinforcement learning algorithm consistent with our theoretical formulation. Finally, we evaluate this new algorithm on the Atari 2600 games, observing that it significantly outperforms many of the recent improvements on DQN, including the related distributional algorithm C51

arXiv.org e-Print Archive

Non-Stationary Markov Decision Processes, a Worst-Case Approach using Model-Based Reinforcement Learning, Extended version

Author: Lecarpentier Erwan
Rachelson Emmanuel
Publication venue
Publication date: 15/01/2020
Field of study

This work tackles the problem of robust zero-shot planning in non-stationary stochastic environments. We study Markov Decision Processes (MDPs) evolving over time and consider Model-Based Reinforcement Learning algorithms in this setting. We make two hypotheses: 1) the environment evolves continuously with a bounded evolution rate; 2) a current model is known at each decision epoch but not its evolution. Our contribution can be presented in four points. 1) we define a specific class of MDPs that we call Non-Stationary MDPs (NSMDPs). We introduce the notion of regular evolution by making an hypothesis of Lipschitz-Continuity on the transition and reward functions w.r.t. time; 2) we consider a planning agent using the current model of the environment but unaware of its future evolution. This leads us to consider a worst-case method where the environment is seen as an adversarial agent; 3) following this approach, we propose the Risk-Averse Tree-Search (RATS) algorithm, a zero-shot Model-Based method similar to Minimax search; 4) we illustrate the benefits brought by RATS empirically and compare its performance with reference Model-Based algorithms.Comment: Published at NeurIPS 2019, 17 pages, 3 figure

arXiv.org e-Print Archive

A Distributional Perspective on Reinforcement Learning

Author: Bellemare Marc G.
Dabney Will
Munos Rémi
Publication venue
Publication date: 21/07/2017
Field of study

In this paper we argue for the fundamental importance of the value distribution: the distribution of the random return received by a reinforcement learning agent. This is in contrast to the common approach to reinforcement learning which models the expectation of this return, or value. Although there is an established body of literature studying the value distribution, thus far it has always been used for a specific purpose such as implementing risk-aware behaviour. We begin with theoretical results in both the policy evaluation and control settings, exposing a significant distributional instability in the latter. We then use the distributional perspective to design a new algorithm which applies Bellman's equation to the learning of approximate value distributions. We evaluate our algorithm using the suite of games from the Arcade Learning Environment. We obtain both state-of-the-art results and anecdotal evidence demonstrating the importance of the value distribution in approximate reinforcement learning. Finally, we combine theoretical and empirical evidence to highlight the ways in which the value distribution impacts learning in the approximate setting.Comment: ICML 201

arXiv.org e-Print Archive

Implicit Quantile Networks for Distributional Reinforcement Learning

Author: Dabney Will
Munos Rémi
Ostrovski Georg
Silver David
Publication venue
Publication date: 14/06/2018
Field of study

In this work, we build on recent advances in distributional reinforcement learning to give a generally applicable, flexible, and state-of-the-art distributional variant of DQN. We achieve this by using quantile regression to approximate the full quantile function for the state-action return distribution. By reparameterizing a distribution over the sample space, this yields an implicitly defined return distribution and gives rise to a large class of risk-sensitive policies. We demonstrate improved performance on the 57 Atari 2600 games in the ALE, and use our algorithm's implicitly defined distributions to study the effects of risk-sensitive policies in Atari games.Comment: ICML 201

arXiv.org e-Print Archive

Policy Optimization as Wasserstein Gradient Flows

Author: Carin Lawrence
Chen Changyou
Li Chunyuan
Zhang Ruiyi
Publication venue
Publication date: 09/08/2018
Field of study

Policy optimization is a core component of reinforcement learning (RL), and most existing RL methods directly optimize parameters of a policy based on maximizing the expected total reward, or its surrogate. Though often achieving encouraging empirical success, its underlying mathematical principle on {\em policy-distribution} optimization is unclear. We place policy optimization into the space of probability measures, and interpret it as Wasserstein gradient flows. On the probability-measure space, under specified circumstances, policy optimization becomes a convex problem in terms of distribution optimization. To make optimization feasible, we develop efficient algorithms by numerically solving the corresponding discrete gradient flows. Our technique is applicable to several RL settings, and is related to many state-of-the-art policy-optimization algorithms. Empirical results verify the effectiveness of our framework, often obtaining better performance compared to related algorithms.Comment: Accepted by ICML 2018; Initial version on Deep Reinforcement Learning Symposium at NIPS 201

arXiv.org e-Print Archive

On Wasserstein Reinforcement Learning and the Fokker-Planck equation

Author: Maginnis Brendan
Richemond Pierre H.
Publication venue
Publication date: 19/12/2017
Field of study

Policy gradients methods often achieve better performance when the change in policy is limited to a small Kullback-Leibler divergence. We derive policy gradients where the change in policy is limited to a small Wasserstein distance (or trust region). This is done in the discrete and continuous multi-armed bandit settings with entropy regularisation. We show that in the small steps limit with respect to the Wasserstein distance

W_2

, policy dynamics are governed by the Fokker-Planck (heat) equation, following the Jordan-Kinderlehrer-Otto result. This means that policies undergo diffusion and advection, concentrating near actions with high reward. This helps elucidate the nature of convergence in the probability matching setup, and provides justification for empirical practices such as Gaussian policy priors and additive gradient noise

arXiv.org e-Print Archive

Autoregressive Quantile Networks for Generative Modeling

Author: Dabney Will
Munos Rémi
Ostrovski Georg
Publication venue
Publication date: 14/06/2018
Field of study

We introduce autoregressive implicit quantile networks (AIQN), a fundamentally different approach to generative modeling than those commonly used, that implicitly captures the distribution using quantile regression. AIQN is able to achieve superior perceptual quality and improvements in evaluation metrics, without incurring a loss of sample diversity. The method can be applied to many existing models and architectures. In this work we extend the PixelCNN model with AIQN and demonstrate results on CIFAR-10 and ImageNet using Inception score, FID, non-cherry-picked samples, and inpainting results. We consistently observe that AIQN yields a highly stable algorithm that improves perceptual quality while maintaining a highly diverse distribution.Comment: ICML 201

arXiv.org e-Print Archive