Search CORE

64 research outputs found

Advances in neuro information processing systems 11: Proceedings of the 1998 conference Edited by Michael S. Kearns, Sara A. Solla and David A. Cohn. MIT Press, Cambridge, MA. (1999). 1090 pages. $65.00

Author
Publication venue: Published by Elsevier Ltd.
Publication date
Field of study

Reinforcement Learning with Perturbed Rewards

Author: Li Bo
Liu Yang
Wang Jingkang
Publication venue
Publication date: 01/02/2020
Field of study

Recent studies have shown that reinforcement learning (RL) models are vulnerable in various noisy scenarios. For instance, the observed reward channel is often subject to noise in practice (e.g., when rewards are collected through sensors), and is therefore not credible. In addition, for applications such as robotics, a deep reinforcement learning (DRL) algorithm can be manipulated to produce arbitrary errors by receiving corrupted rewards. In this paper, we consider noisy RL problems with perturbed rewards, which can be approximated with a confusion matrix. We develop a robust RL framework that enables agents to learn in noisy environments where only perturbed rewards are observed. Our solution framework builds on existing RL/DRL algorithms and firstly addresses the biased noisy reward setting without any assumptions on the true distribution (e.g., zero-mean Gaussian noise as made in previous works). The core ideas of our solution include estimating a reward confusion matrix and defining a set of unbiased surrogate rewards. We prove the convergence and sample complexity of our approach. Extensive experiments on different DRL platforms show that trained policies based on our estimated surrogate reward can achieve higher expected rewards, and converge faster than existing baselines. For instance, the state-of-the-art PPO algorithm is able to obtain 84.6% and 80.8% improvements on average score for five Atari games, with error rates as 10% and 30% respectively.Comment: AAAI 2020 (Spotlight

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications

Double Q-learning

Author: Hasselt H. P. (Hado) van
Publication venue: The MIT Press
Publication date: 01/12/2010
Field of study

In some stochastic environments the well-known reinforcement learning algorithm Q-learning performs very poorly. This poor performance is caused by large overestimations of action values, which result from a positive bias that is introduced because Q-learning uses the maximum action value as an approximation for the maximum expected action value. We introduce an alternative way to approximate the maximum expected value for any set of random variables. The obtained double estimator method is shown to sometimes underestimate rather than overestimate the maximum expected value. We apply the double estimator to Q-learning to construct Double Q-learning, a new off-policy reinforcement learning algorithm. We show the new algorithm converges to the optimal policy and that it performs well in some settings in which Q-learning performs poorly due to its overestimation

CWI's Institutional Repository

Estimating the maximum expected value in continuous reinforcement learning problems

Author: D'Eramo Carlo
Nuara Alessandro
Pirotta Matteo
Restelli Marcello
Publication venue: AAAI press
Publication date: 01/01/2017
Field of study

This paper is about the estimation of the maximum expected value of an infinite set of random variables. This estimation problem is relevant in many fields, like the Reinforcement Learning (RL) one. In RL it is well known that, in some stochastic environments, a bias in the estimation error can increase step-by-step the approximation error leading to large overestimates of the true action values. Recently, some approaches have been proposed to reduce such bias in order to get better action-value estimates, but are limited to finite problems. In this paper, we leverage on the recently proposed weighted estimator and on Gaussian process regression to derive a new method that is able to natively handle infinitely many random variables. We show how these techniques can be used to face both continuous state and continuous actions RL problems. To evaluate the effectiveness of the proposed approach we perform empirical comparisons with related approaches

Archivio istituzionale della ricerca - Politecnico di Milano

Association for the Advancement of Artificial Intelligence: AAAI Publications

Bounded Optimal Exploration in MDP

Author: Kawaguchi Kenji
Publication venue
Publication date: 21/02/2016
Field of study

Within the framework of probably approximately correct Markov decision processes (PAC-MDP), much theoretical work has focused on methods to attain near optimality after a relatively long period of learning and exploration. However, practical concerns require the attainment of satisfactory behavior within a short period of time. In this paper, we relax the PAC-MDP conditions to reconcile theoretically driven exploration methods and practical needs. We propose simple algorithms for discrete and continuous state spaces, and illustrate the benefits of our proposed relaxation via theoretical analyses and numerical examples. Our algorithms also maintain anytime error bounds and average loss bounds. Our approach accommodates both Bayesian and non-Bayesian methods.Comment: In Proceedings of the 30th AAAI Conference on Artificial Intelligence (AAAI), 201

arXiv.org e-Print Archive

Association for the Advancement of Artificial Intelligence: AAAI Publications