Search CORE

7 research outputs found

Tightening Exploration in Upper Confidence Reinforcement Learning

Author: Bourel Hippolyte
Maillard Odalric-Ambrym
Talebi Mohammad Sadegh
Publication venue
Publication date: 08/06/2020
Field of study

The upper confidence reinforcement learning (UCRL2) strategy introduced in (Jaksch et al., 2010) is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the average-reward criterion. Despite its nice and generic theoretical regret guarantees, this strategy and its variants have remained until now mostly theoretical as numerical experiments on simple environments exhibit long burn-in phases before the learning takes place. Motivated by practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities, to compute confidence sets on the reward and transition distributions for each state-action pair. To further tighten exploration, we introduce an adaptive computation of the support of each transition distributions. This enables to revisit the extended value iteration procedure to optimize over distributions with reduced support by disregarding low probability transitions, while still ensuring near-optimism. We demonstrate, through numerical experiments on standard environments, that reducing exploration this way yields a substantial numerical improvement compared to UCRL2 and its variants. On the theoretical side, these key modifications enable to derive a regret bound for UCRL3 improving on UCRL2, that for the first time makes appear a notion of local diameter and effective support, thanks to variance-aware concentration bounds.Comment: Accepted to ICML 202

arXiv.org e-Print Archive

HAL Descartes

Model-Based Reinforcement Learning Exploiting State-Action Equivalence

Author: Asadi Mahsa
Bourel Hippolyte
Maillard Odalric-Ambrym
Talebi Mohammad,
Publication venue: HAL CCSD
Publication date: 09/10/2019
Field of study

International audienceLeveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of regret by a factor of SA/C, in any communicating MDP with S states, A actions, and C classes, which corresponds to a massive improvement when C SA. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs

arXiv.org e-Print Archive

INRIA a CCSD electronic archive server

HAL Descartes

Hal-Diderot

Tightening Exploration in Upper Confidence Reinforcement Learning

Author: Bourel Hippolyte
Maillard Odalric-Ambrym
Talebi Mohammad,
Publication venue: HAL CCSD
Publication date: 01/07/2020
Field of study

International audienceThe upper confidence reinforcement learning (UCRL2) algorithm introduced in (Jaksch et al., 2010) is a popular method to perform regret minimization in unknown discrete Markov Decision Processes under the average-reward criterion. Despite its nice and generic theoretical regret guarantees , this algorithm and its variants have remained until now mostly theoretical as numerical experiments in simple environments exhibit long burn-in phases before the learning takes place. In pursuit of practical efficiency, we present UCRL3, following the lines of UCRL2, but with two key modifications: First, it uses state-of-the-art time-uniform concentration inequalities to compute confidence sets on the reward and (component-wise) transition distributions for each state-action pair. Furthermore , to tighten exploration, it uses an adap-tive computation of the support of each transition distribution, which in turn enables us to revisit the extended value iteration procedure of UCRL2 to optimize over distributions with reduced support by disregarding low probability transitions, while still ensuring near-optimism. We demonstrate , through numerical experiments in standard environments, that reducing exploration this way yields a substantial numerical improvement compared to UCRL2 and its variants. On the theoretical side, these key modifications enable us to derive a regret bound for UCRL3 improving on UCRL2, that for the first time makes appear notions of local diameter and local effective support, thanks to variance-aware concentration bounds

INRIA a CCSD electronic archive server

Model-Based Reinforcement Learning Exploiting State-Action Equivalence

Author: Asadi Mahsa
Bourel Hippolyte
Maillard Odalric-Ambrym
Talebi Mohammad,
Publication venue: HAL CCSD
Publication date: 16/11/2019
Field of study

INRIA a CCSD electronic archive server

Tightening Exploration in Upper Confidence Reinforcement Learning

Author: Bourel Hippolyte
Maillard Odalric
Talebi Sadegh
Publication venue: PMLR
Publication date: 01/01/2020
Field of study

INRIA a CCSD electronic archive server

Copenhagen University Research Information System

Model-Based Reinforcement Learning Exploiting State-Action Equivalence

Author: Asadi Mahsa
Bourel Hippolyte
Maillard Odalric-Ambrym
Talebi Mohammad,
Publication venue: HAL CCSD
Publication date: 16/11/2019
Field of study

Hal-Diderot