18 research outputs found
Horizon-Free and Variance-Dependent Reinforcement Learning for Latent Markov Decision Processes
We study regret minimization for reinforcement learning (RL) in Latent Markov
Decision Processes (LMDPs) with context in hindsight. We design a novel
model-based algorithmic framework which can be instantiated with both a
model-optimistic and a value-optimistic solver. We prove an
regret bound where is the
number of contexts, is the number of states, is the number of actions,
is the number of episodes, and is the maximum transition
degree of any state-action pair. The regret bound only scales logarithmically
with the planning horizon, thus yielding the first (nearly) horizon-free regret
bound for LMDP. Key in our proof is an analysis of the total variance of alpha
vectors, which is carefully bounded by a recursion-based technique. We
complement our positive result with a novel
regret lower bound with , which shows our upper bound minimax
optimal when is a constant. Our lower bound relies on new
constructions of hard instances and an argument based on the symmetrization
technique from theoretical computer science, both of which are technically
different from existing lower bound proof for MDPs, and thus can be of
independent interest.Comment: ICML 202
Model-Based Reinforcement Learning Exploiting State-Action Equivalence
International audienceLeveraging an equivalence property in the state-space of a Markov Decision Process (MDP) has been investigated in several studies. This paper studies equivalence structure in the reinforcement learning (RL) setup, where transition distributions are no longer assumed to be known. We present a notion of similarity between transition probabilities of various state-action pairs of an MDP, which naturally defines an equivalence structure in the state-action space. We present equivalence-aware confidence sets for the case where the learner knows the underlying structure in advance. These sets are provably smaller than their corresponding equivalence-oblivious counterparts. In the more challenging case of an unknown equivalence structure, we present an algorithm called ApproxEquivalence that seeks to find an (approximate) equivalence structure, and define confidence sets using the approximate equivalence. To illustrate the efficacy of the presented confidence sets, we present C-UCRL, as a natural modification of UCRL2 for RL in undiscounted MDPs. In the case of a known equivalence structure, we show that C-UCRL improves over UCRL2 in terms of regret by a factor of SA/C, in any communicating MDP with S states, A actions, and C classes, which corresponds to a massive improvement when C SA. To the best of our knowledge, this is the first work providing regret bounds for RL when an equivalence structure in the MDP is efficiently exploited. In the case of an unknown equivalence structure, we show through numerical experiments that C-UCRL combined with ApproxEquivalence outperforms UCRL2 in ergodic MDPs
Tightening Exploration in Upper Confidence Reinforcement Learning
The upper confidence reinforcement learning (UCRL2) strategy introduced in
(Jaksch et al., 2010) is a popular method to perform regret minimization in
unknown discrete Markov Decision Processes under the average-reward criterion.
Despite its nice and generic theoretical regret guarantees, this strategy and
its variants have remained until now mostly theoretical as numerical
experiments on simple environments exhibit long burn-in phases before the
learning takes place. Motivated by practical efficiency, we present UCRL3,
following the lines of UCRL2, but with two key modifications: First, it uses
state-of-the-art time-uniform concentration inequalities, to compute confidence
sets on the reward and transition distributions for each state-action pair. To
further tighten exploration, we introduce an adaptive computation of the
support of each transition distributions. This enables to revisit the extended
value iteration procedure to optimize over distributions with reduced support
by disregarding low probability transitions, while still ensuring
near-optimism. We demonstrate, through numerical experiments on standard
environments, that reducing exploration this way yields a substantial numerical
improvement compared to UCRL2 and its variants. On the theoretical side, these
key modifications enable to derive a regret bound for UCRL3 improving on UCRL2,
that for the first time makes appear a notion of local diameter and effective
support, thanks to variance-aware concentration bounds.Comment: Accepted to ICML 202
Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments
We study variance-dependent regret bounds for Markov decision processes
(MDPs). Algorithms with variance-dependent regret guarantees can automatically
exploit environments with low variance (e.g., enjoying constant regret on
deterministic MDPs). The existing algorithms are either variance-independent or
suboptimal. We first propose two new environment norms to characterize the
fine-grained variance properties of the environment. For model-based methods,
we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new
analysis techniques show to this algorithm enjoys variance-dependent bounds
with respect to our proposed norms. In particular, this bound is simultaneously
minimax optimal for both stochastic and deterministic MDPs, the first result of
its kind. We further initiate the study on model-free algorithms with
variance-dependent regret bounds by designing a reference-function-based
algorithm with a novel capped-doubling reference update schedule. Lastly, we
also provide lower bounds to complement our upper bounds.Comment: ICML 202