175 research outputs found
Reinforcement Learning in the Wild with Maximum Likelihood-based Model Transfer
In this paper, we study the problem of transferring the available Markov Decision Process (MDP) models to learn and plan efficiently in an unknown but similar MDP. We refer to it as \textit{Model Transfer Reinforcement Learning (MTRL)} problem. First, we formulate MTRL for discrete MDPs and Linear Quadratic Regulators (LQRs) with continuous state actions. Then, we propose a generic two-stage algorithm, MLEMTRL, to address the MTRL problem in discrete and continuous settings. In the first stage, MLEMTRL uses a \textit{constrained Maximum Likelihood Estimation (MLE)}-based approach to estimate the target MDP model using a set of known MDP models. In the second stage, using the estimated target MDP model, MLEMTRL deploys a model-based planning algorithm appropriate for the MDP class. Theoretically, we prove worst-case regret bounds for MLEMTRL both in realisable and non-realisable settings. We empirically demonstrate that MLEMTRL allows faster learning in new MDPs than learning from scratch and achieves near-optimal performance depending on the similarity of the available MDPs and the target MDP
The Benefits of Being Distributional: Small-Loss Bounds for Reinforcement Learning
While distributional reinforcement learning (RL) has demonstrated empirical
success, the question of when and why it is beneficial has remained unanswered.
In this work, we provide one explanation for the benefits of distributional RL
through the lens of small-loss bounds, which scale with the instance-dependent
optimal cost. If the optimal cost is small, our bounds are stronger than those
from non-distributional approaches. As warmup, we show that learning the cost
distribution leads to small-loss regret bounds in contextual bandits (CB), and
we find that distributional CB empirically outperforms the state-of-the-art on
three challenging tasks. For online RL, we propose a distributional
version-space algorithm that constructs confidence sets using maximum
likelihood estimation, and we prove that it achieves small-loss regret in the
tabular MDPs and enjoys small-loss PAC bounds in latent variable models.
Building on similar insights, we propose a distributional offline RL algorithm
based on the pessimism principle and prove that it enjoys small-loss PAC
bounds, which exhibit a novel robustness property. For both online and offline
RL, our results provide the first theoretical benefits of learning
distributions even when we only need the mean for making decisions
Sharp Variance-Dependent Bounds in Reinforcement Learning: Best of Both Worlds in Stochastic and Deterministic Environments
We study variance-dependent regret bounds for Markov decision processes
(MDPs). Algorithms with variance-dependent regret guarantees can automatically
exploit environments with low variance (e.g., enjoying constant regret on
deterministic MDPs). The existing algorithms are either variance-independent or
suboptimal. We first propose two new environment norms to characterize the
fine-grained variance properties of the environment. For model-based methods,
we design a variant of the MVP algorithm (Zhang et al., 2021a) and use new
analysis techniques show to this algorithm enjoys variance-dependent bounds
with respect to our proposed norms. In particular, this bound is simultaneously
minimax optimal for both stochastic and deterministic MDPs, the first result of
its kind. We further initiate the study on model-free algorithms with
variance-dependent regret bounds by designing a reference-function-based
algorithm with a novel capped-doubling reference update schedule. Lastly, we
also provide lower bounds to complement our upper bounds.Comment: ICML 202
- …