4,025 research outputs found
A Theory of Regularized Markov Decision Processes
Many recent successful (deep) reinforcement learning algorithms make use of
regularization, generally based on entropy or Kullback-Leibler divergence. We
propose a general theory of regularized Markov Decision Processes that
generalizes these approaches in two directions: we consider a larger class of
regularizers, and we consider the general modified policy iteration approach,
encompassing both policy iteration and value iteration. The core building
blocks of this theory are a notion of regularized Bellman operator and the
Legendre-Fenchel transform, a classical tool of convex optimization. This
approach allows for error propagation analyses of general algorithmic schemes
of which (possibly variants of) classical algorithms such as Trust Region
Policy Optimization, Soft Q-learning, Stochastic Actor Critic or Dynamic Policy
Programming are special cases. This also draws connections to proximal convex
optimization, especially to Mirror Descent.Comment: ICML 201
Sample-Efficient Multi-Agent RL: An Optimization Perspective
We study multi-agent reinforcement learning (MARL) for the general-sum Markov
Games (MGs) under the general function approximation. In order to find the
minimum assumption for sample-efficient learning, we introduce a novel
complexity measure called the Multi-Agent Decoupling Coefficient (MADC) for
general-sum MGs. Using this measure, we propose the first unified algorithmic
framework that ensures sample efficiency in learning Nash Equilibrium, Coarse
Correlated Equilibrium, and Correlated Equilibrium for both model-based and
model-free MARL problems with low MADC. We also show that our algorithm
provides comparable sublinear regret to the existing works. Moreover, our
algorithm combines an equilibrium-solving oracle with a single objective
optimization subprocedure that solves for the regularized payoff of each
deterministic joint policy, which avoids solving constrained optimization
problems within data-dependent constraints (Jin et al. 2020; Wang et al. 2023)
or executing sampling procedures with complex multi-objective optimization
problems (Foster et al. 2023), thus being more amenable to empirical
implementation
- …