4 research outputs found
Model-Free Non-Stationary RL: Near-Optimal Regret and Applications in Multi-Agent RL and Inventory Control
We consider model-free reinforcement learning (RL) in non-stationary Markov
decision processes. Both the reward functions and the state transition
functions are allowed to vary arbitrarily over time as long as their cumulative
variations do not exceed certain variation budgets. We propose Restarted
Q-Learning with Upper Confidence Bounds (RestartQ-UCB), the first model-free
algorithm for non-stationary RL, and show that it outperforms existing
solutions in terms of dynamic regret. Specifically, RestartQ-UCB with
Freedman-type bonus terms achieves a dynamic regret bound of
, where and are the numbers of states and actions,
respectively, is the variation budget, is the number of time
steps per episode, and is the total number of time steps. We further
present a parameter-free algorithm named Double-Restart Q-UCB that does not
require prior knowledge of the variation budget. We show that our algorithms
are \emph{nearly optimal} by establishing an information-theoretical lower
bound of , the first lower bound in non-stationary RL.
Numerical experiments validate the advantages of RestartQ-UCB in terms of both
cumulative rewards and computational efficiency. We demonstrate the power of
our results in examples of multi-agent RL and inventory control across related
products.Comment: A preliminary version of this work has appeared in ICML 202
Independent Policy Gradient Methods for Competitive Reinforcement Learning
We obtain global, non-asymptotic convergence guarantees for independent
learning algorithms in competitive reinforcement learning settings with two
agents (i.e., zero-sum stochastic games). We consider an episodic setting where
in each episode, each player independently selects a policy and observes only
their own actions and rewards, along with the state. We show that if both
players run policy gradient methods in tandem, their policies will converge to
a min-max equilibrium of the game, as long as their learning rates follow a
two-timescale rule (which is necessary). To the best of our knowledge, this
constitutes the first finite-sample convergence result for independent policy
gradient methods in competitive RL; prior work has largely focused on
centralized, coordinated procedures for equilibrium computation.Comment: Appeared at NeurIPS 202
Linear Last-iterate Convergence in Constrained Saddle-point Optimization
Optimistic Gradient Descent Ascent (OGDA) and Optimistic Multiplicative
Weights Update (OMWU) for saddle-point optimization have received growing
attention due to their favorable last-iterate convergence. However, their
behaviors for simple bilinear games over the probability simplex are still not
fully understood - previous analysis lacks explicit convergence rates, only
applies to an exponentially small learning rate, or requires additional
assumptions such as the uniqueness of the optimal solution. In this work, we
significantly expand the understanding of last-iterate convergence for OGDA and
OMWU in the constrained setting. Specifically, for OMWU in bilinear games over
the simplex, we show that when the equilibrium is unique, linear last-iterate
convergence is achieved with a learning rate whose value is set to a universal
constant, improving the result of (Daskalakis & Panageas, 2019b) under the same
assumption. We then significantly extend the results to more general objectives
and feasible sets for the projected OGDA algorithm, by introducing a sufficient
condition under which OGDA exhibits concrete last-iterate convergence rates
with a constant learning rate whose value only depends on the smoothness of the
objective function. We show that bilinear games over any polytope satisfy this
condition and OGDA converges exponentially fast even without the unique
equilibrium assumption. Our condition also holds for
strongly-convex-strongly-concave functions, recovering the result of (Hsieh et
al., 2019). Finally, we provide experimental results to further support our
theory
A Sharp Analysis of Model-based Reinforcement Learning with Self-Play
Model-based algorithms -- algorithms that explore the environment through
building and utilizing an estimated model -- are widely used in reinforcement
learning practice and theoretically shown to achieve optimal sample efficiency
for single-agent reinforcement learning in Markov Decision Processes (MDPs).
However, for multi-agent reinforcement learning in Markov games, the current
best known sample complexity for model-based algorithms is rather suboptimal
and compares unfavorably against recent model-free approaches. In this paper,
we present a sharp analysis of model-based self-play algorithms for multi-agent
Markov games. We design an algorithm -- Optimistic Nash Value Iteration
(Nash-VI) for two-player zero-sum Markov games that is able to output an
-approximate Nash policy in
episodes of game playing, where is the number of states, are the
number of actions for the two players respectively, and is the horizon
length. This significantly improves over the best known model-based guarantee
of , and is the first that matches
the information-theoretic lower bound except for
a factor. In addition, our guarantee compares favorably against
the best known model-free algorithm if , and outputs a
single Markov policy while existing sample-efficient model-free algorithms
output a nested mixture of Markov policies that is in general non-Markov and
rather inconvenient to store and execute. We further adapt our analysis to
designing a provably efficient task-agnostic algorithm for zero-sum Markov
games, and designing the first line of provably sample-efficient algorithms for
multi-player general-sum Markov games