27 research outputs found
Learning in Non-Cooperative Configurable Markov Decision Processes
The Configurable Markov Decision Process framework includes two entities: a Reinforcement Learning agent and a configurator that can modify some environmental parameters to improve the agent's performance. This presupposes that the two actors have the same reward functions. What if the configurator does not have the same intentions as the agent? This paper introduces the Non-Cooperative Configurable Markov Decision Process, a setting that allows having two (possibly different) reward functions for the configurator and the agent. Then, we consider an online learning problem, where the configurator has to find the best among a finite set of possible configurations. We propose two learning algorithms to minimize the configurator's expected regret, which exploits the problem's structure, depending on the agent's feedback. While a naive application of the UCB algorithm yields a regret that grows indefinitely over time, we show that our approach suffers only bounded regret. Furthermore, we empirically show the performance of our algorithm in simulated domains
A Deep Reinforcement Learning Approach for Finding Non-Exploitable Strategies in Two-Player Atari Games
This paper proposes novel, end-to-end deep reinforcement learning algorithms
for learning two-player zero-sum Markov games. Our objective is to find the
Nash Equilibrium policies, which are free from exploitation by adversarial
opponents. Distinct from prior efforts on finding Nash equilibria in
extensive-form games such as Poker, which feature tree-structured transition
dynamics and discrete state space, this paper focuses on Markov games with
general transition dynamics and continuous state space. We propose (1) Nash DQN
algorithm, which integrates DQN with a Nash finding subroutine for the joint
value functions; and (2) Nash DQN Exploiter algorithm, which additionally
adopts an exploiter for guiding agent's exploration. Our algorithms are the
practical variants of theoretical algorithms which are guaranteed to converge
to Nash equilibria in the basic tabular setting. Experimental evaluation on
both tabular examples and two-player Atari games demonstrates the robustness of
the proposed algorithms against adversarial opponents, as well as their
advantageous performance over existing methods
Local and adaptive mirror descents in extensive-form games
We study how to learn -optimal strategies in zero-sum imperfect
information games (IIG) with trajectory feedback. In this setting, players
update their policies sequentially based on their observations over a fixed
number of episodes, denoted by . Existing procedures suffer from high
variance due to the use of importance sampling over sequences of actions
(Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we
consider a fixed sampling approach, where players still update their policies
over time, but with observations obtained through a given fixed sampling
policy. Our approach is based on an adaptive Online Mirror Descent (OMD)
algorithm that applies OMD locally to each information set, using individually
decreasing learning rates and a regularized loss. We show that this approach
guarantees a convergence rate of with high
probability and has a near-optimal dependence on the game parameters when
applied with the best theoretical choices of learning rates and sampling
policies. To achieve these results, we generalize the notion of OMD
stabilization, allowing for time-varying regularization with convex increments
Model-Free Algorithm with Improved Sample Efficiency for Zero-Sum Markov Games
The problem of two-player zero-sum Markov games has recently attracted
increasing interests in theoretical studies of multi-agent reinforcement
learning (RL). In particular, for finite-horizon episodic Markov decision
processes (MDPs), it has been shown that model-based algorithms can find an
-optimal Nash Equilibrium (NE) with the sample complexity of
, which is optimal in the dependence of the horizon
and the number of states (where and denote the number of actions of
the two players, respectively). However, none of the existing model-free
algorithms can achieve such an optimality. In this work, we propose a
model-free stage-based Q-learning algorithm and show that it achieves the same
sample complexity as the best model-based algorithm, and hence for the first
time demonstrate that model-free algorithms can enjoy the same optimality in
the dependence as model-based algorithms. The main improvement of the
dependency on arises by leveraging the popular variance reduction technique
based on the reference-advantage decomposition previously used only for
single-agent RL. However, such a technique relies on a critical monotonicity
property of the value function, which does not hold in Markov games due to the
update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus,
to extend such a technique to Markov games, our algorithm features a key novel
design of updating the reference value functions as the pair of optimistic and
pessimistic value functions whose value difference is the smallest in the
history in order to achieve the desired improvement in the sample efficiency