Search CORE

130 research outputs found

A Deep Reinforcement Learning Approach for Finding Non-Exploitable Strategies in Two-Player Atari Games

Author: Ding Zihan
Jin Chi
Liu Qinghua
Su Dijia
Publication venue
Publication date: 18/07/2022
Field of study

This paper proposes novel, end-to-end deep reinforcement learning algorithms for learning two-player zero-sum Markov games. Our objective is to find the Nash Equilibrium policies, which are free from exploitation by adversarial opponents. Distinct from prior efforts on finding Nash equilibria in extensive-form games such as Poker, which feature tree-structured transition dynamics and discrete state space, this paper focuses on Markov games with general transition dynamics and continuous state space. We propose (1) Nash DQN algorithm, which integrates DQN with a Nash finding subroutine for the joint value functions; and (2) Nash DQN Exploiter algorithm, which additionally adopts an exploiter for guiding agent's exploration. Our algorithms are the practical variants of theoretical algorithms which are guaranteed to converge to Nash equilibria in the basic tabular setting. Experimental evaluation on both tabular examples and two-player Atari games demonstrates the robustness of the proposed algorithms against adversarial opponents, as well as their advantageous performance over existing methods

arXiv.org e-Print Archive

Context-lumpable stochastic bandits

Author: Abbasi-Yadkori Yasin
Jin Chi
Lattimore Tor
Lee Chung-Wei
Liu Qinghua
Szepesvári Csaba
Publication venue
Publication date: 22/06/2023
Field of study

We consider a contextual bandit problem with

S

contexts and

A

actions. In each round

t=1,2,\dots

the learner observes a random context and chooses an action based on its past experience. The learner then observes a random reward whose mean is a function of the context and the action for the round. Under the assumption that the contexts can be lumped into

r\le \min\{S ,A \}

groups such that the mean reward for the various actions is the same for any two contexts that are in the same group, we give an algorithm that outputs an

\epsilon

-optimal policy after using at most

\widetilde O(r (S +A )/\epsilon^2)

samples with high probability and provide a matching

\widetilde\Omega(r (S +A )/\epsilon^2)

lower bound. In the regret minimization setting, we give an algorithm whose cumulative regret up to time

T

is bounded by

\widetilde O(\sqrt{r^3(S +A )T})

. To the best of our knowledge, we are the first to show the near-optimal sample complexity in the PAC setting and

\widetilde O(\sqrt{{poly}(r)(S+K)T})

minimax regret in the online setting for this problem. We also show our algorithms can be applied to more general low-rank bandits and get improved regret bounds in some scenarios

arXiv.org e-Print Archive