130 research outputs found
A Deep Reinforcement Learning Approach for Finding Non-Exploitable Strategies in Two-Player Atari Games
This paper proposes novel, end-to-end deep reinforcement learning algorithms
for learning two-player zero-sum Markov games. Our objective is to find the
Nash Equilibrium policies, which are free from exploitation by adversarial
opponents. Distinct from prior efforts on finding Nash equilibria in
extensive-form games such as Poker, which feature tree-structured transition
dynamics and discrete state space, this paper focuses on Markov games with
general transition dynamics and continuous state space. We propose (1) Nash DQN
algorithm, which integrates DQN with a Nash finding subroutine for the joint
value functions; and (2) Nash DQN Exploiter algorithm, which additionally
adopts an exploiter for guiding agent's exploration. Our algorithms are the
practical variants of theoretical algorithms which are guaranteed to converge
to Nash equilibria in the basic tabular setting. Experimental evaluation on
both tabular examples and two-player Atari games demonstrates the robustness of
the proposed algorithms against adversarial opponents, as well as their
advantageous performance over existing methods
Context-lumpable stochastic bandits
We consider a contextual bandit problem with contexts and actions.
In each round the learner observes a random context and chooses
an action based on its past experience. The learner then observes a random
reward whose mean is a function of the context and the action for the round.
Under the assumption that the contexts can be lumped into
groups such that the mean reward for the various actions is the same for any
two contexts that are in the same group, we give an algorithm that outputs an
-optimal policy after using at most samples with high probability and provide a matching
lower bound. In the regret
minimization setting, we give an algorithm whose cumulative regret up to time
is bounded by . To the best of our
knowledge, we are the first to show the near-optimal sample complexity in the
PAC setting and minimax regret in the
online setting for this problem. We also show our algorithms can be applied to
more general low-rank bandits and get improved regret bounds in some scenarios
- …