4 research outputs found
Efficient Last-iterate Convergence Algorithms in Solving Games
No-regret algorithms are popular for learning Nash equilibrium (NE) in
two-player zero-sum normal-form games (NFGs) and extensive-form games (EFGs).
Many recent works consider the last-iterate convergence no-regret algorithms.
Among them, the two most famous algorithms are Optimistic Gradient Descent
Ascent (OGDA) and Optimistic Multiplicative Weight Update (OMWU). However, OGDA
has high per-iteration complexity. OMWU exhibits a lower per-iteration
complexity but poorer empirical performance, and its convergence holds only
when NE is unique. Recent works propose a Reward Transformation (RT) framework
for MWU, which removes the uniqueness condition and achieves competitive
performance with OMWU. Unfortunately, RT-based algorithms perform worse than
OGDA under the same number of iterations, and their convergence guarantee is
based on the continuous-time feedback assumption, which does not hold in most
scenarios. To address these issues, we provide a closer analysis of the RT
framework, which holds for both continuous and discrete-time feedback. We
demonstrate that the essence of the RT framework is to transform the problem of
learning NE in the original game into a series of strongly convex-concave
optimization problems (SCCPs). We show that the bottleneck of RT-based
algorithms is the speed of solving SCCPs. To improve the their empirical
performance, we design a novel transformation method to enable the SCCPs can be
solved by Regret Matching+ (RM+), a no-regret algorithm with better empirical
performance, resulting in Reward Transformation RM+ (RTRM+). RTRM+ enjoys
last-iterate convergence under the discrete-time feedback setting. Using the
counterfactual regret decomposition framework, we propose Reward Transformation
CFR+ (RTCFR+) to extend RTRM+ to EFGs. Experimental results show that our
algorithms significantly outperform existing last-iterate convergence
algorithms and RM+ (CFR+)
Generalized Bandit Regret Minimizer Framework in Imperfect Information Extensive-Form Game
Regret minimization methods are a powerful tool for learning approximate Nash
equilibrium (NE) in two-player zero-sum imperfect information extensive-form
games (IIEGs). We consider the problem in the interactive bandit-feedback
setting where we don't know the dynamics of the IIEG. In general, only the
interactive trajectory and the reached terminal node value are
revealed. To learn NE, the regret minimizer is required to estimate the
full-feedback loss gradient by and minimize the regret. In
this paper, we propose a generalized framework for this learning setting. It
presents a theoretical framework for the design and the modular analysis of the
bandit regret minimization methods. We demonstrate that the most recent bandit
regret minimization methods can be analyzed as a particular case of our
framework. Following this framework, we describe a novel method SIX-OMD to
learn approximate NE. It is model-free and extremely improves the best existing
convergence rate from the order of to . Moreover, SIX-OMD is
computationally efficient as it needs to perform the current strategy and
average strategy updates only along the sampled trajectory.Comment: The proof of this paper includes many errors, especially for SIX-OMD,
the regret bound of this algorithm is not right since this regret is lower
than the lowest theoretical regret bound obtained by information theor
An Efficient Deep Reinforcement Learning Algorithm for Solving Imperfect Information Extensive-Form Games
One of the most popular methods for learning Nash equilibrium (NE) in large-scale imperfect information extensive-form games (IIEFGs) is the neural variants of counterfactual regret minimization (CFR). CFR is a special case of Follow-The-Regularized-Leader (FTRL). At each iteration, the neural variants of CFR update the agent's strategy via the estimated counterfactual regrets. Then, they use neural networks to approximate the new strategy, which incurs an approximation error. These approximation errors will accumulate since the counterfactual regrets at iteration t are estimated using the agent's past approximated strategies. Such accumulated approximation error causes poor performance. To address this accumulated approximation error, we propose a novel FTRL algorithm called FTRL-ORW, which does not utilize the agent's past strategies to pick the next iteration strategy. More importantly, FTRL-ORW can update its strategy via the trajectories sampled from the game, which is suitable to solve large-scale IIEFGs since sampling multiple actions for each information set is too expensive in such games. However, it remains unclear which algorithm to use to compute the next iteration strategy for FTRL-ORW when only such sampled trajectories are revealed at iteration t. To address this problem and scale FTRL-ORW to large-scale games, we provide a model-free method called Deep FTRL-ORW, which computes the next iteration strategy using model-free Maximum Entropy Deep Reinforcement Learning. Experimental results on two-player zero-sum IIEFGs show that Deep FTRL-ORW significantly outperforms existing model-free neural methods and OS-MCCFR