15 research outputs found
Mediator Interpretation and Faster Learning Algorithms for Linear Correlated Equilibria in General Extensive-Form Games
A recent paper by Farina & Pipis (2023) established the existence of
uncoupled no-linear-swap regret dynamics with polynomial-time iterations in
extensive-form games. The equilibrium points reached by these dynamics, known
as linear correlated equilibria, are currently the tightest known relaxation of
correlated equilibrium that can be learned in polynomial time in any finite
extensive-form game. However, their properties remain vastly unexplored, and
their computation is onerous. In this paper, we provide several contributions
shedding light on the fundamental nature of linear-swap regret. First, we show
a connection between linear deviations and a generalization of communication
deviations in which the player can make queries to a "mediator" who replies
with action recommendations, and, critically, the player is not constrained to
match the timing of the game as would be the case for communication deviations.
We coin this latter set the untimed communication (UTC) deviations. We show
that the UTC deviations coincide precisely with the linear deviations, and
therefore that any player minimizing UTC regret also minimizes linear-swap
regret. We then leverage this connection to develop state-of-the-art no-regret
algorithms for computing linear correlated equilibria, both in theory and in
practice. In theory, our algorithms achieve polynomially better per-iteration
runtimes; in practice, our algorithms represent the state of the art by several
orders of magnitude
Faster Game Solving via Predictive Blackwell Approachability: Connecting Regret Matching and Mirror Descent
Blackwell approachability is a framework for reasoning about repeated games
with vector-valued payoffs. We introduce predictive Blackwell approachability,
where an estimate of the next payoff vector is given, and the decision maker
tries to achieve better performance based on the accuracy of that estimator. In
order to derive algorithms that achieve predictive Blackwell approachability,
we start by showing a powerful connection between four well-known algorithms.
Follow-the-regularized-leader (FTRL) and online mirror descent (OMD) are the
most prevalent regret minimizers in online convex optimization. In spite of
this prevalence, the regret matching (RM) and regret matching+ (RM+) algorithms
have been preferred in the practice of solving large-scale games (as the local
regret minimizers within the counterfactual regret minimization framework). We
show that RM and RM+ are the algorithms that result from running FTRL and OMD,
respectively, to select the halfspace to force at all times in the underlying
Blackwell approachability game. By applying the predictive variants of FTRL or
OMD to this connection, we obtain predictive Blackwell approachability
algorithms, as well as predictive variants of RM and RM+. In experiments across
18 common zero-sum extensive-form benchmark games, we show that predictive RM+
coupled with counterfactual regret minimization converges vastly faster than
the fastest prior algorithms (CFR+, DCFR, LCFR) across all games but two of the
poker games and Liar's Dice, sometimes by two or more orders of magnitude
Local and adaptive mirror descents in extensive-form games
We study how to learn -optimal strategies in zero-sum imperfect
information games (IIG) with trajectory feedback. In this setting, players
update their policies sequentially based on their observations over a fixed
number of episodes, denoted by . Existing procedures suffer from high
variance due to the use of importance sampling over sequences of actions
(Steinberger et al., 2020; McAleer et al., 2022). To reduce this variance, we
consider a fixed sampling approach, where players still update their policies
over time, but with observations obtained through a given fixed sampling
policy. Our approach is based on an adaptive Online Mirror Descent (OMD)
algorithm that applies OMD locally to each information set, using individually
decreasing learning rates and a regularized loss. We show that this approach
guarantees a convergence rate of with high
probability and has a near-optimal dependence on the game parameters when
applied with the best theoretical choices of learning rates and sampling
policies. To achieve these results, we generalize the notion of OMD
stabilization, allowing for time-varying regularization with convex increments