583 research outputs found
Distributed Algorithms Based on Fictitious Play for Near Optimal Sequential Decision Making.
We develop stochastic search algorithms to find optimal or close to optimal solutions for sequential decision making problems.
We specifically consider two problem classes:
1. Large-scale, discrete, deterministic, finite horizon dynamic programming problems:
We use a Sampled Fictitious Play (SFP) algorithm for solving large-scale, finite horizon, discrete dynamic programming (DP) problems. We model the DP problem as an identical interest game between multiple players. We show that the SFP algorithm converges to the equilibrium strategies of this game. In addition, we present two new algorithms, namely Repeated SFP and SFP Based Local Search, that find globally optimal solutions using SFP as a base algorithm. We present the performance of the algorithms on dynamic lot sizing problems and the Traveling Salesman Problem (TSP). Numerical experiments show that our algorithms find close to optimal solutions very quickly. We also present small modifications that improve the performance of the algorithms.
2. Stochastic, discounted, infinite horizon Markov Decision Problems:
Using Sampled Fictitious Play (SFP) concepts, we develop an online learning algorithm, referred to as SFP based Learning (SFPL), for solving a discounted homogeneous Markov Decision Problem (MDP) where the transition probabilities are unknown. In SFPL, we estimate and update the unknown transition probabilities, the optimal value, and the optimal action of each state, simultaneously. We prove the convergence of SFPL to the optimal solution. We compare the performance of SFPL with SARSA and Q-Learning on dynamic location and windy gridworld problems.Ph.D.Industrial & Operations EngineeringUniversity of Michigan, Horace H. Rackham School of Graduate Studieshttp://deepblue.lib.umich.edu/bitstream/2027.42/64602/1/esisikog_1.pd
A Deep Reinforcement Learning Approach for Finding Non-Exploitable Strategies in Two-Player Atari Games
This paper proposes novel, end-to-end deep reinforcement learning algorithms
for learning two-player zero-sum Markov games. Our objective is to find the
Nash Equilibrium policies, which are free from exploitation by adversarial
opponents. Distinct from prior efforts on finding Nash equilibria in
extensive-form games such as Poker, which feature tree-structured transition
dynamics and discrete state space, this paper focuses on Markov games with
general transition dynamics and continuous state space. We propose (1) Nash DQN
algorithm, which integrates DQN with a Nash finding subroutine for the joint
value functions; and (2) Nash DQN Exploiter algorithm, which additionally
adopts an exploiter for guiding agent's exploration. Our algorithms are the
practical variants of theoretical algorithms which are guaranteed to converge
to Nash equilibria in the basic tabular setting. Experimental evaluation on
both tabular examples and two-player Atari games demonstrates the robustness of
the proposed algorithms against adversarial opponents, as well as their
advantageous performance over existing methods
Zero-sum Polymatrix Markov Games: Equilibrium Collapse and Efficient Computation of Nash Equilibria
The works of (Daskalakis et al., 2009, 2022; Jin et al., 2022; Deng et al.,
2023) indicate that computing Nash equilibria in multi-player Markov games is a
computationally hard task. This fact raises the question of whether or not
computational intractability can be circumvented if one focuses on specific
classes of Markov games. One such example is two-player zero-sum Markov games,
in which efficient ways to compute a Nash equilibrium are known. Inspired by
zero-sum polymatrix normal-form games (Cai et al., 2016), we define a class of
zero-sum multi-agent Markov games in which there are only pairwise interactions
described by a graph that changes per state. For this class of Markov games, we
show that an -approximate Nash equilibrium can be found efficiently.
To do so, we generalize the techniques of (Cai et al., 2016), by showing that
the set of coarse-correlated equilibria collapses to the set of Nash
equilibria. Afterwards, it is possible to use any algorithm in the literature
that computes approximate coarse-correlated equilibria Markovian policies to
get an approximate Nash equilibrium.Comment: Added missing proofs for the infinite-horizo
Episodic Logit-Q Dynamics for Efficient Learning in Stochastic Teams
We present new learning dynamics combining (independent) log-linear learning
and value iteration for stochastic games within the auxiliary stage game
framework. The dynamics presented provably attain the efficient equilibrium
(also known as optimal equilibrium) in identical-interest stochastic games,
beyond the recent concentration of progress on provable convergence to some
(possibly inefficient) equilibrium. The dynamics are also independent in the
sense that agents take actions consistent with their local viewpoint to a
reasonable extent rather than seeking equilibrium. These aspects can be of
practical interest in the control applications of intelligent and autonomous
systems. The key challenges are the convergence to an inefficient equilibrium
and the non-stationarity of the environment from a single agent's viewpoint due
to the adaptation of others. The log-linear update plays an important role in
addressing the former. We address the latter through the play-in-episodes
scheme in which the agents update their Q-function estimates only at the end of
the episodes
A survey of random processes with reinforcement
The models surveyed include generalized P\'{o}lya urns, reinforced random
walks, interacting urn models, and continuous reinforced processes. Emphasis is
on methods and results, with sketches provided of some proofs. Applications are
discussed in statistics, biology, economics and a number of other areas.Comment: Published at http://dx.doi.org/10.1214/07-PS094 in the Probability
Surveys (http://www.i-journals.org/ps/) by the Institute of Mathematical
Statistics (http://www.imstat.org
A mirror descent approach for Mean Field Control applied to Demande-Side management
We consider a finite-horizon Mean Field Control problem for Markovian models.
The objective function is composed of a sum of convex and Lipschitz functions
taking their values on a space of state-action distributions. We introduce an
iterative algorithm which we prove to be a Mirror Descent associated with a
non-standard Bregman divergence, having a convergence rate of order 1/ \sqrt
K. It requires the solution of a simple dynamic programming problem at each
iteration. We compare this algorithm with learning methods for Mean Field Games
after providing a reformulation of our control problem as a game problem. These
theoretical contributions are illustrated with numerical examples applied to a
demand-side management problem for power systems aimed at controlling the
average power consumption profile of a population of flexible devices
contributing to the power system balance
- …