Search CORE

26,350 research outputs found

Model-Free Algorithm with Improved Sample Efficiency for Zero-Sum Markov Games

Author: Feng Songtao
Liang Yingbin
Wang Yu-Xiang
Yang Jing
Yin Ming
Publication venue
Publication date: 17/08/2023
Field of study

The problem of two-player zero-sum Markov games has recently attracted increasing interests in theoretical studies of multi-agent reinforcement learning (RL). In particular, for finite-horizon episodic Markov decision processes (MDPs), it has been shown that model-based algorithms can find an

\epsilon

-optimal Nash Equilibrium (NE) with the sample complexity of

O(H^3SAB/\epsilon^2)

, which is optimal in the dependence of the horizon

H

and the number of states

S

(where

A

and

B

denote the number of actions of the two players, respectively). However, none of the existing model-free algorithms can achieve such an optimality. In this work, we propose a model-free stage-based Q-learning algorithm and show that it achieves the same sample complexity as the best model-based algorithm, and hence for the first time demonstrate that model-free algorithms can enjoy the same optimality in the

H

dependence as model-based algorithms. The main improvement of the dependency on

H

arises by leveraging the popular variance reduction technique based on the reference-advantage decomposition previously used only for single-agent RL. However, such a technique relies on a critical monotonicity property of the value function, which does not hold in Markov games due to the update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus, to extend such a technique to Markov games, our algorithm features a key novel design of updating the reference value functions as the pair of optimistic and pessimistic value functions whose value difference is the smallest in the history in order to achieve the desired improvement in the sample efficiency

arXiv.org e-Print Archive

Discounting in Games across Time Scales

Author: A. Condon
A. Condon
Angelo Montanari
D.A. Martin
J. Filar
Krishnendu Chatterjee
L.S. Shapley
Margherita Napoli
Mimmo Parente
P.R. Kumar
Rupak Majumdar
S. Dziembowski
Publication venue: 'Open Publishing Association'
Publication date: 01/01/2010
Field of study

We introduce two-level discounted games played by two players on a perfect-information stochastic game graph. The upper level game is a discounted game and the lower level game is an undiscounted reachability game. Two-level games model hierarchical and sequential decision making under uncertainty across different time scales. We show the existence of pure memoryless optimal strategies for both players and an ordered field property for such games. We show that if there is only one player (Markov decision processes), then the values can be computed in polynomial time. It follows that whether the value of a player is equal to a given rational constant in two-level discounted games can be decided in NP intersected coNP. We also give an alternate strategy improvement algorithm to compute the value

arXiv.org e-Print Archive

Crossref

Directory of Open Access Journals

IST Austria: PubRep (Institute of Science and Technology)

MPG.PuRe

A Learning Theoretic Approach to Energy Harvesting Communication System Optimization

Author: Blasco Pol
Dohler Mischa
Gündüz Deniz
Publication venue: 'Institute of Electrical and Electronics Engineers (IEEE)'
Publication date: 01/01/2012
Field of study

A point-to-point wireless communication system in which the transmitter is equipped with an energy harvesting device and a rechargeable battery, is studied. Both the energy and the data arrivals at the transmitter are modeled as Markov processes. Delay-limited communication is considered assuming that the underlying channel is block fading with memory, and the instantaneous channel state information is available at both the transmitter and the receiver. The expected total transmitted data during the transmitter's activation time is maximized under three different sets of assumptions regarding the information available at the transmitter about the underlying stochastic processes. A learning theoretic approach is introduced, which does not assume any a priori information on the Markov processes governing the communication system. In addition, online and offline optimization problems are studied for the same setting. Full statistical knowledge and causal information on the realizations of the underlying stochastic processes are assumed in the online optimization problem, while the offline optimization problem assumes non-causal knowledge of the realizations in advance. Comparing the optimal solutions in all three frameworks, the performance loss due to the lack of the transmitter's information regarding the behaviors of the underlying Markov processes is quantified

arXiv.org e-Print Archive

Spiral - Imperial College Digital Repository

King's Research Portal

Archivio istituzionale della ricerca - Università di Modena e Reggio Emilia