26,350 research outputs found
Model-Free Algorithm with Improved Sample Efficiency for Zero-Sum Markov Games
The problem of two-player zero-sum Markov games has recently attracted
increasing interests in theoretical studies of multi-agent reinforcement
learning (RL). In particular, for finite-horizon episodic Markov decision
processes (MDPs), it has been shown that model-based algorithms can find an
-optimal Nash Equilibrium (NE) with the sample complexity of
, which is optimal in the dependence of the horizon
and the number of states (where and denote the number of actions of
the two players, respectively). However, none of the existing model-free
algorithms can achieve such an optimality. In this work, we propose a
model-free stage-based Q-learning algorithm and show that it achieves the same
sample complexity as the best model-based algorithm, and hence for the first
time demonstrate that model-free algorithms can enjoy the same optimality in
the dependence as model-based algorithms. The main improvement of the
dependency on arises by leveraging the popular variance reduction technique
based on the reference-advantage decomposition previously used only for
single-agent RL. However, such a technique relies on a critical monotonicity
property of the value function, which does not hold in Markov games due to the
update of the policy via the coarse correlated equilibrium (CCE) oracle. Thus,
to extend such a technique to Markov games, our algorithm features a key novel
design of updating the reference value functions as the pair of optimistic and
pessimistic value functions whose value difference is the smallest in the
history in order to achieve the desired improvement in the sample efficiency
Discounting in Games across Time Scales
We introduce two-level discounted games played by two players on a
perfect-information stochastic game graph. The upper level game is a discounted
game and the lower level game is an undiscounted reachability game. Two-level
games model hierarchical and sequential decision making under uncertainty
across different time scales. We show the existence of pure memoryless optimal
strategies for both players and an ordered field property for such games. We
show that if there is only one player (Markov decision processes), then the
values can be computed in polynomial time. It follows that whether the value of
a player is equal to a given rational constant in two-level discounted games
can be decided in NP intersected coNP. We also give an alternate strategy
improvement algorithm to compute the value
A Learning Theoretic Approach to Energy Harvesting Communication System Optimization
A point-to-point wireless communication system in which the transmitter is
equipped with an energy harvesting device and a rechargeable battery, is
studied. Both the energy and the data arrivals at the transmitter are modeled
as Markov processes. Delay-limited communication is considered assuming that
the underlying channel is block fading with memory, and the instantaneous
channel state information is available at both the transmitter and the
receiver. The expected total transmitted data during the transmitter's
activation time is maximized under three different sets of assumptions
regarding the information available at the transmitter about the underlying
stochastic processes. A learning theoretic approach is introduced, which does
not assume any a priori information on the Markov processes governing the
communication system. In addition, online and offline optimization problems are
studied for the same setting. Full statistical knowledge and causal information
on the realizations of the underlying stochastic processes are assumed in the
online optimization problem, while the offline optimization problem assumes
non-causal knowledge of the realizations in advance. Comparing the optimal
solutions in all three frameworks, the performance loss due to the lack of the
transmitter's information regarding the behaviors of the underlying Markov
processes is quantified
- …