13 research outputs found
A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum Markov Games
Optimal policies in standard MDPs can be obtained using either value
iteration or policy iteration. However, in the case of zero-sum Markov games,
there is no efficient policy iteration algorithm; e.g., it has been shown that
one has to solve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor,
to implement the only known convergent version of policy iteration. Another
algorithm, called naive policy iteration, is easy to implement but is only
provably convergent under very restrictive assumptions. Prior attempts to fix
naive policy iteration algorithm have several limitations. Here, we show that a
simple variant of naive policy iteration for games converges exponentially
fast. The only addition we propose to naive policy iteration is the use of
lookahead policies, which are anyway used in practical algorithms. We further
show that lookahead can be implemented efficiently in the function
approximation setting of linear Markov games, which are the counterpart of the
much-studied linear MDPs. We illustrate the application of our algorithm by
providing bounds for policy-based RL (reinforcement learning) algorithms. We
extend the results to the function approximation setting.Comment: 41 page
Multi-agent Reinforcement Learning in Sequential Social Dilemmas
Matrix games like Prisoner's Dilemma have guided research on social dilemmas for decades. However, they necessarily treat the choice to cooperate or defect as an atomic action. In real-world social dilemmas these choices are temporally extended. Cooperativeness is a property that applies to policies, not elementary actions. We introduce sequential social dilemmas that share the mixed incentive structure of matrix game social dilemmas but also require agents to learn policies that implement their strategic intentions. We analyze the dynamics of policies learned by multiple self-interested independent learning agents, each using its own deep Q-network, on two Markov games we introduce here: 1. a fruit Gathering game and 2. a Wolfpack hunting game. We characterize how learned behavior in each domain changes as a function of environmental factors including resource abundance. Our experiments show how conflict can emerge from competition over shared resources and shed light on how the sequential nature of real world social dilemmas affects cooperation
On Reinforcement Learning for Turn-based Zero-sum Markov Games
We consider the problem of finding Nash equilibrium for two-player turn-based
zero-sum games. Inspired by the AlphaGo Zero (AGZ) algorithm, we develop a
Reinforcement Learning based approach. Specifically, we propose
Explore-Improve-Supervise (EIS) method that combines "exploration", "policy
improvement"' and "supervised learning" to find the value function and policy
associated with Nash equilibrium. We identify sufficient conditions for
convergence and correctness for such an approach. For a concrete instance of
EIS where random policy is used for "exploration", Monte-Carlo Tree Search is
used for "policy improvement" and Nearest Neighbors is used for "supervised
learning", we establish that this method finds an -approximate
value function of Nash equilibrium in
steps when the underlying state-space of the game is continuous and
-dimensional. This is nearly optimal as we establish a lower bound of
for any policy
Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games
This paper makes progress towards learning Nash equilibria in two-player
zero-sum Markov games from offline data. Specifically, consider a
-discounted infinite-horizon Markov game with states, where the
max-player has actions and the min-player has actions. We propose a
pessimistic model-based algorithm with Bernstein-style lower confidence bounds
-- called VI-LCB-Game -- that provably finds an -approximate Nash
equilibrium with a sample complexity no larger than
(up
to some log factor). Here, is some unilateral
clipped concentrability coefficient that reflects the coverage and distribution
shift of the available data (vis-\`a-vis the target data), and the target
accuracy can be any value within
. Our sample complexity bound strengthens prior
art by a factor of , achieving minimax optimality for the entire
-range. An appealing feature of our result lies in algorithmic
simplicity, which reveals the unnecessity of variance reduction and sample
splitting in achieving sample optimality.Comment: accepted to Operations Researc
A control theoretic approach for security of cyber-physical systems
In this dissertation, several novel defense methodologies for cyber-physical systems have been proposed. First, a special type of cyber-physical system, the RFID system, is considered for which a lightweight mutual authentication and ownership management protocol is proposed in order to protect the data confidentiality and integrity. Then considering the fact that the protection of the data confidentiality and integrity is insufficient to guarantee the security in cyber-physical systems, we turn to the development of a general framework for developing security schemes for cyber-physical systems wherein the cyber system states affect the physical system and vice versa. After that, we apply this general framework by selecting the traffic flow as the cyber system state and a novel attack detection scheme that is capable of capturing the abnormality in the traffic flow in those communication links due to a class of attacks has been proposed. On the other hand, an attack detection scheme that is capable of detecting both sensor and actuator attacks is proposed for the physical system in the presence of network induced delays and packet losses. Next, an attack detection scheme is proposed when the network parameters are unknown by using an optimal Q-learning approach. Finally, this attack detection and accommodation scheme has been further extended to the case where the network is modeled as a nonlinear system with unknown system dynamics --Abstract, page iv