13 research outputs found

    A New Policy Iteration Algorithm For Reinforcement Learning in Zero-Sum Markov Games

    Full text link
    Optimal policies in standard MDPs can be obtained using either value iteration or policy iteration. However, in the case of zero-sum Markov games, there is no efficient policy iteration algorithm; e.g., it has been shown that one has to solve Omega(1/(1-alpha)) MDPs, where alpha is the discount factor, to implement the only known convergent version of policy iteration. Another algorithm, called naive policy iteration, is easy to implement but is only provably convergent under very restrictive assumptions. Prior attempts to fix naive policy iteration algorithm have several limitations. Here, we show that a simple variant of naive policy iteration for games converges exponentially fast. The only addition we propose to naive policy iteration is the use of lookahead policies, which are anyway used in practical algorithms. We further show that lookahead can be implemented efficiently in the function approximation setting of linear Markov games, which are the counterpart of the much-studied linear MDPs. We illustrate the application of our algorithm by providing bounds for policy-based RL (reinforcement learning) algorithms. We extend the results to the function approximation setting.Comment: 41 page

    Multi-agent Reinforcement Learning in Sequential Social Dilemmas

    Get PDF
    Matrix games like Prisoner's Dilemma have guided research on social dilemmas for decades. However, they necessarily treat the choice to cooperate or defect as an atomic action. In real-world social dilemmas these choices are temporally extended. Cooperativeness is a property that applies to policies, not elementary actions. We introduce sequential social dilemmas that share the mixed incentive structure of matrix game social dilemmas but also require agents to learn policies that implement their strategic intentions. We analyze the dynamics of policies learned by multiple self-interested independent learning agents, each using its own deep Q-network, on two Markov games we introduce here: 1. a fruit Gathering game and 2. a Wolfpack hunting game. We characterize how learned behavior in each domain changes as a function of environmental factors including resource abundance. Our experiments show how conflict can emerge from competition over shared resources and shed light on how the sequential nature of real world social dilemmas affects cooperation

    On Reinforcement Learning for Turn-based Zero-sum Markov Games

    Full text link
    We consider the problem of finding Nash equilibrium for two-player turn-based zero-sum games. Inspired by the AlphaGo Zero (AGZ) algorithm, we develop a Reinforcement Learning based approach. Specifically, we propose Explore-Improve-Supervise (EIS) method that combines "exploration", "policy improvement"' and "supervised learning" to find the value function and policy associated with Nash equilibrium. We identify sufficient conditions for convergence and correctness for such an approach. For a concrete instance of EIS where random policy is used for "exploration", Monte-Carlo Tree Search is used for "policy improvement" and Nearest Neighbors is used for "supervised learning", we establish that this method finds an ε\varepsilon-approximate value function of Nash equilibrium in O~(ε(d+4))\widetilde{O}(\varepsilon^{-(d+4)}) steps when the underlying state-space of the game is continuous and dd-dimensional. This is nearly optimal as we establish a lower bound of Ω~(ε(d+2))\widetilde{\Omega}(\varepsilon^{-(d+2)}) for any policy

    Model-Based Reinforcement Learning for Offline Zero-Sum Markov Games

    Full text link
    This paper makes progress towards learning Nash equilibria in two-player zero-sum Markov games from offline data. Specifically, consider a γ\gamma-discounted infinite-horizon Markov game with SS states, where the max-player has AA actions and the min-player has BB actions. We propose a pessimistic model-based algorithm with Bernstein-style lower confidence bounds -- called VI-LCB-Game -- that provably finds an ε\varepsilon-approximate Nash equilibrium with a sample complexity no larger than CclippedS(A+B)(1γ)3ε2\frac{C_{\mathsf{clipped}}^{\star}S(A+B)}{(1-\gamma)^{3}\varepsilon^{2}} (up to some log factor). Here, CclippedC_{\mathsf{clipped}}^{\star} is some unilateral clipped concentrability coefficient that reflects the coverage and distribution shift of the available data (vis-\`a-vis the target data), and the target accuracy ε\varepsilon can be any value within (0,11γ]\big(0,\frac{1}{1-\gamma}\big]. Our sample complexity bound strengthens prior art by a factor of min{A,B}\min\{A,B\}, achieving minimax optimality for the entire ε\varepsilon-range. An appealing feature of our result lies in algorithmic simplicity, which reveals the unnecessity of variance reduction and sample splitting in achieving sample optimality.Comment: accepted to Operations Researc

    A control theoretic approach for security of cyber-physical systems

    Get PDF
    In this dissertation, several novel defense methodologies for cyber-physical systems have been proposed. First, a special type of cyber-physical system, the RFID system, is considered for which a lightweight mutual authentication and ownership management protocol is proposed in order to protect the data confidentiality and integrity. Then considering the fact that the protection of the data confidentiality and integrity is insufficient to guarantee the security in cyber-physical systems, we turn to the development of a general framework for developing security schemes for cyber-physical systems wherein the cyber system states affect the physical system and vice versa. After that, we apply this general framework by selecting the traffic flow as the cyber system state and a novel attack detection scheme that is capable of capturing the abnormality in the traffic flow in those communication links due to a class of attacks has been proposed. On the other hand, an attack detection scheme that is capable of detecting both sensor and actuator attacks is proposed for the physical system in the presence of network induced delays and packet losses. Next, an attack detection scheme is proposed when the network parameters are unknown by using an optimal Q-learning approach. Finally, this attack detection and accommodation scheme has been further extended to the case where the network is modeled as a nonlinear system with unknown system dynamics --Abstract, page iv
    corecore