    Learning with Opponent-Learning Awareness

    Multi-agent settings are quickly gathering importance in machine learning. This includes a plethora of recent work on deep multi-agent reinforcement learning, but also can be extended to hierarchical RL, generative adversarial networks and decentralised optimisation. In all these settings the presence of multiple learning agents renders the training problem non-stationary and often leads to unstable training or undesired final results. We present Learning with Opponent-Learning Awareness (LOLA), a method in which each agent shapes the anticipated learning of the other agents in the environment. The LOLA learning rule includes a term that accounts for the impact of one agent's policy on the anticipated parameter update of the other agents. Results show that the encounter of two LOLA agents leads to the emergence of tit-for-tat and therefore cooperation in the iterated prisoners' dilemma, while independent learning does not. In this domain, LOLA also receives higher payouts compared to a naive learner, and is robust against exploitation by higher order gradient-based methods. Applied to repeated matching pennies, LOLA agents converge to the Nash equilibrium. In a round robin tournament we show that LOLA agents successfully shape the learning of a range of multi-agent learning algorithms from literature, resulting in the highest average returns on the IPD. We also show that the LOLA update rule can be efficiently calculated using an extension of the policy gradient estimator, making the method suitable for model-free RL. The method thus scales to large parameter and input spaces and nonlinear function approximators. We apply LOLA to a grid world task with an embedded social dilemma using recurrent policies and opponent modelling. By explicitly considering the learning of the other agent, LOLA agents learn to cooperate out of self-interest. The code is at github.com/alshedivat/lola

    BL-WoLF: A Framework For Loss-Bounded Learnability In Zero-Sum Games

    We present BL-WoLF, a framework for learnability in repeated zero-sum games where the cost of learning is measured by the losses the learning agent accrues (rather than the number of rounds). The game is adversarially chosen from some family that the learner knows. The opponent knows the game and the learner's learning strategy. The learner tries to either not accrue losses, or to quickly learn about the game so as to avoid future losses (this is consistent with the Win or Learn Fast (WoLF) principle; BL stands for ``bounded loss''). Our framework allows for both probabilistic and approximate learning. The resultant notion of {\em BL-WoLF}-learnability can be applied to any class of games, and allows us to measure the inherent disadvantage to a player that does not know which game in the class it is in. We present {\em guaranteed BL-WoLF-learnability} results for families of games with deterministic payoffs and families of games with stochastic payoffs. We demonstrate that these families are {\em guaranteed approximately BL-WoLF-learnable} with lower cost. We then demonstrate families of games (both stochastic and deterministic) that are not guaranteed BL-WoLF-learnable. We show that those families, nevertheless, are {\em BL-WoLF-learnable}. To prove these results, we use a key lemma which we derive

    Towards Optimal Algorithms For Online Decision Making Under Practical Constraints

    Artificial Intelligence is increasingly being used in real-life applications such as driving with autonomous cars; deliveries with autonomous drones; customer support with chat-bots; personal assistant with smart speakers . . . An Artificial Intelligent agent (AI) can be trained to become expert at a task through a system of rewards and punishment, also well known as Reinforcement Learning (RL). However, since the AI will deal with human beings, it also has to follow some moral rules to accomplish any task. For example, the AI should be fair to the other agents and not destroy the environment. Moreover, the AI should not leak the privacy of users’ data it processes. Those rules represent significant challenges in designing AI that we tackle in this thesis through mathematically rigorous solutions.More precisely, we start by considering the basic RL problem modeled as a discrete Markov Decision Process. We propose three simple algorithms (UCRL-V, BUCRL and TSUCRL) using two different paradigms: Frequentist (UCRL-V) and Bayesian (BUCRL and TSUCRL). Through a unified theoretical analysis, we show that our three algorithms are near-optimal. Experiments performed confirm the superiority of our methods compared to existing techniques. Afterwards, we address the issue of fairness in the stateless version of reinforcement learning also known as multi-armed bandit. To concentrate our effort on the key challenges, we focus on two-agents multi-armed bandit. We propose a novel objective that has been shown to be connected to fairness and justice. We derive an algorithm UCRG to solve this novel objective and show theoretically its near-optimality. Next, we tackle the issue of privacy by using the recently introduced notion of Differential Privacy. We design multi-armed bandit algorithms that preserve differential-privacy. Theoretical analyses show that for the same level of privacy, our newly developed algorithms achieve better performance than existing techniques

    A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem

    Full text link
    Training multiple agents to coordinate is an important problem with applications in robotics, game theory, economics, and social sciences. However, most existing Multi-Agent Reinforcement Learning (MARL) methods are online and thus impractical for real-world applications in which collecting new interactions is costly or dangerous. While these algorithms should leverage offline data when available, doing so gives rise to the offline coordination problem. Specifically, we identify and formalize the strategy agreement (SA) and the strategy fine-tuning (SFT) challenges, two coordination issues at which current offline MARL algorithms fail. To address this setback, we propose a simple model-based approach that generates synthetic interaction data and enables agents to converge on a strategy while fine-tuning their policies accordingly. Our resulting method, Model-based Offline Multi-Agent Proximal Policy Optimization (MOMA-PPO), outperforms the prevalent learning methods in challenging offline multi-agent MuJoCo tasks even under severe partial observability and with learned world models

    Cooperation through communication in decentralized Markov games

    In this paper, we present a comunication-integrated reinforcement-learning algorithm for a general-sum Markov game or MG played by independent, cooperative agents. The algorithm assumes that agents can communicate but do not know the purpose (the semantic) of doing so. We model agents that have different tasks, some of which may be commonly beneficial. The objective of the agents is to determine which are the commonly beneficial tasks, and learn a sequence of actions that achieves the common tasks. In other words, the agents play a multi-stage coordination game, of which they know niether the stage-wise payoff matrix nor the stage transition matrix. Our principal interest is in imposing realistic conditions of learning on the agents. Towards this end, we assume that they operate in a strictly imperfect monitoring setting wherein they do not observe one another's actions or rewards. A learning algorithm for a Markov game under this stricter condition of learning has not been proposed yet to our knowledge. We describe this Markov game with individual reward functions as a new formalism, decentralized Markov game or Dec-MG, a formalism borrowed from Dec-MDP (Markov decison process). For the communicatory aspect of the learning conditions, we propose a series of communication frameworks graduated in terms of facilitation of information exchange amongst the agents. We present results of testing our algorithm in a toy problem MG called a total guessing game

    Leveraging repeated games for solving complex multiagent decision problems

    Prendre de bonnes décisions dans des environnements multiagents est une tâche difficile dans la mesure où la présence de plusieurs décideurs implique des conflits d'intérêts, un manque de coordination, et une multiplicité de décisions possibles. Si de plus, les décideurs interagissent successivement à travers le temps, ils doivent non seulement décider ce qu'il faut faire actuellement, mais aussi comment leurs décisions actuelles peuvent affecter le comportement des autres dans le futur. La théorie des jeux est un outil mathématique qui vise à modéliser ce type d'interactions via des jeux stratégiques à plusieurs joueurs. Des lors, les problèmes de décision multiagent sont souvent étudiés en utilisant la théorie des jeux. Dans ce contexte, et si on se restreint aux jeux dynamiques, les problèmes de décision multiagent complexes peuvent être approchés de façon algorithmique. La contribution de cette thèse est triple. Premièrement, elle contribue à un cadre algorithmique pour la planification distribuée dans les jeux dynamiques non-coopératifs. La multiplicité des plans possibles est à l'origine de graves complications pour toute approche de planification. Nous proposons une nouvelle approche basée sur la notion d'apprentissage dans les jeux répétés. Une telle approche permet de surmonter lesdites complications par le biais de la communication entre les joueurs. Nous proposons ensuite un algorithme d'apprentissage pour les jeux répétés en ``self-play''. Notre algorithme permet aux joueurs de converger, dans les jeux répétés initialement inconnus, vers un comportement conjoint optimal dans un certain sens bien défini, et ce, sans aucune communication entre les joueurs. Finalement, nous proposons une famille d'algorithmes de résolution approximative des jeux dynamiques et d'extraction des stratégies des joueurs. Dans ce contexte, nous proposons tout d'abord une méthode pour calculer un sous-ensemble non vide des équilibres approximatifs parfaits en sous-jeu dans les jeux répétés. Nous montrons ensuite comment nous pouvons étendre cette méthode pour approximer tous les équilibres parfaits en sous-jeu dans les jeux répétés, et aussi résoudre des jeux dynamiques plus complexes.