18 research outputs found
Learning with Opponent-Learning Awareness
Multi-agent settings are quickly gathering importance in machine learning.
This includes a plethora of recent work on deep multi-agent reinforcement
learning, but also can be extended to hierarchical RL, generative adversarial
networks and decentralised optimisation. In all these settings the presence of
multiple learning agents renders the training problem non-stationary and often
leads to unstable training or undesired final results. We present Learning with
Opponent-Learning Awareness (LOLA), a method in which each agent shapes the
anticipated learning of the other agents in the environment. The LOLA learning
rule includes a term that accounts for the impact of one agent's policy on the
anticipated parameter update of the other agents. Results show that the
encounter of two LOLA agents leads to the emergence of tit-for-tat and
therefore cooperation in the iterated prisoners' dilemma, while independent
learning does not. In this domain, LOLA also receives higher payouts compared
to a naive learner, and is robust against exploitation by higher order
gradient-based methods. Applied to repeated matching pennies, LOLA agents
converge to the Nash equilibrium. In a round robin tournament we show that LOLA
agents successfully shape the learning of a range of multi-agent learning
algorithms from literature, resulting in the highest average returns on the
IPD. We also show that the LOLA update rule can be efficiently calculated using
an extension of the policy gradient estimator, making the method suitable for
model-free RL. The method thus scales to large parameter and input spaces and
nonlinear function approximators. We apply LOLA to a grid world task with an
embedded social dilemma using recurrent policies and opponent modelling. By
explicitly considering the learning of the other agent, LOLA agents learn to
cooperate out of self-interest. The code is at github.com/alshedivat/lola
BL-WoLF: A Framework For Loss-Bounded Learnability In Zero-Sum Games
We present BL-WoLF, a framework for learnability in repeated zero-sum games
where the cost of learning is measured by the losses the learning agent accrues
(rather than the number of rounds). The game is adversarially chosen from some
family that the learner knows. The opponent knows the game and the learner's
learning strategy. The learner tries to either not accrue losses, or to quickly
learn about the game so as to avoid future losses (this is consistent with the
Win or Learn Fast (WoLF) principle; BL stands for ``bounded loss''). Our
framework allows for both probabilistic and approximate learning. The resultant
notion of {\em BL-WoLF}-learnability can be applied to any class of games, and
allows us to measure the inherent disadvantage to a player that does not know
which game in the class it is in. We present {\em guaranteed
BL-WoLF-learnability} results for families of games with deterministic payoffs
and families of games with stochastic payoffs. We demonstrate that these
families are {\em guaranteed approximately BL-WoLF-learnable} with lower cost.
We then demonstrate families of games (both stochastic and deterministic) that
are not guaranteed BL-WoLF-learnable. We show that those families,
nevertheless, are {\em BL-WoLF-learnable}. To prove these results, we use a key
lemma which we derive
Towards Optimal Algorithms For Online Decision Making Under Practical Constraints
Artificial Intelligence is increasingly being used in real-life applications such as driving with autonomous cars; deliveries with autonomous drones; customer support with chat-bots; personal assistant with smart speakers . . . An Artificial Intelligent agent (AI) can be trained to become expert at a task through a system of rewards and punishment, also well known as Reinforcement Learning (RL). However, since the AI will deal with human beings, it also has to follow some moral rules to accomplish any task. For example, the AI should be fair to the other agents and not destroy the environment. Moreover, the AI should not leak the privacy of users’ data it processes. Those rules represent significant challenges in designing AI that we tackle in this thesis through mathematically rigorous solutions.More precisely, we start by considering the basic RL problem modeled as a discrete Markov Decision Process. We propose three simple algorithms (UCRL-V, BUCRL and TSUCRL) using two different paradigms: Frequentist (UCRL-V) and Bayesian (BUCRL and TSUCRL). Through a unified theoretical analysis, we show that our three algorithms are near-optimal. Experiments performed confirm the superiority of our methods compared to existing techniques. Afterwards, we address the issue of fairness in the stateless version of reinforcement learning also known as multi-armed bandit. To concentrate our effort on the key challenges, we focus on two-agents multi-armed bandit. We propose a novel objective that has been shown to be connected to fairness and justice. We derive an algorithm UCRG to solve this novel objective and show theoretically its near-optimality. Next, we tackle the issue of privacy by using the recently introduced notion of Differential Privacy. We design multi-armed bandit algorithms that preserve differential-privacy. Theoretical analyses show that for the same level of privacy, our newly developed algorithms achieve better performance than existing techniques
A Model-Based Solution to the Offline Multi-Agent Reinforcement Learning Coordination Problem
Training multiple agents to coordinate is an important problem with
applications in robotics, game theory, economics, and social sciences. However,
most existing Multi-Agent Reinforcement Learning (MARL) methods are online and
thus impractical for real-world applications in which collecting new
interactions is costly or dangerous. While these algorithms should leverage
offline data when available, doing so gives rise to the offline coordination
problem. Specifically, we identify and formalize the strategy agreement (SA)
and the strategy fine-tuning (SFT) challenges, two coordination issues at which
current offline MARL algorithms fail. To address this setback, we propose a
simple model-based approach that generates synthetic interaction data and
enables agents to converge on a strategy while fine-tuning their policies
accordingly. Our resulting method, Model-based Offline Multi-Agent Proximal
Policy Optimization (MOMA-PPO), outperforms the prevalent learning methods in
challenging offline multi-agent MuJoCo tasks even under severe partial
observability and with learned world models
Cooperation through communication in decentralized Markov games
In this paper, we present a comunication-integrated reinforcement-learning algorithm for a general-sum Markov game or MG played by independent, cooperative agents. The algorithm assumes that agents can communicate but do not know the purpose (the semantic) of doing so. We model agents that have different tasks, some of which may be commonly beneficial. The objective of the agents is to determine which are the commonly beneficial tasks, and learn a sequence of actions that achieves the common tasks. In other words, the agents play a multi-stage coordination game, of which they know niether the stage-wise payoff matrix nor the stage transition matrix. Our principal interest is in imposing realistic conditions of learning on the agents. Towards this end, we assume that they operate in a strictly imperfect monitoring setting wherein they do not observe one another's actions or rewards. A learning algorithm for a Markov game under this stricter condition of learning has not been proposed yet to our knowledge. We describe this Markov game with individual reward functions as a new formalism, decentralized Markov game or Dec-MG, a formalism borrowed from Dec-MDP (Markov decison process). For the communicatory aspect of the learning conditions, we propose a series of communication frameworks graduated in terms of facilitation of information exchange amongst the agents. We present results of testing our algorithm in a toy problem MG called a total guessing game
Leveraging repeated games for solving complex multiagent decision problems
Prendre de bonnes décisions dans des environnements multiagents est une tâche difficile dans la mesure où la présence de plusieurs décideurs implique des conflits d'intérêts, un manque de coordination, et une multiplicité de décisions possibles. Si de plus, les décideurs interagissent successivement à travers le temps, ils doivent non seulement décider ce qu'il faut faire actuellement, mais aussi comment leurs décisions actuelles peuvent affecter le comportement des autres dans le futur. La théorie des jeux est un outil mathématique qui vise à modéliser ce type d'interactions via des jeux stratégiques à plusieurs joueurs. Des lors, les problèmes de décision multiagent sont souvent étudiés en utilisant la théorie des jeux. Dans ce contexte, et si on se restreint aux jeux dynamiques, les problèmes de décision multiagent complexes peuvent être approchés de façon algorithmique. La contribution de cette thèse est triple. Premièrement, elle contribue à un cadre algorithmique pour la planification distribuée dans les jeux dynamiques non-coopératifs. La multiplicité des plans possibles est à l'origine de graves complications pour toute approche de planification. Nous proposons une nouvelle approche basée sur la notion d'apprentissage dans les jeux répétés. Une telle approche permet de surmonter lesdites complications par le biais de la communication entre les joueurs. Nous proposons ensuite un algorithme d'apprentissage pour les jeux répétés en ``self-play''. Notre algorithme permet aux joueurs de converger, dans les jeux répétés initialement inconnus, vers un comportement conjoint optimal dans un certain sens bien défini, et ce, sans aucune communication entre les joueurs. Finalement, nous proposons une famille d'algorithmes de résolution approximative des jeux dynamiques et d'extraction des stratégies des joueurs. Dans ce contexte, nous proposons tout d'abord une méthode pour calculer un sous-ensemble non vide des équilibres approximatifs parfaits en sous-jeu dans les jeux répétés. Nous montrons ensuite comment nous pouvons étendre cette méthode pour approximer tous les équilibres parfaits en sous-jeu dans les jeux répétés, et aussi résoudre des jeux dynamiques plus complexes.Making good decisions in multiagent environments is a hard problem in the sense that the presence of several decision makers implies conflicts of interests, a lack of coordination, and a multiplicity of possible decisions. If, then, the same decision makers interact continuously through time, they have to decide not only what to do in the present, but also how their present decisions may affect the behavior of the others in the future. Game theory is a mathematical tool that aims to model such interactions as strategic games of multiple players. Therefore, multiagent decision problems are often studied using game theory. In this context, and being restricted to dynamic games, complex multiagent decision problems can be algorithmically approached. The contribution of this thesis is three-fold. First, this thesis contributes an algorithmic framework for distributed planning in non-cooperative dynamic games. The multiplicity of possible plans is a matter of serious complications for any planning approach. We propose a novel approach based on the concept of learning in repeated games. Our approach permits overcoming the aforementioned complications by means of communication between players. We then propose a learning algorithm for repeated game self-play. Our algorithm allows players to converge, in an initially unknown repeated game, to a joint behavior optimal in a certain, well-defined sense, without communication between players. Finally, we propose a family of algorithms for approximately solving dynamic games, and for extracting equilibrium strategy profiles. In this context, we first propose a method to compute a nonempty subset of approximate subgame-perfect equilibria in repeated games. We then demonstrate how to extend this method for approximating all subgame-perfect equilibria in repeated games, and also for solving more complex dynamic games