10 research outputs found

    Efficiently detecting switches against non-stationary opponents

    Get PDF
    Interactions in multiagent systems are generally more complicated than single agent ones. Game theory provides solutions on how to act in multiagent scenarios; however, it assumes that all agents will act rationally. Moreover, some works also assume the opponent will use a stationary strategy. These assumptions usually do not hold in real world scenarios where agents have limited capacities and may deviate from a perfect rational response. Our goal is still to act optimally in these cases by learning the appropriate response and without any prior policies on how to act. Thus, we focus on the problem when another agent in the environment uses different stationary strategies over time. This will turn the problem into learning in a non-stationary environment, posing a problem for most learning algorithms. This paper introduces DriftER, an algorithm that (1) learns a model of the opponent, (2) uses that to obtain an optimal policy and then (3) determines when it must re-learn due to an opponent strategy change. We provide theoretical results showing that DriftER guarantees to detect switches with high probability. Also, we provide empirical results showing that our approach outperforms state of the art algorithms, in normal form games such as prisoner’s dilemma and then in a more realistic scenario, the Power TAC simulator

    An exploration strategy for non-stationary opponents

    Get PDF
    The success or failure of any learning algorithm is partially due to the exploration strategy it exerts. However, most exploration strategies assume that the environment is stationary and non-strategic. In this work we shed light on how to design exploration strategies in non-stationary and adversarial environments. Our proposed adversarial drift exploration (DE) is able to efficiently explore the state space while keeping track of regions of the environment that have changed. This proposed exploration is general enough to be applied in single agent non-stationary environments as well as in multiagent settings where the opponent changes its strategy in time. We use a two agent strategic interaction setting to test this new type of exploration, where the opponent switches between different behavioral patterns to emulate a non-deterministic, stochastic and adversarial environment. The agent’s objective is to learn a model of the opponent’s strategy to act optimally. Our contribution is twofold. First, we present DE as a strategy for switch detection. Second, we propose a new algorithm called R-max# for learning and planning against non-stationary opponent. To handle such opponents, R-max# reasons and acts in terms of two objectives: (1) to maximize utilities in the short term while learning and (2) eventually explore opponent behavioral changes. We provide theoretical results showing that R-max# is guaranteed to detect the opponent’s switch and learn a new model in terms of finite sample complexity. R-max# makes efficient use of exploration experiences, which results in rapid adaptation and efficient DE, to deal with the non-stationary nature of the opponent. We show experimentally how using DE outperforms the state of the art algorithms that were explicitly designed for modeling opponents (in terms average rewards) in two complimentary domains

    A Better-response Strategy for Self-interested Planning Agents

    Full text link
    [EN] When self-interested agents plan individually, interactions that prevent them from executing their actions as planned may arise. In these coordination problems, game-theoretic planning can be used to enhance the agents¿ strategic behavior considering the interactions as part of the agents¿ utility. In this work, we define a general-sum game in which interactions such as conflicts and congestions are reflected in the agents¿ utility. We propose a better-response planning strategy that guarantees convergence to an equilibrium joint plan by imposing a tax to agents involved in conflicts. We apply our approach to a real-world problem in which agents are Electric Autonomous Vehicles (EAVs). The EAVs intend to find a joint plan that ensures their individual goals are achievable in a transportation scenario where congestion and conflicting situations may arise. Although the task is computationally hard, as we theoretically prove, the experimental results show that our approach outperforms similar approaches in both performance and solution quality.This work is supported by the GLASS project TIN2014-55637-C2-2-R of the Spanish MINECO and the Prometeo project II/2013/019 funded by the Valencian Government.Jordán, J.; Torreño Lerma, A.; De Weerdt, M.; Onaindia De La Rivaherrera, E. (2018). A Better-response Strategy for Self-interested Planning Agents. Applied Intelligence. 48(4):1020-1040. https://doi.org/10.1007/s10489-017-1046-5S10201040484Aghighi M, Bäckström C (2016) A multi-parameter complexity analysis of cost-optimal and net-benefit planning. In: Proceedings of the Twenty-Sixth International Conference on International Conference on Automated Planning and Scheduling. AAAI Press, London, pp 2–10Bercher P, Mattmüller R (2008) A planning graph heuristic for forward-chaining adversarial planning. In: ECAI, vol 8, pp 921–922Brafman RI, Domshlak C, Engel Y, Tennenholtz M (2009) Planning games. In: IJCAI 2009, Proceedings of the 21st international joint conference on artificial intelligence, pp 73–78Bylander T (1994) The computational complexity of propositional strips planning. Artif Intell 69(1):165–204Chen X, Deng X (2006) Settling the complexity of two-player nash equilibrium. In: 47th annual IEEE symposium on foundations of computer science, 2006. FOCS’06. IEEE, pp 261–272Chien S, Sinclair A (2011) Convergence to approximate nash equilibria in congestion games. Games and Economic Behavior 71(2):315–327de Cote EM, Chapman A, Sykulski AM, Jennings N (2010) Automated planning in repeated adversarial games. In: 26th conference on uncertainty in artificial intelligence (UAI 2010), pp 376–383Dunne PE, Kraus S, Manisterski E, Wooldridge M (2010) Solving coalitional resource games. Artif Intell 174(1):20–50Fabrikant A, Papadimitriou C, Talwar K (2004) The complexity of pure nash equilibria. In: Proceedings of the thirty-sixth annual ACM symposium on theory of computing, STOC ’04, pp 604–612Friedman JW, Mezzetti C (2001) Learning in games by random sampling. J Econ Theory 98(1):55–84Ghallab M, Nau D, Traverso P (2004) Automated planning: theory & practice. ElsevierGoemans M, Mirrokni V, Vetta A (2005) Sink equilibria and convergence. In: Proceedings of the 46th annual IEEE symposium on foundations of computer science, FOCS ’05, pp 142–154Hadad M, Kraus S, Hartman IBA, Rosenfeld A (2013) Group planning with time constraints. Ann Math Artif Intell 69(3):243–291Hart S, Mansour Y (2010) How long to equilibrium? the communication complexity of uncoupled equilibrium procedures. Games and Economic Behavior 69(1):107–126Helmert M (2003) Complexity results for standard benchmark domains in planning. Artif Intell 143(2):219–262Helmert M (2006) The fast downward planning system. J Artif Intell Res 26(1):191–246Jennings N, Faratin P, Lomuscio A, Parsons S, Wooldrige M, Sierra C (2001) Automated negotiation: prospects, methods and challenges. Group Decis Negot 10(2):199–215Johnson DS, Papadimtriou CH, Yannakakis M (1988) How easy is local search? J Comput Syst Sci 37 (1):79–100Jonsson A, Rovatsos M (2011) Scaling up multiagent planning: a best-response approach. In: Proceedings of the 21st international conference on automated planning and scheduling, ICAPSJordán J, Onaindía E (2015) Game-theoretic approach for non-cooperative planning. In: 29th AAAI conference on artificial intelligence (AAAI-15), pp 1357–1363McDermott D, Ghallab M, Howe A, Knoblock C, Ram A, Veloso M, Weld D, Wilkins D (1998) PDDL: the planning domain definition language. Yale Center for Computational Vision and Control, New HavenMilchtaich I (1996) Congestion games with player-specific payoff functions. Games and Economic Behavior 13(1):111–124Monderer D, Shapley LS (1996) Potential games. Games and Economic Behavior 14(1):124–143Nigro N, Welch D, Peace J (2015) Strategic planning to implement publicly available ev charching stations: a guide for business and policy makers. Tech rep, Center for Climate and Energy SolutionsNisan N, Ronen A (2007) Computationally feasible vcg mechanisms. J Artif Intell Res 29(1):19–47Nisan N, Roughgarden T, Tardos E, Vazirani VV (2007) Algorithmic game theory. Cambridge University Press, New YorkPapadimitriou CH (1994) On the complexity of the parity argument and other inefficient proofs of existence. J Comput Syst Sci 48(3):498–532Richter S, Westphal M (2010) The LAMA planner: guiding cost-based anytime planning with landmarks. J Artif Intell Res 39(1):127–177Rosenthal RW (1973) A class of games possessing pure-strategy nash equilibria. Int J Game Theory 2(1):65–67Shoham Y, Leyton-Brown K (2009) Multiagent systems: algorithmic, game-theoretic, and logical foundations. Cambridge University PressTorreño A, Onaindia E, Sapena Ó (2014) A flexible coupling approach to multi-agent planning under incomplete information. Knowl Inf Syst 38(1):141–178Torreño A, Onaindia E, Sapena Ó (2014) FMAP: distributed cooperative multi-agent planning. Appl Intell 41(2):606– 626Torreño A, Sapena Ó, Onaindia E (2015) Global heuristics for distributed cooperative multi-agent planning. In: ICAPS 2015. 25th international conference on automated planning and scheduling. AAAI Press, pp 225–233Von Neumann J, Morgenstern O (2007) Theory of games and economic behavior. Princeton University Pressde Weerdt M, Bos A, Tonino H, Witteveen C (2003) A resource logic for multi-agent plan merging. Ann Math Artif Intell 37(1):93–130Wooldridge M, Endriss U, Kraus S, Lang J (2013) Incentive engineering for boolean games. Artif Intell 195:418–43

    A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity

    Get PDF
    The key challenge in multiagent learning is learning a best response to the behaviour of other agents, which may be non-stationary: if the other agents adapt their strategy as well, the learning target moves. Disparate streams of research have approached non-stationarity from several angles, which make a variety of implicit assumptions that make it hard to keep an overview of the state of the art and to validate the innovation and significance of new works. This survey presents a coherent overview of work that addresses opponent-induced non-stationarity with tools from game theory, reinforcement learning and multi-armed bandits. Further, we reflect on the principle approaches how algorithms model and cope with this non-stationarity, arriving at a new framework and five categories (in increasing order of sophistication): ignore, forget, respond to target models, learn models, and theory of mind. A wide range of state-of-the-art algorithms is classified into a taxonomy, using these categories and key characteristics of the environment (e.g., observability) and adaptation behaviour of the opponents (e.g., smooth, abrupt). To clarify even further we present illustrative variations of one domain, contrasting the strengths and limitations of each category. Finally, we discuss in which environments the different approaches yield most merit, and point to promising avenues of future research

    The Exploration-Exploitation Trade-Off in Sequential Decision Making Problems

    No full text
    Sequential decision making problems require an agent to repeatedly choose between a series of actions. Common to such problems is the exploration-exploitation trade-off, where an agent must choose between the action expected to yield the best reward (exploitation) or trying an alternative action for potential future benefit (exploration). The main focus of this thesis is to understand in more detail the role this trade-off plays in various important sequential decision making problems, in terms of maximising finite-time reward. The most common and best studied abstraction of the exploration-exploitation trade-off is the classic multi-armed bandit problem. In this thesis we study several important extensions that are more suitable than the classic problem to real-world applications. These extensions include scenarios where the rewards for actions change over time or the presence of other agents must be repeatedly considered. In these contexts, the exploration-exploitation trade-off has a more complicated role in terms of maximising finite-time performance. For example, the amount of exploration required will constantly change in a dynamic decision problem, in multiagent problems agents can explore by communication, and in repeated games, the exploration-exploitation trade-off must be jointly considered with game theoretic reasoning. Existing techniques for balancing exploration-exploitation are focused on achieving desirable asymptotic behaviour and are in general only applicable to basic decision problems. The most flexible state-of-the-art approaches, έ-greedy and έ-first, require exploration parameters to be set a priori, the optimal values of which are highly dependent on the problem faced. To overcome this, we construct a novel algorithm, έ-ADAPT, which has no exploration parameters and can adapt exploration on-line for a wide range of problems. έ-ADAPT is built on newly proven theoretical properties of the έ-first policy and we demonstrate that έ-ADAPT can accurately learn not only how much to explore, but also when and which actions to explore

    A Survey of Learning in Multiagent Environments: Dealing with Non-Stationarity

    Get PDF
    The key challenge in multiagent learning is learning a best response to the behaviour of other agents, which may be non-stationary: if the other agents adapt their strategy as well, the learning target moves. Disparate streams of research have approached non-stationarity from several angles, which make a variety of implicit assumptions that make it hard to keep an overview of the state of the art and to validate the innovation and significance of new works. This survey presents a coherent overview of work that addresses opponent-induced non-stationarity with tools from game theory, reinforcement learning and multi-armed bandits. Further, we reflect on the principle approaches how algorithms model and cope with this non-stationarity, arriving at a new framework and five categories (in increasing order of sophistication): ignore, forget, respond to target models, learn models, and theory of mind. A wide range of state-of-the-art algorithms is classified into a taxonomy, using these categories and key characteristics of the environment (e.g., observability) and adaptation behaviour of the opponents (e.g., smooth, abrupt). To clarify even further we present illustrative variations of one domain, contrasting the strengths and limitations of each category. Finally, we discuss in which environments the different approaches yield most merit, and point to promising avenues of future research

    Automated Planning in Repeated Adversarial Games

    No full text
    Game theory's prescriptive power typically relies on full rationality and/or self-play interactions. In contrast, this work sets aside these fundamental premises and focuses instead on heterogeneous autonomous interactions between two or more agents. Specifically, we introduce a new and concise representation for repeated adversarial (constant-sum) games that highlight the necessary features that enable an automated planing agent to reason about how to score above the game's Nash equilibrium, when facing heterogeneous adversaries. To this end, we present TeamUP, a model-based RL algorithm designed for learning and planning such an abstraction. In essence, it is somewhat similar to R-max with a cleverly engineered reward shaping that treats exploration as an adversarial optimization problem. In practice, it attempts to find an ally with which to tacitly collude (in more than two-player games) and then collaborates on a joint plan of actions that can consistently score a high utility in adversarial repeated games. We use the inaugural Lemonade Stand Game Tournament to demonstrate the effectiveness of our approach, and find that TeamUP is the best performing agent, demoting the Tournament's actual winning strategy into second place. In our experimental analysis, we show hat our strategy successfully and consistently builds collaborations with many different heterogeneous (and sometimes very sophisticated) adversaries

    Automated planning in repeated adversarial games

    No full text
    Game theorys prescriptive power typically relies on full rationality and/or self-play interactions. In contrast, this work sets aside these fundamental premises and focuses instead on heterogeneous autonomous interactions between two or more agents. Specifically, we introduce a new and concise representation for repeated adversarial (constant-sum) games that highlight the necessary features that enable an automated planing agent to reason about how to score above the games Nash equilibrium, when facing heterogeneous adversaries. To this end, we present TeamUP, a model-based RL algorithm designed for learning and planning such an abstraction. In essence, it is somewhat similar to R-max with a cleverly engineered reward shaping that treats exploration as an adversarial optimization problem. In practice, it attempts to find an ally with which to tacitly collude (in more than two-player games) and then collaborates on a joint plan of actions that can consistently score a high utility in adversarial repeated games. We use the inaugural Lemonade Stand Game Tournament1 to demonstrate the effectiveness of our approach, and find that TeamUP is the best performing agent, demoting the Tournaments actual winning strategy into second place. In our experimental analysis, we show hat our strategy successfully and consistently builds collaborations with many different heterogeneous (and sometimes very sophisticated) adversaries
    corecore