203 research outputs found

    The Exploration-Exploitation Trade-Off in Sequential Decision Making Problems

    No full text
    Sequential decision making problems require an agent to repeatedly choose between a series of actions. Common to such problems is the exploration-exploitation trade-off, where an agent must choose between the action expected to yield the best reward (exploitation) or trying an alternative action for potential future benefit (exploration). The main focus of this thesis is to understand in more detail the role this trade-off plays in various important sequential decision making problems, in terms of maximising finite-time reward. The most common and best studied abstraction of the exploration-exploitation trade-off is the classic multi-armed bandit problem. In this thesis we study several important extensions that are more suitable than the classic problem to real-world applications. These extensions include scenarios where the rewards for actions change over time or the presence of other agents must be repeatedly considered. In these contexts, the exploration-exploitation trade-off has a more complicated role in terms of maximising finite-time performance. For example, the amount of exploration required will constantly change in a dynamic decision problem, in multiagent problems agents can explore by communication, and in repeated games, the exploration-exploitation trade-off must be jointly considered with game theoretic reasoning. Existing techniques for balancing exploration-exploitation are focused on achieving desirable asymptotic behaviour and are in general only applicable to basic decision problems. The most flexible state-of-the-art approaches, έ-greedy and έ-first, require exploration parameters to be set a priori, the optimal values of which are highly dependent on the problem faced. To overcome this, we construct a novel algorithm, έ-ADAPT, which has no exploration parameters and can adapt exploration on-line for a wide range of problems. έ-ADAPT is built on newly proven theoretical properties of the έ-first policy and we demonstrate that έ-ADAPT can accurately learn not only how much to explore, but also when and which actions to explore

    Many-agent Reinforcement Learning

    Get PDF
    Multi-agent reinforcement learning (RL) solves the problem of how each agent should behave optimally in a stochastic environment in which multiple agents are learning simultaneously. It is an interdisciplinary domain with a long history that lies in the joint area of psychology, control theory, game theory, reinforcement learning, and deep learning. Following the remarkable success of the AlphaGO series in single-agent RL, 2019 was a booming year that witnessed significant advances in multi-agent RL techniques; impressive breakthroughs have been made on developing AIs that outperform humans on many challenging tasks, especially multi-player video games. Nonetheless, one of the key challenges of multi-agent RL techniques is the scalability; it is still non-trivial to design efficient learning algorithms that can solve tasks including far more than two agents (N≫2N \gg 2), which I name by \emph{many-agent reinforcement learning} (MARL\footnote{I use the world of ``MARL" to denote multi-agent reinforcement learning with a particular focus on the cases of many agents; otherwise, it is denoted as ``Multi-Agent RL" by default.}) problems. In this thesis, I contribute to tackling MARL problems from four aspects. Firstly, I offer a self-contained overview of multi-agent RL techniques from a game-theoretical perspective. This overview fills the research gap that most of the existing work either fails to cover the recent advances since 2010 or does not pay adequate attention to game theory, which I believe is the cornerstone to solving many-agent learning problems. Secondly, I develop a tractable policy evaluation algorithm -- αα\alpha^\alpha-Rank -- in many-agent systems. The critical advantage of αα\alpha^\alpha-Rank is that it can compute the solution concept of α\alpha-Rank tractably in multi-player general-sum games with no need to store the entire pay-off matrix. This is in contrast to classic solution concepts such as Nash equilibrium which is known to be PPADPPAD-hard in even two-player cases. αα\alpha^\alpha-Rank allows us, for the first time, to practically conduct large-scale multi-agent evaluations. Thirdly, I introduce a scalable policy learning algorithm -- mean-field MARL -- in many-agent systems. The mean-field MARL method takes advantage of the mean-field approximation from physics, and it is the first provably convergent algorithm that tries to break the curse of dimensionality for MARL tasks. With the proposed algorithm, I report the first result of solving the Ising model and multi-agent battle games through a MARL approach. Fourthly, I investigate the many-agent learning problem in open-ended meta-games (i.e., the game of a game in the policy space). Specifically, I focus on modelling the behavioural diversity in meta-games, and developing algorithms that guarantee to enlarge diversity during training. The proposed metric based on determinantal point processes serves as the first mathematically rigorous definition for diversity. Importantly, the diversity-aware learning algorithms beat the existing state-of-the-art game solvers in terms of exploitability by a large margin. On top of the algorithmic developments, I also contribute two real-world applications of MARL techniques. Specifically, I demonstrate the great potential of applying MARL to study the emergent population dynamics in nature, and model diverse and realistic interactions in autonomous driving. Both applications embody the prospect that MARL techniques could achieve huge impacts in the real physical world, outside of purely video games

    Opponent Modelling in Multi-Agent Systems

    Get PDF
    Reinforcement Learning (RL) formalises a problem where an intelligent agent needs to learn and achieve certain goals by maximising a long-term return in an environment. Multi-agent reinforcement learning (MARL) extends traditional RL to multiple agents. Many RL algorithms lose convergence guarantee in non-stationary environments due to the adaptive opponents. Partial observation caused by agents’ different private observations introduces high variance during the training which exacerbates the data inefficiency. In MARL, training an agent to perform well against a set of opponents often leads to bad performance against another set of opponents. Non-stationarity, partial observation and unclear learning objective are three critical problems in MARL which hinder agents’ learning and they all share a cause which is the lack of knowledge of the other agents. Therefore, in this thesis, we propose to solve these problems with opponent modelling methods. We tailor our solutions by combining opponent modelling with other techniques according to the characteristics of problems we face. Specifically, we first propose ROMMEO, an algorithm inspired by Bayesian inference, as a solution to alleviate the non-stationarity in cooperative games. Then we study the partial observation problem caused by agents’ private observation and design an implicit communication training method named PBL. Lastly, we investigate solutions to the non-stationarity and unclear learning objective problems in zero-sum games. We propose a solution named EPSOM which aims for finding safe exploitation strategies to play against non-stationary opponents. We verify our proposed methods by varied experiments and show they can achieve the desired performance. Limitations and future works are discussed in the last chapter of this thesis

    The exploration-exploitation trade-off in sequential decision making problems

    Get PDF
    Sequential decision making problems require an agent to repeatedly choose between a series of actions. Common to such problems is the exploration-exploitation trade-off, where an agent must choose between the action expected to yield the best reward (exploitation) or trying an alternative action for potential future benefit (exploration). The main focus of this thesis is to understand in more detail the role this trade-off plays in various important sequential decision making problems, in terms of maximising finite-time reward. The most common and best studied abstraction of the exploration-exploitation trade-off is the classic multi-armed bandit problem. In this thesis we study several important extensions that are more suitable than the classic problem to real-world applications. These extensions include scenarios where the rewards for actions change over time or the presence of other agents must be repeatedly considered. In these contexts, the exploration-exploitation trade-off has a more complicated role in terms of maximising finite-time performance. For example, the amount of exploration required will constantly change in a dynamic decision problem, in multi-agent problems agents can explore by communication, and in repeated games, the exploration-exploitation trade-off must be jointly considered with game theoretic reasoning. Existing techniques for balancing exploration-exploitation are focused on achieving desirable asymptotic behaviour and are in general only applicable to basic decision problems. The most flexible state-of-the-art approaches, ε-greedy and ε-first, require exploration parameters to be set a priori, the optimal values of which are highly dependent on the problem faced. To overcome this, we construct a novel algorithm, ε-ADAPT, which has no exploration parameters and can adapt exploration on-line for a wide range of problems. ε-ADAPT is built on newly proven theoretical properties of the ε-first policy and we demonstrate that ε-ADAPT can accurately learn not only how much to explore, but also when and which actions to explore

    BNAIC 2008:Proceedings of BNAIC 2008, the twentieth Belgian-Dutch Artificial Intelligence Conference

    Get PDF
    • …
    corecore