77 research outputs found

    Expert iteration

    Get PDF
    In this thesis, we study how reinforcement learning algorithms can tackle classical board games without recourse to human knowledge. Specifically, we develop a framework and algorithms which learn to play the board game Hex starting from random play. We first describe Expert Iteration (ExIt), a novel reinforcement learning framework which extends Modified Policy Iteration. ExIt explicitly decomposes the reinforcement learning problem into two parts: planning and generalisation. A planning algorithm explores possible move sequences starting from a particular position to find good strategies from that position, while a parametric function approximator is trained to predict those plans, generalising to states not yet seen. Subsequently, planning is improved by using the approximated policy to guide search, increasing the strength of new plans. This decomposition allows ExIt to combine the benefits of both planning methods and function approximation methods. We demonstrate the effectiveness of the ExIt paradigm by implementing ExIt with two different planning algorithms. First, we develop a version based on Monte Carlo Tree Search (MCTS), a search algorithm which has been successful both in specific games, such as Go, Hex and Havannah, and in general game playing competitions. We then develop a new planning algorithm, Policy Gradient Search (PGS), which uses a model-free reinforcement learning algorithm for online planning. Unlike MCTS, PGS does not require an explicit search tree. Instead PGS uses function approximation within a single search, allowing it to be applied to problems with larger branching factors. Both MCTS-ExIt and PGS-ExIt defeated MoHex 2.0 - the most recent Hex Olympiad winner to be open sourced - in 9 × 9 Hex. More importantly, whereas MoHex makes use of many Hex-specific improvements and knowledge, all our programs were trained tabula rasa using general reinforcement learning methods. This bodes well for ExIt’s applicability to both other games and real world decision making problems

    Sublinear Computation Paradigm

    Get PDF
    This open access book gives an overview of cutting-edge work on a new paradigm called the “sublinear computation paradigm,” which was proposed in the large multiyear academic research project “Foundations of Innovative Algorithms for Big Data.” That project ran from October 2014 to March 2020, in Japan. To handle the unprecedented explosion of big data sets in research, industry, and other areas of society, there is an urgent need to develop novel methods and approaches for big data analysis. To meet this need, innovative changes in algorithm theory for big data are being pursued. For example, polynomial-time algorithms have thus far been regarded as “fast,” but if a quadratic-time algorithm is applied to a petabyte-scale or larger big data set, problems are encountered in terms of computational resources or running time. To deal with this critical computational and algorithmic bottleneck, linear, sublinear, and constant time algorithms are required. The sublinear computation paradigm is proposed here in order to support innovation in the big data era. A foundation of innovative algorithms has been created by developing computational procedures, data structures, and modelling techniques for big data. The project is organized into three teams that focus on sublinear algorithms, sublinear data structures, and sublinear modelling. The work has provided high-level academic research results of strong computational and algorithmic interest, which are presented in this book. The book consists of five parts: Part I, which consists of a single chapter on the concept of the sublinear computation paradigm; Parts II, III, and IV review results on sublinear algorithms, sublinear data structures, and sublinear modelling, respectively; Part V presents application results. The information presented here will inspire the researchers who work in the field of modern algorithms

    Monte-Carlo tree search enhancements for one-player and two-player domains

    Get PDF

    Catgame: A Tool For Problem Solving In Complex Dynamic Systems Using Game Theoretic Knowledge Distribution In Cultural Algorithms, And Its Application (catneuro) To The Deep Learning Of Game Controller

    Get PDF
    Cultural Algorithms (CA) are knowledge-intensive, population-based stochastic optimization methods that are modeled after human cultures and are suited to solving problems in complex environments. The CA Belief Space stores knowledge harvested from prior generations and re-distributes it to future generations via a knowledge distribution (KD) mechanism. Each of the population individuals is then guided through the search space via the associated knowledge. Previously, CA implementations have used only competitive KD mechanisms that have performed well for problems embedded in static environments. Relatively recently, CA research has evolved to encompass dynamic problem environments. Given increasing environmental complexity, a natural question arises about whether KD mechanisms that also incorporate cooperation can perform better in such environments than purely competitive ones? Borrowing from game theory, game-based KD mechanisms are implemented and tested against the default competitive mechanism – Weighted Majority (WTD). Two different concepts of complexity are addressed – numerical optimization under dynamic environments and hierarchal, multi-objective optimization for evolving deep learning models. The former is addressed with the CATGame software system and the later with CATNeuro. CATGame implements three types of games that span both cooperation and competition for knowledge distribution, namely: Iterated Prisoner\u27s Dilemma (IPD), Stag-Hunt and Stackelberg. The performance of the three game mechanisms is compared with the aid of a dynamic problem generator called Cones World. Weighted Majority, aka “wisdom of the crowd”, the default CA competitive KD mechanism is used as the benchmark. It is shown that games that support both cooperation and competition do indeed perform better but not in all cases. The results shed light on what kinds of games are suited to problem solving in complex, dynamic environments. Specifically, games that balance exploration and exploitation using the local signal of ‘social’ rank – Stag-Hunt and IPD – perform better. Stag-Hunt which is also the most cooperative of the games tested, performed the best overall. Dynamic analysis of the ‘social’ aspects of the CA test runs shows that Stag-Hunt allocates compute resources more consistently than the others in response to environmental complexity changes. Stackelberg where the allocation decisions are centralized, like in a centrally planned economic system, is found to be the least adaptive. CATNeuro is for solving neural architecture search (NAS) problems. Contemporary ‘deep learning’ neural network models are proven effective. However, the network topologies may be complex and not immediately obvious for the problem at hand. This has given rise to the secondary field of neural architecture search. It is still nascent with many frameworks and approaches now becoming available. This paper describes a NAS method based on graph evolution pioneered by NEAT (Neuroevolution of Augmenting Topologies) but driven by the evolutionary mechanisms under Cultural Algorithms. Here CATNeuro is applied to find optimal network topologies to play a 2D fighting game called FightingICE (derived from “The Rumble Fish” video game). A policy-based, reinforcement learning method is used to create the training data for network optimization. CATNeuro is still evolving. To inform the development of CATNeuro, in this primary foray into NAS, we contrast the performance of CATNeuro with two different knowledge distribution mechanisms – the stalwart Weighted Majority and a new one based on the Stag-Hunt game from evolutionary game theory that performed the best in CATGame. The research shows that Stag-Hunt has a distinct edge over WTD in terms of game performance, model accuracy, and model size. It is therefore deemed to be the preferred mechanism for complex, hierarchical optimization tasks such as NAS and is planned to be used as the default KD mechanism in CATNeuro going forward

    Multi-player games in the era of machine learning

    Full text link
    Parmi tous les jeux de société joués par les humains au cours de l’histoire, le jeu de go était considéré comme l’un des plus difficiles à maîtriser par un programme informatique [Van Den Herik et al., 2002]; Jusqu’à ce que ce ne soit plus le cas [Silveret al., 2016]. Cette percée révolutionnaire [Müller, 2002, Van Den Herik et al., 2002] fût le fruit d’une combinaison sophistiquée de Recherche arborescente Monte-Carlo et de techniques d’apprentissage automatique pour évaluer les positions du jeu, mettant en lumière le grand potentiel de l’apprentissage automatique pour résoudre des jeux. L’apprentissage antagoniste, un cas particulier de l’optimisation multiobjective, est un outil de plus en plus utile dans l’apprentissage automatique. Par exemple, les jeux à deux joueurs et à somme nulle sont importants dans le domain des réseaux génératifs antagonistes [Goodfellow et al., 2014] ainsi que pour maîtriser des jeux comme le Go ou le Poker en s’entraînant contre lui-même [Silver et al., 2017, Brown andSandholm, 2017]. Un résultat classique de la théorie des jeux indique que les jeux convexes-concaves ont toujours un équilibre [Neumann, 1928]. Étonnamment, les praticiens en apprentissage automatique entrainent avec succès une seule paire de réseaux de neurones dont l’objectif est un problème de minimax non-convexe et non-concave alors que pour une telle fonction de gain, l’existence d’un équilibre de Nash n’est pas garantie en général. Ce travail est une tentative d'établir une solide base théorique pour l’apprentissage dans les jeux. La première contribution explore le théorème minimax pour une classe particulière de jeux non-convexes et non-concaves qui englobe les réseaux génératifs antagonistes. Cette classe correspond à un ensemble de jeux à deux joueurs et a somme nulle joués avec des réseaux de neurones. Les deuxième et troisième contributions étudient l’optimisation des problèmes minimax, et plus généralement, les inégalités variationnelles dans le cadre de l’apprentissage automatique. Bien que la méthode standard de descente de gradient ne parvienne pas à converger vers l’équilibre de Nash de jeux convexes-concaves simples, il existe des moyens d’utiliser des gradients pour obtenir des méthodes qui convergent. Nous étudierons plusieurs techniques telles que l’extrapolation, la moyenne et la quantité de mouvement à paramètre négatif. La quatrième contribution fournit une étude empirique du comportement pratique des réseaux génératifs antagonistes. Dans les deuxième et troisième contributions, nous diagnostiquons que la méthode du gradient échoue lorsque le champ de vecteur du jeu est fortement rotatif. Cependant, une telle situation peut décrire un pire des cas qui ne se produit pas dans la pratique. Nous fournissons de nouveaux outils de visualisation afin d’évaluer si nous pouvons détecter des rotations dans comportement pratique des réseaux génératifs antagonistes.Among all the historical board games played by humans, the game of go was considered one of the most difficult to master by a computer program [Van Den Heriket al., 2002]; Until it was not [Silver et al., 2016]. This odds-breaking break-through [Müller, 2002, Van Den Herik et al., 2002] came from a sophisticated combination of Monte Carlo tree search and machine learning techniques to evaluate positions, shedding light upon the high potential of machine learning to solve games. Adversarial training, a special case of multiobjective optimization, is an increasingly useful tool in machine learning. For example, two-player zero-sum games are important for generative modeling (GANs) [Goodfellow et al., 2014] and mastering games like Go or Poker via self-play [Silver et al., 2017, Brown and Sandholm,2017]. A classic result in Game Theory states that convex-concave games always have an equilibrium [Neumann, 1928]. Surprisingly, machine learning practitioners successfully train a single pair of neural networks whose objective is a nonconvex-nonconcave minimax problem while for such a payoff function, the existence of a Nash equilibrium is not guaranteed in general. This work is an attempt to put learning in games on a firm theoretical foundation. The first contribution explores minimax theorems for a particular class of nonconvex-nonconcave games that encompasses generative adversarial networks. The proposed result is an approximate minimax theorem for two-player zero-sum games played with neural networks, including WGAN, StarCrat II, and Blotto game. Our findings rely on the fact that despite being nonconcave-nonconvex with respect to the neural networks parameters, the payoff of these games are concave-convex with respect to the actual functions (or distributions) parametrized by these neural networks. The second and third contributions study the optimization of minimax problems, and more generally, variational inequalities in the context of machine learning. While the standard gradient descent-ascent method fails to converge to the Nash equilibrium of simple convex-concave games, there exist ways to use gradients to obtain methods that converge. We investigate several techniques such as extrapolation, averaging and negative momentum. We explore these techniques experimentally by proposing a state-of-the-art (at the time of publication) optimizer for GANs called ExtraAdam. We also prove new convergence results for Extrapolation from the past, originally proposed by Popov [1980], as well as for gradient method with negative momentum. The fourth contribution provides an empirical study of the practical landscape of GANs. In the second and third contributions, we diagnose that the gradient method breaks when the game’s vector field is highly rotational. However, such a situation may describe a worst-case that does not occur in practice. We provide new visualization tools in order to exhibit rotations in practical GAN landscapes. In this contribution, we show empirically that the training of GANs exhibits significant rotations around Local Stable Stationary Points (LSSP), and we provide empirical evidence that GAN training converges to a stable stationary point, which is a saddle point for the generator loss, not a minimum, while still achieving excellent performance
    corecore