31 research outputs found

    Bootstrapping Monte Carlo Tree Search with an Imperfect Heuristic

    Full text link
    We consider the problem of using a heuristic policy to improve the value approximation by the Upper Confidence Bound applied in Trees (UCT) algorithm in non-adversarial settings such as planning with large-state space Markov Decision Processes. Current improvements to UCT focus on either changing the action selection formula at the internal nodes or the rollout policy at the leaf nodes of the search tree. In this work, we propose to add an auxiliary arm to each of the internal nodes, and always use the heuristic policy to roll out simulations at the auxiliary arms. The method aims to get fast convergence to optimal values at states where the heuristic policy is optimal, while retaining similar approximation as the original UCT in other states. We show that bootstrapping with the proposed method in the new algorithm, UCT-Aux, performs better compared to the original UCT algorithm and its variants in two benchmark experiment settings. We also examine conditions under which UCT-Aux works well.Comment: 16 pages, accepted for presentation at ECML'1

    Beyond Monte Carlo Tree Search: Playing Go with Deep Alternative Neural Network and Long-Term Evaluation

    Full text link
    Monte Carlo tree search (MCTS) is extremely popular in computer Go which determines each action by enormous simulations in a broad and deep search tree. However, human experts select most actions by pattern analysis and careful evaluation rather than brute search of millions of future nteractions. In this paper, we propose a computer Go system that follows experts way of thinking and playing. Our system consists of two parts. The first part is a novel deep alternative neural network (DANN) used to generate candidates of next move. Compared with existing deep convolutional neural network (DCNN), DANN inserts recurrent layer after each convolutional layer and stacks them in an alternative manner. We show such setting can preserve more contexts of local features and its evolutions which are beneficial for move prediction. The second part is a long-term evaluation (LTE) module used to provide a reliable evaluation of candidates rather than a single probability from move predictor. This is consistent with human experts nature of playing since they can foresee tens of steps to give an accurate estimation of candidates. In our system, for each candidate, LTE calculates a cumulative reward after several future interactions when local variations are settled. Combining criteria from the two parts, our system determines the optimal choice of next move. For more comprehensive experiments, we introduce a new professional Go dataset (PGD), consisting of 253233 professional records. Experiments on GoGoD and PGD datasets show the DANN can substantially improve performance of move prediction over pure DCNN. When combining LTE, our system outperforms most relevant approaches and open engines based on MCTS.Comment: AAAI 201

    Beyond the One Step Greedy Approach in Reinforcement Learning

    Get PDF
    The famous Policy Iteration algorithm alternates between policy improvement and policy evaluation. Implementations of this algorithm with several variants of the latter evaluation stage, e.g, nn-step and trace-based returns, have been analyzed in previous works. However, the case of multiple-step lookahead policy improvement, despite the recent increase in empirical evidence of its strength, has to our knowledge not been carefully analyzed yet. In this work, we introduce the first such analysis. Namely, we formulate variants of multiple-step policy improvement, derive new algorithms using these definitions and prove their convergence. Moreover, we show that recent prominent Reinforcement Learning algorithms are, in fact, instances of our framework. We thus shed light on their empirical success and give a recipe for deriving new algorithms for future study.Comment: ICML 201

    A Dynamical Systems Approach for Static Evaluation in Go

    Full text link
    In the paper arguments are given why the concept of static evaluation has the potential to be a useful extension to Monte Carlo tree search. A new concept of modeling static evaluation through a dynamical system is introduced and strengths and weaknesses are discussed. The general suitability of this approach is demonstrated.Comment: IEEE Transactions on Computational Intelligence and AI in Games, vol 3 (2011), no

    Adding expert knowledge and exploration in Monte-Carlo Tree Search

    Get PDF
    International audienceWe present a new exploration term, more efficient than clas- sical UCT-like exploration terms and combining efficiently expert rules, patterns extracted from datasets, All-Moves-As-First values and classi- cal online values. As this improved bandit formula does not solve several important situations (semeais, nakade) in computer Go, we present three other important improvements which are central in the recent progress of our program MoGo: { We show an expert-based improvement of Monte-Carlo simulations for nakade situations; we also emphasize some limitations of this modification. { We show a technique which preserves diversity in the Monte-Carlo simulation, which greatly improves the results in 19x19. { Whereas the UCB-based exploration term is not efficient in MoGo, we show a new exploration term which is highly efficient in MoGo. MoGo recently won a game with handicap 7 against a 9Dan Pro player, Zhou JunXun, winner of the LG Cup 2007, and a game with handicap 6 against a 1Dan pro player, Li-Chen Chien

    Opponent modelling in the game of tron using reinforcement learning

    Get PDF
    In this paper we propose the use of vision grids as state representation to learn to play the game Tron using neural networks and reinforcement learning. This approach speeds up learning by significantly reducing the number of unique states. Furthermore, we introduce a novel opponent modelling technique, which is used to predict the opponent’s next move. The learned model of the opponent is subsequently used in Monte-Carlo roll-outs, in which the game is simulated n-steps ahead in order to determine the expected value of conducting a certain action. Finally, we compare the performance using two different activation functions in the multi-layer perceptron, namely the sigmoid and exponential linear unit (Elu). The results show that the Elu activation function outperforms the sigmoid activation function in most cases. Furthermore, vision grids significantly increase learning speed and in most cases this also increases the agent’s performance compared to when the full grid is used as state representation. Finally, the opponent modelling technique allows the agent to learn a predictive model of the opponent’s actions, which in combination with Monte-Carlo roll-outs significantly increases the agent’s performance

    Multiple-Step Greedy Policies in Online and Approximate Reinforcement Learning

    Get PDF
    Multiple-step lookahead policies have demonstrated high empirical competence in Reinforcement Learning, via the use of Monte Carlo Tree Search or Model Predictive Control. In a recent work \cite{efroni2018beyond}, multiple-step greedy policies and their use in vanilla Policy Iteration algorithms were proposed and analyzed. In this work, we study multiple-step greedy algorithms in more practical setups. We begin by highlighting a counter-intuitive difficulty, arising with soft-policy updates: even in the absence of approximations, and contrary to the 1-step-greedy case, monotonic policy improvement is not guaranteed unless the update stepsize is sufficiently large. Taking particular care about this difficulty, we formulate and analyze online and approximate algorithms that use such a multi-step greedy operator.Comment: NIPS 201

    Monte Carlo Tree Search Applied to a Modified Pursuit/Evasion Scotland Yard Game with Rendezvous Spaceflight Operation Applications

    Get PDF
    This thesis takes the Scotland Yard board game and modifies its rules to mimic important aspects of space in order to facilitate the creation of artificial intelligence for space asset pursuit/evasion scenarios. Space has become a physical warfighting domain. To combat threats, an understanding of the tactics, techniques, and procedures must be captured and studied. Games and simulations are effective tools to capture data lacking historical context. Artificial intelligence and machine learning models can use simulations to develop proper defensive and offensive tactics, techniques, and procedures capable of protecting systems against potential threats. Monte Carlo Tree Search is a bandit-based reinforcement learning model known for using limited domain knowledge to push favorable results. Monte Carlo agents have been used in a multitude of imperfect domain knowledge games. One such game was in which Monte Carlo agents were produced and studied in an imperfect domain game for pursuit-evasion tactics is Scotland Yard. This thesis continues the Monte Carlo agents previously produced by Mark Winands and Pim Nijssen and applied to Scotland Yard. In the research presented here, the rules for Scotland Yard are analyzed and presented in an expansion that partially accounts for spaceflight dynamics in order to study the agents within a simplified model, while having some foundation for use within space environments. Results show promise for the use of Monte- Carlo agents in pursuit/evasion autonomous space scenarios while also illuminating some major challenges for future work in more realistic three-dimensional space environments

    Monte-Carlo tree search with heuristic knowledge: A novel way in solving capturing and life and death problems in Go

    Get PDF
    Monte-Carlo (MC) tree search is a new research field. Its effectiveness in searching large state spaces, such as the Go game tree, is well recognized in the computer Go community. Go domain- specific heuristics and techniques as well as domain-independent heuristics and techniques are sys- tematically investigated in the context of the MC tree search in this dissertation. The search extensions based on these heuristics and techniques can significantly improve the effectiveness and efficiency of the MC tree search. Two major areas of investigation are addressed in this dissertation research: I. The identification and use of the effective heuristic knowledge in guiding the MC simulations, II. The extension of the MC tree search algorithm with heuristics. Go, the most challenging board game to the machine, serves as the test bed. The effectiveness of the MC tree search extensions is demonstrated through the performances of Go tactic problem solvers using these techniques. The main contributions of this dissertation include: 1. A heuristics based Monte-Carlo tactic tree search framework is proposed to extend the standard Monte-Carlo tree search. 2. (Go) Knowledge based heuristics are systematically investigated to improve the Monte-Carlo tactic tree search. 3. Pattern learning is demonstrated as effective in improving the Monte-Carlo tactic tree search. 4. Domain knowledge independent tree search enhancements are shown as effective in improving the Monte-Carlo tactic tree search performances. 5. A strong Go Tactic solver based on proposed algorithms outperforms traditional game tree search algorithms. The techniques developed in this dissertation research can benefit other game domains and ap- plication fields