Search CORE

590 research outputs found

Pgx: Hardware-accelerated Parallel Game Simulators for Reinforcement Learning

Author: Habara Keigo
Ishii Shin
Kita Haruka
Koyamada Sotetsu
Murata Yu
Nishimori Soichiro
Okano Shinri
Publication venue
Publication date: 27/06/2023
Field of study

We propose Pgx, a suite of board game reinforcement learning (RL) environments written in JAX and optimized for GPU/TPU accelerators. By leveraging auto-vectorization and Just-In-Time (JIT) compilation of JAX, Pgx can efficiently scale to thousands of parallel executions over accelerators. In our experiments on a DGX-A100 workstation, we discovered that Pgx can simulate RL environments 10-100x faster than existing Python RL libraries. Pgx includes RL environments commonly used as benchmarks in RL research, such as backgammon, chess, shogi, and Go. Additionally, Pgx offers miniature game sets and baseline models to facilitate rapid research cycles. We demonstrate the efficient training of the Gumbel AlphaZero algorithm with Pgx environments. Overall, Pgx provides high-performance environment simulators for researchers to accelerate their RL experiments. Pgx is available at https://github.com/sotetsuk/pgx.Comment: 9 page

arXiv.org e-Print Archive

Learning to Search in Reinforcement Learning

Author: Antonoglou Ioannis
Publication venue: UCL (University College London)
Publication date: 28/04/2023
Field of study

In this thesis, we investigate the use of search based algorithms with deep neural networks to tackle a wide range of problems ranging from board games to video games and beyond. Drawing inspiration from AlphaGo, the first computer program to achieve superhuman performance in the game of Go, we developed a new algorithm AlphaZero. AlphaZero is a general reinforcement learning algorithm that combines deep neural networks with a Monte Carlo Tree search for planning and learning. Starting completely from scratch, without any prior human knowledge beyond the basic rules of the game, AlphaZero managed to achieve superhuman performance in Go, chess and shogi. Subsequently, building upon the success of AlphaZero, we investigated ways to extend our methods to problems in which the rules are not known or cannot be hand-coded. This line of work led to the development of MuZero, a model-based reinforcement learning agent that builds a deterministic internal model of the world and uses it to construct plans in its imagination. We applied our method to Go, chess, shogi and the classic Atari suite of video-games, achieving superhuman performance. MuZero is the first RL algorithm to master a variety of both canonical challenges for high performance planning and visually complex problems using the same principles. Finally, we describe Stochastic MuZero, a general agent that extends the applicability of MuZero to highly stochastic environments. We show that our method achieves superhuman performance in stochastic domains such as backgammon and the classic game of 2048 while matching the performance of MuZero in deterministic ones like Go

UCL Discovery

Agents Explore the Environment Beyond Good Actions to Improve Their Model for Better Decisions

Author: Unverzagt Matthias
Publication venue
Publication date: 06/06/2023
Field of study

Improving the decision-making capabilities of agents is a key challenge on the road to artificial intelligence. To improve the planning skills needed to make good decisions, MuZero's agent combines prediction by a network model and planning by a tree search using the predictions. MuZero's learning process can fail when predictions are poor but planning requires them. We use this as an impetus to get the agent to explore parts of the decision tree in the environment that it otherwise would not explore. The agent achieves this, first by normal planning to come up with an improved policy. Second, it randomly deviates from this policy at the beginning of each training episode. And third, it switches back to the improved policy at a random time step to experience the rewards from the environment associated with the improved policy, which is the basis for learning the correct value expectation. The simple board game Tic-Tac-Toe is used to illustrate how this approach can improve the agent's decision-making ability. The source code, written entirely in Java, is available at https://github.com/enpasos/muzero.Comment: Submitted to NeurIPS 202

arXiv.org e-Print Archive

What model does MuZero learn?

Author: He Jinke
Moerland Thomas M.
Oliehoek Frans A.
Publication venue
Publication date: 18/10/2023
Field of study

Model-based reinforcement learning has drawn considerable interest in recent years, given its promise to improve sample efficiency. Moreover, when using deep-learned models, it is potentially possible to learn compact models from complex sensor data. However, the effectiveness of these learned models, particularly their capacity to plan, i.e., to improve the current policy, remains unclear. In this work, we study MuZero, a well-known deep model-based reinforcement learning algorithm, and explore how far it achieves its learning objective of a value-equivalent model and how useful the learned models are for policy improvement. Amongst various other insights, we conclude that the model learned by MuZero cannot effectively generalize to evaluate unseen policies, which limits the extent to which we can additionally improve the current policy by planning with the model

arXiv.org e-Print Archive

MiniZero: Comparative Analysis of AlphaZero and MuZero on Go, Othello, and Atari Games

Author: Guei Hung
Huang Po-Wei
Peng Pei-Chiun
Shih Chung-Chin
Tsai Yun-Jui
Wei Ting Han
Wu Ti-Rong
Publication venue
Publication date: 15/11/2023
Field of study

This paper presents MiniZero, a zero-knowledge learning framework that supports four state-of-the-art algorithms, including AlphaZero, MuZero, Gumbel AlphaZero, and Gumbel MuZero. While these algorithms have demonstrated super-human performance in many games, it remains unclear which among them is most suitable or efficient for specific tasks. Through MiniZero, we systematically evaluate the performance of each algorithm in two board games, 9x9 Go and 8x8 Othello, as well as 57 Atari games. For two board games, using more simulations generally results in higher performance. However, the choice of AlphaZero and MuZero may differ based on game properties. For Atari games, both MuZero and Gumbel MuZero are worth considering. Since each game has unique characteristics, different algorithms and simulations yield varying results. In addition, we introduce an approach, called progressive simulation, which progressively increases the simulation budget during training to allocate computation more efficiently. Our empirical results demonstrate that progressive simulation achieves significantly superior performance in two board games. By making our framework and trained models publicly available, this paper contributes a benchmark for future research on zero-knowledge learning algorithms, assisting researchers in algorithm selection and comparison against these zero-knowledge learning baselines. Our code and data are available at https://rlg.iis.sinica.edu.tw/papers/minizero.Comment: Submitted to IEEE Transactions on Games, under revie

arXiv.org e-Print Archive

Continuous Monte Carlo Graph Search

Author: Babadi Amin
Ilin Alexander
Kannala Juho
Pajarinen Joni
Zhao Yi
Publication venue
Publication date: 04/10/2022
Field of study

In many complex sequential decision making tasks, online planning is crucial for high-performance. For efficient online planning, Monte Carlo Tree Search (MCTS) employs a principled mechanism for trading off between exploration and exploitation. MCTS outperforms comparison methods in various discrete decision making domains such as Go, Chess, and Shogi. Following, extensions of MCTS to continuous domains have been proposed. However, the inherent high branching factor and the resulting explosion of search tree size is limiting existing methods. To solve this problem, this paper proposes Continuous Monte Carlo Graph Search (CMCGS), a novel extension of MCTS to online planning in environments with continuous state and action spaces. CMCGS takes advantage of the insight that, during planning, sharing the same action policy between several states can yield high performance. To implement this idea, at each time step CMCGS clusters similar states into a limited number of stochastic action bandit nodes, which produce a layered graph instead of an MCTS search tree. Experimental evaluation with limited sample budgets shows that CMCGS outperforms comparison methods in several complex continuous DeepMind Control Suite benchmarks and a 2D navigation task.Comment: Under review as a conference paper at ICLR 202

arXiv.org e-Print Archive

Machine-learning Based Automatic Formulation of Query Sequences to Improve Search

Author: Boerschinger Benjamin
Ciaramita Massimiliano
Huebscher Michelle Chen
Kilcher Yannic
Publication venue: Technical Disclosure Commons
Publication date: 02/06/2021
Field of study

People use search engines to look up information on the Internet, using search queries related to their information needs. This disclosure describes the use of machine learning techniques, including supervised learning and reinforcement learning to train a search agent to search deeper for better, more accurate, better supported answers by interacting with the search engine. The interaction mimics strategies utilized by human experts to carry out accurate web search. The search agent can be modular, and to provide answers to a user query, performs operations such as formulation of new queries in a sequence, analysis of intermediate results, and selection of results based on a chosen success metric that can take into account factors such as accuracy, diversity, presence of justification, etc

Technical Disclosure Common