590 research outputs found
Pgx: Hardware-accelerated Parallel Game Simulators for Reinforcement Learning
We propose Pgx, a suite of board game reinforcement learning (RL)
environments written in JAX and optimized for GPU/TPU accelerators. By
leveraging auto-vectorization and Just-In-Time (JIT) compilation of JAX, Pgx
can efficiently scale to thousands of parallel executions over accelerators. In
our experiments on a DGX-A100 workstation, we discovered that Pgx can simulate
RL environments 10-100x faster than existing Python RL libraries. Pgx includes
RL environments commonly used as benchmarks in RL research, such as backgammon,
chess, shogi, and Go. Additionally, Pgx offers miniature game sets and baseline
models to facilitate rapid research cycles. We demonstrate the efficient
training of the Gumbel AlphaZero algorithm with Pgx environments. Overall, Pgx
provides high-performance environment simulators for researchers to accelerate
their RL experiments. Pgx is available at https://github.com/sotetsuk/pgx.Comment: 9 page
Learning to Search in Reinforcement Learning
In this thesis, we investigate the use of search based algorithms with deep neural
networks to tackle a wide range of problems ranging from board games to video
games and beyond. Drawing inspiration from AlphaGo, the first computer program
to achieve superhuman performance in the game of Go, we developed a new algorithm AlphaZero. AlphaZero is a general reinforcement learning algorithm that
combines deep neural networks with a Monte Carlo Tree search for planning and
learning. Starting completely from scratch, without any prior human knowledge
beyond the basic rules of the game, AlphaZero managed to achieve superhuman
performance in Go, chess and shogi. Subsequently, building upon the success of AlphaZero, we investigated ways to extend our methods to problems in which the rules
are not known or cannot be hand-coded. This line of work led to the development
of MuZero, a model-based reinforcement learning agent that builds a deterministic
internal model of the world and uses it to construct plans in its imagination. We
applied our method to Go, chess, shogi and the classic Atari suite of video-games,
achieving superhuman performance. MuZero is the first RL algorithm to master
a variety of both canonical challenges for high performance planning and visually complex problems using the same principles. Finally, we describe Stochastic
MuZero, a general agent that extends the applicability of MuZero to highly stochastic environments. We show that our method achieves superhuman performance in
stochastic domains such as backgammon and the classic game of 2048 while matching the performance of MuZero in deterministic ones like Go
Agents Explore the Environment Beyond Good Actions to Improve Their Model for Better Decisions
Improving the decision-making capabilities of agents is a key challenge on
the road to artificial intelligence. To improve the planning skills needed to
make good decisions, MuZero's agent combines prediction by a network model and
planning by a tree search using the predictions. MuZero's learning process can
fail when predictions are poor but planning requires them. We use this as an
impetus to get the agent to explore parts of the decision tree in the
environment that it otherwise would not explore. The agent achieves this, first
by normal planning to come up with an improved policy. Second, it randomly
deviates from this policy at the beginning of each training episode. And third,
it switches back to the improved policy at a random time step to experience the
rewards from the environment associated with the improved policy, which is the
basis for learning the correct value expectation. The simple board game
Tic-Tac-Toe is used to illustrate how this approach can improve the agent's
decision-making ability. The source code, written entirely in Java, is
available at https://github.com/enpasos/muzero.Comment: Submitted to NeurIPS 202
What model does MuZero learn?
Model-based reinforcement learning has drawn considerable interest in recent
years, given its promise to improve sample efficiency. Moreover, when using
deep-learned models, it is potentially possible to learn compact models from
complex sensor data. However, the effectiveness of these learned models,
particularly their capacity to plan, i.e., to improve the current policy,
remains unclear. In this work, we study MuZero, a well-known deep model-based
reinforcement learning algorithm, and explore how far it achieves its learning
objective of a value-equivalent model and how useful the learned models are for
policy improvement. Amongst various other insights, we conclude that the model
learned by MuZero cannot effectively generalize to evaluate unseen policies,
which limits the extent to which we can additionally improve the current policy
by planning with the model
MiniZero: Comparative Analysis of AlphaZero and MuZero on Go, Othello, and Atari Games
This paper presents MiniZero, a zero-knowledge learning framework that
supports four state-of-the-art algorithms, including AlphaZero, MuZero, Gumbel
AlphaZero, and Gumbel MuZero. While these algorithms have demonstrated
super-human performance in many games, it remains unclear which among them is
most suitable or efficient for specific tasks. Through MiniZero, we
systematically evaluate the performance of each algorithm in two board games,
9x9 Go and 8x8 Othello, as well as 57 Atari games. For two board games, using
more simulations generally results in higher performance. However, the choice
of AlphaZero and MuZero may differ based on game properties. For Atari games,
both MuZero and Gumbel MuZero are worth considering. Since each game has unique
characteristics, different algorithms and simulations yield varying results. In
addition, we introduce an approach, called progressive simulation, which
progressively increases the simulation budget during training to allocate
computation more efficiently. Our empirical results demonstrate that
progressive simulation achieves significantly superior performance in two board
games. By making our framework and trained models publicly available, this
paper contributes a benchmark for future research on zero-knowledge learning
algorithms, assisting researchers in algorithm selection and comparison against
these zero-knowledge learning baselines. Our code and data are available at
https://rlg.iis.sinica.edu.tw/papers/minizero.Comment: Submitted to IEEE Transactions on Games, under revie
Continuous Monte Carlo Graph Search
In many complex sequential decision making tasks, online planning is crucial
for high-performance. For efficient online planning, Monte Carlo Tree Search
(MCTS) employs a principled mechanism for trading off between exploration and
exploitation. MCTS outperforms comparison methods in various discrete decision
making domains such as Go, Chess, and Shogi. Following, extensions of MCTS to
continuous domains have been proposed. However, the inherent high branching
factor and the resulting explosion of search tree size is limiting existing
methods. To solve this problem, this paper proposes Continuous Monte Carlo
Graph Search (CMCGS), a novel extension of MCTS to online planning in
environments with continuous state and action spaces. CMCGS takes advantage of
the insight that, during planning, sharing the same action policy between
several states can yield high performance. To implement this idea, at each time
step CMCGS clusters similar states into a limited number of stochastic action
bandit nodes, which produce a layered graph instead of an MCTS search tree.
Experimental evaluation with limited sample budgets shows that CMCGS
outperforms comparison methods in several complex continuous DeepMind Control
Suite benchmarks and a 2D navigation task.Comment: Under review as a conference paper at ICLR 202
Machine-learning Based Automatic Formulation of Query Sequences to Improve Search
People use search engines to look up information on the Internet, using search queries related to their information needs. This disclosure describes the use of machine learning techniques, including supervised learning and reinforcement learning to train a search agent to search deeper for better, more accurate, better supported answers by interacting with the search engine. The interaction mimics strategies utilized by human experts to carry out accurate web search. The search agent can be modular, and to provide answers to a user query, performs operations such as formulation of new queries in a sequence, analysis of intermediate results, and selection of results based on a chosen success metric that can take into account factors such as accuracy, diversity, presence of justification, etc
- …