28 research outputs found
A Survey of Monte Carlo Tree Search Methods
Monte Carlo tree search (MCTS) is a recently proposed search method that combines the precision of tree search with the generality of random sampling. It has received considerable interest due to its spectacular success in the difficult problem of computer Go, but has also proved beneficial in a range of other domains. This paper is a survey of the literature to date, intended to provide a snapshot of the state of the art after the first five years of MCTS research. We outline the core algorithm's derivation, impart some structure on the many variations and enhancements that have been proposed, and summarize the results from the key game and nongame domains to which MCTS methods have been applied. A number of open research questions indicate that the field is ripe for future work
On monte carlo tree search and reinforcement learning
Fuelled by successes in Computer Go, Monte Carlo tree search (MCTS) has achieved widespread
adoption within the games community. Its links to traditional reinforcement learning (RL)
methods have been outlined in the past; however, the use of RL techniques within tree search has
not been thoroughly studied yet. In this paper we re-examine in depth this close relation between
the two fields; our goal is to improve the cross-awareness between the two communities. We show
that a straightforward adaptation of RL semantics within tree search can lead to a wealth of new
algorithms, for which the traditional MCTS is only one of the variants. We confirm that planning
methods inspired by RL in conjunction with online search demonstrate encouraging results on
several classic board games and in arcade video game competitions, where our algorithm recently
ranked first. Our study promotes a unified view of learning, planning, and search
Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates
In recent years, state-of-the-art game-playing agents often involve policies
that are trained in self-playing processes where Monte Carlo tree search (MCTS)
algorithms and trained policies iteratively improve each other. The strongest
results have been obtained when policies are trained to mimic the search
behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design,
includes an element of exploration, policies trained in this manner are also
likely to exhibit a similar extent of exploration. In this paper, we are
interested in learning policies for a project with future goals including the
extraction of interpretable strategies, rather than state-of-the-art
game-playing performance. For these goals, we argue that such an extent of
exploration is undesirable, and we propose a novel objective function for
training policies that are not exploratory. We derive a policy gradient
expression for maximising this objective function, which can be estimated using
MCTS value estimates, rather than MCTS visit counts. We empirically evaluate
various properties of resulting policies, in a variety of board games.Comment: Accepted at the IEEE Conference on Games (CoG) 201
Distributed Nested Rollout Policy for Same Game
Nested Rollout Policy Adaptation (NRPA) is a Monte Carlo search heuristic for puzzles and other optimization problems. It achieves state-of-the-art performance on several games including SameGame. In this paper, we design several parallel and distributed NRPA-based search techniques, and we provide a number of experimental insights about their execution. Finally, we use our best implementation to discover 15 better scores for 20 standard SameGame boards
Switchable Lightweight Anti-symmetric Processing (SLAP) with CNN Outspeeds Data Augmentation by Smaller Sample -- Application in Gomoku Reinforcement Learning
To replace data augmentation, this paper proposed a method called SLAP to
intensify experience to speed up machine learning and reduce the sample size.
SLAP is a model-independent protocol/function to produce the same output given
different transformation variants. SLAP improved the convergence speed of
convolutional neural network learning by 83% in the experiments with Gomoku
game states, with only one eighth of the sample size compared with data
augmentation. In reinforcement learning for Gomoku, using AlphaGo
Zero/AlphaZero algorithm with data augmentation as baseline, SLAP reduced the
number of training samples by a factor of 8 and achieved similar winning rate
against the same evaluator, but it was not yet evident that it could speed up
reinforcement learning. The benefits should at least apply to domains that are
invariant to symmetry or certain transformations. As future work, SLAP may aid
more explainable learning and transfer learning for domains that are not
invariant to symmetry, as a small step towards artificial general intelligence.Comment: Change title; 6 pages, 8 figure
The effect of simulation bias on action selection in Monte Carlo Tree Search
A dissertation submitted to the Faculty of Science, University of the Witwatersrand,
in fulfilment of the requirements for the degree of Master of Science. August 2016.Monte Carlo Tree Search (MCTS) is a family of directed search algorithms that has gained widespread
attention in recent years. It combines a traditional tree-search approach with Monte Carlo
simulations, using the outcome of these simulations (also known as playouts or rollouts) to evaluate
states in a look-ahead tree. That MCTS does not require an evaluation function makes it particularly
well-suited to the game of Go — seen by many to be chess’s successor as a grand challenge of
artificial intelligence — with MCTS-based agents recently able to achieve expert-level play on
19Ă—19 boards. Furthermore, its domain-independent nature also makes it a focus in a variety of
other fields, such as Bayesian reinforcement learning and general game-playing.
Despite the vast amount of research into MCTS, the dynamics of the algorithm are still not
yet fully understood. In particular, the effect of using knowledge-heavy or biased simulations in
MCTS still remains unknown, with interesting results indicating that better-informed rollouts do
not necessarily result in stronger agents. This research provides support for the notion that MCTS
is well-suited to a class of domain possessing a smoothness property. In these domains, biased
rollouts are more likely to produce strong agents. Conversely, any error due to incorrect bias
is compounded in non-smooth domains, and in particular for low-variance simulations. This is
demonstrated empirically in a number of single-agent domains.LG201