12,183 research outputs found
Learning Policies from Self-Play with Policy Gradients and MCTS Value Estimates
In recent years, state-of-the-art game-playing agents often involve policies
that are trained in self-playing processes where Monte Carlo tree search (MCTS)
algorithms and trained policies iteratively improve each other. The strongest
results have been obtained when policies are trained to mimic the search
behaviour of MCTS by minimising a cross-entropy loss. Because MCTS, by design,
includes an element of exploration, policies trained in this manner are also
likely to exhibit a similar extent of exploration. In this paper, we are
interested in learning policies for a project with future goals including the
extraction of interpretable strategies, rather than state-of-the-art
game-playing performance. For these goals, we argue that such an extent of
exploration is undesirable, and we propose a novel objective function for
training policies that are not exploratory. We derive a policy gradient
expression for maximising this objective function, which can be estimated using
MCTS value estimates, rather than MCTS visit counts. We empirically evaluate
various properties of resulting policies, in a variety of board games.Comment: Accepted at the IEEE Conference on Games (CoG) 201
On Monte-Carlo tree search for deterministic games with alternate moves and complete information
We consider a deterministic game with alternate moves and complete
information, of which the issue is always the victory of one of the two
opponents. We assume that this game is the realization of a random model
enjoying some independence properties. We consider algorithms in the spirit of
Monte-Carlo Tree Search, to estimate at best the minimax value of a given
position: it consists in simulating, successively, well-chosen matches,
starting from this position. We build an algorithm, which is optimal, step by
step, in some sense: once the first matches are simulated, the algorithm
decides from the statistics furnished by the first matches (and the a
priori we have on the game) how to simulate the -th match in such a way
that the increase of information concerning the minimax value of the position
under study is maximal. This algorithm is remarkably quick. We prove that our
step by step optimal algorithm is not globally optimal and that it always
converges in a finite number of steps, even if the a priori we have on the game
is completely irrelevant. We finally test our algorithm, against MCTS, on
Pearl's game and, with a very simple and universal a priori, on the games
Connect Four and some variants. The numerical results are rather disappointing.
We however exhibit some situations in which our algorithm seems efficient
Traditional Wisdom and Monte Carlo Tree Search Face-to-Face in the Card Game Scopone
We present the design of a competitive artificial intelligence for Scopone, a
popular Italian card game. We compare rule-based players using the most
established strategies (one for beginners and two for advanced players) against
players using Monte Carlo Tree Search (MCTS) and Information Set Monte Carlo
Tree Search (ISMCTS) with different reward functions and simulation strategies.
MCTS requires complete information about the game state and thus implements a
cheating player while ISMCTS can deal with incomplete information and thus
implements a fair player. Our results show that, as expected, the cheating MCTS
outperforms all the other strategies; ISMCTS is stronger than all the
rule-based players implementing well-known and most advanced strategies and it
also turns out to be a challenging opponent for human players.Comment: Preprint. Accepted for publication in the IEEE Transaction on Game
- …
