301 research outputs found
Temporal Difference Learning in Complex Domains
PhDThis thesis adapts and improves on the methods of TD(k) (Sutton 1988) that were
successfully used for backgammon (Tesauro 1994) and applies them to other complex
games that are less amenable to simple pattem-matching approaches. The games
investigated are chess and shogi, both of which (unlike backgammon) require
significant amounts of computational effort to be expended on search in order to
achieve expert play. The improved methods are also tested in a non-game domain.
In the chess domain, the adapted TD(k) method is shown to successfully learn the
relative values of the pieces, and matches using these learnt piece values indicate that
they perform at least as well as piece values widely quoted in elementary chess books.
The adapted TD(X) method is also shown to work well in shogi, considered by many
researchers to be the next challenge for computer game-playing, and for which there
is no standardised set of piece values.
An original method to automatically set and adjust the major control parameters used
by TD(k) is presented. The main performance advantage comes from the learning
rate adjustment, which is based on a new concept called temporal coherence.
Experiments in both chess and a random-walk domain show that the temporal
coherence algorithm produces both faster learning and more stable values than both
human-chosen parameters and an earlier method for learning rate adjustment.
The methods presented in this thesis allow programs to learn with as little input of
external knowledge as possible, exploring the domain on their own rather than by
being taught. Further experiments show that the method is capable of handling many
hundreds of weights, and that it is not necessary to perform deep searches during the
leaming phase in order to learn effective weight
Warm-Start AlphaZero Self-Play Search Enhancements
Recently, AlphaZero has achieved landmark results in deep reinforcement
learning, by providing a single self-play architecture that learned three
different games at super human level. AlphaZero is a large and complicated
system with many parameters, and success requires much compute power and
fine-tuning. Reproducing results in other games is a challenge, and many
researchers are looking for ways to improve results while reducing
computational demands. AlphaZero's design is purely based on self-play and
makes no use of labeled expert data ordomain specific enhancements; it is
designed to learn from scratch. We propose a novel approach to deal with this
cold-start problem by employing simple search enhancements at the beginning
phase of self-play training, namely Rollout, Rapid Action Value Estimate (RAVE)
and dynamically weighted combinations of these with the neural network, and
Rolling Horizon Evolutionary Algorithms (RHEA). Our experiments indicate that
most of these enhancements improve the performance of their baseline player in
three different (small) board games, with especially RAVE based variants
playing strongly
Temoral Difference Learning in Complex Domains
Submitted to the University of London for the Degree of Doctor of Philosophy in Computer Scienc
- …