Performance analysis of a 240 thread tournament level MCTS Go program on
  the Intel Xeon Phi by Mirsoleimani, S. Ali et al.
ar
X
iv
:1
40
9.
42
97
v2
  [
cs
.PF
]  
6 N
ov
 20
14
Performance analysis of a 240 thread tournament level MCTS Go program on the Intel Xeon Phi
Ali Mirsoleimani∗1,2, Aske Plaat1, Jos Vermaseren2, and Jaap van den Herik1
1Leiden Centre of Data Science, Leiden University, The Netherlands
2Nikhef Theory Group, Nikhef Amsterdam, The Netherlands
KEYWORDS
Distributed and Parallel Systems Simulation, Simulation Fi-
delity and Performance Evaluation, Monte Carlo Methods,
Special Architectures, Experimental and Comparative Stud-
ies, Memory access patterns, Game playing.
ABSTRACT
In 2013 Intel introduced the Xeon Phi, a new parallel co-
processor board. The Xeon Phi is a cache-coherent many-
core shared memory architecture claiming CPU-like versa-
tility, programmability, high performance, and power effi-
ciency. The first published micro-benchmark studies indicate
that many of Intel’s claims appear to be true. The current pa-
per is the first study on the Phi of a complex artificial intel-
ligence application. It contains an open source MCTS appli-
cation for playing tournament quality Go (an oriental board
game). We report the first speedup figures for up to 240 par-
allel threads on a real machine, allowing a direct comparison
to previous simulation studies. After a substantial amount
of work, we observed that performance scales well up to 32
threads, largely confirming previous simulation results of this
Go program, although the performance surprisingly deterio-
rates between 32 and 240 threads. Furthermore, we report (1)
unexpected performance anomalies between the Xeon Phi
and Xeon CPU for small problem sizes and small numbers
of threads, and (2) that performance is sensitive to schedul-
ing choices. Achieving good performance on the Xeon Phi
for complex programs is not straightforward; it requires a
deep understanding of (1) search patterns, (2) of scheduling,
and (3) of the architecture and its many cores and caches. In
practice, the Xeon Phi is less straightforward to program for
than originally envisioned by Intel.
Introduction
In 2013 Intel introduced a new many-core architecture,
a cache-coherent shared memory co-processor architecture
claiming CPU-like versatility, programmability, high perfor-
mance, and power efficiency (in contrast to hard-to-program
energy-hungry GPU co-processor architectures from compa-
nies such as NVIDIA). When a new architecture emerges
there is a great interest in the community for analyzing and
understanding its performance. When GPGPU programming
∗email: s.a.mirsoleimani@liacs.leidenuniv.nl
became prevalent, a rich literature on GPU performance
modeling and simulations emerged. See, e.g., [11, 15],
and [14]. This trend is now starting for the Xeon Phi. The
first published micro-benchmark studies indicate that many
of Intel’s claims are true [7]. Of course, micro-benchmarks
only tell a part of the story and performance of actual appli-
cations may differ in practice. In this paper we have chosen
to study an important application from the domain of artifi-
cial intelligence.
Ever since the victory of IBM’s DEEP BLUE over World
Chess Champion Garry Kasparov on 11 May 1997 [22],
computer Go has been the Drosophila Melanogaster of Ar-
tificial Intelligence. The complexity and depth of the game
has frustrated AI researchers trying to replicate the computer
chess successes for many years [25]. The brute-force mini-
max approach, so successful in chess, turned out to be a dead
end in Go, when in 2006 a new probabilistic simulation algo-
rithm was introduced. This new algorithm, Monte Carlo Tree
Search [3, 4, 12, 8] was successful, beating the first human
Go-grandmaster in 2008. Not only was MCTS successful
in Go, it also proved successful in many other combinatorial
optimization and simulation problems [13, 20, 2].
Because of the successes with MCTS, much research ef-
fort has been put into improving the performance of parallel
MCTS algorithms. MCTS performance studies have become
important in their own right in recent years [3, 27, 1, 23, 21].
However, scaling studies of MCTS on shared memory many-
core machines have been limited to smaller studies, typically
8-24 core machines [3], or have been performed in simula-
tion [23]. The advent of the Intel Xeon Phi allows for the
first time to replicate such simulation studies on actual hard-
ware, up to 240 simultaneous threads. We have performed
this study using FUEGO [5], the same open source program
that was used in the simulation study [23], allowing a direct
comparison.
This paper has two main contributions.
• We have performed, to the best of our knowledge, the
first performance study of a non-trivial MCTS program
on the Intel Xeon Phi. We find unexepected sensitiv-
ity to problem size and scheduling, which we attribute
to a low integer performance and complex interconnect
architecture.
• We have performed the first large scale (up to 240
threads) study of MCTS tree parallelism on a real shared
Figure 1: Intel Xeon Phi microarchitecture
memory many core machine. We find good perfor-
mance up to 32 threads, confirming a previous simu-
lation study, and deteriorating performance from 32 to
240 threads.
Moreover, we have two more findings. First, Intel’s wish of
high performance at no cost to the programmer is only partly
achieved, due to the complex hardware characteristics of the
Xeon Phi architecture. Second, the scaling of the algorithm is
dependent on many details including cache hierarchy, access
latency, scheduling policy, and core architecture.
The remainder of this paper is structured as follows: in sec-
tion 2 the architecture of the Xeon Phi is briefly discussed.
Section 3 discusses related work. Section 4 gives the exper-
imental setup, and section 5 gives the experimental results.
Finally, scalability and thread affinity are discussed, and a
conclusion is given.
Architecture of Intel Xeon Phi Co-processor
We start providing an overview of the Intel Xeon Phi co-
processor architecture (see Figure 1). A Xeon Phi co-
processor board consists of up to 61 cores based on the Intel
64-bit ISA. Each of these cores contains vector processing
units (VPU) to execute 512 bits of 8 double-precision float-
ing point elements or 16 single-precision floats or 32-bit in-
tegers at the same time, 4-way SMT, and dedicated L1 and
fully coherent L2 caches [16]. The tag directories (TD) are
used to look up cache data distributed among the cores. The
connection between cores and other functional units such as
memory controllers (MC) is through a bidirectional ring in-
terconnect. There are 8 distributed memory controllers as
interface between the ring burst and main memory which is
up to 16 GB.
The scheduling policy on the Xeon Phi can be influenced
manually at run time. By setting an environment variable
one can control how the threads are bound to cores. This
can be advantageous for exploiting data locality of the al-
gorithm. Using the KMP AFFINITY environmental variable
threads can be distributed among cores. Possible settings are:
compact, balanced, or scatter. The compact type allocates
threads to cores in a way that maximizes cache utilization
compact 0 1 2 3 4 5 6 7 Idle Idle
balanced 0 1 2 3 4 5 6 7
scatter 0 4 1 5 2 6 3 7
Figure 2: Allocation of 8 threads on a 4 core system with 4
threads per core with different affinity types.
△ △
Figure 3: Tree parallelization with local lock. The curly ar-
rows represent threads. The shadowy nodes are locked ones.
The black nodes are newly added to the tree.
while scatter type do thread allocation to maximize core uti-
lization. Figure 2 shows the allocations of threads for dif-
ferent types. If threads access data that is stored in a cache
nearby, the balanced type is the best choice because it maxi-
mizes cache and core utilization simultaneously.
Related Work
Below we review related work on MCTS parallelizations.
The four major parallelization methods for MCTS are leaf
parallelization, root parallelization, tree parallelization [3],
and transposition table driven work scheduling (TDS) based
approaches [27]. Of these, tree parallelization is the method
most often used on shared memory machines. It is the
method used in FUEGO. In tree parallelization one MCTS
tree is shared among several threads that are performing si-
multaneous searches [3]. The main challenge in this method
is using data locks to prevent data corruption. Figure 3 shows
the tree parallelization algorithm with local locks. A lock-
free implementation of this algorithm addressed the afore-
mentioned problem with better scaling than a locked ap-
proach [5]. There is also a case study that shows a good
performance of a (non-MCTS) Monte Carlo simulation on
the Xeon Phi co-processor [24].
Schaefers et al. [21] propose a parallel MCTS method for
distributed memory systems called UCT-Treesplit. Yoshi-
zoe et al. [27] describe a parallelization approach based on
TDS [19, 18] for MCTS called depth-first UCT. There are
some attempts to parallelize MCTS on accelerator processors
including GPU [17].
Segal reports the scaling of tree parallelization with virtual
loss in FUEGO for different number of threads and time con-
trols on a simulated idealized shared-memory system [23] .
He finds that strength of play increases asymptotically with
as resources increase (more time or more threads). A near-
perfect speedup is reported for 64 threads and 60-minute per
game. Segal suggests that speedup starts decreasing beyond
64 threads, although, with large time settings, further scaling
to 512 threads still shows performance increases.
Enzenberger et al.[6] evaluate tree parallelization with vir-
tual loss and local locks on a 16-core shared-memory system.
The algorithm shows an eight-fold speedup with 16 threads.
Experimental Setup: Xeon CPU against Xeon Phi
To determine the effective performance on the Xeon Phi we
have performed self-play experiments. A major problem
with parallel game playing programs is the phenomenon of
search overhead, which occurs since, in parallel, parts of the
search tree will be searched that a sequential search would
already have found to be unimportant. Therefore a simple
efficiency measure such as games per second will not tell the
whole story. In order to determine the effective speedup a
different method is needed. We have performed self-play ex-
periments in which a version of the program with double the
resources (2x # of threads) against a version with single re-
sources (1x # of threads) are compared [3].
In order to generate statistically significant results in a rea-
sonable amount of time most works use the setting of 1 sec-
ond per move, and so did we, initially. FUEGO is an open
source, tournament level Go playing program, developed by
a team headed by Martin Mueller at the University of Al-
berta [5]. The experiments were conducted with FUEGO
SVN revision 1900, on a 9x9 board, with komi 6, Chinese
rules, alternating player color was enabled, opening book
was disabled. The win-rate of two opponents is measured
by running at least a 100-game match. A single game of Go
typically lasts 200 moves. The games were played using the
Gomill python library for tournament play [26].
A statistical method based on [9] is used to calculate 95%-
level confidence lower and upper bounds on the real win-
ning rate of a player. Assume p is the true winning probabil-
ity of a player. The value of p is estimated by 0 ≤ w =
x/n ≤ 1 which results from x ≤ n wins in a match of
n games. Therefore, we may simply assume w the sam-
ple mean of a binary-valued random variable that counts
two draws as a loss plus a win. The expected value of w
is E(w) = p and the variance of w is V ar(w) = p(1 −
p)/n. According to the central limit theorem approximately,
w ≈ Normal(p, p(1 − p)n), so (w − p)/
√
p(1− p)/n ≈
Normal(0, 1). Let z% denote the upper critical value of
the standard N(0, 1) normal distribution for any desired %-
level of statistical confidence(z90% = 1.645, z95% = 1.96).
Then, the probability of w − 1.96
√
p(1− p)/n ≤ p ≤
w + 1.96
√
p(1− p)/n is about 95%. Therefore, the 95%
confidence interval on the true winning probability p is [w −
1.96
√
p(1− p)/n,w+1.96
√
p(1− p)/n]. The value of un-
known p is substituted for w: [w − 1.96
√
w(1 − w)/n,w +
1.96
√
w(1 − w)/n].
Our Xeon Phi co-processor board is hosted on a machine in
which a standard 12-core Xeon CPU is present. This allows
experiments in which a conventional parallel Xeon CPU ar-
chitecture is pitted against the new parallel Xeon Phi archi-
tecture.
The results were measured on a machine with (1) an In-
tel Xeon CPU E5-2695 2.40GHz with 12 cores and 48 hy-
perthreads. Each physical core has 256KB L2 cache and
the chip has a total of 30MB L3 cache. The machine has
160GB physical memory. (2) An Intel Xeon Phi 7120P
1.238GHz is installed which has 61 cores and 244 hardware
threads. Each core has 512KB L2 cache. The co-processor
has 16GB GDDR5 memory on board with an aggregate the-
oretical bandwidth of 352 GB/s. The peak turbo frequency is
1.33GHz. The theoretical performance of the 7120P is 2.416
TFLOPS or TIPS and 1.208 TFLOPS for single-precision or
integer and double-precision floating-point arithmetic opera-
tions, respectively [10].
Intel’s icc 14.0.1 compiler is used to compile FUEGO in na-
tive mode. A native application runs directly on the Xeon
Phi and its embedded Linux operating system.
Experimental Results
Xeon CPU: Figure 4 shows the results of the self-play ex-
periments for FUEGO on the conventional Xeon CPU. For
the 9x9 board the win-rate of the program with double the
number of threads is better than the base program, starting at
70%, decreasing to 58% at 32 threads and then becomes flat.
(The 19x19 board has a similar performance, not shown).
These results are entirely in line with results reported in [6]
for 16 vs 8 threads. The slightly decreasing lines are ex-
plained by the phenomenon of search overhead: the parallel
program with double the amount of threads (e.g., 16 threads)
searches more parts of the tree than the version of the pro-
gram with half the number of threads (e.g., 8 threads).
Xeon Phi: Figure 5 shows our initial results for the win-rate
on the Xeon Phi. For these experimental settings (1 second
per move) the Phi-graph differs markedly from the CPU-
graph in Figure 4. The Xeon CPU shows a smooth, slightly
decreasing line. The Xeon Phi shows a more ragged line that
first slopes up, and then slopes down. Also, the overall win-
rate is lower than on the CPU, and even dips below 50%,
implying that more threads actually loses from less threads!
(all within margins of error as indicated by the error bars).
The best win-rate on the Xeon Phi is for 8 threads while on
the Xeon CPU it is on 2 threads. The playing strength re-
mains above the break-even point of 50% for the first player
until 48 threads and then sharply decreases, until 128 threads
and becomes 50 percent for 240 threads. Up to 64 threads,
these results basically confirm the simulation study by Se-
gal [23]. However, beyond 64 threads the performance drop
is unexpectedly large.
The most striking feature of the experiment with these set-
tings is the difference in performance of an identical program
in an identical setup on the Xeon CPU versus the Xeon Phi:
Figures 4 and 5 do not look alike at all. The Xeon CPU
 20
 30
 40
 50
 60
 70
 80
 90
 100
2 4 8 16 32 48
Pe
rc
en
ta
ge
 W
in
s
Number of Threads
Figure 4: Performance of self-play FUEGO with n threads
against FUEGO with n/2 threads on the Xeon CPU processor.
200 games for each data point.
 20
 30
 40
 50
 60
 70
 80
 90
 100
2 4 8 16 32 48 64 128
240
Pe
rc
en
ta
ge
 W
in
s
Number of Threads
Figure 5: Performance of self-play FUEGO on the Xeon Phi
with n threads against FUEGO with n/2 threads. The board
size is 9x9. 300 games for each data point.
shows a steadily decreasing performance, as expected, where
the Xeon Phi shows a ragged hump-like shape.
To study the possible causes of these results, we performed
experiments to delve deeper into the Xeon Phi architecture.
Efficiency, Thread Affinity, and Problem Size
As mentioned before, the thread scheduling policy on the
Xeon Phi can be influenced manually at run time. In order to
illustrate graphically the effect of different scheduling poli-
cies we have performed a small experiment. Figure 6 and
7 show the effect of different thread affinities on the per-
formance of the Xeon Phi for double-precision and integer
arithmetic operations. The benchmark is executing a loop
which contains c[j] = a[j] ∗ b[j] + c[j] operation for many
times. The effect of thread affinities on the bandwidth of the
Xeon phi for executing the same program in double-precision
is also shown in Figure 8. The results were measured with
turbo mode set to on.
 0
 50
 100
 150
 200
 250
 300
 350
 400
 450
 500
 550
 600
 650
 700
 750
 800
 850
 900
 950
 1000
 1050
 1100
 1150
 1200
 1250
 1300
 8  16  24  32  40  48  56  64  72  80  88  96  104
 112
 120
 128
 136
 144
 152
 160
 168
 176
 184
 192
 200
 208
 216
 224
 232
 240
G
FL
O
PS
Threads
none
compact
balanced
scatter
Figure 6: Performance of double-precision operations of the
Xeon Phi for different numbers of threads.
 0
 50
 100
 150
 200
 250
 300
 350
 400
 450
 500
 550
 600
 650
 700
 750
 800
 850
 900
 950
 1000
 1050
 1100
 1150
 1200
 1250
 1300
 1350
 1400
 1450
 1500
 8  16  24  32  40  48  56  64  72  80  88  96  104
 112
 120
 128
 136
 144
 152
 160
 168
 176
 184
 192
 200
 208
 216
 224
 232
 240
G
IP
S
Threads
none
compact
balanced
scatter
Figure 7: Performance of integer operations of the Xeon Phi
for different numbers of threads.
In the compact mode the performance was steadily increased
and the bandwidth reached a plateau. In the balanced and
scatter modes depending on how many threads are assigned
to each core 4 different regions for double and 3 different re-
gion for integer performance existed. For example, as shown
in Figure 7 between 122 threads and 183 threads some cores
have 2 threads and some others have three threads in bal-
anced mode. This asymmetry degraded the performance dra-
matically at the beginning of the region and then stared to
increase performance. The memory bandwidth has also 4
regions in balanced mode. By using more thread the band-
width never reached the same level as the previous region.
These type of performance behavior makes it really tricky to
select best thread configuration for executing a program like
FUEGO.
For the FUEGO self-play experiments the compact affinity
type has been used. To show the effect of different schedul-
ing policies on FUEGO the three different methods have been
run. Figure 9 shows the effect of different thread affinity
types on the performance of FUEGO. The percentage of wins
for balanced mode shows more stability compared to the two
 0
 10
 20
 30
 40
 50
 60
 70
 80
 90
 100
 110
 120
 130
 140
 150
 160
 170
 180
 190
 8  16  24  32  40  48  56  64  72  80  88  96  104
 112
 120
 128
 136
 144
 152
 160
 168
 176
 184
 192
 200
 208
 216
 224
 232
 240
G
B/
s
Threads
compact
balanced
Figure 8: Memory bandwidth of the Xeon Phi for different
numbers of threads.
 20
 30
 40
 50
 60
 70
 80
 90
 100
2 4 8 16 32 64 128
240
Pe
rc
en
ta
ge
 W
in
s
Number of Threads
compact
balanced
scatter
Figure 9: Self-play performance of FUEGO for different
thread affinity types on the Xeon Phi. The board size is 9x9.
100 games for each data point.
other scheduling methods. The best win-rate is for 4 threads
(1 core) in compact mode and for 16 threads (16 cores) for
scatter mode.
As noted before, the most striking feature of these experi-
ments is the difference in performance of an identical pro-
gram in an identical setup on the Xeon CPU versus the Xeon
Phi using the standard experimental settings of the 9x9 board
and 1 second per move. The Xeon CPU shows a steadily
decreasing performance, as expected, where the Xeon Phi
shows a ragged hump-like shape.
The win-rate graph shows an effective speedup, a speedup
measure that includes search overhead. We have also com-
puted basic efficiency-speedup figures to compare the paral-
lel speed of the Xeon CPU and the Xeon Phi. The FUEGO
program can output the number of games per second that are
performed. Figure 10 shows the number of games per sec-
ond that is performed by FUEGO on a 9x9 board. This is a
convenient measure of how efficient the architecture is run-
ning the program. Games per second is the number of games
that are played by FUEGO before making a move. The num-
 0
 20000
 40000
 60000
 80000
 100000
 120000
1 2 4 8 16 32 48 64 128
240
G
am
es
/s
ec
Number of Threads
Phi-balanced
CPU
Figure 10: Number of games per second for a 9x9 board
when FUEGO makes the second move.
ber increases for both architectures. Due to the higher clock
speed, the amount of work by the each core of the Xeon CPU
is much more than the Xeon Phi core. However, the differ-
ence in clock speed is only a factor of two, whereas in the
figure the difference is more than a factor of 5. Figure 10
shows that even using all cores of the Xeon Phi cannot reach
the performance of 16 threads on The Xeon CPU. The low
games-per-second numbers of the Xeon Phi suggests ineffi-
ciencies due to a small problem size. Closer inspection of the
results on which Figure 10 is based suggests that FUEGO is
not able to do enough simulation on the Xeon Phi for small
number of threads in just 1 second. Therefore, we increase
the time limit per move to 10 seconds. Figure 11 shows
the results of the self-play experiment when the FUEGO can
make a move with 10 seconds for doing simulation on The
Xeon Phi. We see that now the graph is approaching that
of the Xeon CPU. The win-rate behavior for low number of
threads is now much closer to that of the CPU (Figure 5), and
the counter-intuitive hump-shape has changed to the familiar
down-sloping trend. However, we still see fluctuation in the
balanced mode. Up to 32 threads performance is still rea-
sonable (close to 70% win-rate for the 2x thread program)
but up to 240 threads performance deteriorates. The maxi-
mum win-rate is for 8 threads and there is still a marginal
benefit for using 128 threads.
The reason behind the difference between the results in Fig-
ures 5 and 11 is shown in Figure 12 which shows how large
the search tree is when making a move. The size of the tree
when FUEGO has 10 seconds per move on the Xeon Phi is
similar to Xeon CPU with 1 second per move.
Conclusion
Intel’s Xeon Phi has been designed to offer high performance
without the associated hassle of difficult programming. To
this end, the Xeon Phi architecture has many cores, many
levels of caches, an intricate cache-coherency protocol, vec-
tor units, and a fast and complex interconnect. A complex
machinery to offer a simple programming model. Previously,
 20
 30
 40
 50
 60
 70
 80
 90
 100
2 4 8 16 32 64 128
240
Pe
rc
en
ta
ge
 W
in
s
Number of Threads
balanced
Figure 11: Self-play performance of FUEGO for 10 second
per move on the Xeon Phi. The board size is 9x9. 100 games
for each data point.
 0
 1e+06
 2e+06
 3e+06
 4e+06
 5e+06
 6e+06
1 2 4 8 16 32 48 64 128
240
N
um
be
r o
f N
od
es
Number of Threads
Phi-timelimit=1s
Phi-timelimit=10s
CPU-timelimit=1s
Figure 12: The number of nodes in the search tree when
FUEGO makes the second move. balanced thread affinity
type on the Xeon Phi is used. The board size is 9x9.
micro-benchmarks have been reported that indicate that In-
tel’s published figures are essentially achieved in practice [7].
Using a tournament quality game playing program that em-
ploys a popular Monte Carlo method [2], our results show
that porting a complex program and getting it to run correctly
on the Xeon Phi is indeed relatively straightforward. Our re-
sults are, to our knowledge, the first performance results of
a non-trivial AI program on an shared memory architecture
with up to 240 threads, allowing comparison of a simulation
study up to 64 threads [23] and an extension beyond.
However, the results also show that achieving good perfor-
mance is not straightforward: on the conventional Xeon CPU
architecture our results are as predicted, a smooth, slightly
down-sloping line, while, using the standard experimental
settings of 1 second per move, on the Xeon Phi we initially
see a ragged hump-like shape that differs markedly from the
Xeon CPU. Our further experiments with scheduling poli-
cies and problem sizes were prompted by the assumption that
cache locality issues and memory access patterns caused the
unexpected performance results. We find that performance is
quite sensitive to different settings, especially problem size.
By judicious choice of scheduling strategy and problem size,
we were able to achieve reasonable results after considerable
analysis and tuning. Results appear first to be largely in line
with a previous simulation study, showing reasonable scal-
ing up to 32 threads, but deteriorating performance up to 240
threads.
From our experimental results we may conclude that achiev-
ing good performance on the Xeon Phi for complex pro-
grams is not straightforward; achieving good performance
requires a deep understanding of the search algorithm on the
one hand and the architecture and its many cores and caches
on the other hand. We suggest that in this case a complex
architecture does not automatically equal a simple perfor-
mance model. Given the industry trend towards heteroge-
neous many-core architectures, increasing our understanding
of the interplay between algorithm and architecture is vital
for achieving good performance. More research is under
way to create more accurate performance models of the Intel
Xeon Phi for different types of algorithms and MCTS-like
access patterns.
Acknowledgment
This work is supported in part by the ERC Advanced Grant
no. 320651, “HEPGAME.”
REFERENCES
[1] A. Bourki, G. Chaslot, M. Coulm, V. Danjean,
H. Doghmen, J.-B. Hoock, H. Thomas, A. Rim-
mel, F. Teytaud, O. Teytaud, P. Vayssi, T. He´rault,
P. Vayssie`re, and Z. Yu. Scalability and Paralleliza-
tion of Monte-Carlo Tree Search. In Proceedings of
the 7th International Conference on Computers and
Games, CG’10, pages 48–58, Berlin, Heidelberg, 2011.
Springer-Verlag.
[2] C. B. Browne, E. Powley, D. Whitehouse, S. M. Lu-
cas, P. I. Cowling, P. Rohlfshagen, S. Tavener, D. Perez,
S. Samothrakis, and S. Colton. A Survey of Monte
Carlo Tree Search Methods. Computational Intelli-
gence and AI in Games, IEEE Transactions on, 4(1):1–
43, 2012.
[3] G. M. J. B. Chaslot, M. H. M. Winands, and H. J.
van den Herik. Parallel monte-carlo tree search. Com-
puters and Games, 5131:60–71, 2008.
[4] R. Coulom. Efficient Selectivity and Backup Operators
in Monte-Carlo Tree Search. In Proceedings of the 5th
International Conference on Computers and Games,
CG’06, pages 72–83, Berlin, Heidelberg, May 2006.
Springer-Verlag.
[5] M. Enzenberger and M. Mu¨ller. A lock-free multi-
threaded Monte-Carlo tree search algorithm. Advances
in Computer Games, 6048:14–20, 2010.
[6] M. Enzenberger, M. Muller, B. Arneson, and R. Segal.
Fuego-An Open-Source Framework for Board Games
and Go Engine Based on Monte Carlo Tree Search.
IEEE Transactions on Computational Intelligence and
AI in Games, 2(4):259–270, Dec. 2010.
[7] J. Fang, A. A. L. Varbanescu, H. Sips, L. Zhang, C. Xu,
and Y. Che. Test-driving Intel Xeon Phi. In Proceed-
ings of the 5th ACM/SPEC international conference on
Performance engineering - ICPE ’14, number Section
III, pages 137–148, New York, New York, USA, Mar.
2014. ACM Press.
[8] S. Gelly and D. Silver. Monte-Carlo tree search and
rapid action value estimation in computer Go. Artificial
Intelligence, 175(11):1856–1875, 2011.
[9] E. Heinz. New self-play results in computer chess. In
Computers and Games, pages 262–276, 2001.
[10] Intel. Intel Xeon Phi Product Family Highly parallel
processing to power your breakthrough innovations.
http://www.intel.com/content/www/us/en/bench-
marks/server/xeon-phi/xeon-phi-theoretical-
maximums.html, 2013.
[11] A. Karami, S. A. Mirsoleimani, and F. Khunjush. A
Statistical Performance Prediction Model for OpenCL
Kernels on NVIDIA GPUs. In The 17th CSI Interna-
tional Symposium on Computer Architecture & Digital
Systems (CADS’13), Tehran, Iran, 2013. IEEE.
[12] L. Kocsis and C. Szepesva´ri. Bandit based monte-carlo
planning. Machine Learning: ECML 2006, 2006.
[13] J. Kuipers, A. Plaat, J. Vermaseren, and H. van den
Herik. Improving multivariate Horner schemes with
Monte Carlo tree search. Computer Physics Commu-
nications, 184(11):2391–2395, Nov. 2013.
[14] U. Lopez-Novoa, A. Mendiburu, and J. Miguel-Alonso.
A Survey of Performance Modeling and Simulation
Techniques for Accelerator-based Computing. IEEE
Transactions on Parallel and Distributed Systems,
9219(c):1–1, 2014.
[15] S. A. Mirsoleimani, A. Karami, and F. Khunjush. A
Two-Tier Design Space Exploration Algorithm to Con-
struct a GPU Performance Predictor. In Architecture
of Computing Systems–ARCS 2014, pages 135–146.
Springer, 2014.
[16] R. Rahman. Intel Xeon Phi Coprocessor Architec-
ture and Tools: The Guide for Application Developers.
Apress, Sept. 2013.
[17] K. Rocki and R. Suda. Large-Scale Parallel Monte
Carlo Tree Search on GPU. In Parallel and Distributed
Processing Workshops and Phd Forum (IPDPSW),
2011 IEEE International Symposium on, pages 2034–
2037, May 2011.
[18] J. Romein, H. Bal, J. Schaeffer, and A. Plaat. A per-
formance analysis of transposition-table-driven work
scheduling in distributed search. IEEE Transactions on
Parallel and Distributed Systems, 13(5):447–459, May
2002.
[19] J. Romein, A. Plaat, H. E. Bal, and J. Schaeffer. Trans-
position Table Driven Work Scheduling in Distributed
Search. In In 16th National Conference on Artificial
Intelligence (AAAI’99), pages 725–731, 1999.
[20] B. Ruijl, J. Vermaseren, A. Plaat, J. Herik, and H. J.
van den Herik. Combining Simulated Annealing and
Monte Carlo Tree Search for Expression Simplification.
Proceedings of ICAART Conference 2014, 1(1):724–
731, 2014.
[21] L. Schaefers, M. Platzner, and S. Member. Distributed
Monte-Carlo Tree Search : A Novel Technique and its
Application to Computer Go. IEEE Transactions on
Computational Intelligence and AI in Games, 6(3):1–
15, 2014.
[22] J. Schaeffer and A. Plaat. Kasparov versus deep blue:
The re-match. ICCA Journal, 20(2):95–102, 1997.
[23] R. B. Segal. On the Scalability of Parallel UCT. In Pro-
ceedings of the 7th International Conference on Com-
puters and Games, CG’10, pages 36–47, Berlin, Hei-
delberg, 2011. Springer-Verlag.
[24] Shuo-li. Case Study: Achieving High Performance on
Monte Carlo European Option Using Stepwise Opti-
mization Framework. https://software.intel.com/en-
us/articles/case-study-achieving-high-performance-
on-monte-carlo-european-option-using-stepwise,
2013.
[25] H. J. van den Herik, A. Plaat, J. Kuipers, and J. A. M.
Vermaseren. Connecting Sciences. In 5th Interna-
tional Conference on Agents and Artificial Intelligence
(ICAART 2013), 1:IS–7 – IS–16, 2013.
[26] M. Woodcraft. Gomill library.
http://mjw.woodcraft.me.uk/gomill/, 2014.
[27] K. Yoshizoe, A. Kishimoto, T. Kaneko, H. Yoshimoto,
and Y. Ishikawa. Scalable Distributed Monte-Carlo
Tree Search. Fourth Annual Symposium on Combina-
torial Search, pages 180–187, May 2011.
