Block-Parallel IDA* for GPUs (Extended Manuscript) by Horie, Satoru & Fukunaga, Alex
Block-Parallel IDA* for GPUs
(Extended Manuscript)
Satoru Horie and Alex Fukunaga
Graduate School of Arts and Sciences
The University of Tokyo
Abstract
We investigate GPU-based parallelization of Iterative-
Deepening A* (IDA*). We show that straightforward
thread-based parallelization techniques which were pre-
viously proposed for massively parallel SIMD proces-
sors perform poorly due to warp divergence and load
imbalance. We propose Block-Parallel IDA* (BP-
IDA*), which assigns the search of a subtree to a block
(a group of threads with access to fast shared memory)
rather than a thread. On the 15-puzzle, BPIDA* on a
NVIDIA GRID K520 with 1536 CUDA cores achieves
a speedup of 4.98 compared to a highly optimized se-
quential IDA* implementation on a Xeon E5-2670 core.
1
1 Introduction
Graphical Processing Units (GPUs) are many-core pro-
cessors which are now widely used to accelerate many
types of computation. GPUs are attractive for com-
binatorial search because of their massive parallelism.
On the other hand, on many domains, search algo-
rithms such as A* tend to be limited by RAM rather
than runtime. A standard strategy for addressing lim-
ited memory in sequential search is iterative deepening
(Korf 1985). We present a case study on the GPU-
parallelization of Iterative-Deepening A* (Korf 1985)
for the 15-puzzle using the Manhattan Distance heuris-
tic. We evaluate previous thread-based techniques for
parallelizing IDA* on SIMD machines, and show that
these do not scale well due to poor load balance and
warp divergence. We then propose Block-Parallel IDA*
(BPIDA*), which, instead of assigning a subtree to a
single thread, assigns a subtree to a group of threads
which share fast memory. BPIDA* achieves a speedup
of 4.98 compared to a state-of-the-art 15-puzzle solver
on a CPU, and a speedup of 659.5 compared to a single-
thread version of the code running on the GPU.
2 Background and Related Work
An NVIDIA CUDA architecture GPU consists of a set
of streaming multiprocessors (SMs) and a GPU main
1This is an extended manuscript based on a paper ac-
cepted to appear in SoCS2017.
memory (shared among all SMs). Each SM consists of
shared memory, cache, registers, arithmetic units, and
a warp scheduler. Within each SM the cores operate
in a SIMD manner. However, each SM executes inde-
pendently, so threads in different SMs can run asyn-
chronously. A thread is the smallest unit of execution.
A block is a group of threads which execute on the same
SM and share memory. A grid is a group of blocks
which execute the same function. Threads in a block
are partitioned into warps. A warp executes in a SIMD
manner (all threads in the same warp share a program
counter). Warp divergence, an instance of SIMD diver-
gence, occurs when threads belonging to the same warp
follow different execution paths, e.g., IF-THEN-ELSE
branches. Shared memory is shared by a block and is
local within a SM, and access to shared memory is much
faster than access to the GPU global memory which is
shared by all SMs.
Rao et al (1987) parallelized each iteration of
IDA* using work-stealing on multiprocessors. Parallel-
window IDA* assigned each iteration of IDA* to its own
processor (Powley and Korf 1989). Two SIMD parallel
IDA* algorithms are by Powley et al (1993) and Ma-
hanti and Daniels (1993). For each f -cost limited iter-
ation of IDA*, they perform an initial partition of the
workload among the processors, and then periodically
perform load balancing between IDA* iterations and
within each iteration. Hayakawa et al (2015) proposed
a GPU-based parallelization of IDA* for the 3x3x3 Ru-
bik’s cube which searches to a fixed depth l on the CPU,
then invokes a GPU kernel for the remaining subprob-
lems. Their domain-specific load balancing scheme re-
lies on tuning l using knowledge of “God’s number” (op-
timal path length for the most difficult cube instance)
and is fragile – perturbing l by 1 results in a 10x slow-
down. Zhou and Zeng (2015) proposed a GPU-parallel
A* which partitions OPEN into thousands of priority
queues. The amount of global RAM on the GPU (cur-
rently ≤ 24GB) poses a serious limitation for GPU-
based parallel A*. Edelkamp and Sulewski (2010) in-
vestigated memory-efficient GPU search. Sulewski et al
(2011) proposed a hybrid planner which uses both the
GPU and CPU.
ar
X
iv
:1
70
5.
02
84
3v
1 
 [c
s.A
I] 
 8 
M
ay
 20
17
3 Experimental Settings and Baselines
We used the standard set of 100 15-puzzle instances by
Korf (1985). These instances are ordered in approxi-
mate order of difficulty. All solvers used the Manhattan
distance heuristic. Reported runtimes include all over-
heads such as data transfers between CPU and GPU
memories (negligible). All experiments were executed
on a non-shared, dedicated AWS EC2 g2.2xlarge in-
stance. The CPU is an Intel Xeon E5-2670. The GPU
is an NVIDIA GRID K520, with 4GiB global RAM,
48KiB shared RAM/block, 1536 CUDA cores, warp size
32, and 0.80GHz GPU clock rate.
First, we evaluated 3 baseline IDA* solvers:
Solver B: The efficient, Manhattan-Distance heuristic
based 15-puzzle IDA* solver implemented in C++ by
Burns et al. (2012). We used the current version at
https://github.com/eaburns/ssearch.
Solver C: Our own implementation of IDA* in C (code
at http://github.com/socs2017-48/anon48), This is the
basis for G1 and all of our GPU-based code.
Solver G1: A direct port of Solver C to CUDA. The
implementation is optimized so that all data structures
are in the fast, shared memory (the memory which is
local to a SM). This baseline configuration uses
only 1 GPU block/thread, i.e., only 1 core is
used, all other GPU cores are idle.
The total time to solve all 100 problem instances was
620 seconds for Solver B (Burns et al. 2012) and 475
seconds for our Solver C. Solver C was consistently 25%
faster on every instance. Thus, Solver C is appropriate
as a baseline for our GPU-based 15-puzzle solvers.
Next, we compare Solver C (1 CPU thread) to G1
(1 GPU thread). G1 required 62957 seconds to solve
all 100 instances, 131 times slower than Solver C. This
implies that on the GPU we used with 1536 cores, a per-
fectly efficient implementation of parallel IDA* might
be able to achieve a speedup of up to 1536/131 = 11.725
compared to Solver C.
4 Thread-Based Parallel IDA*
Most of the previous work on parallel IDA* paral-
lelizes each iteration of IDA* using a thread-based par-
allel scheme (Rao, Kumar, and Ramesh 1987; Powley,
Ferguson, and Korf 1993; Mahanti and Daniels 1993;
Hayakawa, Ishida, and Murao 2015).
We evaluated 3 thread-parallel IDA* configurations.
Since these are relatively straightforward and not novel,
we sketch the implementations below. Details are in
Appendix.
PSimple (baseline) In this baseline configuration,
for each f -bounded iteration of IDA*, PSimple per-
forms A* search from the start state until as many
unique states as the # of threads are in OPEN. Then,
each root is assigned to a thread. No load balancing
is performed. The subtree sizes under each root state
can vary significantly, so some threads may finish their
subproblem much faster than other threads. Each f -
bounded iteration must wait for all threads to com-
plete, so PSimple has very poor load balance. There-
fore, load balancing mechanisms which redistribute the
work among processors are necessary.
PStaticLB (static load balancing) This config-
uration adds static load balancing to PSimple. Af-
ter each f -bounded iteration, PStaticLB implements a
static load balancing mechanism somewhat similar to
that of (Powley, Ferguson, and Korf 1993). In IDA*,
the i-th iteration repeats all of the work done in itera-
tion i−1. Thus, the # of states visited under each root
state in the iteration i− 1 can be used to estimate the
# of states which will be visited in the current itera-
tion i, and root nodes are redistributed based on these
estimates (details in Appendix).
PFullLB (thread-parallel with dynamic load bal-
ancing) This configuration adds dynamic load bal-
ancing (DLB) to PStaticLB, which moves work to idle
threads from threads with remaining work during an
iteration. On a GPU, work can be transferred between
two threads within a single block relatively cheaply us-
ing the shared memory within a block, while transfer-
ring work between two threads in different blocks is ex-
pensive because it requires access to the global memory.
When dynamic load balancing is triggered, idle threads
steal work from threads with remaining work within a
block. We experimented with various DLB strategies
including variants of policies investigated by (Powley
and Korf 1989; Mahanti and Daniels 1993), and used a
policy we found for triggering DLB based on the pol-
icy by Powley and Korf. See Appendix for additional
details.
4.1 Evaluation of Thread-Parallel IDA*
PSimple on 1536 cores required a total of 3378 seconds
to solve all 100 problems, a speedup of only 18.6 com-
pared to G1 (1 core on the GPU). This is mostly due
to extremely poor load balance. We define load balance
as maxload/averageload , where averageload is the aver-
age number of nodes expanded among all threads, and
maxload is the number of states expanded by the thread
which performed the most work. The load balance for
PSimple on the 100 problems was: mean 96.46, min 14,
max 680, stddev 113.19. This is extremely unbalanced
(maxload is almost 100x averageload).
Static load balancing significantly improved load bal-
ance (PStaticLB: mean 9.96, min 3, max 56, stddev
8.96), and dynamic load balancing further improved
load balance (PFullLB: mean 6.14 min 3 max 19 stddev
3.38). This resulted in speedups of 58.9 and 70.8 com-
pared to G1 (Table 1). However, the 70.8 speedup vs
G1 achieved by PFullLB is only a parallel efficiency of
70.8/1536 = 4.6%, which is extremely poor. We exper-
imented extensively but could not achieve significantly
better results with thread-parallel IDA*.
5 Block Parallelization
The likely causes for the poor (4.6%) efficiency of PFul-
lLB are: (1) SMs become idle due to poor load bal-
ance even after our load balancing efforts, (2) threads
stall for warp divergence, and (3) load balancing over-
head. All of these can be attributed to the thread-
based parallelization scheme in PFullLB and PStati-
cLB, in which each processor/thread executes an inde-
pendent subproblem during a single f -bound iteration.
This scheme, based on parallel IDA* variants originally
designed for SIMD machines (Powley, Ferguson, and
Korf 1993; Mahanti and Daniels 1993), was appropri-
ate for those SIMD architectures where all communica-
tions between processors were very expensive – paying
the price of SIMD divergence overhead was preferable
to incurring communication costs. On the other hand,
in NVIDIA GPUs, threads in the same block (which
execute on the same SM) can access fast shared mem-
ory on the SM with relatively low overhead. We exploit
this in a block-parallel approach.
Rocki and Suda (2009) proposed a GPU-based par-
allel minimax gametree search algorithm for 8x8 Oth-
ello (without any αβ pruning) which works as follows.
Within each block a node n is selected for expansion.
If n is a leaf, it is evaluated using a parallel evalua-
tion function (32 threads, 1 thread per 2 positions in
the 8x8 board). Otherwise a parallel successor gener-
ator function is called (1 thread/position) to generate
successors of n, which are added to the node queue.
This approach greatly reduced warp divergence, since
all threads in the warp are synchronized to perform
the fetch-evaluate-expand cycle. Because there is no
αβ pruning, their search trees have uniform depth (i.e.,
fixed-depth DFS), and also, the # of possible moves on
the othello board (64) conveniently matched a multiple
of the CUDA warp size (32).
We now propose a generalization of this approach
for IDA*, which we call Block-Parallel IDA*, shown in
Alg. 1. In contrast to the parallel minimax of (Rocki
and Suda 2009), BPIDA* handles variable-depth sub-
trees (due to the heuristic, IDA* tree depths are irreg-
ular) and does not depend on a fixed number of appli-
cable operators (e.g., 64).
openList is a stack which is shared among all threads
in the same block, which supports two key parallel op-
erations: parallelPop and atomicPut. parallelPop
extracts (#threads in a block/#operators) nodes from
openList. atomicPut inserts nodes in t into the shared
openList concurrently. This is implemented as a lin-
earizable (Herlihy and Wing 1990) operation.
The BPDFS function is similar to a standard, se-
quential f -limited depth-first search, but in each iter-
ation of the repeat-until loop in lines 4-16 (Alg. 1),
a warp performs the fetch-evaluate-expand cycle on
(#threads in a block/#operators) nodes. The number
of threads per block is set to the warp size (32). This
allows the following: (1) When a warp is scheduled for
execution, all cores in the SM are active. (2) Since all
threads in the block (=warp) share a program counter,
explicit synchronizations become unnecessary.
BPIDA* applies a slightly modified version of the
static load balancing used by PStaticLB (Sec. A.1).
Algorithm 1 BlockParallel IDA*
1: function BPDFS(root, goals, limitf )
2: openList← root
3: fnext =∞
4: repeat
5: s← parallelPop(openList)
6: if s ∈ goals then
7: return s and its parents as a shortest path
8: a← (threadID mod #actions)th action
9: if a is applicable on s then
10: t← successor(a, s)
11: fnew ← g(s) + cost(a) + h(t)
12: if fnew <= limitf then
13: atomicPut(openList, t)
14: else
15: fnext ← min(fnext, fnew)
16: until openList is empty
17: return fnext . no plan is found
18:
19: function BPIDA*(start, goals)
20: roots← CreateRootSet(start, goals)
21: limitf ← DecideFirstLimit(roots)
22: repeat
23: parallelForByBlocks root ∈ roots do
24: limitf , stat← BPDFS(root, goals, limitf )
25: end parallelForByBlocks
26: until shortest path is found
While PStaticLB uses the number of expanded nodes to
estimate the work in the next iteration, BPIDA* uses
the number of repetitions executed in lines 4-16. BP-
IDA* does not use dynamic load balancing.
6 Evaluation of BPIDA*
Runtimes Figure 1a compares the relative runtime
of BPIDA* vs. PFullLB. BPIDA* required a total of
95 seconds to solve all 100 problems, a speedup of 9.39
compared to PFullLB. Table 1 summarizes the total
runtimes and speedups for all algorithms in this paper.
Other metrics There are 3 suspected culprits for the
poor performance of thread-based parallel IDA*: (1)
dynamic load overhead, (2) idle SMs (bad load bal-
ance), and (3) thread stalls for warp divergence. BP-
IDA* doesn’t perform dynamic load balancing, so (1)
is irrelevant. For factors (2) and (3), there are related
configuration total runtime speedup
(seconds) vs. G1
CPU-based sequential algorithms (1 CPU thread)
Solver B (Burns et al. 2012) 620 n/a
Solver C 475 n/a
GPU-based sequential algorithm (1 thread)
G1 62957 1
GPU-based parallel algorithms (1536 threads)
PSimple 3378 18.6
PStaticLB 1069 58.9
PFullLB 892 70.8
BPIDA* 95 659.5
Table 1: Total Runtimes for 100 15-Puzzle Instances
(a) Relative runtimes: PFullLB vs.
BPIDA*
(b) Relative runtimes: Solver C vs.
BPIDA* (finding 1 optimal solution)
(c) Relative runtimes: Solver C vs. BP-
IDA* (finding all optimal solutions)
Figure 1: BP-IDA* Evaluation
metrics, sm efficiency and IPC(instructions per cycle),
which can be measured by the CUDA profiler, nvprof.
sm efficiency is the average % of time at least one warp
is active on a SM. High sm efficiency shows how busy
the SMs are, and high IPC indicates there are few NOPs
due to warp divergence. The (mean, min, max, std-
dev) sm efficiency over 100 instances was (65.22, 31.8,
82.7, 7.94) for PFullLB, and (94.29, 32.3, 99.9, 9.76)
for BPIDA*, and for IPC, the results were (0.30, 0.13,
0.39, 0.048) for PFullLB and (0.97, 0.60, 1.06, 0.059)
for BPIDA*. For both metrics, the results of BPIDA*
were better than PFullLB, and close to the ideal values
(100% sm efficiency and IPC=1.0).
6.1 Comparison with Sequential Solver C
We now compare BPIDA* with the CPU-based, se-
quential Solver C (Sec. 3). Fig. 1b compares the
relative runtimes of Solver C (1 CPU core) and BP-
IDA* (1536 GPU cores). The y-axis shows Run-
time(SolverC)/Runtime(BPIDA*) for each instance.
Comparing the total time to solve all 100 instances,
BPIDA* was 4.98 times faster.
Runtime comparisons between parallel vs. sequential
IDA* can be obfuscated by the fact that they do not
necessarily expand the same set of nodes in the final
iteration, although they expand the same set of nodes
in non-final iterations (the same issue exists with com-
parisons among parallel IDA* variants, but from Fig.
1a and Table 1, it is clear that BPIDA* significantly
outperforms the other parallel algorithms, so above, we
simply reported the time to find a single solution, as is
standard practice in previous works).
To eliminate differences in search efficiency (node ex-
pansion order) from the comparison, the next exper-
iment compares the time required to find all optimal-
cost solutions of every problem, i.e., the search does not
terminate until all nodes with f ≤ OptimalCost have
been expanded. This eliminates node ordering effects,
allowing comparison of the wall-clock time required to
perform the same amount of search. Fig. 1c compares
the relative runtimes of Solver C (1 CPU core) and
BPIDA* (1536 GPU cores). The y-axis shows Run-
time(SolverC)/Runtime(BPIDA*) to find all optimal
solutions for each instance. Comparing the total time
to find all optimal solutions for all 100 instances, BP-
IDA* was 6.78 times faster.
7 Conclusions and Future Work
We proposed Block-Parallel IDA*, which assigns sub-
trees to GPU blocks (groups of threads with fast shared
memory). Compared to thread-parallel approaches,
this greatly reduces warp divergence and improves load
balance. BPIDA* also does not require explicit dy-
namic load balancing, making it relatively simple to
implement. On 1536 cores, BPIDA* achieves a speedup
of 659.5 vs. a 1-thread GPU baseline, i.e., 42% parallel
efficiency. Compared to a highly optimized single-CPU
IDA*, BPIDA* achieves a 6.78x speedup when compar-
ing the time to find all optimal solutions.
The successful parallelization of BPIDA* on the 15-
puzzle with Manhattan distance (MD) heuristic ex-
ploits the following factors: (1) compact states, (2) the
MD heuristic requires little memory, and (3) standard
IDA* doesn’t perform duplicate state detection. Thus,
all work could be performed in the SM local+shared
memories, without using global memory. In many do-
mains, data structures representing each state are larger
and the IDA* state stacks will not fit in local memory.
Also, some powerful memory-intensive heuristics, e.g,.
PDBs (Korf and Felner 2002), will require at least the
use of global memory. Finally, standard approaches for
reducing duplicate state expansion, e.g., transposition
tables (Reinefeld and Marsland 1994) requires signifi-
cant memory. Thus, future work will focus on methods
which use GPU global memory effectively so that do-
mains with larger states, memory-intensive heuristics,
and memory-intensive duplicate pruning techniques can
be used.
References
[Burns et al. 2012] Burns, E. A.; Hatem, M.; Leighton,
M. J.; and Ruml, W. 2012. Implementing fast heuristic
search code. In SOCS.
[Edelkamp, Sulewski, and Yu¨cel 2010] Edelkamp, S.;
Sulewski, D.; and Yu¨cel, C. 2010. Perfect hashing for
state space exploration on the GPU. In ICAPS, 57–64.
[Hayakawa, Ishida, and Murao 2015] Hayakawa, H.;
Ishida, N.; and Murao, H. 2015. GPU-acceleration of
optimal permutation-puzzle solving. In Proceedings of
the 2015 International Workshop on Parallel Symbolic
Computation, 61–69. ACM.
[Herlihy and Wing 1990] Herlihy, M. P., and Wing,
J. M. 1990. Linearizability: A correctness condition for
concurrent objects. ACM Transactions on Program-
ming Languages and Systems (TOPLAS) 12(3):463–
492.
[Korf and Felner 2002] Korf, R. E., and Felner, A. 2002.
Disjoint pattern database heuristics. Artificial intelli-
gence 134(1):9–22.
[Korf 1985] Korf, R. E. 1985. Depth-first iterative-
deepening: An optimal admissible tree search. Artificial
intelligence 27(1):97–109.
[Mahanti and Daniels 1993] Mahanti, A., and Daniels,
C. J. 1993. A SIMD approach to parallel heuristic
search. Artificial Intelligence 60(2):243–282.
[Powley and Korf 1989] Powley, C., and Korf, R. E.
1989. Single-agent parallel window search: A summary
of results. In IJCAI, 36–41.
[Powley, Ferguson, and Korf 1993] Powley, C.; Fergu-
son, C.; and Korf, R. E. 1993. Depth-first heuris-
tic search on a SIMD machine. Artificial Intelligence
60(2):199–242.
[Rao, Kumar, and Ramesh 1987] Rao, V. N.; Kumar,
V.; and Ramesh, K. 1987. A parallel implementation
of iterative-deepening-a*. Artificial Intelligence Labo-
ratory, University of Texas at Austin.
[Reinefeld and Marsland 1994] Reinefeld, A., and Mars-
land, T. A. 1994. Enhanced iterative-deepening search.
IEEE Transactions on Pattern Analysis and Machine
Intelligence 16(7):701–710.
[Rocki and Suda 2009] Rocki, K., and Suda, R. 2009.
Parallel minimax tree searching on GPU. In Interna-
tional Conference on Parallel Processing and Applied
Mathematics, 449–456. Springer.
[Sulewski, Edelkamp, and Kissmann 2011] Sulewski,
D.; Edelkamp, S.; and Kissmann, P. 2011. Exploiting
the computational power of the graphics card: Optimal
state space planning on the GPU. In ICAPS.
[Zhou and Zeng 2015] Zhou, Y., and Zeng, J. 2015.
Massively parallel A* search on a GPU. In AAAI, 1248–
1255.
A Appendices
A.1 Thread-Based Parallel IDA* Details
We provide further details on the Thread-based parallel
IDA* implementations which were sketched in Section
4 of the submission. The overall algorithmic scheme is
shown in Alg. 2.
PSimple (baseline) In this baseline configuration,
for each f -bounded iteration of IDA*, CreateRootSet
(Alg. 2, line 20) performs A* search from the start state
until as many unique states as the number of threads
are in OPEN. Then, each root is assigned to a thread.
No load balancing is performed, i.e., UpdateRootSet
and DynamicLoadBalance do nothing. The subtree size
under each root state can vary significantly, so some
threads may finish subproblem much faster than other
threads. Each f -bounded iteration must wait for all
threads to complete, so PSimple has very poor load
balance. Therefore, some load balancing mechanisms
which (re)distribute the work among processors are nec-
essary.
PStaticLB (static load balancing) This configu-
ration adds static load balancing to PSimple. In IDA*,
the i-th iteration repeats all of the work done in iter-
ation i − 1. Thus, the number of states visited under
each root state in the iteration i − 1 can be used to
estimate the number of states which will be visited in
the current iteration i.
UpdateRootSet (Alg. 2, line 26) implements the fol-
lowing static load balancing mechanism. Let load(n)
be the number of nodes expanded in the previous iter-
ation under node n. If load(n) > averageload , where
averageload is the average number of nodes expanded
under all of the root nodes, we split n as follows. We
then initialize droots to ø, and perform an A* expansion
of the search tree starting at n, adding generated nodes
into droots, until droots has ≥ load(n)/averageload
nodes. Then, we remove n from the root set, set
load(m) ← load(n)/|droots| for each m ∈ droots, and
add m ∈ droots to the root set.
Then, we allocate each root in the root set to threads.
For each thread t, roots are assigned to t until the
sum of the load(n) for the roots assigned to t exceeds
averageload . Roots are allocated to threads in the or-
der which they were generated – we experimented with
other orders for this assignment but did not obtain sig-
nificant differences.
A goal node might be found during this split-
redistribute phase, but the path to such a goal node
may or may not be optimal-cost, so it needs to be re-
turned to OPEN without expanding it.
This rebalancing phase is performed on the CPU, so
duplicate detection and pruning via the CLOSED list
is performed during rebalancing.
PFullLB (thread-parallel with dynamic load bal-
ancing) This configuration adds dynamic load bal-
ancing to PStaticLB, i.e., the DynamicLoadBalance
function (Alg. 2, line 8) is enabled.
Dynamic load balancing moves work to idle threads
from threads with remaining work during an itera-
tion. On a GPU, work can be transferred between
two threads within a single block relatively cheaply be-
cause this can be done using the shared memory within
a block, while transferring work between two threads
which are in different blocks is expensive because it re-
Algorithm 2 Parallel IDA*
1: function DFS(root, goals, limitf )
2: openList← root
3: fnext =∞
4: repeat
5: s← pop(openList)
6: if s ∈ goals then
7: return s and its parents as a shortest path
8: if dynamic load balance is triggered then Dynami-
cLoadBalance(stat)
9: for all a ∈ applicable actions(s) do
10: t← successor(a, s)
11: fnew ← g(s) + cost(a) + h(t)
12: if fnew <= limitf then
13: openList← t
14: else
15: fnext ← min(fnext, fnew)
16: until openList is empty
17: return fnext, stat . no plan is found
18:
19: function ParallelIDA*(start, goals)
20: rootset← CreateRootSet(start, goals)
21: limitf ← DecideFirstLimit(rootSet)
22: repeat
23: parallelForByThreads root ∈ rootset do
24: limitf , stat← DFS(root, goals, limitf )
25: end parallelForByThreads
26: UpdateRootSet(start, goals, rootSet, stat)
27: until shortest path is found
quires access to the global memory. When dynamic load
balancing is triggered, then within a block, idle threads
steal work from threads with remaining work.
We experimented with various dynamic load balanc-
ing strategies including variants of policies investigated
by (Powley and Korf 1989; Mahanti and Daniels 1993).
The best performing policy for triggering dynamic bal-
ancing (the trigger is checked in algorithm 2, line 8) is
based on the policy by Powley and Korf: perform load
balancing when the fraction of idle (completed) threads
within a block exceeds W (t)/(L+t), where L is the time
spent for the previous load balancing operation, t is the
time since the previous load balancing operation, W (t)
is the amount of work (number of nodes visited) since
the previous load balancing operation. Note that time
t is not wall-clock time since the last rebalancing, but
the actual amount of GPU time consumed. In addi-
tion, load rebalancing is constrained so that it can not
be triggered until at least L/2 seconds have passed since
the previous rebalancing.
A.2 Evaluation of Thread-Parallel IDA*
PSimple on 1536 cores required a total of 3378 seconds
to solve all 100 problems, a speedup of only 18.6 com-
pared to G1 (1 core on the GPU).
We define load balance as maxload/averageload,
where averageload is the average number of nodes vis-
ited among all threads, and maxload is the number of
states visited by the thread which performed the most
work (visited the largest number of nodes).
Figure 2: Maximum thread workload in PSimple
Figure 2 plots, for each of the 100 problem in-
stances, the maximum workload (number of nodes vis-
ited) among all threads of PSimple relative to the aver-
age workload, for the next-to-last iteration , i.e., a value
of 1 means perfect load balance. We measure load bal-
ance for the next-to-last iteration because in the last
iteration, only part of the tree needs to be expanded.
Figure 2 shows that some threads expand up to 600
times as many nodes as the average thread, indicating
that PSimple has extremely poor load balance.
Figure 3a compares the relative runtimes, i.e., the y-
axis shows Runtime(PSimple)/Runtime(PStaticLB) for
each instance. PStaticLB required a total of 1069 sec-
onds to solve all 100 problems, a speedup of 3.16 com-
pared to PSimple.
Figure 3b plots, for each of the 100 problem instances,
the maximum load (number of nodes visited) among
all threads of PStaticLB relative to the average load
for all threads for the next-to-last iteration. On one
hand, Figure 3b indicates that maxload/averageload
of PStaticLB is significantly smaller than that of PSim-
ple. This explains the 3.16x speedup of PStaticLB com-
pared to PSimple. On the other hand, the load bal-
ance is still quite poor, with some problems having a
maxload/averageload ratio > 50, with most problem
instances having a maxload/averageload > 5.
Figure 3d compares the relative runtimes, i.e., the
y-axis shows Runtime(PStaticLB)/Runtime(PFullLB)
for each instance. PFullLB required a total of 892 sec-
onds to solve all 100 problems, a speedup of 1.20 com-
pared to PStaticLB.
Figure 3e plots, for each of the 100 problem instances,
the maximum load (number of nodes visited) among all
threads of PFullLB relative to the average load for all
threads for the next-to-last iteration. Figure 3f overlays
the results in Figure 3e with the results in Figure 3b.
Figures 3e-3b indicate that maxload/averageload of
PFullLB is significantly smaller than that of PStaticLB.
In particular, the worst maxload/averageload ratios
have been reduced to< 20 (compared to> 50 for PStat-
icLB). Nevertheless, most of the maxload/averageload
ratios are > 4, indicating that even the combination
of static and dynamic load balancing is insufficient for
achieving good load balance.
(a) Relative runtimes: PSimple vs.
PStaticLB
(b) Maximum thread workload: PStat-
icLB
(c) Max workload comparison: PSim-
ple vs. PStaticLB
(d) Relative runtimes: PStaticLB vs.
PFullLB
(e) Maximum thread Workload: PFul-
lLB
(f) Max workload comparison: PStati-
cLB vs. PFullLB
Figure 3: PStaticLB Evaluation
We experimented extensively with both static and
dynamic load balancing strategies, but so far, we have
not found any strategy/configuration that significantly
outperforms the results presented here.
