A parallel priority queue with fast updates for GPU architectures by Iacono, John et al.
A parallel priority queue with fast updates for GPU architectures
John Iacono∗ Ben Karsin† Nodari Sitchinava‡
Abstract
The high computational throughput of modern graphics processing units (GPUs) make them
the de-facto architecture for high-performance computing applications. However, to achieve
peak performance, GPUs require highly parallel workloads, as well as memory access patterns
that exhibit good locality of reference. As a result, many state-of-the-art algorithms and data
structures designed for GPUs sacrifice work-optimality to achieve the necessary parallelism.
Furthermore, some abstract data types are avoided completely due to there being no corresponding
data structure that performs well on the GPU. One such abstract data type is the priority queue.
Many well-known algorithms rely on priority queue operations as a building block. While
various priority queue structures have been developed that are parallel, cache-aware, or
cache-oblivious, none has been shown to be efficient on GPUs. In this paper, we present
the parBucketHeap, a parallel, cache-efficient data structure designed for modern GPU
architectures that supports standard priority queue operations, as well as bulk update. We
analyze the structure in several well-known computational models and show that it provides
both optimal parallelism and is cache-efficient. We implement the parBucketHeap and, using
it, we solve the single-source shortest path (SSSP) problem. Experimental results indicate that,
for sufficiently large, dense graphs with high diameter, we out-perform current state-of-the-art
SSSP algorithms on the GPU by up to a factor of 5. Unlike existing GPU SSSP algorithms,
our approach is work-optimal and places significantly less load on the GPU, reducing power
consumption.
1 Introduction
In the past decade, graphics processing units (GPUs) have emerged as the most effective hardware
architectures for solving computationally intensive problems. The high computational throughput
and relatively low power consumption have made GPUs and many-core architectures the most
frequently used hardware systems for large compute clusters and supercomputers. However, many
classical algorithms and data structures that were developed for sequential machines generate
workloads that are not well-suited for highly parallel GPUs. As a result, many state-of-the-art
GPU-efficient libraries employ brute-force techniques to more effectively leverage the hardware,
resulting in algorithms that are theoretically sub-optimal. Additionally, GPU-efficient variations of
many fundamental data structures have yet to be developed.
One fundamental abstract data type (ADT) that lacks a GPU-efficient data structure is the
priority queue. The priority queue ADT is defined as a collection of elements that supports update,
delete, and extractMin (return the smallest element in the collection). The prototypical priority
queue data structure is the heap, with many variants providing different performance tradeoffs.
∗Universite´ Libre de Bruxelles.
†Universite´ Libre de Bruxelles.
‡University of Hawaii an Manoa.
ar
X
iv
:1
90
8.
09
37
8v
1 
 [c
s.D
S]
  2
5 A
ug
 20
19
Many of the well-known abstract computational models has a corresponding priority queue data
structures that minimizes the cost of each operation. However, the complexity of GPU architectures
makes it unclear which model best captures GPU performance and if any such existing data structure
is well-suited.
GPU-efficient algorithms require a high degree of parallelism while exhibiting memory access
patterns with good locality of reference. Thus, GPU performance is best captured by computational
models designed to model parallel machines and those that focus on cache efficiency. The parallel
random access machine (PRAM) model [1] is the most well-known parallel computational model,
where the running time of an algorithm is measured by the number of parallel memory accesses
it performs. The PRAM model has several variants that restrict or allow concurrent reading or
writing to the same memory location by different processors. In this work, we consider only the
EREW (exclusive read, exclusive write) PRAM model that disallows such concurrent accesses. The
external memory (EM) model [2] (also known as the disk access model) considers a single processor
with internal memory of size M and measures the cost of an algorithm by the number of block
transfers it performs, where each block transfer loads B contiguous elements into internal memory.
The PEM model [3] seeks to combine these two models, with P processors each having M internal
memory, and cost is measured in the number of parallel block accesses, where each processor loads
B elements into internal memory.
Table 1: Time to perform priority queue operations in different models.
RAM / PRAM EM / PEM
Data structure extractMin bulkUpdate extractMin bulkUpdate
Seq. Bucket
Heap [4]
O(log n) O(d log n) O
(
1
B log
n
B
)
O
(
d
B log
n
B
)
Parallel Priority
Queue [5]
O(1) O(d) O(1) O(d)
This work O(1) O(log d) O
(
log (n/M)
PB +
1
B
)
O
(
d log (n/M)
PB +
d
B
)
1.1 Contributions In this work, we present the parallel bucket heap (parBucketHeap), a
GPU-efficient priority queue data structure. We base the parBucketHeap on the cache-oblivious
bucket heap of Brodal et al. [4], and extend it to allow for parallelism. Thus, when performing
extractMin, update, and delete, the parBucketHeap matches the performance of the current
best structures in the EM and PRAM models, and significantly reduces the cost in the PEM model
(see Table 1 for a summary of these results). Furthermore, the parBucketHeap allows for efficient
batching of updates into a single bulkUpdate operation that further reduces the cost per update.
The bulkUpdate operation is particularly useful when the parBucketHeap is used to solve the
single-source shortest path (SSSP) problem using a variant of Dijkstra’s algorithm, as batches of
update operations are necessary.
We implement the parBucketHeap using the CUDA API and experimentally evaluate its
performance on two modern GPUs. We demonstrate that the parBucketHeap can efficiently
perform large batches of updates. Using the parBucketHeap, we implement a variant of Dijkstra’s
algorithm to solve the SSSP problem. We evaluate the performance of our SSSP algorithm and
2
compare results with the SSSP algorithm available with the state-of-the-art GPU graph algorithms
library, nvGRAPH [6]. We show that, for sufficiently dense graphs with long shortest paths, our
algorithm outperforms the nvGRAPH SSSP implementation by up to 5.3×. Finally, we demonstrate
that, due to the work-efficient nature of our SSSP algorithm, using nvGRAPH results int he GPU
drawing up to 3.1× more power to solve instances of the SSSP problem with the same execution
time.
The paper is organized as follows: in Section 2 we provide background and discuss related
work, in Section 3 we present our parallel bucket heap data structure, and in Section 4 we provide
implementation details and experimental results.
2 Background and Related work
In this work we consider the priority queue abstract data type (ADT). Formally, the ADT is
defined over a collection of elements, Q, where each element consists of a value and priority, i.e.,
e ∈ Q = (val, p). For each element e, we define e.val and e.p to be the value, and priority,
respectively. The ADT supports the following operations: extractMin removes and returns the
element in Q with the smallest priority, i.e., e ∈ Q s.t. e.p = min
ei∈Q
ei.p. update(e) adds new element
e = (val, p) to Q and, if there exists an e′ = (val, p′) with the same value in Q, then remove e′ (so
that it is replaced by e).1 delete(e) removes e from Q if it is contained in Q.
There are many data structures defined that implement the priority queue ADT, including
various types of heaps [4, 5, 7–9]. We note that many other data structures, including binary search
trees or sorted arrays, can be used as priority queues, though they provide additional functionality
and are therefore not as efficient when performing only the above operations. Though not considered
part of the standard priority queue ADT, in this work we additionally consider the bulkUpdate(U)
operation that, given a set of elements U , performs update(e) for each e ∈ U .
We formally define what it means for a data structure to correctly implement the priority
queue ADT as follows. Given a series of n operations O = Op(1),Op(2), . . . ,Op(n), comprised of
extractMin, update, and delete operations, any extractMin, Op(i), performed must retrieve
the element with the smallest priority in the collection. That is, given that the data structure
contains the elements Q, if we perform k update operations, inserting elements U , followed by an
extractMin, the extractMin must return the element e ∈ Q ∪ U such that e.p = min
ei∈Q∪U
ei.p.
Similarly, if the if the data structure contains Q and we perform k delete operations to remove
the elements D, an extractMin must return element e ∈ Q \D such that e.p = min
ei∈Q\D
ei.p (where
Q \D represents all elements in Q that are not in D).
Several fundamental priority queue data structures provide tradeoffs between simplicity and
performance (e.g., binary heaps, Fibonacci heaps [10], pairing heaps [8], etc.). However, these
heaps are inherently sequential and operations on heaps cannot be easily parallelized. Thus, several
priority queues have been developed to expose parallelism [5,9,11], with Brodal et al. [5] presenting
a structure that performs all standard priority queue operations in constant time and logarithmic
work in the parallel random access machine (PRAM) model [1].
To our knowledge, all existing parallel priority queue data structures are not cache-efficient, and
as such may not perform well on GPUs or other parallel systems that rely on locality of reference to
1In this work, we assume that updates only decrease priority (i.e., p < p′). However, this assumption can easily be removed
by adding a timestamp to each element when it is inserted into the structure and, when deleting duplicate entries, the most
recent timestamp is kept.
3
achieve peak performance. While not inherently parallel, in the context of the EM or cache-oblivious
models, the cache-oblivious bucket heap [4] and buffer heap [12] structures achieve sub-constant time
operations when the block size, B, is sufficiently large. Since there are no parallel, cache-efficient
priority queue structures, few works have considered using priority queues on GPUs [13, 14]. While,
in 2012, He et al. [14] presented a priority queue that could achieve a 30x speedup over sequential
execution, Baudis et al. [13] more recently demonstrated that, for small queues of up to 500 items,
simple circular buffers out-perform tree-based queues for a range of applications.
In this work, we focus on using priority queue structures to solve the single-source shortest paths
(SSSP) problem. Given a graph consisting of n vertices, m weighted edges, and a source vertex s,
the SSSP problem asks to find, for each vertex v, the path from s to v such that sum of the weights
of all edges on the path is smallest. We assume that all edges have non-negative weight edges and,
for this work, we consider both directed and undirected graphs. In the sequential setting, Dijkstra’s
algorithm [7] using a Fibonacci heap solves the SSSP problem in O(m + n log n) on any graph with
n vertices and m edges. While this is the best sequential result for general graphs with non-negative
edge weight, several algorithms have been proposed that improve this for special cases [15–17]. For
example, Thorup et al. [15] present a algorithm that runs in O(m + n) time if edge weights are
positive integers.
In the parallel setting, the SSSP problem suffers from the transitive closure bottleneck [18]. Thus,
finding an algorithm that is work-efficient (i.e., the same work complexity as Dijkstra’s algorithm)
with o(n) runtime remains an important open problem. Nevertheless, the SSSP problem is in NC [19],
as there exist algorithms with poly-logarithmic runtime using a polynomial number of processors [20].
These algorithms, however, perform significantly more work than Dijkstra’s algorithm, making them
impractical. As a result, many alternative parallel SSSP algorithms have been proposed [16,17, 21].
Crauser et al. [16] present a method of reducing the depth of Dijkstra’s algorithm to O(n1/3) on
graphs with certain properties, with high probability. Meyer et al. [17] introduced a hybrid between
the work-inefficient Bellman-Ford algorithm [7] and Dijkstra’s algorithm, increasing parallelism
without sacrificing work efficiency. This work was further improved by Blelloch et al. [21], although
both approaches are sub-optimal in the worst case because they rely on properties of the graphs
being processed.
To perform well on modern GPU architectures, algorithms need to have both a high degree of
parallelism and memory access patterns with good locality of reference. As such, many current
state-of-the-art GPU libraries avoid work-efficient algorithms and data structures in favor of more
parallel brute-force types of approaches. In 2007, Harish et al. [22] demonstrated that a Bellman-
Ford approach is well-suited to GPUs, achieving performance gains over sequential approaches for
scale-free graphs. On more dense graphs, however, this approach does not perform as well, and in
2008, Katz et al. [23] presented a GPU-efficient Floyd-Warshall implementation to solve the all-pairs
shortest path (APSP) problem, achieving up to 6.5× speedup over the algorithm of Harish et al.
More recently, Davidson et al. [24] developed and experimentally evaluated several GPU-efficient
SSSP algorithms and, while not work-efficient, showed significant performance gains over sequential
approaches on a range of graphs. A similar approach was further optimized by Wang et al. [25]
and integrated into the Gunrock framework. The current state-of-the-art GPU graph algorithms
are in the nvGRAPH library [6] that is included with the CUDA toolkit [26]. The SSSP algorithm
employed by nvGRAPH is a highly optimized variation of Bellman-Ford with heuristics designed to
reduce overall work while maximizing parallelism. However, since the their approach is based on
the Bellman-Ford algorithm, it is not work-efficient and may perform poorly in the worst case, as
4
we demonstrate in this work.
3 Parallel bucket heap
In this section we describe the parBucketHeap, a GPU-efficient data structure that implements
the priority queue ADT. The structure is based on the cache-oblivious bucket heap structure of
Brodal et al. [4]. Thus, we begin by providing an overview of the bucket heap structure.
3.1 Sequential bucket heap The bucket heap [4] is made up of a series of buckets, Bi, and
signal buffers, Si, where the capacity of Bi is 2
2i+1 and Si is 2
2i. If the heap contains n elements,
there are a total of dlog4 n + 1e buckets, B0, . . . , Bdlog4 n+1e−1 and signal buffers S0, . . . , Sdlog4 n+1e−1.
Each bucket and buffer contains some number of elements, where each element e consists of a priority,
e.p, and value, e.val. Each bucket Bi and buffer Si contains |Bi| and |Si| elements, respectively,
that are sorted by value and the structure maintains the “heap” property that any element in Bi
has smaller priority than any element in Bi+1 or in Si+1. To accomplish this, each level maintains
a maximum priority, pi such that, for any element e ∈ Bj , if j ≤ i, then e.p ≤ pi. Similarly, if
e ∈ Bj ∪ Sj and j > i, then e.p > pi. If i = dlog4 n + 1e − 1 (i.e., the largest level), then pi = ∞.
Figure 1 illustrates the sequential bucket heap structure [4] and shows the relationship between
elements stored at each level.
· · ·
· · · · · ·
B0 S0
B1 S1
S2B2
Below are all ≥ p0
update/deleteextractMin
Fill(B1)
|Bi| = 22i+1 |Si| = 22i
· · ·
Empty(S1)
Figure 1: Illustration of the sequential bucket heap structure of Brodal et al. [4]. updates and
deletes are inserted into S0, while extractMins are removed from B0. Empty(Si) empties
Si into Si+1, Fill(Bi) fills Bi from Bi+1, while pi is maintained at each level to ensure the heap
property between levels.
The heap property of the bucket heap ensures that, if B0 ∪ S0 is non-empty, it will contain
the element with smallest priority in the structure. Thus, as long as B0 ∪ S0 is kept non-empty,
extractMin can simply remove and return the minimum priority element in B0 ∪ S0. Similarly,
as long as S0 is kept non-full, update(e) can simply insert e into S0. delete operates similar
to update using an element with a special priority value, DEL, that moves down the structure,
annihilating elements with matching values. The remainder of the bucket heap description involves
keeping B0 non-empty and S0 non-full while maintaining the heap property, which is accomplished
with the Fill and Empty operations, respectively.
At a high level, if any signal buffer Si becomes full, Empty(Si) is called, pushing elements to
the level i + 1 below and leaving Si empty. To maintain the heap property, Empty(Si) operates as
follows: i) Si and Bi are merged by values, ii) for any elements with duplicate values, remove those
with larger priority, and iii) all elements e such that e.p > pi are moved to Si+1. If the resulting Bi
is too full (i.e., there are too many elements with e.p ≤ pi), then pi is updated so that |Bi| = 22i+1.
5
Updating the priority of an existing element is accomplished when elements with duplicate
values are found and the element with larger priority is removed (elements with special delete
priorities are also applied this way, see [4] for details). Since lists are sorted by value, removing
elements with duplicate values can be performed with a scan. If the number of elements in a bucket
Bi fall under the minimum size (e.g., half full), Fill(Bi) is called, which fills Bi with elements from
the level below. This is accomplished by i) calling Empty(Si+1), ii) filling Bi with the smallest
priority elements in Bi+1, and iii) updating pi. All of these operations are performed via scans of
contiguous arrays, implying that the sequential bucket heap structure is cache-oblivious. In the
EM model, the amortized cost of extractMin, update, or delete using the sequential bucket
heap is O( 1B log
n
B ).
3.2 Parallel bucket heap definition As with the sequential bucket heap, our parallel bucket
heap, parBucketHeap, is comprised of a series of levels, each containing a bucket Bi and signal
buffer Si. We maintain the same heap property as the sequential bucket heap (described above), so
extractMin (Algorithm 1), update (Algorithm 2), and delete (Algorithm 3) only operate on
B0 and S0. We define ` to be the maximum non-empty level of the parBucketHeap.
Unlike the sequential bucket heap, we add the restriction that S0 be empty as a precondition
to extractMin, update, and delete. Furthermore, we increase the maximum capacity of every
bucket and signal buffer by the parameter d, so the capacities of Si and Bi are d2
2i and d22i+1,
respectively. This allows us to perform a bulkUpdate(U) (Algorithm 4) of up to d elements
efficiently by simply inserting them all into S0. However, increasing the capacity of S0 necessitates
that extractMin perform FindMin(B0), which involves scanning B0 to find the element with
smallest priority. Thus, while the d parameter defines the maximum number of updates that can
be performed in bulk, it also increases the cost of perform extractMin. We note that we use a
special delete signal priority (DEL in the pseudocode) that is used to handle delete operations.2
Algorithm 1 extractMin()
Precondition: if ` > 0, |B0| ≥ d
Precondition: |S0| = 0
Postcondition: if ` > 0, |B0| ≥ d
e← FindMin(B0)
B0 ← B0 \ e
S0 ← {(e.val,DEL)}
return e
Algorithm 2 update(e)
Precondition: |S0| = 0
Postcondition: |S0| = 1
S0 ← {e}
To simplify our analysis and description, we combine the Empty and Fill operations of the
sequential bucket heap of Brodal et al. [4] into a single operation, Resolve. The Resolve(i)
operation (pseudocode detailed in Algorithm 5) both fills Bi and empties Si, leaving |Bi| = 22i+1
2In our implementation, we use DEL = −1, since we only use non-negative priorities.
6
Algorithm 3 delete(val)
Precondition: |S0| = 0
Postcondition: |S0| = 1
S0 ← {(val,DEL)}
Algorithm 4 bulkUpdate(U)
Precondition: |S0| = 0
Precondition: U is sorted by value and |U | ≤ d
Postcondition: S0 = U
S0 ← U
(unless i = `) and |Si| = 0. Our description of the Resolve operation in Algorithm 5 is high-level
and the subroutines Merge, DeleteDuplicates, and Select can be implemented in different
ways, depending on the desired level of parallelism, which we discuss in our analysis in the subsequent
sections.
The sequential bucket heap described above empties signal buffers and fills buckets as needed,
resulting from extractMin, update, or delete operations. Thus, some operations can result
in large buckets or signal buffers being filled or emptied, respectively. In the sequential setting,
this is unavoidable. However, in the parallel setting, we can pro-actively resolve larger levels using
additional processors without interfering with the processor adding or removing elements from the
smallest level. For example, if we can assign 1 processor to each level of the parBucketHeap, we
can have each processor resolve it’s level when needed, allowing each operation to proceed without
having to resolve any large levels.
Theorem 3.1. If, for every level i > 0, we perform Resolve(i) after every 4th Resolve(i− 1),
then all preconditions are always satisfied.
Proof. Consider the pseudocode in Algorithm 5 and some level i > 0. Assume that, for all levels
0 < j < i, the preconditions of every call to Resolve(j) are satisfied. By the postconditions,
when Resolve(i) completes, |Si| = 0. Since we assume that the preconditions are satisfied for
Resolve(i − 1), each call adds at most d22(i−1) elements to Si, so if Resolve(i − 1) has been
performed at most 4 times, |Si| ≤ 4 · d22i−2 = d22i. If we assume that i < `, then after Resolve(i),
|Bi| = d22i+1. Each Resolve(i− 1) removes at most d22(i−1) elements from Bi, so after the 4th
Resolve(i− 1), |Bi| ≥ d22i+1 − 4 · d22(i−1) = d22i.
For the base case when i = 1, Resolve(1) completes, |B1| = 8d and |S1| = 0. Each call to
Resolve(0) removes at most d elements from B1 and adds d to S1. Thus, after the 4th Resolve(0),
|B1| ≥ 4d = d22i and |S1| ≤ 4d = d22i.
We now prove that the preconditions on Bi+1 and Si+1 are satisfied. Consider the 4th time that
Resolve(i) is being performed following a Resolve(i + 1). Since Resolve(i) has been performed
3 times since Resolve(i), |Si+1| ≤ 3 · d22i = d22(i+1)− d22i. If i+ 1 > `, then after Resolve(i+ 1),
|Bi+1| = d22(i+1)+1. Thus, before the 4th Resolve(i), |Bi+1| ≥ (d22(i+1)+1−3·d22i) = d22(i+1)+d22i.
3.2.1 Parallel execution sequence Consider a series of n operations, defined as
Op(1),Op(2), . . . ,Op(n), where each operation is extractMin, update, delete, or bulkUpdate.
7
Algorithm 5 Resolve(i)
Precondition: if i < `, then |Bi| ≥ d22i
Precondition: |Si| ≤ d22i
Precondition: if i + 1 < `, then |Bi+1| ≥ d22(i+1) + d22i
Precondition: |Si+1| ≤ d22(i+1) − d22i
Postcondition: if i < `, |Bi| = d22i+1
Postcondition: |Si| = 0
1: if |Si| > 0 then . Empty Si if needed
2: Bi ← Merge(Si, Bi); Si = ∅
3: Bi ← DeleteDuplicates(Bi)
4: num← |{e : e ∈ Bi and e.p ≤ pi}| . Count elements with small priority
5: if num > 22i+1 then . Update pi if needed
6: pi ← Select(Bi, 22i+1)
7: B′i = {e : e ∈ Bi and e.p > pi} . Move large priority elements to Si+1
8: Bi = {e : e ∈ Bi and e.p ≤ pi}
9: Si+1 ←Merge(Si+1, B′i)
10: if |Bi| < 22i+1 and i is not largest non-empty level then . Fill Bi if needed
11: Bi+1 ←Merge(Bi+1, Si+1); Si+1 ← ∅
12: Bi+1 ← DeleteDuplicates(Bi+1)
13: pi ← Select(Bi+1, 22i+1−|Bi|)
14: B′i+1 ← {e : e ∈ Bi+1 and e.p ≤ pi} . Pull elements up to fill Bi
15: Bi+1 ← {e : e ∈ Bi+1 and e.p > pi}
16: Bi ← Merge(Bi, B′i+1)
17: Si+1 ← {e : e ∈ Bi+1 and e.p > pi+1} . Move large priority elements back to Si+1
8
L
ev
el
0
1
2
3
Time →
· · · · · ·Res0(43k)
Res2(4k)
Res3(k)
Res2(4k+1)
T (R3)
Op(43k)
Res1(4
2k)
Figure 2: Illustration of the dependencies when performing a series of operations. Small green boxes
represent operations (extractMin, update, delete, or bulkUpdate) and larger boxes represent
Resolves. Dependencies are shown with arrows between boxes. Since resolving larger levels takes
more time, we represent them by wider boxes. Width scaled to show that, if Resolve(0) takes d
time, Resolve(i) can take d22i time without delaying any operations (which occur every 5d parallel
memory accesses).
Clearly, to correctly perform all operations, we must resolve different levels of the parBucketHeap
periodically. We define Resi(k) to be the k-th execution of Resolve(i) during this series of opera-
tions. Theorem 3.1 defines a set of constraints on the order of execution of each Resi(k). We use
these constraints to define a set of dependencies, which we formally define using A→ B to denote
that B depends on A being complete in order for the preconditions of B to be met.
After each extractMin, update, delete, or bulkUpdate, we must perform a Resolve(0)
to ensure that the preconditions are met for subsequent operations and resolves (|B0| ≥ d and
|S0| = 0). Thus, we combine Op(k) and Res0(k) and refer to it simply as Res0(k). For all k > 1,
each Op(k) depends on Res0(k − 1) completing, so Res0(k − 1)→ Res0(k). By Theorem 3.1, we
have that, for all i > 0 and k ≥ 1, Resi−1(4k) → Resi(k) Finally, we need to avoid concurrent
accesses to arrays Bi and Si during parallel executions of Resolve on various levels.
Lemma 3.1. Any number of non-adjacent levels can be resolved concurrently.
Proof. Consider two levels i and j such that they are non-adjacent (i.e., j < i − 1 or j > i + 1).
The Resolve(i) operation, detailed in Algorithm 5, updates the values and sizes of Bi, Si, Bi+1,
and Si+1. Thus, if j > i+ 1, then Resolve(i) and Resolve(j) do not update the values or sizes of
any of the same lists and can therefore be correctly performed concurrently. Similarly, if j < i− 1,
then j + 1 < i and Resolve(i) and Resolve(j) can be performed concurrently. Since, for any
non-adjacent i and j, Resolve(i) and Resolve(j) are independent, then resolves on any set of
non-adjacent levels are all independent and can therefore also be performed concurrently.
This lemma gives us the additional constraint that no two consecutive levels can be resolving
concurrently. This leads to every resolve having the dependency that the resolve on the next (larger)
level must complete before it can begin, i.e., Resi(k)→ Resi−1(4k + 1)
Intuitively, the dependencies defined between resolves generates a directed acyclic graph (DAG)
where each vertex represents a resolve operation and each edge is a dependency. Figure 2 illustrates
this DAG. In this figure, green boxes represent operations (extractMin, update, delete, or
bulkUpdate), other color boxes represent resolves, and the width of each box is the amount of
time needed to perform it.
9
3.3 EREW PRAM analysis We now evaluate the cost of performing n operations (where each
operation in extractMin, update, delete, or bulkUpdate in the EREW PRAM model [1]
(described in Section 1). To do so, we use the dependencies defined above to define a start and
ending time of each Resolve. We then compute the time needed to perform each Resolve to
determine the total time to perform n operations.
We define T (Ri) to be the time needed to perform Resolve(i), and we define starti(k) and
endi(k) to be the start and end times of Resi(k), respectively. With these definitions, we formalize
the start and end times of each resolve operation based on the dependencies defined above (and
illustrated in Figure 2). For all i > 0, each Resolve(i) depends on the 4th resolve of the previously
level, i.e.,
starti(k) = endi−1(4k)
And the 1st Resolve(i) after the next (larger) level starts also depends on it completing, so if
k mod 4 = 1
starti(k) = max (endi−1(4k), endi+1((k − 1)/4))
Each Resolve(i) takes T (Ri) time to complete, thus
endi(k) = starti(k) + T (Ri)
Thus, for any resolve, our dependencies define the completion time of Resi(k) as
endi(k) =

max(endi−1(4k), endi+1(k−14 ))+T (Ri) if i ≥ 1 and k mod 4 = 1
endi−1(4k) + T (Ri) if i ≥ 1 and k mod 4 6= 1
end1((k − 1)/4) + T (R0) if i = 0 and k mod 4 = 1
end0(k − 1) + T (R0) if i = 0 and k mod 4 6= 1
for all k > 1. Using these constraints, we can simply compute end0(n) to determine the time needed
to perform n operations. To compute this, we must determine T (Ri) for each level i.
Lemma 3.2. Resolve(i) takes T (Ri) = O(log d + i + 1) time and O(d2
2i) work in the EREW
PRAM model, for all i ≥ 0 and d ≥ 1.
Proof. Consider the Resolve(i) operation and we refer to line numbers in the pseudocode of
Algorithm 5. The resolve operation relies on several primitives including Merge, Select, and
DeleteDuplicates. Select and Merge each take O(log n) time and O(n) work in the PRAM
model [27]. DeleteDuplicates involves i) identifying and deleting duplicate entries, and ii)
compressing the remaining elements into contiguous space. This can be accomplished with a parallel
scan and prefix sums, taking O(log n) time and O(n) work on a list of length n. All moving of
elements into contiguous space or compressing lists with deleted elements can be done using a
parallel scan and prefix sums as well.
If |Si| > 0 (line 1 of Algorithm 5), the resolve operation empties Si with i) Merge (line 2),
ii) DeleteDuplicates (line 3), iii) count small priority elements (line 4), iv) Select (lines 5-6),
move large priority elements to Si (lines 7-8), and v) Merge (line 9). If |Bi| < d22i+1 (line 10),
then Resolve(i) refills Bi with i) Merge (line 11), ii) DeleteDuplicates (line 12), iii) Select
10
(line 13), iv) move small priority elements out of Bi+1 (lines 14-15), v) Merge (line 16), and vi)
move large priority elements back to Si+1 (line 17). Thus, in the EREW PRAM model, the time to
perform Resolve(i) is
T (Ri) = O
( Merge︷ ︸︸ ︷
log |Bi|+
DeleteDuplicates︷ ︸︸ ︷
log (|Bi|+ |Si|) +
count︷ ︸︸ ︷
log (|Bi|+ |Si|) +
move/compress︷ ︸︸ ︷
log (|Bi|+ |Si|) +
Merge︷ ︸︸ ︷
log |Si+1|+
Merge︷ ︸︸ ︷
log |Bi+1|
+
DeleteDuplicates︷ ︸︸ ︷
log (|Bi+1|+ |Si+1|) +
Select︷ ︸︸ ︷
log (|Bi+1|+ |Si+1|) +
move/compress︷ ︸︸ ︷
log (|Bi+1|+ |Si+1|) +
Merge︷ ︸︸ ︷
log |Bi+1|
+
move/compress︷ ︸︸ ︷
log |Bi+1|+ |Si+1|
)
= O(log (d22(i+1)+1))
Thus, for all i ≥ 0,
T (Ri) = O(log d + i + 1)
and the amount of work is
W (Ri) = O
(Merge︷︸︸︷
|Bi| +
DeleteDuplicates︷ ︸︸ ︷
(|Bi|+ |Si|) +
count︷ ︸︸ ︷
(|Bi|+ |Si|) +
move/compress︷ ︸︸ ︷
(|Bi|+ |Si|) +
Merge︷ ︸︸ ︷
|Si+1|+
Merge︷ ︸︸ ︷
|Bi+1|
+
DeleteDuplicates︷ ︸︸ ︷
(|Bi+1|+ |Si+1|) +
Select︷ ︸︸ ︷
(|Bi+1|+ |Si+1|) +
move/compress︷ ︸︸ ︷
(|Bi+1|+ |Si+1|) +
Merge︷ ︸︸ ︷
|Bi+1|+
move/compress︷ ︸︸ ︷
(|Bi+1|+ |Si+1|)
)
= O(d22(i+1)+1) = O(d22i)
With T (Ri) computed for all levels i, we can solve the recurrence defined above. For simplicity
of notation, we divide our PRAM execution time into a series of timesteps, where at each timestep
we perform t parallel memory accesses such that any Op(k) and Res0(k) can be performed in a
single timestep.
Lemma 3.3. With Op(k) and Res0(k) being performed at timestep 5k − 5, all dependencies can be
satisfied with starti(k) = 5k · 4i − 5 + 13(4i − 1) and T (Ri) ≤ 22i timesteps, for all i ≥ 0 and k > 0.
Proof. Recall that our dependencies define the ending time of Resi(k) to be
endi(k) =

max(endi−1(4k), endi+1(k−14 ))+T (Ri) if i ≥ 1 and k mod 4 = 1
endi−1(4k) + T (Ri) if i ≥ 1 and k mod 4 6= 1
end1((k − 1)/4) + T (R0) if i = 0 and k mod 4 = 1
end0(k − 1) + T (R0) if i = 0 and k mod 4 6= 1
Thus, for any i > 0, if we assume that k mod 4 = 1, then
starti(k) = max(endi−1(4k), endi+1((k − 1)/4))
11
Assume that, for all j 6= i and k > 0, startj(k) = 5k ·4j−5 + 13(4j−1). For some resolution Resi(k),
there is a dependence on the smaller level (i− 1), and on the larger level (i + 1), i.e.,
starti(k) = max(
smaller level︷ ︸︸ ︷
endi−1(4k),
larger level︷ ︸︸ ︷
endi+1((k − 1)/4))
We first consider the dependence on the smaller level:
endi−1(4k) = starti−1(4k) + T (Ri−1)
= 5(4k) · 4i−1 − 5 + 1
3
(4i−1 − 1) + 4i−1
= 5k · 4i − 5 + 4
3
4i−1 − 1
3
= 5k · 4i − 5 + 1
3
(4i − 1)
The dependence on the larger level is
endi+1(
k−1
4
) = starti−1((k − 1)/4) + T (Ri+1)
= 5(
k − 1
4
) · 4i+1 − 5 + 1
3
(4i+1 − 1) + 4i+1
= 5k · 4i − 5 · 4i − 5 + 1
3
4i+1 +
3
3
4i+1 − 1
3
= 5k · 4i − 5 + 4
3
4i +
12
3
4i − 15
3
4i − 1
3
= 5k · 4i − 5 + 1
3
(4i − 1)
Thus, we have that
starti(k) = max
(
5k · 4i − 5 + 1
3
(4i − 1), 5k · 4i − 5 + 1
3
(4i − 1)
)
= 5k · 4i − 5 + 1
3
(4i − 1)
Finally, for the base case at level i = 0,
start0(k) = 5k − 5
= 5k · 40 − 5 + 1
3
(40 − 1)
Theorem 3.2. Using the parBucketHeap with d = 1, extractMin, update, and delete take
O(1) time and O(log n) work in the EREW PRAM model.
Proof. From Lemma 3.3, end0(k) = 5k − 4. Since we perform Op(k) along with Resolve(k), then
Op(n) is completed at timestep 5n − 4. Recall that each timestep performs t parallel memory
accesses such that Resolve(0) and an extractMin, update, delete, or bulkUpdate can be
12
performed in a single timestep. By Lemma 3.2, with d = 1, Resolve(0) can be performed in c
memory accesses, for some constant c. With d = 1, each Op(k) also takes a constant number of
memory accesses, so t is a constant. Therefore, performing n operations takes t(5n − 4) = O(n)
time, for O(1) time each.
The structure has a total of at most dlog4 ne + 1 levels while performing n operations. By
Lemma 3.3, with d = 1, if T (Ri) ≤ 22i, then all n operations can be correctly performed in O(n)
time. By Lemma 3.2, with Pi = c processors, for some constant c, T (Ri) = 2
2i. Thus, with
c · log4 n processors, we can perform n operations in O(n) time, requiring O(n log n) work, or
O(log n) amortized work per operation.
Theorem 3.3. Using the parBucketHeap with d > 1, extractMin, update, delete, and
bulkUpdate of up to d elements take O(log d) time and O(d log nd ) work in the PRAM model.
Proof. The proof of Lemma 3.2 applies to the parBucketHeap with all d > 1, resulting in
Resolve(i) taking O(log d+ i+ 1) time and O(d22i) work. With d > 1, Resolve(0) takes O(log d)
time, so each timestep takes O(log d) time and O(d) work. There are a total of O(log nd ) levels, so
each timestep takes a total of O(d log nd ) work. The proof of Theorem 3.2 applies, so end0(n) = O(n)
timesteps, thus we can perform n operations in O(n log d) time and O(nd log nd ) work, where each
operation is extractMin, update, delete, or a bulkUpdate of up to d elements. Therefore,
each operation takes O(log d) time and O(d log nd ) work in the PRAM model.
3.4 PEM analysis To optimize the performance of the parBucketHeap for the PEM model,
we modify it as follows. First, we set the capacity of B0 and S0 to
M
2 and
M
4 , respectively, so that
one processor can maintain them in internal memory. This allows the processor assigned to level
0 to perform all extractMin and update operations without expending any I/Os. Thus, the
amortized cost per extractMin or update is determined only by the Resolve operations. By
increasing the capacity of each level by a factor of M4 , we have that Bi and Si have capacity 2
2i−1M
and 22i−2M , respectively. Thus, the number of I/Os required for a single processor to perform
Resolve(i) is O
(
22iM
B
)
.
Lemma 3.4. The parBucketHeap can perform extractMin, update, or delete in
O
(
log (n/M)
PB +
1
B
)
parallel I/Os in the PEM model.
Proof. We assign processor P0 to level 0, and it always maintains B0 and S0 in internal memory.
Recall that extractMin, update, and delete simply remove or add elements to Bi and Si,
respectively. Thus, these operations can be performed without expending any I/Os, and we need to
simply compute the cost of resolves. We apply the resolution schedule from Section 3.2, with each
timestep performing c·MB I/Os per level. Since B0 and S0 have capacities M/2 and M/4, respectively,
we can perform M/4 operations for each Resolve(0) (at each timestep). Since we are using the
previously defined resolution schedule, the proof of Lemma 3.2 applies and end0(k) = 5k − 4. At
each timestep, we perform M/4 operations (extractMin, update, or delete), so we complete
n operations by timestep end0(
n
M/4) =
20n
M . With one processor per level (i.e., P = log4
n
M ) each
timestep takes c · MB parallel I/Os, so n operations takes 20nM · cMB = 20cnB = O( nB ) parallel I/Os, or
O( 1B ) I/Os per operation, amortized.
Since Brent’s principle [28] does not apply directly to the PEM model, we consider the case
where P < log4
n
M and compute the associated cost directly. With P < log4
n
M , we assign O(
n
M /P )
13
levels to each processor. At each step of the resolution schedule, each processors performs all of the
work associated with the levels it is assigned. Since each Resolve(i) is performed with a series
of operations that require scanning Bi, Si, Bi+1, and Si+1, we do not rely on the state of internal
memory, except for level 0, which one processor can always maintain. Thus, with P < log4
n
M , for
every M4 operations we will perform O
(
M
B · log (n/M)P
)
parallel I/Os and, in the PEM model, the
amortized cost per extractMin, update, or delete is O
(
log (n/M)
PB
)
.
Corollary 3.1. The parBucketHeap can perform bulkUpdate(d) in O
(
d log (n/M)
PB +
d
B
)
parallel I/Os in the PEM model.
4 Experimental Results
In this section, we implement our parBucketHeap and evaluate its performance on our GPU
hardware platforms. We use the structure to implement a variation of Dijkstra’s algorithm and
evaluate its performance in solving the SSSP problem on several types of graphs. For all experimental
and implementation details, we use the standard GPU terminology of NVIDIA GPUs and the
CUDA API [26].
We implement the parBucketHeap and our SSSP algorithm using C++11 [29] and the CUDA
10 API [26] for NVIDIA GPUs. We compare the performance of our SSSP algorithm with the
SSSP implementation in the nvGRAPH [6] library. We also compare performance with a simple
GPU-efficient blocked Floyd-Warshall implementation to compare performance with a brute-force
approach.
We experimentally evaluate performance performance on our two platforms: RTX and Pascal.
The RTX platform has a Turing generation NVIDIA RTX 2080 Ti GPU with 11GiB of GDDR6
memory and 4352 compute cores. The Pascal platform has a Pascal generation NVIDIA GTX 1080
GPU with 8GiB GDDR5 memory and 2560 cores. We repeat all experiments 5 times, and generate
a series of synthetic graphs to evaluate SSSP performance. All synthetic graphs are generated
randomly, unless otherwise noted, and have varying characteristics. We use V to denote the number
of vertices, E the number of edges, and diameter (D) to denote the maximum number of vertices of
any shortest path from the source vertex. In this work we focus on graphs with a large diameter
(D = V ), as these types of graphs negate some of the heuristics employed by nvGRAPH, giving us
a better idea of the potential worst-case performance.
4.1 Implementation details We implement the parBucketHeap on the GPU, taking into
account many GPU-specific performance considerations. For details on these GPU-specific
optimizations and other details of modern GPU architectures, see standard documentation [26].
We allocate all buckets and signal buffers in global memory, and utilize d as a parameter that
defines the capacity of the smallest level (i.e., the capacities of S0 and B0 are d and 2d, respectively).
We allocate 1 thread-block per level of the structure, for a total of dlog4 nd + 1e thread-blocks,
and utilize 256 threads per block (we experimentally investigate the impact of this parameter
in Section 4.2). Thread-block 0 performs all extractMin, update, delete, and bulkUpdate
operations as needed, and immediately performs Resolve(0) after each operation. Every level
maintains a count, ci and each thread-block i > 0 performs Resolve(i) when ci−1 = 4 · ci and
ci+1 ≥ 4ci (thus ensuring that all preconditions are always met). Resolve(i) follows the pseudocode
outlined in Algorithm 5, and we implement the subroutines as follows:
14
• Merge – We use the Thrust [30] merge primitive using 1 thread-block to merge Si and Bi
and place the result in Bi.
• DeleteDuplicates – We perform a parallel scan (with 1 thread-block) of Bi, marking every
smallest priority unique element as a 1 in a separate array, idx. We then perform parallel
prefix sums on idx to compute the compressed index, which we use to eliminate duplicates.
We handle deletes by setting their priority to −1, so that all other elements with the same
value are removed.
• Select – We implement a simple randomized parallel quickselect algorithm that randomly
selects a pivot, marks/counts elements with a parallel scan, performs prefix sums, moves
necessary elements, and recurses.
• compress/move – As with DeleteDuplicates, we use parallel scans and prefix sums to
identify and move elements and to compress lists so they remain in contiguous space.
4.1.1 Single-source shortest path implementation We use the parBucketHeap to solve
the SSSP problem with a variant of Dijkstra’s algorithm that we call parDijkstra. Each element
contained in the parBucketHeap has value and priority corresponding to vertex ID (vid) and
tentative path distance p, respectively. parDijkstra performs V rounds where, at each round,
extractMin is called on the parBucketHeap to get the (v, p) pair corresponding to the vertex
in the structure with smallest tentative distance. We mark this extracted vertex, v, as settled and,
for each outgoing edge from v with weight w and destination vertex u, if u is not settled, we apply
update((u, p + w)) to the parBucketHeap. Since these updates are all performed together, we
group them into bulkUpdate operations. We use a single thread-block per level of the structure,
and thread-block 0 is responsible for performing all extractMin and bulkUpdate operations, as
well as Resolve(0) as needed. We experimentally determine that using 256 threads per thread-block
provides the best performance on our platforms in the following section.
4.2 Performance tuning Our implementation of the parBucketHeap includes several
parameters that may impact performance. Thus, we develop microbenchmarks to determine
the impact of these parameters on performance. We first consider the d parameter that controls
the size of S0 and the maximum number of elements that can be included in a bulkUpdate.
Increasing d increases the cost of performing Resolve(0), but allows us to more efficiently apply
many updates. Figure 3 plots the average time per Resolve(0) on each of our platforms, for varying
d. Surprisingly, we see significantly different results for each platform. However, for both platforms,
we see a small increase in the cost of resolving, relative to the increase in d. This result confirms
that the bulkUpdate operation of the parBucketHeap is efficient, especially for large batches.
Thus, for all experiments with parDijkstra, we set d to be equal to the maximum out-degree of
any vertex in the graph, allowing us to apply all updates in each round with a single bulkUpdate.
Next, we evaluate the performance impact of the number of threads per thread-block. For
each of our platforms, we perform parDijkstra on a synthetic graph while varying the number
of warps per thread-block. Figure 4 plots the results of this experiment on a graph with V = 20k,
E = 200M , and D = V . Results indicate that, while using multiple warps per thread-block improves
performance, performance gains are negligible beyond 8 warps. Thus, for all experiments hereafter,
we use 8 warps (256 threads) per thread-block.
15
0 500 1000 1500 2000
d (size of S0)
0.0
0.5
1.0
1.5
2.0
2.5
3.0
3.5
4.0
T
im
e
(m
s)
RTX
Pascal
Figure 3: The impact of d (size of S0) on
the time to perform a Resolve(0) operation.
While increasing d increases the runtime of
Resolve(0), the increase is small compared to
the number of elements being added to the par-
BucketHeap, showing that the bulkUpdate
operation is efficient for large d.
1 3 5 7 9 11
Warps per threadblock
0
50
100
T
im
e
(s
)
RTX
Pascal
Figure 4: The impact of the number of warps
per thread-block on the overall performance
of our SSSP algorithm using the parBuck-
etHeap. Results shown on both of our hard-
ware platforms for a graph with V = 20k,
E = 200M , and D = V .
0 10 20 30
Out degree (×103)
0
50
100
150
200
250
300
T
im
e
(s
)
nvGRAPH
parDijkstra (check)
parDijkstra (NO check)
(a) RTX platform
0 10 20 30
Out degree (×103)
0
100
200
300
400
500
600
700
T
im
e
(s
)
nvGRAPH
parDijkstra (check)
parDijkstra (NO check)
(b) Pascal platform
Figure 5: Average time to solve the single-source shortest path problem on a DAG of V = 30k,
D = V , and varying out-degree per vertex. Results shown with and without the random access
check, as the graph is a DAG and parDijkstra can be performed without it. As graph density
increases, the time to perform parDijkstra increases minimally, while nvGRAPH becomes much
slower.
4.3 Directed acyclic graphs (DAGs) At each round, parDijkstra checks a global array
to see if each destination vertex is settled before performing the update. This results in random
memory accesses, reducing the cache-efficiency of our algorithm. Brodal et al. [4] avoid this issue
on undirected graphs by maintaining a second bucket heap structure containing edges that are
16
10 20 30
V (×103)
0
50
100
150
200
250
300
350
400
T
im
e
(s
)
nvGRAPH
parDijkstra
Floyd-Warshall
(a) RTX platform
10 20 30
V (×103)
0
100
200
300
400
500
600
700
800
900
T
im
e
(s
)
nvGRAPH
parDijkstra
Floyd-Warshall
(b) Pascal platform
Figure 6: Average runtime for complete, undirected graphs with D = V , for varying V .
inserted when their source vertex is extracted and settled. To determine if this is necessary for
parDijkstra, we evaluate the performance impact of the random access to determine if vertices
are settled. On directed acyclic graphs (DAG)s, no settled vertex will never be visited again, so the
random access is not necessary. Thus, on a series of synthetic DAGs, we evaluate the performance
of parDijkstra both with and without this random memory access check. Figure 5 plots the
average runtime of parDijkstra with and without the random memory accesses performed, on a
synthetic graph with V = 30k, D = V , and varying out-degree (and therefore E). The performance
of nvGRAPH is included to illustrate how it compares.
4.4 Undirected graphs The results in Figure 5 illustrate that, while the random memory
accesses to determine if vertices are settled cause significant performance loss, they do not degrade
performance by more than a factor of 2. We also see that, on the RTX platform, the impact of
this is much less significant. This is due to the RTX GPU being a more recent generation that
includes hardware features that mitigate the performance loss due to uncoalesced memory accesses.
On both hardware platforms, we see that nvGRAPH performance degrades significantly faster than
parDijkstra with increased out-degree. Thus, parDijkstra out-performs nvGRAPH on DAGs
of 30k vertices with out-degree exceeding 8k and 6k on RTX and Pascal, respectively.
On undirected graphs, the parDijkstra algorithm cannot avoid the random memory accesses
needed to determine if each vertex is already settled. However, as illustrated with DAGs,
parDijkstra outperforms nvGRAPH for sufficiently dense graphs despite these random accesses.
Thus, we consider the extreme case of complete, weighted undirected graphs to compare the
performance of parDijkstra and nvGRAPH. We note that, on complete graphs, the all pairs
shortest path (APSP) Floyd-Warshall algorithm [7] is commonly used due to its simplicity. Thus, we
also measure the performance of a simple GPU-efficient Floyd-Warshall implementation. Figure 6
plots the average runtime of parDijkstra, nvGRAPH, and the Floyd-Warshall implementation
on complete graphs of varying sizes, all with D = V . Surprisingly, the results indicate that the
simple Floyd-Warshall algorithm outperforms nvGRAPH for all sizes. Additionally, we see that
parDijkstra outperforms even Floyd-Warshall when V ≥ 16k.
17
0 10 20 30 40 50 60 70 80 90
Time (s) →
0
50
100
150
200
250
300
P
ow
er
us
ag
e
(w
at
ts
)
nvGRAPH
parDijkstra
idle
peak
(a) RTX platform, V = 30k, E = 240M
0 20 40 60 80 100120140160180
Time (s) →
0
50
100
150
200
250
P
ow
er
us
ag
e
(w
at
ts
)
nvGRAPH
parDijkstra
idle
peak
(b) Pascal platform, V = 30k, E = 180M
Figure 7: GPU power usage while computing SSSP on a graph with V = 30k and D = V . Power
usage sampled every 5 seconds during execution.
4.5 Power usage A key benefit to parDijkstra is that it is work-efficient, theoretically
performing significantly less work than the algorithm used by nvGRAPH in the worst case. As a
result, our algorithm requires less computation, putting less demand on the GPU and consuming
less power. To determine the impact of this, we measure the power consumption of the GPU
while performing each algorithm. We select graphs for which each algorithm has approximately
the same execution time and use the nvprof profiling tools [31] to measure the power usage of the
GPU. We sample the power usage every 5 seconds during the execution, with results plotted in
Figure 7. As expected, results indicate that on both platforms the average power usage is much
lower when running parDijkstra than nvGRAPH. On random graphs with V = 30k, E = 240M ,
and D = 30k, the RTX GPU uses 2.7 watt-hours running parDijkstra and 6.3 (2.3× more) with
nvGRAPH. The Pascal GPU uses 2.8 watt-hours running parDijkstra and 8.9 (3.1× more)
running nvGRAPH.
References
[1] J. JaJa, Introduction to Parallel Algorithms. Reading, MA: Addison-Wesley, 1992.
[2] A. Aggarwal and J. Vitter, “The input/output coplexity of sorting and related problems,” Commun.
ACM, vol. 31, no. 11, 1988.
[3] L. Arge, M. Goodrich, M. Nelson, and N. Sitchinava, “Fundamental parallel algorithms for private-cache
chip multiprocessors,” in Proc. of SPAA, Munich, Germany, 6 2008, pp. 235–246.
[4] G. S. Brodal, R. Fagerberg, U. Meyer, and N. Zeh, “Cache-oblivious data structures and algorithms for
undirected breadth-first search and shortest paths,” in Algorithm Theory - SWAT 2004, T. Hagerup
and J. Katajainen, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2004, pp. 480–492.
[5] G. S. Brodal, J. L. Tra¨ff, and C. D. Zaroliagis, “A parallel priority queue with constant time operations,”
J. Parallel Distrib. Comput., vol. 49, no. 1, pp. 4–21, Feb. 1998.
[6] NVIDIA, “CUDA nvgraph library,” 2019. [Online]. Available: https://docs.nvidia.com/cuda/nvgraph/
index.html
[7] T. H. Cormen, C. Stein, R. L. Rivest, and C. E. Leiserson, Introduction to Algorithms, 2nd ed.
McGraw-Hill Higher Education, 2001.
18
[8] M. L. Fredman, R. Sedgewick, D. D. Sleator, and R. E. Tarjan, “The pairing heap: A new form
of self-adjusting heap,” Algorithmica, vol. 1, no. 1, pp. 111–129, Jan. 1986. [Online]. Available:
http://dx.doi.org/10.1007/BF01840439
[9] P. F. Dietz and R. Ramant, “Very fast optimal parallel algorithms for heap construction,” in Proceedings
of 1994 6th IEEE Symposium on Parallel and Distributed Processing, Oct 1994, pp. 514–521.
[10] M. L. Fredman and R. E. Tarjan, “Fibonacci heaps and their uses in improved network optimization
algorithms,” in 25th Annual Symposium on Foundations of Computer Science, Oct 1984, pp. 338–346.
[11] G. C. Hunt, M. M. Michael, S. Parthasarathy, and M. L. Scott, “An efficient algorithm for concurrent
priority queue heaps,” Inf. Process. Lett., vol. 60, no. 3, pp. 151–157, Nov. 1996. [Online]. Available:
http://dx.doi.org/10.1016/S0020-0190(96)00148-2
[12] R. A. Chowdhury and V. Ramachandran, “Cache-oblivious shortest paths in graphs using buffer
heap,” in Proceedings of the Sixteenth Annual ACM Symposium on Parallelism in Algorithms and
Architectures, ser. SPAA ’04. New York, NY, USA: ACM, 2004, pp. 245–254. [Online]. Available:
http://doi.acm.org/10.1145/1007912.1007949
[13] N. Baudis, F. Jacob, and P. Andelfinger, “Performance evaluation of priority queues for fine-grained
parallel tasks on gpus,” in 2017 IEEE 25th International Symposium on Modeling, Analysis, and
Simulation of Computer and Telecommunication Systems (MASCOTS), Sep. 2017, pp. 1–11.
[14] X. He, D. Agarwal, and S. K. Prasad, “Design and implementation of a parallel priority queue on
many-core architectures,” 2012 19th International Conference on High Performance Computing, pp.
1–10, 2012.
[15] M. Thorup, “Undirected single source shortest paths in linear time,” in Proceedings 38th Annual
Symposium on Foundations of Computer Science, Oct 1997, pp. 12–21.
[16] A. Crauser, K. Mehlhorn, U. Meyer, and P. Sanders, “A parallelization of dijkstra’s shortest path
algorithm,” in Proceedings of the 23rd International Symposium on Mathematical Foundations of
Computer Science, 1998, pp. 722–731.
[17] U. Meyer and P. Sanders, “Delta-stepping: A parallel single source shortest path algorithm,” in
Proceedings of the 6th Annual European Symposium on Algorithms, ser. ESA ’98, 1998, pp. 393–404.
[18] R. M. Karp and V. Ramachandran, “Handbook of theoretical computer science (vol. a),” J. van
Leeuwen, Ed. Cambridge, MA, USA: MIT Press, 1990, ch. Parallel Algorithms for Shared-memory
Machines, pp. 869–941. [Online]. Available: http://dl.acm.org/citation.cfm?id=114872.114889
[19] S. Arora and B. Barak, Computational Complexity: A Modern Approach, 1st ed. New York, NY, USA:
Cambridge University Press, 2009.
[20] Y. Han, V. Pan, and J. Reif, “Efficient parallel algorithms for computing all pair shortest paths in
directed graphs,” in Proceedings of the Fourth Annual ACM Symposium on Parallel Algorithms and
Architectures, ser. SPAA ’92. New York, NY, USA: ACM, 1992, pp. 353–362. [Online]. Available:
http://doi.acm.org/10.1145/140901.141913
[21] G. E. Blelloch, Y. Gu, Y. Sun, and K. Tangwongsan, “Parallel shortest paths using radius stepping,”
in Proceedings of the 28th ACM Symposium on Parallelism in Algorithms and Architectures, 2016, pp.
443–454.
[22] P. Harish and P. J. Narayanan, “Accelerating large graph algorithms on the gpu using
cuda,” in Proceedings of the 14th International Conference on High Performance Computing,
ser. HiPC’07. Berlin, Heidelberg: Springer-Verlag, 2007, pp. 197–208. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1782174.1782200
[23] G. J. Katz and J. T. Kider, Jr, “All-pairs shortest-paths for large graphs on the gpu,” in Proceedings
of the 23rd ACM SIGGRAPH/EUROGRAPHICS Symposium on Graphics Hardware, ser. GH ’08.
Aire-la-Ville, Switzerland, Switzerland: Eurographics Association, 2008, pp. 47–55. [Online]. Available:
http://dl.acm.org/citation.cfm?id=1413957.1413966
[24] A. Davidson, S. Baxter, M. Garland, and J. D. Owens, “Work-efficient parallel gpu methods for single-
source shortest paths,” in Proceedings of the 2014 IEEE 28th International Parallel and Distributed
Processing Symposium, ser. IPDPS ’14, 2014, pp. 349–359.
[25] Y. Wang, A. Davidson, Y. Pan, Y. Wu, A. Riffel, and J. D. Owens, “Gunrock: A high-performance
19
graph processing library on the gpu,” SIGPLAN Not., vol. 51, no. 8, pp. 11:1–11:12, Feb. 2016.
[26] NVIDIA, “CUDA programming guide 10.0,” 2019. [Online]. Available: http://docs.nvidia.com/cuda
[27] O. Green, S. Odeh, and Y. Birk, “Merge path - A visually intuitive approach to parallel merging,”
CoRR, vol. abs/1406.2628, 2014.
[28] J. JaJa, An Introduction to Parallel Algorithms. Reading, Mass.: Addison-Wesley Publishing Co.,
1992.
[29] ISO, ISO/IEC 14882:2011 Information technology — Programming languages — C++. Geneva,
Switzerland: International Organization for Standardization, Feb. 2012. [Online]. Available:
http://www.iso.org/iso/iso catalogue/catalogue tc/catalogue detail.htm?csnumber=50372
[30] J. Hoberock and N. Bell, “Thrust: A parallel template library,” 2010, version 1.7.0. [Online]. Available:
http://thrust.github.io/
[31] NVIDIA, “Profiler user’s guide,” 2019. [Online]. Available: http://docs.nvidia.com/cuda/
profiler-users-guide
20
