Accelerating Concurrent Heap on GPUs by Chen, Yanhao et al.
1Accelerating Concurrent Heap on GPUs
YANHAO CHEN, Rutgers University
FEI HUA, Rutgers University
CHAOZHANG HUANG, Rutgers University
JEREMY BIEREMA, Rutgers University
CHI ZHANG, University of Pisburgh
EDDY Z. ZHANG, Rutgers University
Priority queue, oen implemented as a heap, is an abstract data type that has been used in many well-known
applications like Dijkstra’s shortest path algorithm, Prim’s minimum spanning tree, Human encoding,and
the branch-and-bound algorithm. However, it is challenging to exploit the parallelism of the heap on GPUs
since the control divergence and memory irregularity must be taken into account. In this paper, we present
a parallel generalized heap model that works eectively on GPUs. We also prove the linearizability of our
generalized heap model which enables us to reason about the expected results. We evaluate our concurrent
heap thoroughly and show a maximum 19.49X speedup compared to the sequential CPU implementation and
2.11X speedup compared with the existing GPU implementation [5]. We also apply our heap to single source
shortest path with up to 1.23X speedup and 0/1 knapsack problem with up to 12.19X speedup.
CCS Concepts: •Soware and its engineering→ General programming languages; •Social and pro-
fessional topics→ History of programming languages;
ACM Reference format:
Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang. 2016. Accelerating
Concurrent Heap on GPUs. 1, 1, Article 1 (January 2016), 24 pages.
DOI: 10.1145/nnnnnnn.nnnnnnn
1 INTRODUCTION
A priority queue is an abstract data type which assigns each data element a priority and an element
of high priority is always served before an element of low priority. A priority queue is dynamically
maintained, allowing a mix of insertion and deletion updates. Well-known applications of priority
queue include Dijkstra’s shortest path algorithm, Prim’sminimum spanning tree, Human encoding,
and the branch-and-bound algorithm that solves many combinatorial optimization problems.
Understanding how to accelerate priority queue on many-core architecture has profound impacts.
A comprehensive study will not only shed light on the performance benets/limitation of running
the priority queue itself on accelerator architecture but also pave the road for future work that
parallelizes a large class of applications which build on priority queues.
Priority queue is oen implemented as heap. In this paper, we focus on heap. Heap is a funda-
mental abstract data type, but has not been extensively studied for its acceleration on GPUs. A
heap is a tree data structure. Using min-heap as an example, every node in the binary tree has a
key that is smaller than or equal to that of its parent. ere are two basic operations for heap –
insertion updates and deletion updates. e deletion update always returns the minimal key. e
insertion update inserts a key to the right location in the tree. Both operations allow logarithmic
complexity and leave the binary tree in a balanced state. An example of min-heap is shown in Fig.
1.
2016. XXXX-XXXX/2016/1-ART1 $15.00
DOI: 10.1145/nnnnnnn.nnnnnnn
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
ar
X
iv
:1
90
6.
06
50
4v
1 
 [c
s.D
C]
  1
5 J
un
 20
19
1:2 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
7 4
2 5
0
10 6
11
Insertion 
7 4
2 5
0
10
min Heap
7 4
2 5
10 6
11
Deletion 
Fig. 1. An example of min-heap and irregularity in the heap operations
However, it is non-trivial to exploit parallelism of the binary heap, and reason about the correct-
ness given a parallel implementation. ere are two key challenges that prevent us from fully taking
advantage of the massive parallelism in many-core processors. Each update operation of the binary
heap involves a tree walk. Dierent tree walks exhibit dierent control ow paths and the memory
locality might be low. e control divergence and memory irregularity are two main performance
hazards for GPU computing [17]. e two performance hazards need to be tackled before we can
eciently accelerate binary heap on GPUs. Moreover, as the parallel design gets complicated, it is
not easy to reason about the correctness properties of the concrete implementation.
Existing work for parallelizing heap neither provide correctness guarantee nor take the GPU
performance hazards into consideration. Among the research for parallelizing heap on CPUs, the
work by Rao and Kumar [11] avoids locking the heap in its entirety, associates each node with
a lock, and makes insertion updates top-down, such that the locking order of nodes prevents
deadlock. Hunt et al. [7] adopts the same ne grained locking mechanism, but makes insertion
updates boom up while maintaining a top-down lock order. e implementation by Hunt et
al. alleviates the contention at the root node. However, neither of them formally reasons about
the correctness of their implementation. Neither tackles the control divergence problem caused by
random tree walks as it was not a problem for CPUs at the time when both works were published.
e closest related work to ours is by He and others [5] which is a GPU implementation of the
binary heap. It is based on the idea presented by Deo and Prasad [3] in 1992, which exploits the
parallelism by increasing the node capacity in the heap. One node may contain k keys (k ≥ 1).
However, while it exploits intra-node parallelism, inter-node parallelism is not well exploited. It
divides the heap into even and odd levels and uses barrier synchronization to make sure operations
on two types of levels are never processed at the same time. It assumes all insertion/deletion updates
progress at the same rate. Moreover, between every two consecutive barrier synchronization points,
only one insertion or deletion request can be accepted, which severely limits the eciency of its
implementation on GPUs. Our implementation is shown to be much faster than the work by He et
al.[5] (in Section 5).
In this work, we present a design of concurrent heap that is well-suited for many-core accelerators.
Although our idea is implemented and evaluated on GPUs, it applies to other general purpose
accelerator architecture with vector processing units. Further, we not only show that our design
outperforms sequential CPU implementation and existing GPU implementation, but also prove
that our concurrent heap is linearizable. Specically, our contribution is summarized as follows:
1. We develop a generalized heap model. In our model, each node of the heap may contain
multiple keys. is similar to the work by Deo and Prasad [3]. However, there are two key
dierences. First, assuming k is the node capacity, Deo and Prasad [3] only allow inserting/deleting
exactly k keys, while it is not uncommon that an application inserts/deletes less than or more than
k keys. We added support for partial insertion and deletion in our generalized model. Second,
we exploit both intra-node parallelism and inter-node parallelism, the laer of which is not fully
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:3
explored by Deo and Prasad [3] or He et al. [5]. Note that the benet of having multiple keys in one
node is multi-fold. It allows for intra-node parallelism, memory parallelism, local caching, and can
alleviate control divergence since in a tree walk k keys in the same node move along the same path.
2. We prove the linearizability of our implementation. We propose two types of heap
implementations and prove both are linearizable. Linearizability is a strong correctness condition.
A history of concurrent invocation and response events is linearizable if and only if some (valid)
reordering of events yield a legal sequential history. We provide a model for describing the
concurrent and sequential histories and for inserting linearization points. As far as we know,
existing heap implementations on CPU [3, 7, 11] or GPU [5] do not have a formal reasoning about
their correctness or linearizability condition.
3. We perform a comprehensive evaluation of the concurrent heap. We thoroughly evaluate
our heap implementation and provide an enhanced understanding of the interplay between heap
parameters and execution eciency. We perform sensitivity analysis for heap node capacity, partial
operation percentage, concurrent thread number, and initial heap utilization. We explore the
dierence between insertion and deletion performance. We also evaluate our implementation on
real workloads, while most previous work use synthetic traces [3, 5, 7, 11], as far as we know.
We show that performance improvement could be up to 19.49 times compared with sequential
CPU implementation, 2.11 times compared with the existing GPU implementation. We improve
the single source shortest path algorithm by up to 123% and improve the performance of 0/1
knapsack by up to 1219%, which demonstrates the great potential of applying priority queue on
GPU accelerators.
2 BACKGROUND
2.1 Heap Data Structure
A heap data structure can be viewed as a binary tree and each node of the binary tree stores a key.
Without loss of generation, we use the min-heap to describe our idea throughout the paper. e
minimal key is stored at the root node. A node’s key is smaller than or equal to parent’s key. A
heap is maintained using two basic operations: insert and delete-min.
During the insertion process, a key is inserted to an appropriate location such that the heap
property is maintained. In a boom-up insertion process, it places the key in the rst empty leaf
node, then repeats the following steps: compare a node’s key with its parent node’s key, if smaller,
then swap itself with the parent node, otherwise, terminate. e boom-up insertion algorithm is
shown in Fig. 2 (a).
(a) insertion 
Procedure insert ( key )     
1.      lastnode = key, r = lastnode 
2.      while ( r != root ) do 
3.           if ( r < Parent(r) ) then 
4.                swap( r, Parent(r) ) 
5.           else break 
6.           endIf 
7.      endWhile
(b) deletion 
Procedure delete ( key )     
1.      key = lastnode, lastnode = null, p = root 
2.      while ( LEFT(p) != null or RIGHT(p) != null ) do 
3.           c = min(LEFT(p), RIGHT(p)) 
4.           if ( p < c ) then break 
5.           else swap( p, c ) 
6.           endIf 
7.      endWhile
Fig. 2. Insert and Delete in a sequential Heap
A delete-min procedure returns the minimal key in the heap. It removes the key at the root node
and starts a “heapify” process to restore the min-heap property. To heapify, it moves the last leaf
node to the root, and repeat the following steps: (1) compare the node p’s le child and right child
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:4 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
(if there is any), (2) return the smaller of the two as c , and (3) if p’s key is larger than c’s key, swap
the node c with the node p, otherwise, terminate. e algorithm is shown in Fig. 2 (b).
2.2 GPU Architecture
GPU is a type of many-core accelerator. It employs the single instruction multiple thread (SIMT)
execution model. In order to take advantage of GPU, two fundamental factors need to be taken
into consideration [8, 10]: control divergence and memory locality.
In SIMT model, threads are organized into groups, each of which executes in lock-step manner.
Each group is called a warp. e threads in the same warp can only be issued one instruction at
one time. If threads in the same warp need to execute dierent instructions, the execution will be
serialized. is is called control divergence.
During execution, the data must be fetched for all threads at each instruction. e warp cannot
execute until the data operands of all its threads are ready. Memory parallelism needs to be exploited
since data in physical memory is organized into large contiguous blocks. Data is fetched in the
unit of memory blocks. If one memory references involves non-contiguous data access in multiple
blocks, the warp needs to fetch multiple blocks. If one memory reference involves contiguous data
access in memory, it will reduce the number of memory blocks that need to be fetched.
e SIMT model provides a limited set of synchronization primitives. Barrier synchronization
is allowed for threads within CTA. A GPU kernel does not complete until all its threads have
completed, which can be used as an implicit barrier among all threads. Although there is no
provided lock intrinsics on GPUs, the atomic compare and swap (CAS) function is provided ,and
can be used to implement synchronization primitives.
2.3 Linearizability
In concurrent programming, linearizability [6] is a strong condition which constrains the possible
output of a set of interleaved operations. It is also a safety property that enables us to reason about
the expected results from a concurrent system [13]. e execution of these operations results in a
history, an ordered sequence of invocation and response events. e invocation refers to the event
when an operation starts. e response refers to the event when operation completes.
A sequential history is the one that an invocation is always followed by a matching response,
and a response event followed by another invocation. Alternatively, since in a sequential history
operations do not overlap, we can consider as if an invocation and its matching response happen at
the same time, and an operation takes immediate eect. ere is no real concurrency in a sequential
history. However, a sequential history is easy to reason about.
We say that the history H as an ordered list of invocation and response events { e1, e2, ..., ek}
is linearizable if there exists a re-ordering of the events such that (1) a correct sequential history
can be generated, and (2) if the response of an operation ei precedes the invocation of another
operation ej , res(ei ) < inv(ej ), then the operation ei precedes the operation ej in the reordered
events. Typically linearization point is used to denote the time when an operation takes immediate
eect between the invocation and response of one operation. Finding the right linearization points
to construct a correct sequential history naturally meets the condition of (2).
3 CONCURRENT HEAP DESIGN
We exploit the parallelism of heap operations by allowing concurrent insert operations, INS, and
delete operations, DEL, on dierent tree nodes. In the meantime, each node in the binary tree
is extended to contain a batch of keys instead of only one. We refer to this proposed heap as
the generalized heap throughout this paper. Our heap is well-suited for acceleration on GPUs.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:5
Parallelism exists within a batch of keys and the control divergence is reduce because all keys in
the same batch move along the same path in the tree during tree traversals.
3.1 Generalized Heap
Each node in the generalized heap contains k keys. 1 Since the number of keys in the heap is not
always a multiple of k and an insertion or deletion may not be exactly k keys, we use a partial
buer implementation. e partial buer contains no more than k − 1 keys . All the keys in the
partial buer should be larger than or equal to the keys in the root node so as to make sure the
smallest k keys are in the root node. We denote the k keys in one heap node as a batch.
Like conventional heap, aer each INS andDEL update on the generalized heap, the heap property
needs to be preserved. Here, we formally dene the generalized heap property:
Property 1. Given any node c in the generalized heap and its parent node p, the smallest key in c
is always larger than or equal to the largest key in p:
min
i=1..k
node[c][i] ≥ max
j=1..k
node[p][j]
Property 2. Given any node c in the generalized heap, the keys in c are sorted in ascending order:
∀i ∈ [1,k) : node[c][i] ≤ node[c][i + 1]
Property 3. Given the partial buer b with size s , all the keys are sorted in ascending order and
are larger than or equal to those in the root node r :
∀i ∈ [1, s) : node[b][i] ≤ node[b][i + 1]
node[b][1] ≥ node[r ][k − 1]
Note that the heap property for the conventional heap is a special case of the generalized heap
property with k = 1. When the batch of each node contains only one key, the generalized heap
property 1 and 2 are still satised. e generalized heap property 3 does not apply since the partial
buer contains at most k − 1 keys so there is no partial buer in the conventional heap.
e most space ecient way to represent a generalized heap is using the array. Each entry of
the array represents a single key and consecutive k entries represent a node. us, the generalized
heap can be represented as a linear array while the rst k entries are from the root node and
the next k entries are from the second node and so on. erefore, array entries in the range of
[i ∗ k, (i + 1) ∗ k − 1] are from the i-th node in the generalized heap. An array representation allows
an implicit binary tree representation of the generalized heap. Fig. 3 shows an example of the
generalized heap in both array representation and binary tree representation. e partial buer is
stored separately.
0 0 0 1
2 2 3 4 1 2 4 5
6 8 8 9 4 4 5 8 6 8 9 9
0 0 0 1 2 2 3 4 1 2 4 5 6 8 8 9 4 4 5 8 6 8 9 9
Batch 1 Batch 2 Batch 3 Batch 4 Batch 5 Batch 6
Fig. 3. An example of the generalized heap
1we use k to represent the node capacity throughout this paper.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:6 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
3.2 INS and DEL operations on the Generalized Heap
ere are two basic operations for heap: DEL operation deletes the root node which contains the
smallest k keys from the heap and INS operation inserts new keys into the heap. We describe these
two basic operations on the generalized heap.
(c) DEL operation 
Procedure delete ( deleteItems ) 
1.      if ( lastelem == 0 ) then return 
2.      deleteItems = B[1] 
3.      tar = lastelem - -; cur = 1 
4.      B[1] = B[tar]; B[tar] = MAX_VALUE 
5.      if ( tar == 1) then return 
6.      while (1) do  
7.           l = LEFT( cur ); r = RIGHT( cur ) 
8.            // Suppose right child batch has the largest item 
9.           ( B[l], B[r] ) = MergeAndSort( B[l], B[r] ) 
10.             if ( B[cur][K - 1] <= B[l][0] ) then break 
11.             ( B[cur], B[l] ) = MergeAndSort( B[cur], B[l] ) 
12.             cur = l    
(b) bottom-up INS operation 
Procedure bottom_up_insert ( insItems ) 
1.      insItems = sort( insItems ) 
2.      cur = lastelem++; par = cur >> 1 
3.      B[cur] = insItems 
4.      while (cur != 1) do 
5.          if ( B[cur][0] >= B[par][K - 1] ) then break 
6.          ( B[par], B[cur] ) = MergeAndSort( B[par], B[cur] ) 
7.           cur = par; par = cur >> 1
(a) top-down INS operation 
Procedure top_down_insert ( insItems )     
1.      insItems = sort( insItems ) 
2.      tar = lastelem++; cur = 1; level = log2(tar) - 1 
3.      while (cur != tar) do 
4.           ( B[cur], insItems ) = MergeAndSort( B(cur), insItems ) 
5.         cur = tar >> - - level 
6.      B[tar] = insItems
define B[x] as heap.node[x]

define lastelem as heap.size

Fig. 4. Pseudo Codes for INS and DEL operations on Generalized Heap
3.2.1 DEL Operation. e DEL operation on the generalized heap retrieves the k keys from
the root node. Since the root node is le empty, the generalized heap needs to be heapied to
satisfy the generalized heap property. e pseudo code is shown in Fig. 4(c).
e DEL operation rells the root node with the keys from the last leaf node of the heap (line 1 -
5). Note that we will ll the last node with a MAX value to make sure the old keys in that node
are covered. en, we propagate the new values in root node down. During the propagation, DEL
operation will perform the MergeAndSort operation on two child nodes l and r (line 9). Here we
formally dene theMergeAndSort operation between two batch of keys a and b:
(batch[hi][1 : k],batch[lo][1 : k]) = MergeAndSort(batch[a][1 : k],batch[b][1 : k]) such that
∀i ∈ [1,k) : batch[hi][i] ≤ batch[hi][i + 1]
∀i ∈ [1,k) : batch[lo][i] ≤ batch[lo][i + 1]
max
i=1..k
batch[hi][i] ≤ min
j=1..k
batch[lo][j]
MergeAndSort operation returns two batches hi and lo with size k . hi stores the k smallest keys
and lo stores the k largest keys.
e DEL operation places the lo part back to the child node whose maximum key was larger
(compared with the other child) before (line 9). In this example, let’s suppose it is the child node r .
It can be proved that with such a placement policy, the generalized heap property on the sub-heap
of r will be maintained. e hi part is placed into the child node l . en, another MergeAndSort
operation is applied to the current node and the child l (line 11). e k smallest of the merged
result will stay in the current node, and the k largest will propagate through sub-heap of l . e
propagation ends until the generalized heap property is satised (line 10). An example of the DEL
operation is shown in Fig. 5.
3.2.2 INS Operation. e INS operation inserts new keys into the generalized heap.2 It grows
the heap by adding a new node to the rst empty node in the heap, we call this location the target
node. Given the target node, a path from the root node to the target node can be found and we call
it the insert path.
2We suppose one INS operation inserts at most k new keys. For the case that inserting more than k keys, multiple INS
operations can be invoked.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:7
2 2 3 4 1 2 4 5
6 8 9 9
(a)
1 2 2 2 3 4 4 5
6 8 9 9
(b)
MergeAndSort
MergeAndSort
6 8 8 9 3 4 4 5
1 2 2 2
(c)
Fig. 5. Example: Deletion for the generalized heap
ere are two possible directions for the propagation of the INS operation, which leads to two
dierent types of INS operation: 1© top-down INS, which starts from the root node and propagates
down until it reaches the target node; 2© boom-up INS, which starts at the target node and
propagates up until it reaches the root node or when the generalized heap property is satised in
the middle of the heap.
Top-down INS operation. e top-down INS operation starts at the root node and propagates
down to the boom level of the heap. e propagation of the top-down INS operation follows the
insert path, the MergeAndSort operation is performed between the new insert items and each node
on the insert path until it reaches the target node. An example of the top-down INS operation is
shown in Fig. 6 and the pseudo code is provided in Fig. 4(a).
2 2 3 4
0 0 0 1 0 4 7 7
MergeAndSort
2 2 3 4
0 0 0 0 1 4 7 7
2 2 3 4 1 4 7 7
0 0 0 0
(a) (b) (c)
Fig. 6. Example: Top-Down Insertion
Boom-up INS operation. e boom-up INS operation inserts new keys from the boom of
the heap to the root batch of the heap and still follows the corresponding insert path. e pseudo
code is shown in Fig. 4(b). We rst move the to-insert new keys to the target node (line 3). Since
the generalized heap property may be violated, the MergeAndSort operation is performed between
the nodes on the insert path and their parent nodes. e propagation keeps going along the insert
path until it reaches the root batch or the generalized heap property is satised in the middle (line
5). An example of boom-up INS is provided in Fig. 7.
2 2 3 4
0 0 0 1 0 4 7 7
(a)
2 2 3 4 0 4 7 7
0 0 0 1
(a)
MergeAndSort
2 2 3 4 1 4 7 7
0 0 0 0
(c)
Fig. 7. Example: Boom-Up Insertion
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:8 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
Discussion. Compared to top-down INS operation, boom-up INS operation may not need to
traverse all the nodes on the insert path since the generalized heap property may be satised in
the middle. Moreover, when concurrent INS and DEL operations are performed on the generalized
heap, boom-up INS operation can reduce the contention on the top levels of the heap. However,
the boom-up INS operation needs to pay aention to the potential deadlock problem caused by
the opposite propagation directions of INS and DEL operations. We will discuss more about these
concurrent INS and DEL operations in the following sections.
3.3 Concurrent Heap
In this section, we describe how DEL and INS operations can be performed concurrently on our
parallel generalized heap. Our algorithms are inspired by the methods discussed in [11] and [7]
which introduced concurrent INS and DEL operations on a heap with k = 1, with top-down INS and
boom-up INS operations respectively. In this paper, we call the concurrent heap with Top-Down
INS operation the TD-INS/TD-DEL Heap and the one with Boom-Up INS as BU-INS/TD-DEL Heap.
3.3.1 Lock Order for INS and DEL operations. In [11] and [7], to support concurrent INS and DEL
operations while ensuring correctness and avoiding deadlocks, a simple lock strategy is applied.
Instead of locking the whole heap, each node of the heap is assigned a lock and only a small portion
of the nodes are locked at one time. Our rst implementation adopts this method. In Fig. 8, we
show how we handle the locking order for both INS and DEL operations. e partial buer is
handled when the root node is locked, so that both the root node and the partial buer is protected
by the same lock. For all other nodes, each node has only one lock.
Lock(N1) 
     handle_partial_buffer(P) 
Lock(Nk) 
for (i = 1; i < k; i++) do 
          do_heapify_work() 
     if (i < k - 1) then 
          Lock(Ni + 1) 
     endIf 
     unLock(Ni) 
endFor 
unLock(NK)
top-down insertion
2m 2m+1
m
4 5
2 3
1
……
partial buffer
N1
N2
N3
Nk-1
Nk
P
insert/delete path
Lock(N1) 
     handle_partial_buffer(P) 
Lock(Nk) 
unLock(N1) 
for (i = k; i > 1; i - -) do 
     unLock(Ni) 
     Lock(Ni - 1) 
     Lock(Ni) 
          do_heapify_work() 
     unLock(Ni) 
endFor 
unLock(N1)
bottom-up insertion
Lock(N1) 
    handle_partial_buffer(P) 
for (i = 1; i < k; i++) do 
     Lock(LEFT(Ni)) 
     Lock(RIGHT(Ni)) 
          do_heapify_work() 
     unLock(heapified_child) 
     unLock(Ni) 
endFor 
unlock(Nk)
deletion
Fig. 8. Lock Order for top-down INS, boom-up INS and DEL operations
e top-down INS operation starts at the root node N1 and propagates along the insert path
N1,N2, ...,Nk . It will do the heapify work of Ni when it has locked Ni . Aer it nishes its work,
it will lock Ni+1 before it unlocks Ni which follows a parent-child locking order. Similarly, the
boom-up INS operation also follows the parent-child locking order. When the boom-up INS is at
Ni , it will release the lock on Ni rst, locks Ni−1 next and then locks Ni . Note that, in this case, the
boom-up INS operation does not lock any node aer it releases Ni . e DEL operation needs to
lock more nodes during its propagation. When it is at Ni and Ni is locked, it then locks its two child
nodes and do the heapify work. Aer the work is done, it unlocks the child node that is already
heapied and then Ni . In this way, both INS and DEL operations follow the parent-child order so
that no deadlock could happen.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:9
Each node of the heap is associated with a multi-state lock. is multi-state lock has multiple
states which can indicate the status of each node. e multi-state lock for top-down INS and
boom-up INS operations have dierent states. We describe the dierence in the following sections.
We implement the multi-state lock using atomicCAS. Atomic operations are well optimized on
GPUs [2] which makes it a straightforward choice to implement the multi-state lock.
3.3.2 TD-INS/TD-DEL Heap. Our TD-INS/TD-DEL heap implements top-down insertions
and top-down deletions, using the locking order as described in Fig. 8. We let the multi-state
lock have four dierent states: AVAIL indicates that the node is available; INUSE means that
the node is acquired by another operation; To lock a node, the state of that node changes from
AVAIL to INUSE. A node with state TARGET represents that the node is the target node of a
insert operation; e state of a node is changed toMAKRED only when the target node is needed
by a delete operation for insertion and deletion cooperation. Finite State Automata is shown in Fig.
9.
e INS and DEL operations can cooperate to speedup the propagation[11]. When the DEL
operation needs to ll the root node, if there is an INS operation that is being in progress. e DEL
operation does not need to wait until the last leaf node to be ready (if it is not ready and if it is
the target node of an in-progress insertion), it can ll the root node with the keys from the INS
operation.
e DEL operation changes the state of the last node from TARGET toMARKED. to let the
INS operation know that a DEL operation is asking for the insert keys. When the INS operation
nds that the state of the target node is MAKRED, it moves the insert keys to the root node
and terminate. e DEL operation can then continue and use those keys in the root node for
propagation.
Insert Process Delete Process I/D Process
TARGETINUSE AVAIL MARKED
TD-INS/TD-DEL Heap
INS 
HOLDINUSE AVAIL DELMOD
BU-INS/TD-DEL Heap
Fig. 9. FSA of TD-INS/TD-DEL Heap and BU-INS/TD-DEL Heap
To handle partial batch insertion, we acquire the partial buer at the time when we hold the
root node. Since only one operation can work on the same node, this can make sure that no two
operations can work on the partial buer at the same time. en we apply the MergeAndSort
operation between the insert keys and the partial buer. We check if the partial buer have enough
space to contain those new keys. If so, we perform another MergeAndSort operation between the
partial buer and the root node to satisfy the generalized heap property 3. If not, we obtain the k
smallest keys from the MergeAndSort result as a full batch and propagate it down through the root
node, while leaving the rest keys in the partial buer.
For DEL operation, we only consider deleting the items from the partial buer when the total
number of keys is less than a full batch, in another word, all heap nodes are empty. is is because
based on the generalized heap property 3, the root node always have the smallest keys in the heap.
Although allowing partial batch insertion will cause extra overhead, however, the inserted keys in
the partial buer do not need to propagate into the heap immediately until the partial buer is
overown. In this way, we still gain the benet of memory locality and the intra-node parallelism.
We show the pseudo code of INS and DEL operations on the TD-INS/TD-DEL Heap in Fig. 10.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:10 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
define state(x) as heap.node[x].state

define MS_LOCK(x, stateA, stateB) as 

while (CAS(state(x), stateA, stateB) != stateA)

define MS_TRYLOCK(x, stateA, stateB) as 

     return CAS(state(x), stateA, stateB)

define MS_UNLOCK(x, stateA, stateB) as 

     CAS(state(x), stateA, stateB)
Procedure top_down_insert ( insItems, insSize ) 
1.      insItems = sort(insItems) 
2.      MS_LOCK(1, AVAIL, INUSE) 
3.      if (pBuffer.size + insSize >= K) then 
4.           (insItems, pBuffer) = MergeAndSort(insItems, pBuffer) 
5.      else 
6.           (pBuffer, insItems) = MergeAndSort(insItems, pBuffer) 
7.           if (lastelem != 0) then 
8.                (B[1], pBuffer) = MergeAndSort(B[1], pBuffer) 
9.           MS_UNLOCK(1, INUSE, AVAIL) 
10.           return 
11.       tar = lastelem++; cur = 1; level = log2(tar) - 1 
12.      if (tar != 1) then 
13.           MS_LOCK(tar, AVAIL, INUSE) 
14.      while (cur != tar) do 
15.         if (state(tar) == MARKED) then break 
16.           (B[cur], insItems) = MergeAndSort(B(cur), insItems) 
17.         cur = tar >> - - level 
18.         if (cur != tar) then  
19.                MS_LOCK(cur, AVAIL, INUSE) 
20.         MS_UNLOCK(cur >> 1, INUSE, AVAIL)  
21.      tstate = MS_TRYLOCK(tar, TARGET, INUSE)  
22.      tar = tstate == TARGET ? tar : 1 
23.      B[tar] = insItems 
24.      if (tar != cur) then 
25.           MS_UNLOCK(tar, state(tar), AVAIL) 
26.      MS_UNLOCK(cur, INUSE, AVAIL)
Procedure delete ( deleteItems ) 
1.      MS_LOCK(1, AVAIL, INUSE) 
2.      if (lastelem == 0) then 
3.           if (pBuffer.size != 0) then 
4.                deleteItems = pBuffer[1:pBuffer.size] 
5.           MS_UNLOCK(1, INUSE, AVAIL) 
6.           return 
7.      deleteItems = B[1] 
8.      tar = lastelem - -; cur = 1 
9.      tstate = MS_TRYLOCK(tar, TARGET, MARKED) 
10.      if (tstate == MARKED) then 
11.           while (state(tar) != AVAIL) 
12.      else 
13.           MS_LOCK(tar, AVAIL, INUSE) 
14.           B[1] = B[tar]; B[tar] = MAX_VALUE 
15.           MS_UNLOCK(tar, INUSE, AVAIL) 
16.      (B[1], pBuffer) = MergeAndSort(B[1], pBuffer) 
17.      while (1) do 
18.           l = LEFT(cur); r = RIGHT(cur) 
19.           lstate = INUSE; rstate = INUSE 
20.           while (lstate == INUSE) do 
21.                lstate = MS_TRYLOCK(l, AVAIL, INUSE) 
22.           if (lstate != AVAIL) then // cur has no child 
23.                MS_UNLOCK(cur, INUSE, AVAIL) 
24.                return 
25.           while (rstate == INUSE) do 
26.                rstate = MS_TRYLOCK(r, AVAIL, INUSE) 
27.           if (rstate != AVAIL) then // cur only has left child 
28.                (B[cur], B[l]) = MergeAndSort(B[cur], B[l]) 
29.                break 
30.           // Suppose right child batch has the largest item 
31.           (B[l], B[r]) = MergeAndSort(B[l], B[r]) 
32.           MS_UNLOCK(r, INUSE, AVAIL) 
33.           if (B[cur][K - 1] <= B[l][0]) then break 
34.           (B[cur], B[l]) = MergeAndSort(B[cur], B[l]) 
35.           MS_UNLOCK(cur, INUSE, AVAIL) 
36.           cur = l     
37.      MS_UNLOCK(cur, INUSE, AVAIL) 
38.      MS_UNLOCK(l, INUSE, AVAIL) 
 
 
Fig. 10. Pseudo Codes for TD-INS/TD-DEL Heap Operations
3.3.3 Linearizability of TD-ins/TD-del Heap. We show that the heap with top-down inser-
tion and top-down deletion (TD-Ins/TD-Del) is linearizable.
In order to reason about the linearizability, we need to dene our notations. An ins or del
operation takes a certain amount of time to complete. We denote the time an operation is invoked
as the invocation time, the time when an operation is completed as the response time. A history
includes an ordered list of invocation and response events (ordered with respect to time).
Our TD-Ins/TD-Del implementation uses ne-grained locks that each node is associated with a
lock. We denote the time a thread acquires the lock of a node as acquire time and the time a thread
release the lock of a node as release time.
We denote an operation with a 4-tuple followed by two parameters op[s,ac, re, t](x)T . e
symbol op is the type of the operation, ins or del; s is the invocation time, t is the response time; ac
refers to acquire time of a node of interest; re refers to release of the same node; x is the parameter
of the operation, if the operation is insertion, it means insert x into the heap, if the operation is
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:11
deletion, it means x is returned; T refers to the thread id. Note that both ac and re are within the
time interval s and t , and that ac < re .
To prove linearizability, we need to show that for any given history H, we can nd a correct
sequential history S based on a valid reordering of invocation and response events in H. Here
the term “valid reordering” refers to the case when there are two operations e0 and e1 in H, if the
response time of e0 is before the invocation time of e1, e0 will proceed e1 in the sequential history.
To prove such a sequential history exist, we prove the following lemma rst.
Lemma 3.1. No two threads can work on the same heap node simultaneously.
Proof. According to our implementation, if a thread T has acquired the node B which means
the T has changed B’s state to INUSE, then no other thread can acquire B until T releases it. 
We denote a history H with n operations as H = { opHi (si , acRi , reRi , ti ) xHi THi — 1 ≤ i ≤ n
} 3. Here acR and reR respectively refer to the acquire and release of the lock for the root node
in the heap. In our notation, the history H is an ordered list such that its operations are ordered
with respect to the time the root node is released. Since each operation in TD-INS/TD-DEL heap
needs to acquire the root at its rst step in g. 10, and based on Lemma 3.1, only one thread can
successfully acquire root node at one time. us for two operations opHu (su , acRu , reRu , tu ) xHu THu ,
and opHv (sv , acRv , reRv , tv ) xHv THv , we have u < v if and only reRu < acRv .
Theorem 3.2. e TD-INS/TD-DEL heap is linearizable.
Proof. We show that we can construct a sequential history S given any H. To construct the
sequential history, we rst construct a list of linearization points { Ni — i = 1 to n } such that acRi
< Ni < reRi . Simply put, Ni is an arbitrary time point between every pair of events that acquire
and release the root node. An example of seing linearization point is shown in Fig. 11.
An operation appear to occur instantaneously at its linearization point. A linearization point
has to be between the invocation time and response time for an operation, since acquiring and
releasing root is a step within every update in TD-Ins/TD-Del heap, the Ni time points we set here
is naturally between invocation and response time.
Next we construct a sequential history as
S = { opSi (Ni ) xSi T Si — 1 ≤ i ≤ n }
We set a one-to-one correspondence between the i-th operation opHi (si , acRi , reRi , ti ) xHi THi
in H and the i-th operation opSi (Ni ) xSi T Si in S. We set T Si = THi and opSi = opHi . For all insert
operations, we set xSi = xHi , which means we insert the same key items for the corresponding
operation in S. Next we prove that xSi = xHi for every delete update, which means every delete
operation in S returns the same value as its corresponding delete operation in H.
Assume that when the m-th operation in H releases root node at time reRm , the heap value is
HeapHm , its set of keys are denoted as set(HeapHm), and its root node is denoted as root(HeapHm).
Similarly, at the time Nm of the m-th operation in S, assume the heap value is HeapSm , its set of keys
are denoted as set(HeapSm), and its root node is denoted as root(HeapSm). We prove two properties:
L1: set(HeapHi ) = set(HeapSi ), 1 ≤ i ≤ m
L2: root(HeapHi ) =min(set(HeapHi )) and root(HeapSi ) =min(set(HeapSi )), 1 ≤ i ≤ m
We prove properties L1 and L2 by induction. Initially we have H0 which is an empty history, and
a heap value HeapH0 . We set S0 empty and set HeapS0 to be the same as HeapH0 . Properties L1 and
L2 are satised for the initial heap.
3is is slightly dierent from traditional notation of a history, but means the same.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:12 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
Assume at the time point reRk in H and at the time point Nk in S, properties L1 and L2 hold. We
just need to prove for the time point reRk+1 in H and the time point Nk+1 in S, properties L1 and
L2 hold as well. ere are two cases.
Case I – the (k+1)-th operation in H is an insertion: insHk+1(sk+1,acRk+1, reRk+1, tk+1)xHk+1THk+1.
At the time reRk+1, since when a thread releases a root node in our implementation, the new
item xHk+1 is already merged with the original root and the smaller item of the merged result is
kept in root while the larger item may or may not propagate down the heap. erefore the root
node should contain the smallest item aer taking the new item xHk+1 into consideration. Formally,
set(HeapHk+1) = set(HeapHk+1)
⋃
xHk+1 and root(HeapHk+1) =min(set(HeapHk+1)).
In the sequential history, since we also set (k+1)-th operation as insert update at timeNk+1, the op-
eration is insSk+1(Nk+1)xSk+1THk+1, where xSk+1 = xHk+1. In sequential history, the insertion incurs as if
instantaneously, thus set(HeapSk+1) = set(HeapSk )
⋃
xHk+1 and root(HeapSk+1) =min(set(HeapSk+1)).
us set(HeapHk+1) = set(HeapSk+1), root(HeapHk+1) = min(set(HeapHk+1)) and root(HeapSk+1) =
min(set(HeapSk+1)). Properties L1 and L2 hold.
Case II – If the (k+1)-th operation in H is a deletion, delHk+1(sk+1,acRk+1, reRk+1, tk+1)xHk+1THk+1.
In our implementation, the deletion removes the root which ismin(set(HeapHk )) and re-heapies
the heap. It does not release root until root node is updated to the smallest of the remaining items,
thus at the time point reRk+1, root(HeapHk+1) =min(set(HeapHk ) −min(set(HeapHk ))).
In the sequential history, we set a delete operation at the time point Nk+1, as if the delete
update happens instantaneously. en the delete update returnsmin(set(HeapSk )) which is the
same asmin(set(HeapHk ))), and in the meantime, root node will be updated tomin(set(HeapSk ) −
min(set(HeapSk )))which is also the same as the root ofHeapHk+1 as described above. us properties
L1 and L2 still hold.
us we have successfully constructd a sequential history S from any given history H. erefore,
the TD-INS/TD-DEL heap is linearizable.

H 
Non-RootIns(X)T
𝑎𝑐𝑅2
Root
𝑠2 𝑟𝑒𝑅2 𝑡2
𝑁𝑡
Del(Y)W Non-Root
𝑎𝑐𝑅3
Root
𝑟𝑒𝑅3 𝑡3𝑠3
𝑁𝑤
Ins(Z)P Non-Root
𝑎𝑐𝑅1
Root
𝑟𝑒𝑅1 𝑡1𝑠1
𝑁𝑧
Fig. 11. Example: set up of linearization points for TD-INS/TD-DEL heap operation history
3.3.4 BU-INS/TD-DEL Heap. To reduce the contention on the root node of top-down INS
operations, Hunts et.al [7] proposed a mechanism that does boom-up INS operations while solving
the potential deadlocks from opposite propagation direction. It allows the insert thread temporarily
releases the control of the insert items and a tag is used to store the pid of the thread that modies
the insert items. Here we use a similar method but does not need to store the pid .
e multi-state lock used in BU-INS/TD-DEL Heap have four dierent states. INUSE and AVAIL
are the same as the ones used in TD-INS/TD-DEL Heap. Since the insert operation may temporarily
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:13
release the control of its node, so it uses the state to tell whether the node has been modied by
the time it releases the node. It changes the state of the node from INUSE to INSHOLD when
it releases the node. When the insert operation acquire the node again, if it nds that the state
is no longer INSHOLD, this means one or more delete operations have modied this batch and
the insert operation can skip this batch since the delete operation makes sure the sub-heap has
satised the generalized heap property. On the contrary, the delete operation which acquires
the node from INSHOLD to INUSE will change the state to DELMOD when releasing the node.
We show the FSA of the BU-INS/TD-DEL Heap in Fig. 9. e pseudo codes of the concurrent
insert and delete operations in boom-up manner are shown in Fig. 12.
When an insert operation acquires the temporarily released node, there are three possible cases
for the new state with that node:
(1) INSHOLD: the insert operation holds the batch successfully and the MergeAndSort opera-
tion can be performed with the parent batch.
(2) DELMOD: the batch has been modied by one or more delete operations, the insert
operation can move to the parent batch.
(3) INUSE: some other operations are using this batch
e state of the parent node may also be changed. If the state is notAVAIL or INUSE, this means
a delete operation has already deleted the parent node and the insert operation can terminate.
Partial buer is handled at the beginning of each operation when it locks the root node. For both
INS and DEL operations, the part for handling the partial buer is exactly the same as the one we
showed in Section 3.3.2 so as to make sure generalized heap property 3 is satised.
3.3.5 Linearizability of BU-ins/TD-del Heap. Now we show that the BU-INS/TD-DEL Heap
is linearizable. Note that we will use the same notations that we have dened previously in Section
3.3.3. We use the notation insHi (si ,acLi , reLi , ti )xHi THi for INS operation i anddelHj (sj ,acR j , reR j , tj )xHj THj
for DEL operation j. Here acL and reL respectively refer to the acquire and release for the last
locked node in a boom-up INS operation. is last locked node may or may not be the root node,
since the generalized heap property may be satised in the middle of a boom-up INS update.
We denote a history H with n operations as H = {opHi (si ,aci , rei , ti )xHi THi |1 ≤ i ≤ n} while aci
and rei can be either acLi and reLi for INS, or acRi and reRi (R is for root 4) for DEL. e operations
in H are ordered with respect to the time rei ( either when INS release the last locked node or when
DEL release the root node ). It is possible that the time reu and rev are the same for two operations
opu and opv . In this case, an arbitrary order can be chosen. us for two operations opHu (su , acu ,
reu , tu ) xHu THu , and opHv (sv , acv , rev , tv ) xHv THv , we have u < v when reu < rev or reu == rev .
Lemma 3.3. Given a delete operation delHi (si ,acRi , reRi , ti )xHi THi and an insert operation insHj (sj ,
acLj , reLj , tj )xHj T
H
j that [acLj , reLj ] ∩ [acRi , reRi ] , ∅, then xHi ≤ xHj
Proof. Based on Lemma 3.1 and the condition [acLj , reLj ] ∩ [acRi , reRi ] , ∅, we know that the
last locked node of the insert operation is not the root node. us we can derive thatmin(xHj ) ≥
max(xHi ) which indicates that xHi ≤ xHj . 
Theorem 3.4. e BU-INS/TD-DEL heap is linearizable.
Proof. We show that we can construct a sequential history S given any H. We construct a list
of linearization points { Ni — i = 1 to n } such that acRi < Ni < reRi if the i-th operation is DEL, or
acLi < Ni < reLi if the i-th operation is INS. We pick Ni as an arbitrary time point between the
4e same notation is already used in Section 3.3.3
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:14 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
Procedure bottom_up_insert ( insItems, insSize ) 
1.      insItems = sort(insItems) 
2.      MS_LOCK(1, AVAIL, INUSE) 
3.      if (pBuffer.size + insSize >= K) then 
4.           (insItems, pBuffer) = MergeAndSort(insItems, pBuffer) 
5.      else 
6.           (pBuffer, insItems) = MergeAndSort(insItems, pBuffer) 
7.           if (lastelem != 0) then 
8.                (B[1], pBuffer) = MergeAndSort(B[1], pBuffer) 
9.           MS_UNLOCK(1, INUSE, AVAIL) 
10.           return 
11.       cur = lastelem++; par = cur >> 1 
12.       if (cur != 1) then 
13.            MS_LOCK(cur, AVAIL, INUSE) 
14.            MS_UNLOCK(1, INUSE, AVAIL) 
15.       B[cur] = insItems 
16.       while (cur != 1) do 
17.            MS_UNLOCK(cur, INUSE, INSHOLD) 
18.            pstate = INUSE; cstate = INUSE 
19.            while (pstate == INUSE || pstate == INSHOLD) do 
20.                 pstate = MS_TRYLOCK(par, AVAIL, INUSE) 
21.            if (pstate != AVAIL) then return 
22.            while (cstate == INUSE) do 
23.                 cstate = MS_TRYLOCK(cur, INSHOLD, INUSE) 
24.            if (cstate == DELMOD) then 
25.                 MS_UNLOCK(cur, DELMOD, AVAIL) 
26.            elif (cstate == AVAIL) then 
27.                 MS_UNLOCK(par, INUSE, AVAIL) 
28.                 return 
29.            else 
30.                 if (B[cur][0] >= B[par][K - 1]) then 
31.                      MS_UNLOCK(par, INUSE, AVAIL) 
32.                      break 
33.                 (B[par], B[cur]) = MergeAndSort(B[par], B[cur]) 
34.                 MS_UNLOCK(cur, INUSE, AVAIL) 
35.            cur = par; par = cur >> 1 
36.       MS_UNLOCK(cur, INUSE, AVAIL)
Procedure delete ( deleteItems ) 
1.      MS_LOCK(1, AVAIL, INUSE) 
2.      if (lastelem == 0) then 
3.           if (pBuffer.size != 0) then 
4.                deleteItems = pBuffer[1:pBuffer.size] 
5.           MS_UNLOCK(1, INUSE, AVAIL) 
6.           return 
7.      deleteItems = B[1] 
8.      tar = lastelem - - 
9.      MS_LOCK(tar, AVAIL, INUSE) 
10.      B[1] = B[tar]; B[tar] = MAX_VALUE 
11.      MS_UNLOCK(tar, INUSE, AVAIL) 
12.      (B[1], pBuffer) = MergeAndSort(B[1], pBuffer) 
13.      cur = 1; cstate = INUSE 
14.      while (1) do 
15.           l = LEFT(cur); r = RIGHT(cur) 
16.           lstate = INUSE; rstate = INUSE 
17.           while (lstate == INUSE) do 
18.                lstate = MS_TRYLOCK(l, state(l), INUSE) 
19.           while (rstate == INUSE) do 
20.                rstate = MS_TRYLOCK(r, state(r), INUSE) 
21.           lstate = lstate == INSHOLD ? DELMOD : lstate 
22.           rstate = rstate == INSHOLD ? DELMOD : rstate 
23.            // Suppose right child batch has the largest item 
24.           (B[l], B[r]) = MergeAndSort(B[l], B[r]) 
25.           MS_UNLOCK(r, INUSE, rstate) 
26.           if (B[cur][K - 1] <= B[l][0]) then 
27.                 MS_UNLOCK(cur, INUSE, cstate) 
28.                 MS_UNLOCK(l, INUSE, lstate) 
29.                 return 
30.           (B[cur], B[l]) = MergeAndSort(B[cur], B[l]) 
31.           MS_UNLOCK(cur, INUSE, cstate); 
32.           cur = l; cstate = lstate
Fig. 12. Concurrent Insertion and Deletion with partial batch in Boom-Up Mode
provided time range. We construct the sequential history S = { opSi (Ni ) xSi T Si — 1 ≤ i ≤ n } based
on the linearization points. Each operation opHi in S corresponds to an operation opSi in H.
Like what we did in Section 3.3.3, we set xSi = xHi if the i-th operation is INS. We will prove that
xSj = x
H
j if the j-th operation is DEL. We prove by induction. Initially we have an empty history
H0 and the heap value HeapH0 . We set S0 empty and set HeapS0 to be the same as HeapH0 . At the
beginning, we perform a (dummy) deletion in H and also a (dummy) deletion in S, both DEL return
the same result. e dummy deletion in H completes before any real operation starts. e heap
value for S and for H aer the dummy deletion will be be the same.
Assume that xSk = x
H
k at the time rek in H and at Nk in S while the k-th operation in H is a
delete operation delHk and the k-th operation in S is also a delete operation del
S
k . Additionally,
set(HeapSk ) = set(HeapHk ), where HeapSk is the heap value at the time Nk in S, HeapHk is not
necessarily the heap value at the time reRk in H, rather, it is the set of keys that are contributed by
all preceding insertion/deletion in H’s ordered list (note that the operations in H are already ordered,
see the beginning of Section 3.3.5). Let the next delete update in H be the (k+m)-th operation opHk+m .
If we prove that at the time point rek+m in H and Nk+m in S withm ≥ 1, both delete operations
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:15
return the same value, that is xSk+m = x
H
k+m , and set(HeapSk+m) = set(HeapHk+m), then we prove
that all matching delete operations in S and H return the same value by induction.
In the concurrent history H , between time point reRk and reRk+m , there arem − 1 concurrent
operations and these operations are all insert operations. Among all these m - 1 insert operations,
we let IH be a set of insert operations such that
IH = {insHi (si ,acLi , reLi , ti )xHi THi — k < i < k +m, (acLi , reLi ) ∩(acRk+m , reRk+m) , ∅}
We let the set I ′ be the insert operations from thesem − 1 operations but not in I . e dierence
between I and I ′ is that all I ′ operations complete before opHk+m , while I operations might overlap
with opHk+m . If we consider the inserted keys contributed by I’ as X
I ′ =
⋃
i ∈I ′ xHi . e set of keys in
the heap would be X I ′ ∪HeapHk if none of the operations in I has taken eect, the minimum would
bemin(X I ′ ∪ HeapHk ).
Let X I =
⋃
j ∈I xHj , it is not dicult to show thatmin(X I
′ ∪ HeapHk ) =min(X I
′ ∪ X I ∪ HeapHk ).
According to Lemma 3.3, for any insertion i in I , since its (acL, reL) interval overlaps with the
root acquiring and releasing interval for delHk+m , we know that the last node locked by operation i
cannot be the root, and thus the inserted value xSi , i ∈ I cannot be smaller than the root node. at
is xSi ≥ min(X I
′ ∪HeapHk ) for any i ∈ I . erefore,min(X I
′ ∪HeapHk ) =min(X I
′ ∪X I ∪HeapHk ) is
proved. e implication is that regardless if any operation or all operations in the set I complete,
opHk+m will always returnmin(X I
′ ∪ X I ∪ HeapHk ) which ismin(
⋃
k<i<k+m x
H
i ∪ HeapHk ).
In the sequential history S , there arem − 1 insert operations between time point Nk and Nk+m .
e heap should include the set of keys that are in HeapSk and also x
S
i (k < i < k +m). us opHk+m
returnsmin(⋃k<i<k+m xSi ∪ HeapSk ).
Since set(HeapSk ) = set(HeapHk ) and all matching insertion operations use the same parameter
value for H and S, both opHk+m and op
S
k+m return the same value. Note that it is trivial to prove that
set(HeapHk+m) = set(HeapSk+m)
us we have successfully constructed a sequential history S from any given history H.erefore,
the BU-INS/TD-DEL heap is linearizable

4 IMPLEMENTATION
e building blocks of the generalized heap include the sorting operation, the MergeAndSort
operation, and the multi-state lock that we introduced in Section 3.3.1. We use parallel bitonic
sorting algorithm for local sorting operation within a thread block and merge-path[4] algorithm
for the MergeAndSort operation. We introduce optimization to eliminate redundant MergeAndSort
operations and enable an early stop mechanism to reduce the total computation load and alleviate
the contention on the locks.
In our implementation, threads in one thread block work together for one INS and DEL operation.
We choose thread-block-level operation since barrier synchronization is provided within a thread
block while no built-in inter-CTA synchronization is provided and the overhead for synchronization
between all thread blocks is high.
Besides, threads within the same thread block have access to the same shared memory space,
which can increase data reuse during propagation. We load frequently used items into the shared
memory. Using thread-block-level INS andDEL operations can also benet frommemory coalescing.
e items in the same node are placed continuously in the memory so that threads within the
warp can achieve maximum memory coalescing. Also, the multi-state lock is a thread-block-level
lock which is safer than a thread-level lock since a thread-level lock may cause deadlock due to
desynchronization within a warp [16].
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:16 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
4.1 Sorting Operation
e INS operations sorts the to-insert items before the propagation starts. To perform sorting, these
to-insert items are loaded to the shared memory for ecient data access and movement. In our
generalized heap implementation, the number of to-insert items for one insert operation is limited
by the size of the shared memory per thread block (no more than 1K pairs in our case). We choose
parallel bitonic sorting algorithm as it can be adopted for our thread-block-level operations well.
Bitonic sorting is a comparison-based sorting algorithm. Other ecient non-comparison based
GPU sorting algorithms (e.g. parallel radix sort) require types to have the same lexicographical
order as integers. is not only limits the practical use of the sorting algorithms to only numeric
types like int or f loat but also the sorting complexity of which is based on the size of the key
(length of the data). As we mentioned before, in our parallel generalized heap, the number of the
to-insert items is usually small, which means the size of the key can dominant the sorting eciency.
Parallel bitonic sorting algorithm’s complexity depends on the number of input elements which
makes it more suitable for our case.
4.2 MergeAndSort Operation
In both INS and DEL operations, we perform the MergeAndSort operations frequently during the
heapify process. is can be optimized thanks to the generalized heap property 2 that the keys
in a node are already sorted. Instead of directly sorting the keys that need to be merged, performing
a MergeAndSort operation on those sorted nodes will be more time ecient.
In our parallel generalized heap implementation, the number of items in a node is small and
we also load the data into the shared memory. Here we use the GPU merge-path algorithm [4],
which merges two already sorted sequences. e main advantage of the merge-path method is that
it can assign workload evenly to threads. It has low-latency communication and high-bandwidth
shared memory usage which our implementation can benet from a lot. Detailed description and
complexity analysis of the GPU merge-path algorithm can be found in [4].
4.3 Optimizations
To improve the performance of concurrent INS and DEL operations, we apply the following
optimizations.
Remove Redundant MergeAndSort Operations. MergeAndSort operation is the major overhead of
heap operations. It is used frequently to make sure that the generalized heap satises generalized
heap property. In our implementation, we compare the keys in the nodes and then decide if a
MergeAndSort operation is necessary. When the largest keys in a node is smaller than the smallest
key in the other, instead of performing the MergeAndSort operation, we simply swap the two nodes
which is much more ecient. is optimization reduces the number of MergeAndSort operations
within every insert and delete operation.
Early Stop. is optimization is similar to what we do in a conventional heap. e INS and DEL
operations will terminate once the generalized heap property is satised. For our generalized
heap, Early Stop can happen for all operations except for top-down INS which has to bring the
to-insert items to the target node that locates at the boom of the heap. us it has to traverse all
levels of the generalized heap.
Bit-Reversal Permutation. e INS operation needs to decide the target node and two consecutive
INS operations may select the two target nodes with the same parent. In this case, the insert path
from the root node to the target nodes are highly overlapped which can increase the contention
between the two INS operations. We apply the bit-reversal permutation[7] that makes sure for
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:17
any two consecutive INS operations, the two insert paths have no common nodes except the root
node. Consecutive DEL operation also select the last batch in the heap following the bit-reversal
permutation like INS operation, but in the reversed order.
5 EVALUATION
5.1 Experiment Setup
We perform our experiments on an NVIDIA TITAN X GPU with an Intel Xeon E5-2620 CPU with
2.1GHz working frequency. e TITAN X GPU has 28 streaming multi-processors (SMs) with 128
cores, for a total of 3584 cores. Every thread block has 48 KB of shared memory and 64K available
registers. e maximal number of active threads is 1K per thread block and 2K per SM.
We evaluate our parallel heap from six dierent perspectives:
• We compare our concurrent heap implementation with a sequential CPU heap and a
previous GPU Heap [5]. We use input workloads with dierent heap access paerns.
• We vary the number of the number of thread blocks to evaluate the impact of contention
levels and the scalabiltiy. e number of threads aects the number of active ins or del
operations.
• We perform sensitivity analysis with respect to heap node capacity K, the type of operation
ins or del, and thread block size.
• We evaluate how inserting partial batches would inuence the heap performance by varying
the percentage of partial batch operations.
• We test the concurrent ins and del performance under dierent heap utilization which
means the heap is initialized with dierent number of pre-inserted keys.
• We apply our parallel heap to two real world applications which are single source shortest
path (sssp) and 0/1 knapsack problem.
5.2 Concurrent Heap v.s. Sequential Heap
We use the GPU parallel heap implementation by He, Deo, and Prasad [5] as our GPU baseline. We
refer to this implementation as parallel synchronous heap or in short, P-Sync Heap. We use the C++
STL priority queue library as the sequential CPU heap which we refer to as the STL Heap. Note
that INS operation of the P-Sync Heap is top-down, while it is boom-up for STL Heap.
We evaluate the performance of inserting 512M keys into an empty heap and then deleting all
these 512M keys from the heap. We use dierent types of input keys which are 1© randomized
32-bit int keys 2© 32-bit int keys sorted in ascending order 3© 32-bit int keys sorted in descending
order. e results are shown in Table 1.
Our concurrent heap has an average 16.59X speedup compared to the STL Heap and 2.03X
speedup compared to P-Sync Heap. We observe the best performance when the input keys are
sorted in ascending order in all cases. For STL Heap and BU-INS/TD-DEL Heap, it is because the INS
operations only need to place the insert keys at the target node without traversing the entire heap.
For P-Sync Heap and TD-INS/TD-DEL Heap, although INS operations start from the root node, we
still gain the benet of the keys sorted in ascending order as it avoids the overhead of unnecessary
merging operations along the insert path.
Both TD-INS/TD-DEL Heap and BU-INS/TD-DEL Heap are faster than P-Sync Heap. is is because
we can support concurrent INS or DEL at the same level of the heap while for P-Sync Heap, only
one INS or DEL can work on the same level which exhibits with a lower inter-node parallelism. In
later experiments, we will use randomized 32-bit keys for all INS and DEL operations performance
evaluation except for the real world applications.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:18 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
Table 1. Heap performance on CPU v.s. GPU
Method random descend ascend
STL Heap 1,959,550 1,898,015 1,214,906
P-sync Heap 209,648 205,761 201,66
TD-INS/TD-DEL Heap 112,090 100,163 99,082
BU-INS/TD-DEL Heap 104,417 97,593 96,247
Thread block number: 128, thread block size: 512,
K: 1024, time unit: milli-second (ms), keys: 512M
4 8 16 32 64 128
Thread Block Number
0
5
10
15
Ti
m
e 
(m
s)
104
top-down insertion
bottom-up insertion
deletion
Fig. 13. Heap performance w.r.t
thread block numbers
5.3 Impact of Thread Number
We evaluate the performance of top-down insertion update, boom-up insertion update, and deletion
update respectively by varying the number of thread blocks. e more thread blocks, the more
concurrency we can gain, and also the more contention on the heap. We x all other parameters,
with thread block size = 512 and batch size = 1024. We test the performance of inserting 512M
random 32-bit keys into an empty heap for insertion-only experiments, and deleting 512M keys
from a full heap for deletion-only experiemnt.5 We show the results in Fig. 13.
e performance of both ins and del operations become beer when the number of thread blocks
is increased since more concurrency can be obtained. However, the benet from concurrency is
restricted when the thread block number keeps increasing since more thread blocks also means
more contention on the heap nodes.
e del operation always needs much more time than the ins operations especially when the
thread block number is large which with an average 2.6X slow down. is is because del operation
needs to hold both parent node and its two child nodes and performing at most two MergeAndSort
operations when updating keys on each level of the heap while ins operation needs only one.
When comparing top-down ins with boom-up ins operations. We see that boom-up ins always
has a beer performance since it causes less contention on the root node of the heap and it may not
need to traverse all the nodes on the insert path (the heap property may be satised in the middle).
5.4 Impact of Heap Node Capacity
Fig. 14 shows how ins and del performance is inuenced by heap node capacity. Due to the
limits in shared memory size per thread block, the maximum batch size we used is 1K. Also, the
maximum number of thread block size depends on the batch size, since it does not make sense to
have more than one thread handling one key in MergeAndSort operations. We test the performance
by inserting 512M keys to an empty heap and deleting 512M keys from a full heap.
When thread block size is the same, for both ins and del, we can observe that the performance
becomes beer when we use a larger node capacity. Using a larger node capacity means that with
the same number of keys, the depth of the heap is reduced. If the node capacity is doubled then the
level of the heap is reduced by one which leads to fewer MergeAndSort operations and tree walks.
Also a larger node capacity can provide more intra-node parallelism.
In Fig. 14, it also shows that it is not always good to increase the thread block size because large
thread block size can increase the overhead of synchronization within a thread block. Among
all these congurations, we choose one with thread block size 512 and batch size 1024 for later
experiments since it has the best performance for both ins and del operations.
5In this case, a fully heap is dened as a heap that has 512M keys, regardless of the batch size.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:19
128 256 512 1024
Node Capacity
0
0.5
1
1.5
2
Ti
m
e 
(m
s)
105
thread block size = 128
thread block size = 256
thread block size = 512
thread block size = 1024
(a) top-down ins operations
128 256 512 1024
Node Capacity
0
0.5
1
1.5
2
Ti
m
e
(m
s)
105
thread block size = 128
thread block size = 256
thread block size = 512
thread block size = 1024
(b) boom-up ins operations
128 256 512 1024
Node Capacity
0
1
2
3
4
Ti
m
e 
(m
s)
105
thread block size = 128
thread block size = 256
thread block size = 512
thread block size = 1024
(c) del operations
Fig. 14. Performance of Inserting and Deleting 512M Keys w.r.t Node Capacity and Thread Block Size
5.5 Impact of Initial Heap Utilization
We control the initial heap size by pre-inserting a certain number of keys, for instance, to achieve
a initial 10-level heap, we need to insert batchSize ∗ 210 keys. In this experiment, every thread
performs one ins and one del, which we call an ins-del pair. Since the number of thread blocks is
xed and at most such number of ins could happens at the same time, so the heap level is also xed
only if the initial heap level is higher than a certain number. In our experiment, we use 128 thread
blocks and each thread block will perform 2K ins-del pairs with a total 256K pairs.
In Fig. 15, we show the heap performance with respect to dierent initial heap size from a 6-level
heap with 64K items to a 18-level heap with 256M items. We can observe that when the initial heap
utilization is increased, these ins-del pairs need more time to nish. Both ins and del may traverse
more levels of the heap and perform more MergeAndSort operations. Operations on BU-Heap have
a beer performance since boom-up ins still has the benet of stopping tree traversals earlier.
6 7 8 9 10 11 12 13 14 15 16 17 18
Initial Heap Level
5.5
6
6.5
7
Ti
m
e 
(m
s)
104
GPU TD-Heap
GPU BU-Heap
Fig. 15. Performance w.r.t Initial Heap Size
0 20 40 60 80 100
Percentage of Full Batch Insertion (%)
2
3
4
5
6
7
8
Ti
m
e 
(m
s)
104
GPU TD-Heap
GPU BU-Heap
Fig. 16. Performance w.r.t Partial Buer
5.6 Impact of Partial Buer and Partial Batch Insertion
We evaluate how partial batch updates inuence the heap performance. We test the performance
by inserting 512M items into an empty heap. We control the percentage of full batch insertions and
let rest insertions be randomly generated partial batches. e results are shown in Fig. 16. As we
can see, with the increase in the percentage of full batch insertions, the performance of becomes
beer. is is potentially because more threads are needed to insert the same number of keys, and
it also cause more contention on the lock that protects the root node since inserting a partial batch
will always require to lock the root node in both BU-INS/TD-DEL heap and TD-INS/TD-DEL heap.
e implication is that, although partial batch is supported in the heap implementation, it would
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:20 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
be good to avoid using partial batch insertions in real workloads if the total number of inserted
keys is the same, since the overall performance dierence could be up to 4X.
5.7 Concurrent Heap with Real World Applications
We apply our concurrent generalized heap to two real world applications: the single source shortest
path (SSSP) algorithm and the 0/1 knapsack problem. Both applications can take the advantage of
our concurrent heap by processing items with higher priority rst. e purpose of this section is to
shed light on the potential of incorporating our concurrent heap with many-core accelerators to
solve real world applications. Further optimization to our concurrent heap with application-based
asynchronous updates for insertion and deletion is possible, but we will leave it as our future work.
5.7.1 SSSP with Concurrent Heap. Gunrock[15] is a well known parallel iterative graph process-
ing library on the GPU. It applies a compute-advance model to solve for applications like the SSSP
such that at each iteration, nodes are classied as active nodes and inactive nodes by checking if
their new distance bring an update to existing distance, aer which only active nodes would be
explored in the next iteration since inactive nodes will not bring updates to the nal result.
Our implementation of the parallel SSSP algorithm is similar to Gunrock’s. At each iteration,
we use our heap to store those active nodes with their current distance as the key. In this way,
those nodes with the shortest distance would be explored rst in the next iteration. As a result, our
implementation tends to reduce the overhead of unnecessary updates and the number of active
nodes being explored.
We use gunrock[15] as the baseline for comparison and we set a threshold N such that only
when the number of active nodes is larger than N , will we incorporate the algorithm with our
concurrent heap. We use N=10K in our experiments. We choose 14 dierent real world graphs and
describe the properties of these graphs in Table 2.
Table 2. Graph Information
Graph Name # Nodes # Edges Type of Graph
AS365 3,799,275 22,736,152 2D FE triangular meshes
bundle adj 513,351 20,721,402 Bundle adjustment problem
coPapersDBLP 540,486 30,491,458 DIMACS10 set
delaunay n22 4,194,304 25,165,738 DIMACS10 set
hollywood-2009 1,139,905 115,031,232 Graph of movie actors
Hook 1498 1,498,023 62,415,468 3D mechanical problem
kron g500 logn20 1,048,576 89,240,544 DIMACS10 set
Stanford Berkeley 685,230 7,600,595 Berkeley-Stanford web graph
Long Coup dt0 1,470,152 88,559,144 Coupled consolidation problem
M6 3,501,776 21,003,872 2D FE triangular meshes
NLR 4,163,763 24,975,952 2D FE triangular meshes
rgg n 2 20 s0 1,048,576 13,783,240 Undirected Random Graph
Serena 1,391,349 65,923,050 Structural Problem
We show the result of parallel SSSP in Table 3. For all the graphs we tested, we have an average
of 1.13X overall speedup with the threshold N = 10K compared to the baseline. e heap based
sssp does not perform well on graph Stanford Berkeley since it is a small graph, which means that
the number of active nodes at each level is not large enough for the improvement brought by the
heap to cover the overhead of it’s own operations.
In Table 3, we also list the time in milliseconds for sssp computation and heap operations
separately. e computation time is the SSSP computation time, which includes processing times
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:21
Table 3. Parallel Single Source Shortest Path Performance
Baseline Heap Based SSSP w/ N=10K
Graphs Computate # Nodes Heap Compute Total # Nodes SpeedupTime(ms) Visited Time(ms) Time(ms) Time(ms) Visited
AS365 654.44 19,664,769 193.43 422.12 615.55 11,843,368 1.06
bundle adj 144.54 903,097 11 126.48 137.48 877,675 1.05
coPapersDBLP 46.13 981,876 12.52 25.46 37.98 710,794 1.21
delaunay n22 1125.93 29,832,633 283.04 647.61 930.65 18,607,590 1.21
hollywood-2009 100.17 2,007,447 14.22 74.35 88.58 1,370,459 1.13
Hook 1498 233.76 2,786,271 31.39 182.52 213.91 1,756,776 1.09
kron g500 logn20 117.79 2,590,570 28.72 73.1 101.82 860,552 1.16
Long Coup dt0 190.06 2,699,927 43.46 116.77 160.23 1,571,565 1.19
Stanford Berkeley 55.3 530,294 5.17 52.54 57.71 462,860 0.96
M6 677.95 20,972,903 161.88 472.67 634.56 16,126,697 1.07
NLR 894.71 29,583,224 318.35 439.61 757.96 16,123,803 1.18
rgg n 2 20 s0 920.27 7,112,685 46.44 701.78 748.22 5,871,411 1.23
Serena 124.16 2,594,858 27.98 84.94 112.93 1,498,836 1.1
for node expanding, edge ltering and distance updating. e heap time is the time spent on the
heap operations. e number of nodes visited represents the total number of times that nodes
being explored during SSSP computation. With incorporation of our heap, the number of node
visits is reduced remarkably compared to the baseline, which directly leads to the reduced sssp
computation time.
5.7.2 0/1 Knapsack with Concurrent Heap. e knapsack problem appears in real-world decision-
making processes in a wide variety of elds. It denes as follows: given weights and benets for
some items and a knapsack with a limited capacity W, determine the maximum total benet can be
obtained in the knapsack. e 0/1 knapsack problem is a branch of the knapsack problem where
one must either put the complete item in the knapsack or don’t pick it at all.
Branch and bound is an algorithm design paradigm, which is usually used for solving combina-
torial optimization problems such as the 0/1 knapsack problem. e solution to the 0/1 knapsack
problem can be expressed as a path in a binary decision tree where each level in the tree represents
we either pick or do not pick an item. us, with n items, there are 2n possible solutions. Instead
of blindly checking for every possible solution for the maximum benet under certain capacity,
we can prune the search space by comparing the bound (the best possible benet we could gain if
we choose this node) of a node with the current maximum benet to see if it is worth continue
exploring.
A simple sequential implementation of such algorithm is to enqueue to-explore nodes into a
priority queue with their current benets as the key values and always choose to explore nodes
with largest benet rst. On one hand, it is a greedy approach that would give optimal solution
despite that it might encounter several local optimal before get to the global optima. On the other
hand, if nodes with larger benet are explored rst, we can skip exploring certain nodes with
a bound that is smaller than the max current benet. We implement a parallel GPU knapsack
algorithm based on the sequential version with our concurrent heap to show that the incorporation
of our concurrent heap accelerates knapsack computation.
Since parallelizing node exploration might result in unnecessary growth in heap size, we also
apply a technique which lters invalid nodes when heap size is larger than a certain threshold
before the algorithm continues to explore more nodes. We name this version as knapsack with
garbage collection (GC).
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:22 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
Table 4. Datasets for 0/1 Knapsack Problem
Dataset Type Size Range
ks sc 700 18k Strongly Correlated 700 18000
ks sc 800 18k Strongly Correlated 800 18000
ks sc 200 7k Strongly Correlated 200 7000
ks asc 750 16k Almost Strongly Correlated 750 16000
ks asc 1300 6k Almost Strongly Correlated 1300 6000
ks asc 500 7k Almost Strongly Correlated 500 7000
ks esc 900 18k Even-odd Strongly Correlated 900 18000
ks esc 1200 13k Even-odd Strongly Correlated 1200 13000
ks esc 400 8k Even-odd Strongly Correlated 400 8000
ks ss 100 18k Subset Sum 100 18000
ks ss 1250 12k Subset Sum 1250 12000
ks ss 1300 14k Subset Sum 1300 14000
Table 5. 0/1 Knapsack Problem with heap
CPU w/ GPU w/ concurrent GPU w/ concurrent
Priorityeue heap heap and GC.
Dataset Time # Nodes Time # Nodes SpeedUp Time # Nodes SpeedUp(ms) Explored (ms) Explored (ms) Explored
ks sc 700 18k 825.70 782802 670.57 813089 1.23 595.58 810717 1.39
ks sc 800 18k 977.49 923255 757.06 955374 1.29 708.02 956514 1.38
ks sc 200 7k 202.40 243106 199.76 249373 1.01 205.27 249935 0.99
ks asc 750 16k 757.17 709267 722.90 389249 1.05 566.17 445231 1.34
ks asc 1300 6k 5239.97 4934552 5118.03 2832737 1.02 4115.55 2404241 1.27
ks asc 500 7k 502.37 475402 549.10 296824 0.91 499.03 295951 1.01
ks esc 900 18k 1128.4 1080182 848.65 1123920 1.33 796.58 1124880 1.42
ks esc 1200 13k 2013.06 1950260 2066.64 2002002 0.97 1357.75 1770747 1.48
ks esc 400 8k 355.25 399185 346.53 418925 1.03 348.52 421504 1.02
ks ss 100 18k 42.38 54278 3.55 55 3.48 11.92 55 12.19
ks ss 1250 12k 20.27 23886 4.30 94 4.12 4.72 94 4.92
ks ss 1300 14k 25.02 25305 19.64 9452 1.27 18.01 8528 1.39
In [9], S. Martello et al. dened and tested with several types of instances of knapsack problems.
Using the same data generation tool, we generated 12 knapsack datasets to demonstrate the potential
of our concurrent heap. We describe the properties of these datasets in Table 4.
We compared the running time in milliseconds and the number of explored nodes between
sequential and GPU knapsack in Table 5. For all the datasets we tested we obtain an average overall
speedup of 2.31X for GPU knapsack and 2.48X for GPU knapsack with garbage collection.
We nd that our GPU knapsack algorithms with concurrent heap performs particularly well
with Subset-sum(ss) instances, with a maximum speedup up to 12.19X compared to sequential
version. Also, the number of nodes explored for GPU knapsack is signicantly smaller than that of
sequential knapsack. Because of the greedy property of the branch and bound algorithm, it does
not guarantee the path its exploring will lead to a global optimal solution, it is possible, especially
with the Subset-sum instances where the benet of an item is equal to the weight of it. On the other
hand, parallelizing the branch and bound algorithm with our concurrent heap allows it to solve for
a large amount of potential solutions that are prioritized by their current benet simultaneously,
which can lead to a faster convergence of the global optimal solution.
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
Accelerating Concurrent Heap on GPUs 1:23
eoretically, parallelizing node exploration in branch and bound knapsack problem may cause
an exponential growth in the queue size since it performs parallel explorations for nodes in a binary
decision tree. However, according to our experiments, we nd that the GPU knapsack sometimes
results in less node exploration because while nodes with higher benet are explored earlier than
other nodes, there are chances where the current max benet converges fast enough so that the
nodes with lower priority quickly become invalid for exploration since it’s bound become smaller
than the max benet, which leads to a reduction in exploring time and correspondingly an increase
of overall performance.
6 RELATEDWORK
CPU Parallel Heap Algorithms e most popular CPU approach [1, 7, 11, 12, 14] to gain par-
allelism for parallel heap is by supporting concurrent insert or/and delete operations. Rao and
Kumar[11] proposed a scheme that used multiple mutual exclusion locks to protect each node in the
heap. ey also proceeded insertions in the top-down manner to avoid deadlock while insertions
for the conventional heap follow a boom-up manner. Rassul[1] proposed LR-algorithm which was
an extension to Rao and Kumar’s method that scaer the accesses of dierent operations to reduce
the contention. Hunt and others[7] present a lock-based heap algorithm that supports insertion
and deletion in an opposite directions. Deo and Prasad[3] increased the node capacity. However,
their algorithm does not support concurrent insertions/deletions.
All these parallel heap algorithms on CPUs cannot be applied to GPUs directly because of the
unique SIMT execution model employed by moder GPUs. For parallel algorithms on GPUs, the
optimization for thread divergence, memory coalescing and synchronization need to be taken into
account.
GPU Parallel Heap Algorithms Parallel Heap algorithms are less studied on GPUs. He [5]
introduced a parallel heap algorithms for many-fore architectures like GPUs. eir algorithm
exploited the parallelism of the parallel heap by increasing the node capacity, like the idea in
[3], and pipelining the insert and delete operations. However, their approach did not exploit the
parallelism for concurrent operations at the same level of the heap. Also they need a global barrier
to synchronize all threads aer the insert or delete updates at every levels which brought extra
heavy overhead.
7 CONCLUSION
is work proposes a concurrent heap implementation that is friendly to many-core accelerators.
We develop a generalized heap and support both intra-node and inter-node parallelism. We also
prove that our two heap implementations are linearizable. We evaluate our concurrent heap
thoroughly and show a maximum 19.49X speedup compared to the sequential CPU implementation
and 2.11X speedup compared with the existing GPU implementation [5].
ACKNOWLEDGMENTS
is material is based upon work supported by the National Science Foundation under Grant
No. nnnnnnn and Grant No. mmmmmmm. Any opinions, ndings, and conclusions or recommen-
dations expressed in this material are those of the author and do not necessarily reect the views
of the National Science Foundation.
REFERENCES
[1] R. Ayani. 1990. LR-algorithm: concurrent operations on priority queues. In Proceedings of the Second IEEE Symposium
on Parallel and Distributed Processing 1990. IEEE, Piscataway, NJ, USA, 22–25. hps://doi.org/10.1109/SPDP.1990.143500
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
1:24 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
[2] NVIDIA Corporation. 2010. NVIDIA’s Next Generation CUDA Compute Architecture: Kepler GK110:
e Fastest, Most Ecient HPC Architecture Ever Built. hps://www.nvidia.com/content/PDF/kepler/
NVIDIA-Kepler-GK110-Architecture-Whitepaper.pdf
[3] Narsingh Deo and Sushil Prasad. 1992. Parallel heap: An optimal parallel priority queue. e Journal of Supercomputing
6, 1 (1992), 87–98.
[4] Oded Green, Robert McColl, and David A Bader. 2012. GPU merge path: a GPU merging algorithm. In Proceedings of
the 26th ACM international conference on Supercomputing. ACM, ACM, New York, NY, USA, 331–340.
[5] X. He, D. Agarwal, and S. K. Prasad. 2012. Design and implementation of a parallel priority queue on many-core
architectures. In 2012 19th International Conference on High Performance Computing. IEEE, Piscataway, NJ, USA, 1–10.
hps://doi.org/10.1109/HiPC.2012.6507490
[6] Maurice P Herlihy and Jeannee M Wing. 1990. Linearizability: A correctness condition for concurrent objects. ACM
Transactions on Programming Languages and Systems (TOPLAS) 12, 3 (1990), 463–492.
[7] Galen C Hunt, Maged M Michael, Srinivasan Parthasarathy, and Michael L Sco. 1996. An ecient algorithm for
concurrent priority queue heaps. Inform. Process. Le. 60, 3 (1996), 151–157.
[8] S. Kumar, D. Kim, M. Smelyanskiy, Y. Chen, J. Chhugani, C. J. Hughes, C. Kim, V. W. Lee, and A. D. Nguyen. 2008.
Atomic Vector Operations on Chip Multiprocessors. In 2008 International Symposium on Computer Architecture. IEEE,
Piscataway, NJ, USA, 441–452. hps://doi.org/10.1109/ISCA.2008.38
[9] Silvano Martello, David Pisinger, and Paolo Toth. 1999. Dynamic programming and strong bounds for the 0-1 knapsack
problem. Management Science 45, 3 (1999), 414–424.
[10] N. Moscovici, N. Cohen, and E. Petrank. 2017. A GPU-Friendly Skiplist Algorithm. In 2017 26th International Conference
on Parallel Architectures and Compilation Techniques (PACT). IEEE, Piscataway, NJ, USA, 246–259. hps://doi.org/10.
1109/PACT.2017.13
[11] RV Nageshwara and Vipin Kumar. 1988. Concurrent access of priority queues. IEEE Trans. Comput. 37, 12 (1988),
1657–1665.
[12] Sushil K Prasad and Sagar I Sawant. 1995. Parallel heap: A practical priority queue for ne-to-medium-grained
applications on small multiprocessors. In Parallel and Distributed Processing, 1995. Proceedings. Seventh IEEE Symposium
on. IEEE, IEEE, Piscataway, NJ, USA, 328–335.
[13] Nir Shavit and Gadi Taubenfeld. 2016. e computability of relaxed data structures: queues and stacks as examples.
Distributed Computing 29, 5 (2016), 395–407.
[14] Nir Shavit and Asaph Zemach. 1999. Scalable concurrent priority queue algorithms. In Proceedings of the eighteenth
annual ACM symposium on Principles of distributed computing. ACM, ACM, New York, NY, USA, 113–122.
[15] Yangzihao Wang, Andrew Davidson, Yuechao Pan, Yuduo Wu, Andy Riel, and John D. Owens. 2016. Gunrock:
A High-performance Graph Processing Library on the GPU, In PPOPP. SIGPLAN Not. 51, 8, Article 11, 12 pages.
hps://doi.org/10.1145/3016078.2851145
[16] Henry Wong, Misel-Myrto Papadopoulou, Maryam Sadooghi-Alvandi, and Andreas Moshovos. 2010. Demystifying
GPU microarchitecture through microbenchmarking. In 2010 IEEE International Symposium on Performance Analysis of
Systems & Soware (ISPASS). IEEE, Piscataway, NJ, USA, 235–246.
[17] Eddy Z. Zhang, Yunlian Jiang, Ziyu Guo, Kai Tian, and Xipeng Shen. 2011. On-the-y Elimination of Dynamic
Irregularities for GPU Computing. SIGPLAN Not. 46, 3 (March 2011), 369–380. hps://doi.org/10.1145/1961296.1950408
, Vol. 1, No. 1, Article 1. Publication date: January 2016.
