Priority queue, o en implemented as a heap, is an abstract data type that has been used in many well-known applications like Dijkstra's shortest path algorithm, Prim's minimum spanning tree, Hu man encoding,and the branch-and-bound algorithm. However, it is challenging to exploit the parallelism of the heap on GPUs since the control divergence and memory irregularity must be taken into account. In this paper, we present a parallel generalized heap model that works e ectively on GPUs. We also prove the linearizability of our generalized heap model which enables us to reason about the expected results. We evaluate our concurrent heap thoroughly and show a maximum 19.49X speedup compared to the sequential CPU implementation and 2.11X speedup compared with the existing GPU implementation [5] . We also apply our heap to single source shortest path with up to 1.23X speedup and 0/1 knapsack problem with up to 12.19X speedup.
INTRODUCTION
A priority queue is an abstract data type which assigns each data element a priority and an element of high priority is always served before an element of low priority. A priority queue is dynamically maintained, allowing a mix of insertion and deletion updates. Well-known applications of priority queue include Dijkstra's shortest path algorithm, Prim's minimum spanning tree, Hu man encoding, and the branch-and-bound algorithm that solves many combinatorial optimization problems.
Understanding how to accelerate priority queue on many-core architecture has profound impacts. A comprehensive study will not only shed light on the performance bene ts/limitation of running the priority queue itself on accelerator architecture but also pave the road for future work that parallelizes a large class of applications which build on priority queues.
Priority queue is o en implemented as heap. In this paper, we focus on heap. Heap is a fundamental abstract data type, but has not been extensively studied for its acceleration on GPUs. A heap is a tree data structure. Using min-heap as an example, every node in the binary tree has a key that is smaller than or equal to that of its parent. ere are two basic operations for heapinsertion updates and deletion updates. e deletion update always returns the minimal key. e insertion update inserts a key to the right location in the tree. Both operations allow logarithmic complexity and leave the binary tree in a balanced state. An example of min-heap is shown in Fig.  1 . However, it is non-trivial to exploit parallelism of the binary heap, and reason about the correctness given a parallel implementation. ere are two key challenges that prevent us from fully taking advantage of the massive parallelism in many-core processors. Each update operation of the binary heap involves a tree walk. Di erent tree walks exhibit di erent control ow paths and the memory locality might be low. e control divergence and memory irregularity are two main performance hazards for GPU computing [17] . e two performance hazards need to be tackled before we can e ciently accelerate binary heap on GPUs. Moreover, as the parallel design gets complicated, it is not easy to reason about the correctness properties of the concrete implementation.
Existing work for parallelizing heap neither provide correctness guarantee nor take the GPU performance hazards into consideration. Among the research for parallelizing heap on CPUs, the work by Rao and Kumar [11] avoids locking the heap in its entirety, associates each node with a lock, and makes insertion updates top-down, such that the locking order of nodes prevents deadlock. Hunt et al. [7] adopts the same ne grained locking mechanism, but makes insertion updates bo om up while maintaining a top-down lock order.
e implementation by Hunt et al. alleviates the contention at the root node. However, neither of them formally reasons about the correctness of their implementation. Neither tackles the control divergence problem caused by random tree walks as it was not a problem for CPUs at the time when both works were published. e closest related work to ours is by He and others [5] which is a GPU implementation of the binary heap. It is based on the idea presented by Deo and Prasad [3] in 1992, which exploits the parallelism by increasing the node capacity in the heap. One node may contain k keys (k ≥ 1). However, while it exploits intra-node parallelism, inter-node parallelism is not well exploited. It divides the heap into even and odd levels and uses barrier synchronization to make sure operations on two types of levels are never processed at the same time. It assumes all insertion/deletion updates progress at the same rate. Moreover, between every two consecutive barrier synchronization points, only one insertion or deletion request can be accepted, which severely limits the e ciency of its implementation on GPUs. Our implementation is shown to be much faster than the work by He et al. [5] (in Section 5) .
In this work, we present a design of concurrent heap that is well-suited for many-core accelerators. Although our idea is implemented and evaluated on GPUs, it applies to other general purpose accelerator architecture with vector processing units. Further, we not only show that our design outperforms sequential CPU implementation and existing GPU implementation, but also prove that our concurrent heap is linearizable. Speci cally, our contribution is summarized as follows: 1. We develop a generalized heap model. In our model, each node of the heap may contain multiple keys.
is similar to the work by Deo and Prasad [3] . However, there are two key di erences. First, assuming k is the node capacity, Deo and Prasad [3] only allow inserting/deleting exactly k keys, while it is not uncommon that an application inserts/deletes less than or more than k keys. We added support for partial insertion and deletion in our generalized model. Second, we exploit both intra-node parallelism and inter-node parallelism, the la er of which is not fully explored by Deo and Prasad [3] or He et al. [5] . Note that the bene t of having multiple keys in one node is multi-fold. It allows for intra-node parallelism, memory parallelism, local caching, and can alleviate control divergence since in a tree walk k keys in the same node move along the same path. 2. We prove the linearizability of our implementation. We propose two types of heap implementations and prove both are linearizable. Linearizability is a strong correctness condition. A history of concurrent invocation and response events is linearizable if and only if some (valid) reordering of events yield a legal sequential history. We provide a model for describing the concurrent and sequential histories and for inserting linearization points. As far as we know, existing heap implementations on CPU [3, 7, 11] or GPU [5] do not have a formal reasoning about their correctness or linearizability condition. 3. We perform a comprehensive evaluation of the concurrent heap. We thoroughly evaluate our heap implementation and provide an enhanced understanding of the interplay between heap parameters and execution e ciency. We perform sensitivity analysis for heap node capacity, partial operation percentage, concurrent thread number, and initial heap utilization. We explore the di erence between insertion and deletion performance. We also evaluate our implementation on real workloads, while most previous work use synthetic traces [3, 5, 7, 11] , as far as we know. We show that performance improvement could be up to 19.49 times compared with sequential CPU implementation, 2.11 times compared with the existing GPU implementation. We improve the single source shortest path algorithm by up to 123% and improve the performance of 0/1 knapsack by up to 1219%, which demonstrates the great potential of applying priority queue on GPU accelerators.
BACKGROUND

Heap Data Structure
A heap data structure can be viewed as a binary tree and each node of the binary tree stores a key. Without loss of generation, we use the min-heap to describe our idea throughout the paper. e minimal key is stored at the root node. A node's key is smaller than or equal to parent's key. A heap is maintained using two basic operations: insert and delete-min.
During the insertion process, a key is inserted to an appropriate location such that the heap property is maintained. In a bo om-up insertion process, it places the key in the rst empty leaf node, then repeats the following steps: compare a node's key with its parent node's key, if smaller, then swap itself with the parent node, otherwise, terminate. e bo om-up insertion algorithm is shown in Fig. 2 (a) .
else swap( p, c ) 6. endIf 7.
endWhile
Fig. 2. Insert and Delete in a sequential Heap
A delete-min procedure returns the minimal key in the heap. It removes the key at the root node and starts a "heapify" process to restore the min-heap property. To heapify, it moves the last leaf node to the root, and repeat the following steps: (1) compare the node p's le child and right child (if there is any), (2) return the smaller of the two as c, and (3) if p's key is larger than c's key, swap the node c with the node p, otherwise, terminate. e algorithm is shown in Fig. 2 (b) .
GPU Architecture
GPU is a type of many-core accelerator. It employs the single instruction multiple thread (SIMT) execution model. In order to take advantage of GPU, two fundamental factors need to be taken into consideration [8, 10] : control divergence and memory locality.
In SIMT model, threads are organized into groups, each of which executes in lock-step manner. Each group is called a warp. e threads in the same warp can only be issued one instruction at one time. If threads in the same warp need to execute di erent instructions, the execution will be serialized. is is called control divergence.
During execution, the data must be fetched for all threads at each instruction. e warp cannot execute until the data operands of all its threads are ready. Memory parallelism needs to be exploited since data in physical memory is organized into large contiguous blocks. Data is fetched in the unit of memory blocks. If one memory references involves non-contiguous data access in multiple blocks, the warp needs to fetch multiple blocks. If one memory reference involves contiguous data access in memory, it will reduce the number of memory blocks that need to be fetched.
e SIMT model provides a limited set of synchronization primitives. Barrier synchronization is allowed for threads within CTA. A GPU kernel does not complete until all its threads have completed, which can be used as an implicit barrier among all threads. Although there is no provided lock intrinsics on GPUs, the atomic compare and swap (CAS) function is provided ,and can be used to implement synchronization primitives.
Linearizability
In concurrent programming, linearizability [6] is a strong condition which constrains the possible output of a set of interleaved operations. It is also a safety property that enables us to reason about the expected results from a concurrent system [13] . e execution of these operations results in a history, an ordered sequence of invocation and response events. e invocation refers to the event when an operation starts. e response refers to the event when operation completes.
A sequential history is the one that an invocation is always followed by a matching response, and a response event followed by another invocation. Alternatively, since in a sequential history operations do not overlap, we can consider as if an invocation and its matching response happen at the same time, and an operation takes immediate e ect. ere is no real concurrency in a sequential history. However, a sequential history is easy to reason about. We say that the history H as an ordered list of invocation and response events { e 1 , e 2 , ..., e k } is linearizable if there exists a re-ordering of the events such that (1) a correct sequential history can be generated, and (2) if the response of an operation e i precedes the invocation of another operation e j , res(e i ) < in (e j ), then the operation e i precedes the operation e j in the reordered events. Typically linearization point is used to denote the time when an operation takes immediate e ect between the invocation and response of one operation. Finding the right linearization points to construct a correct sequential history naturally meets the condition of (2).
CONCURRENT HEAP DESIGN
We exploit the parallelism of heap operations by allowing concurrent insert operations, INS, and delete operations, DEL, on di erent tree nodes. In the meantime, each node in the binary tree is extended to contain a batch of keys instead of only one. We refer to this proposed heap as the generalized heap throughout this paper. Our heap is well-suited for acceleration on GPUs.
Parallelism exists within a batch of keys and the control divergence is reduce because all keys in the same batch move along the same path in the tree during tree traversals.
Generalized Heap
Each node in the generalized heap contains k keys. 1 Since the number of keys in the heap is not always a multiple of k and an insertion or deletion may not be exactly k keys, we use a partial bu er implementation. e partial bu er contains no more than k − 1 keys . All the keys in the partial bu er should be larger than or equal to the keys in the root node so as to make sure the smallest k keys are in the root node. We denote the k keys in one heap node as a batch.
Like conventional heap, a er each INS and DEL update on the generalized heap, the heap property needs to be preserved. Here, we formally de ne the generalized heap property: Property 1. Given any node c in the generalized heap and its parent node p, the smallest key in c is always larger than or equal to the largest key in p:
Property 2. Given any node c in the generalized heap, the keys in c are sorted in ascending order:
Property 3. Given the partial bu er b with size s, all the keys are sorted in ascending order and are larger than or equal to those in the root node r :
Note that the heap property for the conventional heap is a special case of the generalized heap property with k = 1. When the batch of each node contains only one key, the generalized heap property 1 and 2 are still satis ed. e generalized heap property 3 does not apply since the partial bu er contains at most k − 1 keys so there is no partial bu er in the conventional heap.
e most space e cient way to represent a generalized heap is using the array. Each entry of the array represents a single key and consecutive k entries represent a node. us, the generalized heap can be represented as a linear array while the rst k entries are from the root node and the next k entries are from the second node and so on. erefore, array entries in the range of [i * k, (i + 1) * k − 1] are from the i-th node in the generalized heap. An array representation allows an implicit binary tree representation of the generalized heap. Fig. 3 shows an example of the generalized heap in both array representation and binary tree representation. e partial bu er is stored separately. 3.2.1 DEL Operation. e DEL operation on the generalized heap retrieves the k keys from the root node. Since the root node is le empty, the generalized heap needs to be heapi ed to satisfy the generalized heap property. e pseudo code is shown in Fig. 4(c) .
e DEL operation re lls the root node with the keys from the last leaf node of the heap (line 1 -5). Note that we will ll the last node with a MAX value to make sure the old keys in that node are covered. en, we propagate the new values in root node down. During the propagation, DEL operation will perform the MergeAndSort operation on two child nodes l and r (line 9). Here we formally de ne the MergeAndSort operation between two batch of keys a and b:
MergeAndSort operation returns two batches hi and lo with size k. hi stores the k smallest keys and lo stores the k largest keys.
e DEL operation places the lo part back to the child node whose maximum key was larger (compared with the other child) before (line 9). In this example, let's suppose it is the child node r . It can be proved that with such a placement policy, the generalized heap property on the sub-heap of r will be maintained. e hi part is placed into the child node l. en, another MergeAndSort operation is applied to the current node and the child l (line 11). e k smallest of the merged result will stay in the current node, and the k largest will propagate through sub-heap of l. e propagation ends until the generalized heap property is satis ed (line 10). An example of the DEL operation is shown in Fig. 5. 3.2.2 INS Operation. e INS operation inserts new keys into the generalized heap. 2 It grows the heap by adding a new node to the rst empty node in the heap, we call this location the target node. Given the target node, a path from the root node to the target node can be found and we call it the insert path. ere are two possible directions for the propagation of the INS operation, which leads to two di erent types of INS operation: 1 top-down INS, which starts from the root node and propagates down until it reaches the target node; 2 bo om-up INS, which starts at the target node and propagates up until it reaches the root node or when the generalized heap property is satis ed in the middle of the heap.
Top-down INS operation. e top-down INS operation starts at the root node and propagates down to the bo om level of the heap. e propagation of the top-down INS operation follows the insert path, the MergeAndSort operation is performed between the new insert items and each node on the insert path until it reaches the target node. An example of the top-down INS operation is shown in Fig. 6 and the pseudo code is provided in Fig. 4 (a). Bo om-up INS operation. e bo om-up INS operation inserts new keys from the bo om of the heap to the root batch of the heap and still follows the corresponding insert path. e pseudo code is shown in Fig. 4 (b). We rst move the to-insert new keys to the target node (line 3). Since the generalized heap property may be violated, the MergeAndSort operation is performed between the nodes on the insert path and their parent nodes. e propagation keeps going along the insert path until it reaches the root batch or the generalized heap property is satis ed in the middle (line 5). An example of bo om-up INS is provided in Fig. 7 . 1:8 Yanhao Chen, Fei Hua, Chaozhang Huang, Jeremy Bierema, Chi Zhang, and Eddy Z. Zhang
Discussion. Compared to top-down INS operation, bo om-up INS operation may not need to traverse all the nodes on the insert path since the generalized heap property may be satis ed in the middle. Moreover, when concurrent INS and DEL operations are performed on the generalized heap, bo om-up INS operation can reduce the contention on the top levels of the heap. However, the bo om-up INS operation needs to pay a ention to the potential deadlock problem caused by the opposite propagation directions of INS and DEL operations. We will discuss more about these concurrent INS and DEL operations in the following sections.
Concurrent Heap
In this section, we describe how DEL and INS operations can be performed concurrently on our parallel generalized heap. Our algorithms are inspired by the methods discussed in [11] and [7] which introduced concurrent INS and DEL operations on a heap with k = 1, with top-down INS and bo om-up INS operations respectively. In this paper, we call the concurrent heap with Top-Down INS operation the TD-INS/TD-DEL Heap and the one with Bo om-Up INS as BU-INS/TD-DEL Heap.
Lock Order for INS and DEL operations.
In [11] and [7] , to support concurrent INS and DEL operations while ensuring correctness and avoiding deadlocks, a simple lock strategy is applied. Instead of locking the whole heap, each node of the heap is assigned a lock and only a small portion of the nodes are locked at one time. Our rst implementation adopts this method. In Fig. 8 , we show how we handle the locking order for both INS and DEL operations.
e partial bu er is handled when the root node is locked, so that both the root node and the partial bu er is protected by the same lock. For all other nodes, each node has only one lock.
Lock(N1)
handle_partial_buffer ( 
bottom-up insertion
Lock(N1)
handle_partial_buffer ( e top-down INS operation starts at the root node N 1 and propagates along the insert path N 1 , N 2 , ..., N k . It will do the heapify work of N i when it has locked N i . A er it nishes its work, it will lock N i+1 before it unlocks N i which follows a parent-child locking order. Similarly, the bo om-up INS operation also follows the parent-child locking order. When the bo om-up INS is at N i , it will release the lock on N i rst, locks N i−1 next and then locks N i . Note that, in this case, the bo om-up INS operation does not lock any node a er it releases N i . e DEL operation needs to lock more nodes during its propagation. When it is at N i and N i is locked, it then locks its two child nodes and do the heapify work. A er the work is done, it unlocks the child node that is already heapi ed and then N i . In this way, both INS and DEL operations follow the parent-child order so that no deadlock could happen.
Each node of the heap is associated with a multi-state lock. is multi-state lock has multiple states which can indicate the status of each node. e multi-state lock for top-down INS and bo om-up INS operations have di erent states. We describe the di erence in the following sections.
We implement the multi-state lock using atomicCAS. Atomic operations are well optimized on GPUs [2] which makes it a straightforward choice to implement the multi-state lock.
TD-INS/TD-DEL Heap.
Our TD-INS/TD-DEL heap implements top-down insertions and top-down deletions, using the locking order as described in Fig. 8 . We let the multi-state lock have four di erent states: AVAIL indicates that the node is available; INUSE means that the node is acquired by another operation; To lock a node, the state of that node changes from AVAIL to INUSE. A node with state TARGET represents that the node is the target node of a insert operation; e state of a node is changed to MAKRED only when the target node is needed by a delete operation for insertion and deletion cooperation. Finite State Automata is shown in Fig.  9 .
e INS and DEL operations can cooperate to speedup the propagation [11] . When the DEL operation needs to ll the root node, if there is an INS operation that is being in progress. e DEL operation does not need to wait until the last leaf node to be ready (if it is not ready and if it is the target node of an in-progress insertion), it can ll the root node with the keys from the INS operation.
e DEL operation changes the state of the last node from TARGET to MARKED. to let the INS operation know that a DEL operation is asking for the insert keys. When the INS operation nds that the state of the target node is MAKRED, it moves the insert keys to the root node and terminate.
e DEL operation can then continue and use those keys in the root node for propagation. To handle partial batch insertion, we acquire the partial bu er at the time when we hold the root node. Since only one operation can work on the same node, this can make sure that no two operations can work on the partial bu er at the same time.
en we apply the MergeAndSort operation between the insert keys and the partial bu er. We check if the partial bu er have enough space to contain those new keys. If so, we perform another MergeAndSort operation between the partial bu er and the root node to satisfy the generalized heap property 3. If not, we obtain the k smallest keys from the MergeAndSort result as a full batch and propagate it down through the root node, while leaving the rest keys in the partial bu er.
For DEL operation, we only consider deleting the items from the partial bu er when the total number of keys is less than a full batch, in another word, all heap nodes are empty. is is because based on the generalized heap property 3, the root node always have the smallest keys in the heap. Although allowing partial batch insertion will cause extra overhead, however, the inserted keys in the partial bu er do not need to propagate into the heap immediately until the partial bu er is over own. In this way, we still gain the bene t of memory locality and the intra-node parallelism.
We show the pseudo code of INS and DEL operations on the TD-INS/TD-DEL Heap in Fig. 10 . In order to reason about the linearizability, we need to de ne our notations. An ins or del operation takes a certain amount of time to complete. We denote the time an operation is invoked as the invocation time, the time when an operation is completed as the response time. A history includes an ordered list of invocation and response events (ordered with respect to time).
Our TD-Ins/TD-Del implementation uses ne-grained locks that each node is associated with a lock. We denote the time a thread acquires the lock of a node as acquire time and the time a thread release the lock of a node as release time.
We denote an operation with a 4-tuple followed by two parameters op[s, ac, re, t](x)T . e symbol op is the type of the operation, ins or del; s is the invocation time, t is the response time; ac refers to acquire time of a node of interest; re refers to release of the same node; x is the parameter of the operation, if the operation is insertion, it means insert x into the heap, if the operation is deletion, it means x is returned; T refers to the thread id. Note that both ac and re are within the time interval s and t, and that ac < re.
To prove linearizability, we need to show that for any given history H, we can nd a correct sequential history S based on a valid reordering of invocation and response events in H. Here the term "valid reordering" refers to the case when there are two operations e 0 and e 1 in H, if the response time of e 0 is before the invocation time of e 1 , e 0 will proceed e 1 in the sequential history.
To prove such a sequential history exist, we prove the following lemma rst. L 3.1. No two threads can work on the same heap node simultaneously.
P . According to our implementation, if a thread T has acquired the node B which means the T has changed B's state to INUSE, then no other thread can acquire B until T releases it.
We denote a history H with n operations as
Here acR and reR respectively refer to the acquire and release of the lock for the root node in the heap. In our notation, the history H is an ordered list such that its operations are ordered with respect to the time the root node is released. Since each operation in TD-INS/TD-DEL heap needs to acquire the root at its rst step in g. 10, and based on Lemma 3.1, only one thread can successfully acquire root node at one time. P . We show that we can construct a sequential history S given any H. To construct the sequential history, we rst construct a list of linearization points { N i -i = 1 to n } such that acR i < N i < reR i . Simply put, N i is an arbitrary time point between every pair of events that acquire and release the root node. An example of se ing linearization point is shown in Fig. 11 .
An operation appear to occur instantaneously at its linearization point. A linearization point has to be between the invocation time and response time for an operation, since acquiring and releasing root is a step within every update in TD-Ins/TD-Del heap, the N i time points we set here is naturally between invocation and response time.
Next we construct a sequential history as
We set a one-to-one correspondence between the i-th operation
For all insert operations, we set x S i = x H i , which means we insert the same key items for the corresponding operation in S. Next we prove that x S i = x H i for every delete update, which means every delete operation in S returns the same value as its corresponding delete operation in H.
Assume that when the m-th operation in H releases root node at time reR m , the heap value is Heap H m , its set of keys are denoted as set(Heap H m ), and its root node is denoted as root(Heap H m ). Similarly, at the time N m of the m-th operation in S, assume the heap value is Heap S m , its set of keys are denoted as set(Heap S m ), and its root node is denoted as root(Heap S m ). We prove two properties:
) and root(Heap S i ) = min(set(Heap S i )), 1 ≤ i ≤ m We prove properties L1 and L2 by induction. Initially we have H 0 which is an empty history, and a heap value Heap H 0 . We set S 0 empty and set Heap S 0 to be the same as Heap H 0 . Properties L1 and L2 are satis ed for the initial heap.
Assume at the time point reR k in H and at the time point N k in S, properties L1 and L2 hold. We just need to prove for the time point reR k+1 in H and the time point N k +1 in S, properties L1 and L2 hold as well. ere are two cases.
Case I -the (k+1)-th operation in H is an insertion:
In the sequential history, since we also set (k+1)-th operation as insert update at time N k +1 , the op-
, where
. In sequential history, the insertion incurs as if instantaneously, thus set(
)). Properties L1 and L2 hold.
In our implementation, the deletion removes the root which is min(set(Heap H k )) and re-heapi es the heap. It does not release root until root node is updated to the smallest of the remaining items, thus at the time point
In the sequential history, we set a delete operation at the time point N k +1 , as if the delete update happens instantaneously.
en the delete update returns min(set(Heap S k )) which is the same as min(set(Heap H k ))), and in the meantime, root node will be updated to min(set(Heap S k ) − min(set(Heap S k ))) which is also the same as the root of Heap H k+1 as described above. us properties L1 and L2 still hold.
us we have successfully constructd a sequential history S from any given history H. erefore, the TD-INS/TD-DEL heap is linearizable. [7] proposed a mechanism that does bo om-up INS operations while solving the potential deadlocks from opposite propagation direction. It allows the insert thread temporarily releases the control of the insert items and a tag is used to store the pid of the thread that modi es the insert items. Here we use a similar method but does not need to store the pid.
e multi-state lock used in BU-INS/TD-DEL Heap have four di erent states. INUSE and AVAIL are the same as the ones used in TD-INS/TD-DEL Heap. Since the insert operation may temporarily release the control of its node, so it uses the state to tell whether the node has been modi ed by the time it releases the node. It changes the state of the node from INUSE to INSHOLD when it releases the node. When the insert operation acquire the node again, if it nds that the state is no longer INSHOLD, this means one or more delete operations have modi ed this batch and the insert operation can skip this batch since the delete operation makes sure the sub-heap has satis ed the generalized heap property. On the contrary, the delete operation which acquires the node from INSHOLD to INUSE will change the state to DELMOD when releasing the node.
We show the FSA of the BU-INS/TD-DEL Heap in Fig. 9 . e pseudo codes of the concurrent insert and delete operations in bo om-up manner are shown in Fig. 12 .
When an insert operation acquires the temporarily released node, there are three possible cases for the new state with that node:
(1) INSHOLD: the insert operation holds the batch successfully and the MergeAndSort operation can be performed with the parent batch. (2) DELMOD: the batch has been modi ed by one or more delete operations, the insert operation can move to the parent batch. (3) INUSE: some other operations are using this batch e state of the parent node may also be changed. If the state is not AVAIL or INUSE, this means a delete operation has already deleted the parent node and the insert operation can terminate.
Partial bu er is handled at the beginning of each operation when it locks the root node. For both INS and DEL operations, the part for handling the partial bu er is exactly the same as the one we showed in Section 3.3.2 so as to make sure generalized heap property 3 is satis ed.
Linearizability of BU-ins/TD-del Heap. Now we show that the BU-INS/TD-DEL Heap
is linearizable. Note that we will use the same notations that we have de ned previously in Section 3.
We use the notation ins
for DEL operation j. Here acL and reL respectively refer to the acquire and release for the last locked node in a bo om-up INS operation. is last locked node may or may not be the root node, since the generalized heap property may be satis ed in the middle of a bo om-up I N S update. We denote a history H with n operations as H = {op H i (s i , ac i , re i , t i )x H i T H i |1 ≤ i ≤ n} while ac i and re i can be either acL i and reL i for INS, or acR i and reR i (R is for root 4 ) for DEL. e operations in H are ordered with respect to the time re i ( either when INS release the last locked node or when DEL release the root node ). It is possible that the time re u and re are the same for two operations op u and op . In this case, an arbitrary order can be chosen. us for two operations op H u (s u , ac u , re u , t u ) x H u T H u , and op H (s , ac , re , t ) x H T H , we have u < when re u < re or re u == re . P . We show that we can construct a sequential history S given any H. We construct a list of linearization points { N i -i = 1 to n } such that acR i < N i < reR i if the i-th operation is DEL, or acL i < N i < reL i if the i-th operation is INS. We pick N i as an arbitrary time point between the Procedure bottom_up_insert ( insItems, insSize ) 1 . insItems = sort(insItems) 2.
MS_LOCK(1, AVAIL, INUSE) 3.
if (pBuffer.size + insSize >= K) then 4.
(insItems, pBuffer) = MergeAndSort(insItems, pBuffer) 5. else 6.
(pBuffer, insItems) = MergeAndSort(insItems, pBuffer) 7.
if (lastelem != 0) then 8.
( We will prove that x S j = x H j if the j-th operation is DEL. We prove by induction. Initially we have an empty history H 0 and the heap value Heap H 0 . We set S 0 empty and set Heap S 0 to be the same as Heap H 0 . At the beginning, we perform a (dummy) deletion in H and also a (dummy) deletion in S, both DEL return the same result. e dummy deletion in H completes before any real operation starts. e heap value for S and for H a er the dummy deletion will be be the same.
Assume that ), then we prove that all matching delete operations in S and H return the same value by induction.
In the concurrent history H , between time point reR k and reR k+m , there are m − 1 concurrent operations and these operations are all insert operations. Among all these m -1 insert operations, we let I H be a set of insert operations such that
We let the set I be the insert operations from these m − 1 operations but not in I . e di erence between I and I is that all I operations complete before op H k +m , while I operations might overlap with op H k +m . If we consider the inserted keys contributed by I' as X I = i ∈I x H i . e set of keys in the heap would be X I ∪ Heap H k if none of the operations in I has taken e ect, the minimum would be min(X I ∪ Heap H k ).
). According to Lemma 3.3, for any insertion i in I , since its (acL, reL) interval overlaps with the root acquiring and releasing interval for del H k +m , we know that the last node locked by operation i cannot be the root, and thus the inserted value x S i , i ∈ I cannot be smaller than the root node. at is
) is proved. e implication is that regardless if any operation or all operations in the set I complete, op H k +m will always return min(
). In the sequential history S, there are m − 1 insert operations between time point N k and N k +m . e heap should include the set of keys that are in Heap S k and also
Since set(Heap 
IMPLEMENTATION
e building blocks of the generalized heap include the sorting operation, the MergeAndSort operation, and the multi-state lock that we introduced in Section 3.3.1. We use parallel bitonic sorting algorithm for local sorting operation within a thread block and merge-path [4] algorithm for the MergeAndSort operation. We introduce optimization to eliminate redundant MergeAndSort operations and enable an early stop mechanism to reduce the total computation load and alleviate the contention on the locks.
In our implementation, threads in one thread block work together for one INS and DEL operation. We choose thread-block-level operation since barrier synchronization is provided within a thread block while no built-in inter-CTA synchronization is provided and the overhead for synchronization between all thread blocks is high.
Besides, threads within the same thread block have access to the same shared memory space, which can increase data reuse during propagation. We load frequently used items into the shared memory. Using thread-block-level INS and DEL operations can also bene t from memory coalescing.
e items in the same node are placed continuously in the memory so that threads within the warp can achieve maximum memory coalescing. Also, the multi-state lock is a thread-block-level lock which is safer than a thread-level lock since a thread-level lock may cause deadlock due to desynchronization within a warp [16] .
Sorting Operation
e INS operations sorts the to-insert items before the propagation starts. To perform sorting, these to-insert items are loaded to the shared memory for e cient data access and movement. In our generalized heap implementation, the number of to-insert items for one insert operation is limited by the size of the shared memory per thread block (no more than 1K pairs in our case). We choose parallel bitonic sorting algorithm as it can be adopted for our thread-block-level operations well.
Bitonic sorting is a comparison-based sorting algorithm. Other e cient non-comparison based GPU sorting algorithms (e.g. parallel radix sort) require types to have the same lexicographical order as integers. is not only limits the practical use of the sorting algorithms to only numeric types like int or f loat but also the sorting complexity of which is based on the size of the key (length of the data). As we mentioned before, in our parallel generalized heap, the number of the to-insert items is usually small, which means the size of the key can dominant the sorting e ciency. Parallel bitonic sorting algorithm's complexity depends on the number of input elements which makes it more suitable for our case.
MergeAndSort Operation
In both INS and DEL operations, we perform the MergeAndSort operations frequently during the heapify process. is can be optimized thanks to the generalized heap property 2 that the keys in a node are already sorted. Instead of directly sorting the keys that need to be merged, performing a MergeAndSort operation on those sorted nodes will be more time e cient.
In our parallel generalized heap implementation, the number of items in a node is small and we also load the data into the shared memory. Here we use the GPU merge-path algorithm [4] , which merges two already sorted sequences. e main advantage of the merge-path method is that it can assign workload evenly to threads. It has low-latency communication and high-bandwidth shared memory usage which our implementation can bene t from a lot. Detailed description and complexity analysis of the GPU merge-path algorithm can be found in [4] .
Optimizations
To improve the performance of concurrent INS and DEL operations, we apply the following optimizations.
Remove Redundant MergeAndSort Operations. MergeAndSort operation is the major overhead of heap operations. It is used frequently to make sure that the generalized heap satis es generalized heap property. In our implementation, we compare the keys in the nodes and then decide if a MergeAndSort operation is necessary. When the largest keys in a node is smaller than the smallest key in the other, instead of performing the MergeAndSort operation, we simply swap the two nodes which is much more e cient. is optimization reduces the number of MergeAndSort operations within every insert and delete operation.
Early Stop. is optimization is similar to what we do in a conventional heap. e INS and DEL operations will terminate once the generalized heap property is satis ed. For our generalized heap, Early Stop can happen for all operations except for top-down INS which has to bring the to-insert items to the target node that locates at the bo om of the heap. us it has to traverse all levels of the generalized heap.
Bit-Reversal Permutation. e INS operation needs to decide the target node and two consecutive INS operations may select the two target nodes with the same parent. In this case, the insert path from the root node to the target nodes are highly overlapped which can increase the contention between the two INS operations. We apply the bit-reversal permutation [7] that makes sure for any two consecutive INS operations, the two insert paths have no common nodes except the root node. Consecutive DEL operation also select the last batch in the heap following the bit-reversal permutation like INS operation, but in the reversed order.
EVALUATION
Experiment Setup
We perform our experiments on an NVIDIA TITAN X GPU with an Intel Xeon E5-2620 CPU with 2.1GHz working frequency. e TITAN X GPU has 28 streaming multi-processors (SMs) with 128 cores, for a total of 3584 cores. Every thread block has 48 KB of shared memory and 64K available registers. e maximal number of active threads is 1K per thread block and 2K per SM.
We evaluate our parallel heap from six di erent perspectives:
• We compare our concurrent heap implementation with a sequential CPU heap and a previous GPU Heap [5] . We use input workloads with di erent heap access pa erns.
• We vary the number of the number of thread blocks to evaluate the impact of contention levels and the scalabiltiy. e number of threads a ects the number of active ins or del operations.
• We perform sensitivity analysis with respect to heap node capacity K, the type of operation ins or del, and thread block size.
• We evaluate how inserting partial batches would in uence the heap performance by varying the percentage of partial batch operations.
• We test the concurrent ins and del performance under di erent heap utilization which means the heap is initialized with di erent number of pre-inserted keys.
• We apply our parallel heap to two real world applications which are single source shortest path (sssp) and 0/1 knapsack problem.
Concurrent Heap v.s. Sequential Heap
We use the GPU parallel heap implementation by He, Deo, and Prasad [5] as our GPU baseline. We refer to this implementation as parallel synchronous heap or in short, P-Sync Heap. We use the C++ STL priority queue library as the sequential CPU heap which we refer to as the STL Heap. Note that INS operation of the P-Sync Heap is top-down, while it is bo om-up for STL Heap.
We evaluate the performance of inserting 512M keys into an empty heap and then deleting all these 512M keys from the heap. We use di erent types of input keys which are 1 randomized 32-bit int keys 2 32-bit int keys sorted in ascending order 3 32-bit int keys sorted in descending order. e results are shown in Table 1 .
Our concurrent heap has an average 16.59X speedup compared to the STL Heap and 2.03X speedup compared to P-Sync Heap. We observe the best performance when the input keys are sorted in ascending order in all cases. For STL Heap and BU-INS/TD-DEL Heap, it is because the INS operations only need to place the insert keys at the target node without traversing the entire heap. For P-Sync Heap and TD-INS/TD-DEL Heap, although INS operations start from the root node, we still gain the bene t of the keys sorted in ascending order as it avoids the overhead of unnecessary merging operations along the insert path.
Both TD-INS/TD-DEL Heap and BU-INS/TD-DEL Heap are faster than P-Sync Heap. is is because we can support concurrent INS or DEL at the same level of the heap while for P-Sync Heap, only one INS or DEL can work on the same level which exhibits with a lower inter-node parallelism. In later experiments, we will use randomized 32-bit keys for all INS and DEL operations performance evaluation except for the real world applications. 
Impact of Thread Number
We evaluate the performance of top-down insertion update, bo om-up insertion update, and deletion update respectively by varying the number of thread blocks. e more thread blocks, the more concurrency we can gain, and also the more contention on the heap. We x all other parameters, with thread block size = 512 and batch size = 1024. We test the performance of inserting 512M random 32-bit keys into an empty heap for insertion-only experiments, and deleting 512M keys from a full heap for deletion-only experiemnt. 5 We show the results in Fig. 13 . e performance of both ins and del operations become be er when the number of thread blocks is increased since more concurrency can be obtained. However, the bene t from concurrency is restricted when the thread block number keeps increasing since more thread blocks also means more contention on the heap nodes.
e del operation always needs much more time than the ins operations especially when the thread block number is large which with an average 2.6X slow down. is is because del operation needs to hold both parent node and its two child nodes and performing at most two MergeAndSort operations when updating keys on each level of the heap while ins operation needs only one.
When comparing top-down ins with bo om-up ins operations. We see that bo om-up ins always has a be er performance since it causes less contention on the root node of the heap and it may not need to traverse all the nodes on the insert path (the heap property may be satis ed in the middle).
Impact of Heap Node Capacity
Fig. 14 shows how ins and del performance is in uenced by heap node capacity. Due to the limits in shared memory size per thread block, the maximum batch size we used is 1K. Also, the maximum number of thread block size depends on the batch size, since it does not make sense to have more than one thread handling one key in MergeAndSort operations. We test the performance by inserting 512M keys to an empty heap and deleting 512M keys from a full heap.
When thread block size is the same, for both ins and del, we can observe that the performance becomes be er when we use a larger node capacity. Using a larger node capacity means that with the same number of keys, the depth of the heap is reduced. If the node capacity is doubled then the level of the heap is reduced by one which leads to fewer MergeAndSort operations and tree walks. Also a larger node capacity can provide more intra-node parallelism.
In Fig. 14 , it also shows that it is not always good to increase the thread block size because large thread block size can increase the overhead of synchronization within a thread block. Among all these con gurations, we choose one with thread block size 512 and batch size 1024 for later experiments since it has the best performance for both ins and del operations. 
Impact of Initial Heap Utilization
We control the initial heap size by pre-inserting a certain number of keys, for instance, to achieve a initial 10-level heap, we need to insert batchSize * 2 10 keys. In this experiment, every thread performs one ins and one del, which we call an ins-del pair. Since the number of thread blocks is xed and at most such number of ins could happens at the same time, so the heap level is also xed only if the initial heap level is higher than a certain number. In our experiment, we use 128 thread blocks and each thread block will perform 2K ins-del pairs with a total 256K pairs.
In Fig. 15 , we show the heap performance with respect to di erent initial heap size from a 6-level heap with 64K items to a 18-level heap with 256M items. We can observe that when the initial heap utilization is increased, these ins-del pairs need more time to nish. Both ins and del may traverse more levels of the heap and perform more MergeAndSort operations. Operations on BU-Heap have a be er performance since bo om-up ins still has the bene t of stopping tree traversals earlier. 
Impact of Partial Bu er and Partial Batch Insertion
We evaluate how partial batch updates in uence the heap performance. We test the performance by inserting 512M items into an empty heap. We control the percentage of full batch insertions and let rest insertions be randomly generated partial batches. e results are shown in Fig. 16 . As we can see, with the increase in the percentage of full batch insertions, the performance of becomes be er. is is potentially because more threads are needed to insert the same number of keys, and it also cause more contention on the lock that protects the root node since inserting a partial batch will always require to lock the root node in both BU-INS/TD-DEL heap and TD-INS/TD-DEL heap. e implication is that, although partial batch is supported in the heap implementation, it would be good to avoid using partial batch insertions in real workloads if the total number of inserted keys is the same, since the overall performance di erence could be up to 4X.
Concurrent Heap with Real World Applications
We apply our concurrent generalized heap to two real world applications: the single source shortest path (SSSP) algorithm and the 0/1 knapsack problem. Both applications can take the advantage of our concurrent heap by processing items with higher priority rst. e purpose of this section is to shed light on the potential of incorporating our concurrent heap with many-core accelerators to solve real world applications. Further optimization to our concurrent heap with application-based asynchronous updates for insertion and deletion is possible, but we will leave it as our future work.
5.7.1 SSSP with Concurrent Heap. Gunrock [15] is a well known parallel iterative graph processing library on the GPU. It applies a compute-advance model to solve for applications like the SSSP such that at each iteration, nodes are classi ed as active nodes and inactive nodes by checking if their new distance bring an update to existing distance, a er which only active nodes would be explored in the next iteration since inactive nodes will not bring updates to the nal result.
Our implementation of the parallel SSSP algorithm is similar to Gunrock's. At each iteration, we use our heap to store those active nodes with their current distance as the key. In this way, those nodes with the shortest distance would be explored rst in the next iteration. As a result, our implementation tends to reduce the overhead of unnecessary updates and the number of active nodes being explored.
We use gunrock [15] as the baseline for comparison and we set a threshold N such that only when the number of active nodes is larger than N , will we incorporate the algorithm with our concurrent heap. We use N=10K in our experiments. We choose 14 di erent real world graphs and describe the properties of these graphs in Table 2 . We show the result of parallel SSSP in Table 3 . For all the graphs we tested, we have an average of 1.13X overall speedup with the threshold N = 10K compared to the baseline. e heap based sssp does not perform well on graph Stanford Berkeley since it is a small graph, which means that the number of active nodes at each level is not large enough for the improvement brought by the heap to cover the overhead of it's own operations.
In Table 3 , we also list the time in milliseconds for sssp computation and heap operations separately. e computation time is the SSSP computation time, which includes processing times for node expanding, edge ltering and distance updating. e heap time is the time spent on the heap operations. e number of nodes visited represents the total number of times that nodes being explored during SSSP computation. With incorporation of our heap, the number of node visits is reduced remarkably compared to the baseline, which directly leads to the reduced sssp computation time.
0/1 Knapsack with Concurrent
Heap. e knapsack problem appears in real-world decisionmaking processes in a wide variety of elds. It de nes as follows: given weights and bene ts for some items and a knapsack with a limited capacity W, determine the maximum total bene t can be obtained in the knapsack. e 0/1 knapsack problem is a branch of the knapsack problem where one must either put the complete item in the knapsack or don't pick it at all. Branch and bound is an algorithm design paradigm, which is usually used for solving combinatorial optimization problems such as the 0/1 knapsack problem. e solution to the 0/1 knapsack problem can be expressed as a path in a binary decision tree where each level in the tree represents we either pick or do not pick an item. us, with n items, there are 2 n possible solutions. Instead of blindly checking for every possible solution for the maximum bene t under certain capacity, we can prune the search space by comparing the bound (the best possible bene t we could gain if we choose this node) of a node with the current maximum bene t to see if it is worth continue exploring.
A simple sequential implementation of such algorithm is to enqueue to-explore nodes into a priority queue with their current bene ts as the key values and always choose to explore nodes with largest bene t rst. On one hand, it is a greedy approach that would give optimal solution despite that it might encounter several local optimal before get to the global optima. On the other hand, if nodes with larger bene t are explored rst, we can skip exploring certain nodes with a bound that is smaller than the max current bene t. We implement a parallel GPU knapsack algorithm based on the sequential version with our concurrent heap to show that the incorporation of our concurrent heap accelerates knapsack computation.
Since parallelizing node exploration might result in unnecessary growth in heap size, we also apply a technique which lters invalid nodes when heap size is larger than a certain threshold before the algorithm continues to explore more nodes. We name this version as knapsack with garbage collection (GC). In [9] , S. Martello et al. de ned and tested with several types of instances of knapsack problems. Using the same data generation tool, we generated 12 knapsack datasets to demonstrate the potential of our concurrent heap. We describe the properties of these datasets in Table 4 .
We compared the running time in milliseconds and the number of explored nodes between sequential and GPU knapsack in Table 5 . For all the datasets we tested we obtain an average overall speedup of 2.31X for GPU knapsack and 2.48X for GPU knapsack with garbage collection.
We nd that our GPU knapsack algorithms with concurrent heap performs particularly well with Subset-sum(ss) instances, with a maximum speedup up to 12.19X compared to sequential version. Also, the number of nodes explored for GPU knapsack is signi cantly smaller than that of sequential knapsack. Because of the greedy property of the branch and bound algorithm, it does not guarantee the path its exploring will lead to a global optimal solution, it is possible, especially with the Subset-sum instances where the bene t of an item is equal to the weight of it. On the other hand, parallelizing the branch and bound algorithm with our concurrent heap allows it to solve for a large amount of potential solutions that are prioritized by their current bene t simultaneously, which can lead to a faster convergence of the global optimal solution. eoretically, parallelizing node exploration in branch and bound knapsack problem may cause an exponential growth in the queue size since it performs parallel explorations for nodes in a binary decision tree. However, according to our experiments, we nd that the GPU knapsack sometimes results in less node exploration because while nodes with higher bene t are explored earlier than other nodes, there are chances where the current max bene t converges fast enough so that the nodes with lower priority quickly become invalid for exploration since it's bound become smaller than the max bene t, which leads to a reduction in exploring time and correspondingly an increase of overall performance.
6 RELATED WORK CPU Parallel Heap Algorithms e most popular CPU approach [1, 7, 11, 12, 14] to gain parallelism for parallel heap is by supporting concurrent insert or/and delete operations. Rao and Kumar [11] proposed a scheme that used multiple mutual exclusion locks to protect each node in the heap. ey also proceeded insertions in the top-down manner to avoid deadlock while insertions for the conventional heap follow a bo om-up manner. Rassul [1] proposed LR-algorithm which was an extension to Rao and Kumar's method that sca er the accesses of di erent operations to reduce the contention. Hunt and others [7] present a lock-based heap algorithm that supports insertion and deletion in an opposite directions. Deo and Prasad [3] increased the node capacity. However, their algorithm does not support concurrent insertions/deletions.
All these parallel heap algorithms on CPUs cannot be applied to GPUs directly because of the unique SIMT execution model employed by moder GPUs. For parallel algorithms on GPUs, the optimization for thread divergence, memory coalescing and synchronization need to be taken into account. GPU Parallel Heap Algorithms Parallel Heap algorithms are less studied on GPUs. He [5] introduced a parallel heap algorithms for many-fore architectures like GPUs.
eir algorithm exploited the parallelism of the parallel heap by increasing the node capacity, like the idea in [3] , and pipelining the insert and delete operations. However, their approach did not exploit the parallelism for concurrent operations at the same level of the heap. Also they need a global barrier to synchronize all threads a er the insert or delete updates at every levels which brought extra heavy overhead.
CONCLUSION
is work proposes a concurrent heap implementation that is friendly to many-core accelerators. We develop a generalized heap and support both intra-node and inter-node parallelism. We also prove that our two heap implementations are linearizable. We evaluate our concurrent heap thoroughly and show a maximum 19.49X speedup compared to the sequential CPU implementation and 2.11X speedup compared with the existing GPU implementation [5] .
