Dynamic parallelism allows GPU kernels to launch additional kernels at runtime directly from the GPU. In this paper we show that dynamic parallelism enables relatively simple high-performance graph algorithms for GPUs. We present breadth-first search (BFS) and single-source shortest paths (SSSP) algorithms that use dynamic parallelism to adapt to the irregular and data-driven nature of these problems. Our approach results in simple code that closely follows the highlevel description of the algorithms but yields performance competitive with the current state of the art.
INTRODUCTION
Although graphics processing units (GPUs) are best known for their impressive floating point performance, their high memory bandwidth means GPUs can also accelerate memorybound applications such as graph algorithms. The irregular nature of graph algorithms, however, makes it di cult to e↵ectively utilize parallel hardware. NVIDIA's Kepler-class and newer GPUs support dynamic parallelism (DP), which allows GPU kernels to launch additional kernels from the GPU in order to accommodate parallelism discovered at runtime without.CPU. To date, DP has been largely under-utilized in the literature. Instead, e cient GPU graph algorithms are often implemented using multiple kernels, with each specialized for certain degrees or forms of parallelism (e.g., [4] [6] ). There is a disadvantage to needing so much specialization, however, as it increases the complexity of the algorithm.
Using breadth-first search (BFS) as an illustrative example, we show that DP is well-suited to graph algorithms, as the resulting implementation of the BFS algorithm closely follows the abstract algorithm definition. The basic algorithm does Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. Copyrights for components of this work owned by others than ACM must be honored. Abstracting with credit is permitted. To copy otherwise, or republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. Request permissions from Permissions@acm.org. not perform as well as the current state of the art, but by applying a small set of modular improvements (which expose additional aspects and advantages of DP), we show that we can improve our algorithm's performance to be competitive with the current state of the art.
Specifically, our paper makes the following contributions:
• We demonstrate the applicability of DP to graph algorithms through a novel implementation of breadth-first search, which is significantly simpler in its expression than existing GPU graph algorithms (Section 3).
• We identify performance bottlenecks in our naive DP implementation and show that only minor modifications are necessary to overcome these bottlenecks (Section 3.1).
• We show that DP is generally applicable to graph algorithms through an implementation of single source shortest paths (SSSP) (Section 4).
BACKGROUND

Graph 500
The Graph 500 benchmark [7] is the leading graph benchmark in HPC, measuring the performance of breadth-first search, and the SSSP kernel is in development. Graph 500 utilizes a performance measure of Traversed Edges Per-Second (TEPS ), which is calculated as follows:
The performance of benchmark implementations are usually reported using GTEPS , where 1 GTEPS =10 9 TEPS .
CUDA and Dynamic Parallelism
CUDA programs feature a number of kernels that run on the GPU with a large number of threads. These threads are divided into a grid of blocks and within a block threads are further groups into warps. Before dynamic parallelism was introduced, CUDA required the kernels to be launched by the host program, which can lead to underutilizing the GPU in inherently dynamic algorithms like graph problems. Dynamic parallelism allows the launch of child kernels directly from within a kernel [1] . For our purposes, this allows a kernel to spawn a new kernel to process a vertex's adjacent vertices without incurring a round trip to the CPU. 
BFS ALGORITHM DESCRIPTION
We use a level-synchronized algorithm for BFS. The algorithm has two loops: the outer loop iterates over frontier queues and the inner loop iterates over each of a vertex's outgoing edges, adding the edges' targets to the next iteration's frontier queue. Both the inner and the outer loops can be executed in parallel, which are represented by kernel1 and kernel2 in our algorithm, respectively. kernel1 performs one level of the search, calling kernel2 which processes one vertex from the current queue, adds each of its adjacent vertices to the queue for the next iteration, and sets the parent of the adjacent vertices to the current vertex in parallel. This is the essence of our dynamic parallelism based parallel breadth-first search algorithm (BFS-DP). The basic form of our algorithm is implemented in a mere 29 non-empty lines of code.
Kernels kernel1 and kernel2 make use of two additional data structures: the parent map and the queue. The parent map is an array with one entry per vertex that stores the parent discovered for each vertex. There are several choices for how to represent the queue, such as a dynamically sized queue, a fixed-size boolean-vector or a fixed-size bitmap. We chose a bitmap for its simplicity. The bitmap has one bit per vertex, and each vertex's corresponding bit is set to true if that vertex is in the current frontier.
This algorithm is very straightforward and also quite readable but does not perform well (0.06 GTEPS on scale 24) sincelaunching kernels for nodes with only a few adjacent vertices hurts performance.
Improvements
Due to the simplicity of our code, our algorithm can be extended with a number of modular improvements to change the performance. Here we discuss several changes that improved performance. Later on in Section 3.2 we will describe attempted improvements that actually hurt performance. See Table 1 for a summary of the e↵ect of each of these changes. In each case we report the impact of these changes on Graph500 graphs of scales 15, 20 and 24. We chose these three scales because they were representative of three categories of L1 cache, L2 cache and global memory behavior.
At scale 24, most of the changes show smaller e↵ects on the performance. This shows that the lack of locality for higher scale graphs becomes the dominating factor that impacts performance. This trend holds true for changes that did not improve performance as well; the changes are smaller at higher scales.
Warp-Level Thread Cooperation
To increase the performance of our implementation, we adapted an optimization mentioned in [6] that allows threads within the same warp to cooperatively process adjacencies of a node. With only 42 lines of additional code including comments, warp-level thread cooperation increased our Table 2 : The e↵ect of changes that resulted in a performance decrease (GTEPS mean of 10 runs on a Graph500 graph and change relative to the best performing version from Table 1) .
performance by about 16⇥. We ran our benchmark on scales 15, 20 and 24 with only the warp-level thread cooperation optimization turned on, processing all adjacencies without launching child kernels, and we achieved 0.185, 0.686 and 0.564 GTEPS respectively, which is roughly a 15% decrease in performance compared to using both warp-level thread cooperation and child kernels ( Table 1 ). This result shows that launching dynamic kernels for large adjacency processing does improve the performance of our algorithm.
Edge Sorting
When analyzing the profiles for our code we noticed poor memory behavior. This is to be expected given the sparse and irregular nature of graph algorithms, but we can hopefully make small improvements. Many vertices may point to the same vertex v, meaning the threads responsible for each of these vertices will each have to access v's entry in the parent map. We will get better performance if these accesses happen at the same time, and the likelihood of this occurring increases if we ensure threads process adjacent vertices in sequential order.
Sorting each vertex's edges before doing the search improves the memory transactions per access ratio, which in turn leads to about a 8% overall improvement in performance at scale 20. Smaller graphs see a greater improvement, which is to be expected as they have better locality just by virtue of being smaller.
Failed Improvements
We discuss our experience with some changes that did not help here. Table 2 summarizes the performance impact of each of these changes.
Wide Bitmaps
Our implementation represents the frontier using a single bit per vertex that indicates whether the vertex is in the frontier. Updating this bitmap requires atomic operations as multiple threads can set bits in the same word at once. We had hoped that replacing the bitmap with an array of 32-bit values could gain some performance by removing the need for atomic operations.
This was not the case. We saw a performance decrease of 2%⇠4%, meaning the reduced overhead for atomic operations did not overcome the increased memory tra c caused by expanding the size of the bitmap.
Multi-Vertices per Child Kernel Thread
To increase throughput and decrease the number of blocks launched per child kernel, we experimented with increasing the number of adjacent vertices processed per thread (NAPT) in the child kernel. In our warp-level thread cooperation processing stage, we were able to increase the performance of our algorithm by over 100% after we increased the number of adjacent vertices processed by each warp-level thread. We believed, therefore, that we might be able to achieve an even higher level of parallelism and further improve our performance by allowing a thread in the child kernel process more than one adjacency at a time.
We were able to reduce the total number of threads launched in the child kernel by setting NAP T to greater than 1 but at the cost of performance decrease. The results in Table 2 show the performance impact of processing 2 vertices per thread. 
Single Driver Kernel
Our kernels discussed so far are able to perform one level of the BFS traversal without any input from the CPU. We tried to reduce the latency introduced by round trips between the CPU and GPU at each iteration by running the driver loop on the GPU as well. This led to a performance decrease of 44% on scale 15 and 1% on scale 24. It might be possible to combine the driver kernel and kernel1 together to improve performance, but we have not explored this.
SINGLE SOURCE SHORTEST PATHS
Since both BFS and SSSP algorithms can utilize a similar level-synchronized strategy for their traversals, it was easy to modify our BFS algorithm to perform SSSP calculations using dynamic parallelism (SSSP-DP). Instead of marking a neighbor as visited in each iteration, we compared the sum of the weight of the edge and the weight of the parent to the previously marked weight of the neighbor. If the new weight by visiting from the current parent is smaller than the recorded weight of the neighbor, the new weight is updated and the neighbor is marked to be visited in the next frontier.
EVALUATION
In addition to the results we have presented so far, we tested the performance of our algorithm on a number of graphs and compared them with existing implementations from the literature. Despite of our code's simplicity, it performs competitively against other single-GPU BFS implementations. 
Experimental Setup
We performed our experiments on an Intel Xeon E5-2670 v3 12-core processor machine with 32 GB RAM and an NVIDIA Tesla K40 GPU, which includes 12 GB of RAM and 2880 CUDA cores.
We ran di↵erent BFS and SSSP implementations on both Graph 500 graphs and the USA road networks graphs including di↵erent algorithms included in the Lonestar GPU package as well as Ueno et. al. on Graph 500. [9] on Graph 500 graphs. Our implemenation performs the best of the three through scale 19 at which point [6] overtakes ours.
Breadth-first search Performance
As the graph gets larger, the spatial locality in memory decreases. We believe that after scale 19 the locality has decreased to the point where more TLB misses are occuring. The overhead introduced by launching more dynamic kernels could also have impacted performance, which explains our decrease in performance at larger scales. Adopting more techniques in [6] could alleviate this performance decrease. Figure 2 shows our performance on a selection of road networks, where lonestar-wla, lonestar-wlc and lonestar-wlw represent di↵erent worklists used in the lonestar BFS algorithm collection. Note that none of the nodes in the road network have a high enough degree to take advantage of dynamic parallelism, which explains some of our relatively decreased performance.
Single source shortest paths Performance
Since atomic opearations are required to compare and update the distances of each node in the graph for SSSP, SSSP algorithms are naturally more time consuming than traversing the same graph using BFS. We compared our timing against a collection of SSSP algorithms using di↵erent topologies and data structures in Lonestar GPU using graphs of the US road networks from the 9th DIMACS challenge [3] and the performance results can be seen in Fig. 3 , where lonestar-wlc and lonestar-wln represent di↵erent worklists used in the lonestar SSSP algorithm collection.
The results suggest that our SSSP implementation lags behind some of the implementations in Lonestar that utilize advanced work queuing techniques. Due to the nature of the road network graphs, however, all the nodes are processed in the warp-level thread cooperation stage since none of the nodes in a road network graph has enough adjacent vertices to trigger child kernel processing. Therefore, our algorithm might not be able to achieve as much parallelism on road network graphs as on a Graph 500 graph.
RELATED WORK
We have already referred extensively in this paper to [6] as an exemplar of the best in class single-GPU BFS implementation. Our implementation adopts several techniques from this work. Wang and Yalamanchili [10] examine the e ciency of DP on a selection of problems such as the overheads of launching dynamic kernels. To decrease the cost of dynamic parallelism, Wang et al. [11] propose a more lightweight mechanism of dynamically adding blocks to the current kernel. We also compared our performance against [9] , which focuses on using GPUs in a distributed setting on MPI. They use a vertex sorting technique, as in their earlier work [8] , that is similar to our edge sorting optimization (Section 3.1.2).
[5] developed a warp-centric programming model which allocated tasks at warp granularity to minimize divergence with warps. performance improvement. A recently published work by Davidson et. al. [2] proposes an SSSP algorithm that builds on the model and optimizations mentioned in [6] and reports up to 14x improvement on low-degree graphs and 20-60x improvement against a serial implementation of SSSP using Bellman-Ford.
CONCLUSION
We have shown that dynamic parallelism is a useful tool in simplifying the development of high performance GPU graph algorithms. Starting with a very simple algorithm, we can progressively apply additional improvements to yield better performance without increasing code complexity. At some scales our implementation even outperforms the one of the faster single-GPU BFS algorithms published to date (Lonestar GPU's adaptation of [6] ). The progressive manner in which we have built our algorithm leads to simpler, modular and highly maintainable code. In addition to simplifying code, CUDA's dynamic parallelism serves as a tool to increase performance at a higher level of abstraction.
