Load balancing graph analytics workloads on GPUs is difficult because of the irregular nature of graph applications and the high variability in vertex degrees, particularly in powerlaw graphs. In this paper, we describe a novel load balancing scheme that aims to address this problem. Our scheme is implemented in the IrGL compiler to allow users to generate efficient load-balanced code for GPUs from high-level sequential programs. We evaluated several IrGL-generated load-balanced programs on up to 16 GPUs. Our experiments show that this scheme can achieve an average speed-up of 1.5× on inputs that suffer from severe load imbalance problems when previous state-of-the-art load-balancing schemes are used.
Introduction
Graphics processing units (GPUs) have become popular platforms for processing graph analytical applications [3, 8, 12, 14, 15, 29, 38] . In spite of the computational advantages provided by the GPUs, improving the performance of graph analytical applications remains a challenge even for simple vertex programs. Vertex programs execute in rounds, and in each round, they apply an operator to certain active vertices in the graph. Operators update labels of the active vertex and its immediate neighbors, and must appear to have been executed atomically. In a large graph, there may be many active vertices that can be processed in parallel. However, there are several complications in ensuring load balance across GPU threads. The first is that the set of active vertices in each round is statically unpredictable and may vary dramatically from round to round. Therefore, static load balancing techniques do not work well. Another complication is that most large graphs today are power-law graphs in which vertex degrees follow a power-law distribution (i.e., a few vertices . have orders of magnitude more neighbors than the rest of the vertices). Therefore, simple load balancing schemes that assign active vertices to threads do not perform well. Finally, good load balancing schemes must account for the architecture of modern GPUs and the presence of the hierarchy of threads consisting of threads, warps, and thread blocks. In multi-GPU systems, the computational load imbalance in a single GPU impacts the efficiency of execution. As many distributed GPU frameworks adopt the bulk-synchronous parallel execution model [8, 15] , load imbalance within one GPU may exacerbate load imbalance across the machine.
Several graph processing frameworks have proposed different methods of resolving the load balancing problem for graph analytics on GPU [3, 8, 15, 28, 29] . Most of these strategies involve dynamically dividing vertices and/or edges evenly across the thread blocks, warps, and threads of the GPU. However, most of these techniques have the limitations that include high memory overheads, load imbalance across thread blocks, an inability to efficiently handle irregular workloads, load-balancing overheads, or high programming effort.
In this paper, we present an adaptive load balancing strategy that addresses the load imbalance problem at runtime. It is based on a low-overhead inspector that estimates load imbalance among thread blocks at runtime. If it determines that load imbalance exists, computation is split by assigning edges evenly across all the thread blocks. We propose a novel cyclic edge distribution strategy that takes into account the memory access patterns of the GPU threads. Furthermore, we propose a method that determines if the load balancing is not beneficial in a round of computation to avoid the overhead of splitting edges across the thread blocks (intra thread block load balancing continues to happen).
We implemented our strategy in the IrGL compiler [28] , which permits users to write simple sequential graph analytics programs without detailed knowledge of GPU architectures. Our compiler automatically generates efficient load-balanced parallel CUDA code. The generated code interoperates with the Gluon communication substrate [8] , enabling it to run on multiple GPUs in a distributed-memory cluster.
We evaluated the benefits of our approach on a single machine with up to 6 GPUs and on a distributed GPU cluster with up to 16 GPUs. We compare our approach with other frameworks that support different load balancing strategies. Our experiments show that our load balanced code achieves an average speedup of (1) 1.7× on a single GPU and (2) 2.1× on multiple GPUs over many graph applications. Our loadbalanced code achieves an average speedup of 1.5× when compared to other third-party frameworks, on power-law graphs while incurring negligible overhead in inputs that do not suffer from heavy load imbalance.
The rest of the paper is organized as follows. Section 2 gives the context of our work. Section 3 details existing load balancing strategies and discusses the trade-offs they make. Section 4 presents our proposed adaptive load balancing strategy, implemented in the IrGL [28] compiler. Sections 5 and 6 discuss our experimental setup and evaluation respectively. Section 7 describes related work, and Section 8 concludes the paper.
Background on Graph Analytics and GPUs
This section gives a high-level introduction to implementing graph analytics applications on GPUs.
Graph Analytics
A graph consists of a set of vertices and a set of edges. Labels may be associated with both vertices and edges. Node labels are initialized at the start of the computation and updated repeatedly during execution until some global quiescence condition is reached. Updates to node labels are performed by applying an operator to active vertices in the graph [30] . A push-style operator reads the label of the active vertex and updates labels of its immediate neighbors. A pull-style operator reads the labels of the immediate neighbors of the active vertex and updates the label of the active vertex. A vertex becomes inactive once the operator has been applied to it, but it may become active again later in the computation. Several approaches exist for finding active vertices in the graph. Topology-driven algorithms execute in rounds, and in each round, the operator is applied to all vertices. The Bellman-Ford single-source shortest-path (sssp) algorithm is an example [7] . These algorithms are simple to implement, but they may be inefficient if there are few active vertices in each round, as is the case in sssp and bfs computation in road networks. Data-driven algorithms maintain explicit worklists of active vertices, and the operator is applied only to those vertices. The worklist is initialized at the start of the computation, and execution terminates when the worklist is empty. Dijkstra and delta-stepping algorithms for sssp are examples: the source vertex is initially placed in the worklist, and during the computation, any vertex whose label is updated becomes active, and it is placed on the worklist [7] . Data-driven algorithms can be executed in rounds by maintaining two worklists that we call current and next worklists: in each round, active vertices from the current worklist are processed and newly activated vertices are added to the next worklist. The round terminates when the current worklist is empty.
To process a graph on a distributed cluster, the input graph is first partitioned among the hosts in a cluster. There are many partitioning policies such as outgoing edge cut (OEC), incoming edge cut (IEC), and vertex cuts [13] . Execution occurs in rounds. In each round, hosts compute on their local partitions and then participate in a global synchronization step in which the labels of boundary vertices are reconciled.
GPU Execution
An NVIDIA GPU [26] consists of a set of streaming multiprocessors (SMs). Each SM is associated with resources such as streaming processors (SPs), registers, scratchpad memory, and L1 cache. The GPU has an L2 cache and a global memory which can be accessed from any of the SMs.
From the perspective of a CUDA programmer 1 , the GPU appears as a device that executes multithreaded programs called kernels. GPUs support the SPMD programming model for kernels, so each thread executes the same code but may follow an independent control path. A kernel is launched on the GPU with a fixed number of threads which cannot be altered during execution. The CUDA programming model for kernels is hierarchical -each kernel executes as a collection of thread blocks that in turn contain the actual threads. Threads within the same thread block can synchronize and communicate with each other using fast on-chip shared memory. Thread blocks can be launched in any order on the GPU. The number of thread blocks that can be launched in an SM depends on the number of available resources in the SM as well as the resource requirement of each thread block. The threads in the thread block are divided into sets of consecutive 32 threads called warps. The threads in a warp execute program instructions in a SPMD manner. Once a thread block finishes its execution, another thread block can be launched. Finally, a kernel finishes its execution once all its thread blocks complete their execution.
Writing irregular programs using CUDA involves a higher degree of architectural awareness than is required for regular data-parallel programs. Communication and mutual exclusion, which are building blocks of optimistically parallelized irregular programs, are especially tricky to get right in a portable fashion. The inability to spawn new threads for newly discovered work or even reschedule existing threads (e.g. using some application-level priority) prevents the implementation of sophisticated schedulers as on the CPU. Despite these limitations, GPU offer higher memory bandwidth and an order of magnitude more concurrency than most CPUs, both of which can be exploited to obtain highperformance in graph analytics applications.
Existing Load Balancing Strategies and Challenges
In this paper, we address the following challenge: given the active vertices in a partition in a round, how should the work of processing those vertices be distributed among the threads of the GPU owning the partition? We start by describing the pros and cons of existing GPU load balancing strategies for graph analytics applications.
Vertex-Based and Edge-Based Load Distribution
Vertex-based load balancing schemes balance work by assigning roughly equal numbers of vertices to GPU threads [23, 24, 31] . Each thread is responsible for processing the outgoing and/or incoming edges of the vertices assigned to it. This strategy works well if the number of edges connected to each vertex is roughly the same. In power-law graphs, some vertices have a much higher degree than other vertices, so this strategy will result in severe load imbalance both at an inter-thread block level as well as at an intra-thread block level.
Edge-based load balancing, on the other hand, assigns roughly equal numbers of edges to each GPU thread, and each thread updates the endpoints of its assigned edges during computation. This works well for graphs that have unbalanced degree distributions [32] since each thread will have the same amount of edges to processes. However, for efficiency, this scheme requires a graph representation on the GPU that allows a thread to quickly access the endpoints and data of an edge from the edge ID such as coordinate format [32] or an edge list format. This may require prohibitively more space than formats such as compressed sparse row (CSR) or compressed sparse column (CSC) that many graph analytics systems [3, 8, 15, 28, 29] use. major problem for GPUs which already have limited memory. Alternatively, this space overhead can be avoided at the cost of increased computation by requiring threads to search for edge endpoints in a format like CSR or CSC [36] .
Thread-Warp-CTA (TWC) Distribution
Thread-Warp-CTA (TWC) load balancing assigns a vertex and its edges to a thread, a warp, or a thread block (CTA) based on the degree of that vertex [8, 22, 28, 36] . Each vertex in the graph is assigned to one of three bins: small, medium, and large, based on its degree. The vertices in the large bin are processed at a thread block level: edges of the vertex are partitioned among the threads of a thread block for processing. Similarly, vertices in the medium bin are processed at the warp level: their edges are partitioned among the threads of a warp. Finally, the vertices in the small bin are processed at the thread level. The three bins can be processed sequentially [36] , or they can be processed concurrently [28] . This strategy is good for load balance among all threads as well as for locality of access, and it does not need to store edge endpoints as edge-based balancing does.
One problem with this strategy is that the unit of work assignment to threads is still done by distributing the vertices equally: this can still cause load imbalance at the thread block, warp, and thread level if the degree distributions within a bin varies significantly. In particular, since there is no upper bound on the degree of the vertices assigned to the large bin, thread block imbalance may be significant. To illustrate this, we compute the work done (number of processed edges) by the each thread block for D-IrGL [8] which uses the TWC policy. Figure 1a shows the load distribution on different thread blocks in the first three rounds of executing sssp (single-source-shortest-path) using the rmat25 input. The first two rounds consume a significant portion of the total execution time, and the figure shows that computational load across thread blocks is imbalanced: some thread blocks process a large number of edges compared to other thread blocks. The third round, however, is load-balanced across the thread blocks. We observe similar behavior with many other applications on power-law graphs.
Thread block imbalance can also vary across different applications for the same input. We use Figure 1c to illustrates this: bfs (push-style) suffers from thread block load imbalance, but pagerank (pull-style) does not. Finally, in a given application, computational load can be distributed differently across thread blocks for different inputs. Figure 1b shows the thread block distribution for bfs on road-USA and rmat23 inputs. Here, bfs exhibits thread block load imbalance on rmat23 but not on road-USA.
Other Strategies
Other strategies exist to circumvent some of the weaknesses of the previously discussed strategies. For example, Enterprise [18] adds another bin to the three that TWC maintains to processes vertices with extremely large degree using all CTAs on the GPU to alleviate possible thread block imbalance. However, it only uses it in breadth-first search. Gunrock [36] provides an edge-based load-balance strategy (LB) that assigns edges equally among the thread blocks in addition to a dynamic binning work assignment strategy similar to TWC). They use the average degree of graph to choose between TWC and LB. However, the chosen strategy is used in all rounds, which may not be the best policy to follow if the computation pattern varies substantially from round to round. 
Discussion
Ideally, these issues would be addressed by a load balancing scheme that can be incorporated into a compiler so application developers can use it without having a deep knowledge of the underlying GPU architecture. We describe such a scheme next.
Adaptive Load Balancer
To explain the adaptive load balancer introduced in this paper, we use sssp as an example program. Figure 2 shows a snippet of a simple sssp program written using IrGL [28] programming constructs. The outer loop at Line 3 iterates over the active vertices in each round, and the inner loop (at Line 5) processes the neighbors of the active vertex. The sssp operator (known as the relaxation operator [7] ) is applied in each iteration of the inner loop, and it corresponds to a unit of work (load) for this program. Other vertex programs for graph analytics applications follow a similar pattern: an outer loop that iterates over active vertices and an inner loop that processes the immediate neighbors of a given active vertex. Figure 3 shows the code generated by the modified IrGL compiler for the sssp code of Figure 2 . We explain our adaptive load balancing scheme using this code. Lines 30-36 is the main outer loop that iteratively calls the functions SSSP and SSSP_LB.
Approach
Our approach extends the TWC scheme discussed in Section 3.2 with a bin called huge. During the preprocessing phase in each round, a vertex is assigned to this bin if its degree exceeds a threshold value, which we denote by THRESH-OLD. Otherwise, it is assigned to one of the TWC bins. This is shown in lines 3-9 of the code in Figure 3 .
At the end of this inspection phase, the implementation checks to see if the huge bin has any vertices. If so, the edges of these vertices are distributed evenly among threads in all thread blocks. To accomplish this, we first perform a prefix sum of the number of edges connected to the huge vertices (shown at line 31). The final value in the prefix sum is the total number of edges connected to the set of huge vertices (this is called total_edges in line 14. Dividing this value by the total number of threads gives the number of edges assigned to each thread (line 15).
At this point, we know how many edges must be assigned to each thread (call this w), but we still need to determine which edges are assigned to a given thread. There are many choices, but we used two policies: cyclic and blocked. They are illustrated in Figure 4 . In the cyclic distribution, edges are assigned to threads in a round-robin manner, whereas in the blocked distribution, each thread is assigned a contiguous set of edges. More precisely, if the number of threads is p, thread T i is assigned edges e i , e p+i . . . e (w −1) * p+i in the cyclic scheme Figure 3 . A snippet of compiler generated CUDA code for source shortest path (sssp) appliation.
whereas it is assigned edges e (i * w ) , e (i * (w +1)) . . . e (i+1) * w −1 in the blocked scheme.
Regardless of the distribution, a thread that needs to process an edge needs to know the source and destination vertices of that edge. If the graph is stored in COO format, this information is readily available, but like most systems in this space, our system uses a CSR representation of the graph to save space. Therefore, we use a binary search on the prefix sum array computed earlier (line 31). For example, in Figure 4 , thread T 4 needs to process edge e 4 in the cyclic distribution, so it makes a binary search in the prefix sum array to identify the source vertex v 0 (which has first 40 edges). However, the performance of the binary search depends on the distribution; in particular, the cyclic distribution has better locality than the blocked distribution. This is illustrated in Figure 4 . In the cyclic distribution, consecutive threads (T 0 ..T 19 ) process consecutive edges (E 0 ..E 19 ) whose binary search computation follows the same trajectory in the binary tree, which is good for cache performance. In the blocked distribution, threads T 0 ..T 19 compute source vertices by following different paths in the binary search tree and this has worse locality. The experimental results in Section 6 show that the cyclic distribution performs better in practice, so we use it in the rest of the experimental study. In Figure 3 , the computation of the source and destination vertices is used in line 21. The relaxation operator is then applied to this edge (line 22).
Analysis
The value of the degree threshold for classifying a vertex as huge is currently a static parameter in our system. Setting this value to 0 will put all vertices in the huge bin; this may be good for load balancing, but there will be a lot of overhead from the binary search. Conversely, setting the threshold to a value larger than the max degree of vertices in the graph ensures that no vertex will be placed in the huge bin; this eliminates the overhead of binary search but hurts load balancing. Therefore, there is a sweet spot that yields the best performance. We found experimentally that a threshold value equal to the number of threads launched in a kernel gives good performance, so we use this value in the experiments reported in Section 6. In each round of computation, our scheme requires to maintain two worklists to keep track of high degree vertices and their prefix sums. This has a computational advantage to reduce the binary search space. However, it incurs (O(V )) space overhead.
Evaluation Methodology
We implemented our Adaptive Load Balancer (ALB) in the IrGL compiler [28] . The IrGL compiler generates CUDA code for a single GPU. To support multiple GPU platforms, we plugged the generated code with the graph partitioner, CuSP [13] (with cartesian vertex cut policy) and the Gluon [8] communication substrate. CuSP partitions the input graph among GPUs and supports various (customizable) partitioning policies that can be chosen at runtime, while Gluon manages the synchronizations of vertex labels among GPUs. We denote the resulting system as D-IrGL(ALB). We evaluated our implementation on two hardware platforms: the first is a machine we call Momentum, and the second is the Bridges [27] cluster located at the Pittsburgh Supercomputing Center [2] . Momentum is used for singlehost multi-GPU experiments. It has 2 Intel Xeon E5-2650 v4 CPUs with 12 cores each and 96 GB of DRAM on each CPU. It has 6 GPUs: 4 NVIDIA Tesla K80 with 12 GB of memory each and 2 NVIDIA GeForce GTX 1080 with 8GB of memory each. Bridges is used for multi-host multi-GPU experiments. Each machine in the cluster contains 2 Intel Broadwell E5-2683 v4 CPUs with 16 cores each, 128GB of RAM, and 2 NVIDIA Tesla P100 GPUs with 16GB of memory each. The network interconnect used to connect the machines in the cluster is the Intel Omni-Path Architecture. We run on up to 16 GPUs on our experiments (8 machines). All codes were compiled in Momentum using cuda 9.0 and gcc 6.1.0 and in Bridges using cuda 9.0, gcc 6.3.0 and Intel MPI. Table 1 lists the input graphs that we used for our experiments; the table splits the graphs into ones we ran on a single host (Momentum) and ones we ran on multiple hosts (Bridges). All the rmat graphs are randomly generated graphs created with an RMAT generator [5] . road-USA [1] is a road network. orkut [17] and twitter40 [16] are social networks.
We evaluate five applications: breadth-first search (bfs), connected components (cc), k-core decomposition (kcore), pagerank (pr), and single-source shortest path (sssp). We use push-style implementation for bfs, cc, and sssp, and pull-style implementation for pr and kcore. The source node for bfs and sssp is the highest out-degree node of the graph except for the road networks, for which it is 0. The k value for kcore is 100. The tolerance value in pr is 10 −6 . All algorithms are run until convergence.
We compare the performance of our approach with the following frameworks: D-IrGL [8] , Gunrock [29] , and Lux [15] . D-IrGL, is a distributed multi-GPU graph analytical framework. It does not support a dynamic thread block load balancing mechanism. However, it can balance the load within a thread block using TWC. D-IrGL supports all 5 applications that we use for the evaluation. Gunrock is a single host multi-GPU graph analytical framework. It supports two load balancing schemes: TWC and LB. TWC can balance the load within a thread block, and LB can balance the load across the thread blocks. For Gunrock, we use bfs, sssp, and cc; they do not have kcore; we omit pr as it did not produce correct output. Lux is a distributed multi-GPU graph analytical framework. It does not support thread block load imbalance. However, it supports a variant of TWC scheme to balance the load within a thread block. For Lux, we used only cc and pr as other applications are either not available or not correct. pr for Lux does not provide a convergence criteria, so we executed it for the same number of iterations as that of D-IrGL. Execution time of all applications using all frameworks are reported as an average of three runs, excluding graph loading, partitioning, and construction time.
Experimental Results
In this section, we evaluate and analyze our adaptive load balancing approach using a single GPU, multiple GPUs on a single host, and multiple GPUs on a distributed cluster. Table 2 compares the performance of our adaptive load balancing approach, denoted as D-IrGL (ALB), with that of D-IrGL, Gunrock (TWC), and Gunrock (LB). Note that D-IrGL does not support thread block load balancing policy but supports TWC. Gunrock (TWC) denotes the Gunrock framework that uses TWC policy and Gunrock (LB) denotes Gunrock with their thread block load balancing strategy. Gunrock provides two variants for bfs: with and without direction optimization. Currently, both D-IrGL and D-IrGL (ALB) do not support direction optimization for bfs, so we evaluate bfs in Gunrock without direction optimization (for reference, bfs with direction optimization results are in parenthesis). From the table, we observe that D-IrGL (ALB) shows up to 4× speed up when compared to D-IrGL for bfs, sssp, cc, and kcore on rmat23 and rmat25. These application and input configurations, as shown in Section 3, suffer from severe load imbalance across thread blocks in some rounds. It illustrates that our adaptive load balancing scheme is effective in detecting the presence of such load imbalance and in improving their performance. D-IrGL (ALB) does not show improvement for pr on rmat23 and rmat25 because pr visits the incoming neighbors of a vertex as opposed to other applications, like bfs, which visits outgoing neighbors. These graphs have low max D in when compared to max D out , as shown in Table 1 , and their in-degree distribution is not as skewed as their out-degree distribution. Consequently, pr does not suffer from thread block load imbalance in any round of computation (as shown in Section 3). D-IrGL (ALB) performs similar to D-IrGL on road-USA and orkut. D-IrGL does not suffer from load imbalance in any round of computation for road-USA (max D out of road-USA is only 9). As a result, D-IrGL (ALB) does not apply its thread block load balancing strategy at runtime. Similarly, D-IrGL does not suffer from severe load imbalance for orkut as orkut has very low max D out when compared to rmat23 and rmat25. These results show the adaptiveness of our approach: if an application suffers from the thread block load imbalance at runtime, it benefits from our approach; otherwise, it incurs minimal overhead. Table 2 shows that Gunrock (LB) performs better than Gunrock (TWC) due to thread block load imbalance in TWC. However, D-IrGL (ALB) outperforms both Gunrock (TWC) and Gunrock (LB) for most of the applications on rmat23, rmat25, and orkut. D-IrGL (ALB), unlike Gunrock (LB), is adaptive, i.e., if an application does not suffer from thread block load imbalance in a round of computation, it can minimize the overhead of load balancing. Gunrock (LB) performs better than D-IrGL (ALB) only for bfs and cc on road-USA. Gunrock uses an explicit sparse work-list to maintain the active vertices, whereas both D-IrGL and D-IrGL (ALB) use an implicit dense work-list to identify the active vertices.
Single GPU
In the implicit work-list, D-IrGL and D-IrGL (ALB) traverse all the vertices to determine the presence of the active vertices, whereas in the explicit work-list, Gunrock needs to traverse only the active vertices themselves. As there are few active vertices in any round of computation for bfs and cc on road-USA, D-IrGL and D-IrGL (ALB) incur additional overhead. However, we do not observe this behavior for sssp on road-USA with D-IrGL and D-IrGL (ALB) as they have a significant number of active vertices in most rounds of the computation.
To understand the strength of our approach, we show the thread block load distribution for several input and application configurations with and without our approach. Figure 5a and 5b show the thread block work distribution for D-IrGL and D-IrGL (ALB) schemes for bfs on rmat23 in 0 th round (it spends 84% of the total execution time in D-IrGL). The figure shows that D-IrGL suffers from huge load imbalance: thread block 0 processes all 34,941,924 edges of a high degree vertex, whereas the rest of thread blocks do not process any edges. Figure 5b shows the load distribution for two kernels that are launched in our approach: (1) LB, which distributes the load of high degree vertices equally among all the thread blocks and (2) TWC, which distributes the load of all other active vertices to threads/warp/CTA based on their degrees. The figure also shows the total load for each thread block, which is the sum of the respective loads in TWC and LB kernels. Figure 5b shows that D-IrGL (ALB) has a much more balanced load distribution than D-IrGL because ALB detects the presence of high degree vertices at runtime and distributes their load equally among all the thread blocks. The TWC kernel launched in D-IrGL (ALB) does not process any edge in this configuration as it does not have any low-degree active vertices in this round. Figure 5c and Figure 5d show similar load distributions for sssp in 1 st round. D-IrGL suffers from thread block load imbalance in this case as well. However, the difference lies in the distribution of load for LB and TWC kernels in D-IrGL (ALB). In this case, TWC kernel also processes few edges (these edges corresponds to that of low-degree active vertices) whose load also get balanced. Figure 5e and Figure 5f show the load distributions for cc on road-USA. D-IrGL does not suffer from load imbalance in any round. As a result, our executor phase does not initiate the LB kernel. Hence, none of the thread blocks of the LB kernel need to process edges, as shown in Figure 5f .
Finally, we show the load distributions for pr on rmat23 in Figure 5g and Figure 5h . Although rmat23 has a power-law degree distribution, it has lower max D in and skew in indegree distribution than max D out and skew in out-degree distribution, respectively. As pr traverses the incoming neighbors of active vertices, pr does not suffer from thread block load imbalance in D-IrGL. Consequently, ALB does not instantiate the thread block load balancer, i.e., the LB kernel does not process any edges. Figure 9 . D-IrGL (ALB) with IEC and OEC partitioning. Figure 6 compares the performance of D-IrGL, D-IrGL (ALB), Gunrock (TWC), and Gunrock (LB) schemes on Momentum using up to 6 GPUs. D-IrGL (ALB) performs better than D-IrGL, Gnrock (TWC), and Gunrock (LB) on 1 to 6 GPUs for rmat23 and rmat25 on all applications except for pr. As discussed, the performance of pr is comparable to D-IrGL since it does not suffer from thread block load imbalance. To understand the benefits in a single-host multi-GPU system, we show the breakdown of the total execution time into computation time and the non-overlapping communication time for 6 GPUs in Figure 7 . The computation time of an application in a GPU is the time spent in executing the kernels of the application in the GPU. We report computation time as the maximum of that among all GPUs. The rest of the execution time is the non-overlapping communication time, including the time to synchronize the labels of vertices among the GPUs. Figure 7 shows that D-IrGL spends a significant amount of time in computation time on 6 GPUs, and D-IrGL (ALB) is effective in reducing the overall computation time, even for multiple GPUs. Even though the input graph is partitioned among the multiple GPUs, thread block load imbalance even in a single GPU affects the entire multi-GPU system. In these cases, ALB is effective in detecting and correcting the load imbalance on each GPU, showing a reduction in the total computation time. As a result, it improves the performance in the entire multi-GPU system. Figure 8 compares the performance of two distribution strategies discussed in Section 4, i.e., cyclic and blocked. The figure shows the results for rmat23 and rmat25, which has thread block load imbalance for all applications except pr. We observe that D-IrGL (ALB) with cyclic distribution performs much better than that with blocked distribution for all configurations (except pr), and it is up to 4× faster. As explained in Section 4, cyclic distribution is faster than blocked distribution because cyclic has better memory access pattern that benefits from cache hits.
Single-Host Multi-GPUs
To further understand the benefits of ALB on multiple GPUs, we study the performance using two different graph partitioning policies (IEC and OEC), which is shown in Figure 9 . D-IrGL provides different graph partitioning policies to minimize the synchronization time, achieve load balance among multiple GPUs, and to improve the overall performance. Though these policies may address the load imbalance among the multiple hosts/GPUs, each GPU can still suffer from thread block load imbalance. The figure shows that our approach is effective in addressing the load imbalance present in each GPU, which leads to a reduction in the computation time, thereby improving the total execution time. Figure 11 . Execution time breakdown on 16 GPUs of Bridges. Figure 10 compares the performance of D-IrGL (ALB) with that of D-IrGL and Lux frameworks, which support multihost multi-GPU graph analytical applications, on up to 16 GPU of Bridges. It is clear that D-IrGL, even without thread block load balancing support, performs better than Lux for all configurations. D-IrGL and D-IrGL (ALB) perform similary on uk2007 but D-IrGL (ALB) is slightly faster than D-IrGL on twitter40. On rmat26 and rmat27, D-IrGL (ALB) is faster than D-IrGL by an average speed up of 4×.
Multi-Host Multi-GPUs
To analyze the performance improvement in rmat26 and rmat27, we show the breakdown of the total execution time into computation and non-overlapping communication time on 16 GPUs as described in Section 6.2. The results show that most applications in D-IrGL spend most of the execution time performing computation. These inputs have a very large max D out (shown in Table 1 ) which makes them suffer from much more from thread block load imbalance on one of the GPUs, even when used on 16 GPUs. D-IrGL (ALB) is effective in reducing the computation time on the GPUs by balancing load across the thread blocks on each GPU to improve the performance of the application in the multi-host multi-GPU system.
We also observe that D-IrGL (ALB) performs similar to uk2007 for all configurations. This graph has a max D out of 15,402, which is less than the number of threads launched in our system (i.e, 26,624), which is our threshold to identify high-degree vertices. Hence, the graph does not have any high-degree active vertex in any round of the computation. D-IrGL (ALB) helps in minimizing the overhead at runtime by not applying thread block balancing strategy in any round.
In summary, our adaptive load balancer (ALB) is effective in improving performance of applications significantly on the configurations that suffer from thread block load imbalance in both single GPU and multi-GPU systems with minimal overhead on configurations that have balanced thread block load.
7 Related Work
Graph Processing on GPUs
Many prior works deal with graph processing on GPUs, spanning single-host single-GPU systems [21, 21, 28, 34, 35] , CPU-GPU heterogeneous systems [10, 11] , single-host multi-GPU systems [3, 12, 14, 29, 38] , and multi-host multi-GPU systems [8, 15] . In this paper, we present an inter-thread block load balancer that is implemented as an extension to the D-IrGL (Gluon [8] enabled IRGL [28] ) compiler framework and uses CuSP [13] graph partitioner for efficiency.
Load Balancing Strategies for Graph Processing
There have been two main directions in which prior work has focused on load balancing: node-based task-distribution [23] and edge-based task-distribution [32] . Considering the architecture of the GPUs, prior work [22, 28] has considered distributing the nodes using a hierarchical load balancing strategy to to threads, warps or CTA (cooperative thread array), depending on the size of the neighbors -this scheme is called the TWC scheme. As discussed in Section 3, these schemes do not have good mechanisms for load balancing at the thread block level, leading to performance loss, especially for power-law graphs where the imbalance among the degrees of the nodes is very high and the number of active nodes keep changing. Gunrock [29] decides to uses intra thread block (TWC) or the inter-thread block based scheme to do load-balancing, but this decision is made at the pre-processing stage and is not dynamic. Similarly, Enterprise [18] processes vertices with extremely large degree using all CTAs on the GPU to alleviate possible thread block imbalance. However, Enterprise is explored only for the BFS kernel and launches different kernels depending on the degree of the vertex. In contrast, we present a dynamic scheme to adaptively perform load-balancing across thread blocks. Chen et al. [6] present a scheme to do task based dynamic load-balancing on both single-and multi-GPU systems. In contrast, we present a scheme to balance the load for programs in IrGL.
Load Balancing Graph Applications
There have been many works that try to speedup and load balance individual applications, both for distributed and shared memory architectures [4, 9, 19, 20, 25, 33, 37] . The main point is that they present many techniques that are specific to the application under considerations. In contrast, in this paper, we present techniques to do dynamic load balancing that are general in nature and can be applied on any graph analytical application on GPUs.
Conclusion
This paper describes an adaptive load balancing mechanism to address the thread block load imbalance in GPUs. It follows the inspector-executor model to detect the presence of thread block load imbalance and distribute the load equally among the thread blocks efficiently at runtime. We implemented our strategy in the IrGL compiler and evaluated its effectiveness on several graph applications and inputs using up to 16 GPUs. The results show that our approach helps in improving performance of applications by 1.5× on average for inputs that suffer from load imbalance when compared to the previous state-of-the-art load-balancing schemes.
