Abstract-We present the results obtained by using an evolution of our CUDA-based solution for the exploration, via a breadth first search, of large graphs. This latest version exploits at its best the features of the Kepler architecture and relies on a combination of techniques to reduce both the number of communications among the GPUs and the amount of exchanged data. The final result is a code that can visit more than 800 billion edges in a second by using a cluster equipped with 4,096 Tesla K20X GPUs.
Ç

INTRODUCTION
M ANY graph algorithms perform poorly running in parallel, due to their lack of arithmetic intensity and spatial locality. A large fraction of the running time is spent accessing memory and, on distributed systems like high performance computing clusters, in communication among computational nodes.
A number of studies have focused on the breadth first search (BFS) algorithm. The BFS is a fundamental step for solving basic problems on graphs, such as finding the diameter or testing a graph for bi-partiteness, as well as more sophisticated ones, such as finding community structures in networks or computing the maximum-flow/minimum-cut, problems that may have an immediate practical utility. Strategies for the optimization of the BFS strongly depend on the properties of the analyzed graph. For example, graphs with large diameters and regular structures, like those used in physical simulations, present fewer difficulties compared to "real-world" graphs that can have short diameters and skewed degree distributions. In the latter case, a small fraction of vertices have a very high degree, whereas most have only a few connections. This characteristic complicates the load-balancing among tasks in a parallel implementation. Nevertheless, many authors have demonstrated that it is possible to realize high performance parallel and distributed BFS [1] , [2] , [3] , [4] , [5] , [6] , [7] .
In the recent past we presented two codes [8] , [9] able to explore very large graphs by using a cluster of GPUs. In the beginning, we proposed a new method to map CUDA threads to data by means of a prefix-sum and a binary search operation. Such mapping achieves a perfect loadbalancing: at each level of a BFS one thread is associated with one child of the vertices in the current BFS queue or, in other words, one thread is associated to one vertex to visit. Then we modified the compressed sparse row data structure by adding a new array that allows for a faster and complete filtering of already visited edges. Moreover, we reduced the data exchanged by sending the predecessors of the visited vertices only in the end of the BFS.
In the present work we further extend our work in three directions: i) a new approach for the local part of the search that exploits the improvement in the efficiency of the atomic operations provided by the Nvidia Kepler architecture; ii) a different partitioning of the graph among GPUs that is now based on a 2D decomposition of the adjacency matrix representing the graph [10] ; iii) an optimization to reduce the size of exchanged messages that relies on the use of a bitmap [11] . The first item improves the performance on each GPU whereas the second one improves the scalability by leveraging a communication pattern that does not require an all-toall data exchange. The last optimization makes possibile a dramatic reduction in the size of exchanged messages. This change halves, by itself, the total execution time. The combination of the three enhancements provides a significant advantage with respect to our previous solution: the number of traversed edges per second (TEPS) eight times greater for large graphs that require at least 1,024 GPUs.
The paper is organized as follows: in Section 2, we provide background information on distributed BFS and briefly discuss our previous solutions presented in [8] and [9] . Our current work is presented in Section 3; Section 4 reports a comparison with our previous work and the results obtained by the new implementation. In Section 5 we describe the bitmap-based solution that allows the reduction of the size of messages. In Section 6 we briefly review some of the techniques presented in related works. Finally, Section 7 concludes the work with a perspective on future activities.
BACKGROUND ON PARALLEL DISTRIBUTED BFS 2.1 Parallel Distributed BFS on Multi-GPU Systems
In a distributed memory implementation, the graph is partitioned among the computing nodes by assigning to each one a subset of the original vertices and edges. The search is performed in parallel, starting from the processor owning the root vertex. At each step, processors handling one or more frontier vertices follow the edges connected to them to identify unvisited neighbors. Newly discovered vertices are then exchanged in order to notify their owners and the next iteration begins. The search stops when the connected component containing the root vertex has been completely visited.
The partitioning strategy used to distribute the graph is a crucial part of the process because it determines the load balancing among computing nodes and their communication pattern. As reported by many authors, communications represent the major bottleneck of a distributed BFS [5] , [6] , [10] . For what concerns computations, using parallel architectures such as GPUs, introduces a second level of parallelism that exploits a shared memory approach for the local processing of the graph.
For a better understanding of the implementation proposed in the present work, we briefly review the core features of our previous work (hereafter referred to as the original work/code). We implemented [8] a parallel distributed BFS for multi-GPU systems based on a simple partitioning of the input graph where vertices were assigned to processors by using a modulo rule. Such partitioning resulted in a good load balancing among the processors but required the processors to exchange data with, potentially, every other processor in the pool. This aspect limited the scalability of the code beyond 1; 024 GPUs. As to the local graph processing, in our original code we parallelized the frontier expansion by using a GPU thread per edge connected to the frontier. In that way each thread is in charge of only one edge and the whole next level frontier set can be processed in parallel. We achieve an ideal load-balancing, by devising a technique to map threads to data that combines an exclusive scan operation and a binary search function.
The distributed implementation requires the data to be correctly arranged before messages can be exchanged among computing nodes: vertices reached from the frontier must be grouped by their owners. Moreover, many vertices are usually reached from different frontier vertices [3] , [4] , [8] , therefore it is important to remove duplicates before data transfers in order to reduce communication overhead. The removal of duplicates and the grouping of vertices can be implemented in a straightforward way by using the atomic operations offered by CUDA. However, at the time of our original work, atomics were quite penalizing so we implemented the two operations by supporting benign race conditions using an integer map and a parallel compact primitive [9] . The scalability of the code, however, was still limited to 1,024 nodes.
2D Partitioning
Different partitioning strategies can be employed to reduce the number of communications such as those presented in [5] , [10] , [11] . Hereafter we recall the 2D partitioning scheme introduced by Yoo et al. [10] where the computing nodes are arranged as a logical grid with R rows and C columns and mapped onto the adjacency matrix A NÂN (partitioning it into blocks of edges). The processor grid is mapped once horizontally and C times vertically thus dividing the columns in C blocks and the rows in RC blocks, as shown in Fig. 1 .
Processor P ij handles all the edges in the blocks ðmR þ i; jÞ, with m ¼ 0; . . . ; C À 1. Vertices are divided into RC blocks and processor P ij handles the block jR þ i. Considering the edge lists represented along the columns of the adjacency matrix, this partitioning is such that: (i) the edge lists of the vertices handled by each processor are partitioned among the processors in the same grid column; (ii) for each edge, the processor in charge of the destination vertex is in the same grid row of the edge owner. With such decomposition, each step of the BFS requires two communication phases, called expand and fold. The first one involves the processors in the same grid column whereas the second those in the same grid row. Algorithm 1 shows a pseudo code for a parallel BFS with 2D decomposition. At the beginning of each step, each processor has its own subset of the frontier set of vertices (initially only the root vertex). The search entails the scanning of the edge lists of all the frontier vertices so, due to property (i), each processor gathers the frontier sets of vertices from the other processors in the same processorcolumn (expand exchange, line 13). For each gathered vertex u, all outgoing edges ðu; v i Þ are identified and sent to the processors in charge of each v i (lines [14] [15] [16] [17] [18] [19] . For property (ii) this exchange only involves processors in the same processor-row (fold exchange). On the receiving side, edges whose target vertex has already been visited are filtered out (line 22). Those that remain have their data updated (lines [23] [24] [25] and are added to the frontier Fig. 1 . Two-dimensional partitioning of an adjacency matrix A with an R Â C processor grid. The matrix is divided into C consecutive groups of R Â C blocks of edges along the vertical direction. Each block is a N=ðRCÞ Â N=C sub-matrix of A. Different groups of blocks are colored with different yellow gradients. For each block, the column of processors owning the corresponding vertices (row indexes) are shown in blue. On the left part, it is shown the sequence of blocks, from top to bottom, assigned to the generic processor ðp i ; p j Þ. The colors correspond to the blocks assigned to the processor in each group of blocks. set used for the next level expansion (line 26). The search ends when the frontier of each processor is empty, meaning that the whole connected component containing the root vertex has been visited. The main advantage of the 2D partitioning is a reduction of the number of communications. If P is the number of processors, our first implementation required OðP Þ data transfers at each step whereas the 2D partitioning only requires 2 Â Oð ffiffiffiffi P p Þ communications.
BFS ON A MULTI-GPU SYSTEM WITH 2D PARTITIONING
Our work loosely follows the Graph500 [12] benchmark specifications. The benchmark requires to perform multiple BFS operations on the same R-MAT [13] graph, starting each search from a random root vertex. It poses no constraint about the kind of data structures used to represent the graph but it requires the vertices to be represented using at least 48 bits. Since the focus of the present work is the evaluation of the new local part of the search and the 2D partitioning with respect to our original code, we do not strictly adhere to the Graph500 specifications and represent vertices with 32 bits 1 (more than enough to index the highest number of vertices storable in the memory of current GPUs).
Local Graph Data Structure
Each processor stores its part of the adjacency matrix as a ðN=RÞ Â ðN=CÞ local matrix (Fig. 1) where blocks of edges are stored in the same row order as in the global matrix. This allows us to map global indexes to local indexes so that global row v is mapped to the same local row for every processor in the same processor-row of the owner of vertex v. In a similar way, global column u is mapped to the same local column for every processor in the same processorcolumn handling the adjacency list of vertex u.
Local adjacency matrices are sparse as is the global matrix, so they are stored in compressed form. Since they are accessed by reading an entire column for each vertex of the frontier, we used a representation that is efficient for column access, the compressed sparse column (CSC) format. As a consequence, processors may scan adjacency lists by accessing blocks of consecutive memory locations during the frontier expansion (one block for each vertex in the frontier of the same processor-column). Since the non-zeroes entries of an adjacency matrix have all the same value, the CSC is represented with only two arrays: the column offset array col and the row index array row.
Predecessors and BFS levels are stored in arrays of size N=R (number of rows of the local matrix). For the frontier, since it can only contain local vertices, we use an array of size N=ðRCÞ and the information about visited vertices is stored in a bitmap with d N=R 32 e 32-bit words.
Parallel Multi-GPU Implementation
The code generates a graph by using the R-MAT generator provided by the reference code available from the Graph500 website 2 . Then the graph is partitioned as described in Section 2.2. The BFS phase is performed entirely by the GPUs with the CPUs acting as network coprocessors to assist in data transfers.
Algorithm 2 shows the code scheme of the BFS. Every processor starts with the level, predecessor and bitmap arrays set to default values (lines 1-4). The owner of the root vertex copies its local column index into the frontier array and sets its level, predecessor and visited bit (lines 5-10). All the data structures are indexed by using local indexes. At the beginning of each step of the visit, the expand communication is performed (line 13). Every processor exchanges its frontier with the other processors in the same processor-column and stores the received vertices in the all_front array. Note that in this phase, since processors send subsets of their own vertices, only disjoint sets are exchanged. After the exchange, in the expand_frontier routine, processors in each column collectively scan the whole adjacency lists of the vertices in 1. This choice does not limit the total size of graphs. They are generated or read by using 64 bits per vertex. The 32-bit representation is used to store local partitions.
2. http://www.graph500.org/referencecode all_front (line 14). For each vertex, its unvisited neighbors are set as visited and the vertex itself is set as their predecessor. For neighbors owned locally the level is also set. The routine returns the unvisited neighbors, divided by the processor-column of the owner, inside an array with C blocks, dst_verts. After the frontier expansion, neighbors local to the processor are moved from the jth block of the dst_verts array into the front array (lines [15] [16] and the fold communication phase is performed. Unvisited neighbors just discovered are sent to their owners, located in the same processor-row, and received vertices are returned in the int_verts array (line 17). In order to reduce the amount of exchanged data, in the fold phase we send only destination vertices, instead of edges. Moreover, to avoid sending more than once the same vertex to its owner, we use a bitmap to keep track of local vertices that have been visited.
Finally, the update_frontier routine selects, among the received vertices, those that have not been visited yet and sets their level and bitmap bit (line 18). Returned vertices are then appended to the new frontier and the cycle exit condition is evaluated (line 19). The BFS continues as long as at least one processor has a non-empty frontier at the end of the cycle.
Communications
The expand and fold communication phases are implemented with point-to-point MPI primitives and make use of the following scheme: start send operations; wait for completion of receive operations posted in the previous round; post non-blocking receive operations for the next round that hides the latency of the receive operations in the BFS round and avoids possible deadlocks due to the lack of receive buffers.
Since the communication involves only local indexes, the expand and fold phases are carried out by exchanging 32-bit words.
Frontier Expansion
After the local frontiers have been gathered, the frontier expansion phase starts. This phase has a workload proportional to the sum of the degrees of the vertices in the current frontier and is memory bandwidth bound with an irregular memory access pattern. It is characterized by a high degree of intrinsic parallelism, as each edge originating from the frontier can be processed independently from the others. There could be, however, groups of edges leading to the same vertex. In those cases, only one edge per group should be processed. In our code we use a thread for each edge connected to the frontier and this allows the selection of the single edge to be followed in, at least, two ways: either by taking advantage of benign race conditions or by using synchronization mechanisms such as atomic operations.
In our previous work, we chose to avoid atomic operations because they were quite penalizing with the Nvidia Fermi architecture, available at that time. The current architecture, code-named Kepler, introduced many enhancements including a significant performance improvement of the atomic operations (almost an order of magnitude with respect to Fermi GPUs). As a consequence, we decided to use atomics in the new 2D code. The resulting code is much simpler because there is no longer the need to support the benign race conditions on which the original code relied [8] .
Atomics are used to access the bitmap in both the frontier expansion and update phases and to group outgoing vertices based on destination processors.
At the beginning of the expansion phase, the vertices in the blocks of the all_front array are copied in a contiguous block of device memory and then the corresponding columns of the CSC matrix are processed. We employ a mapping between data and threads similar to the one used in our previous code [8] where a CUDA thread is used for every edge originating from the current frontier. The mapping allows to achieve an optimal load balancing as edge data are evenly distributed among the GPU threads. Given the frontier vertices u 0 ; u 1 ; u 2 ; . . . ; u nÀ1 with degrees d 0 ; d 1 ; d 2 ; . . . ; d nÀ1 and adjacency lists:
thread i is mapped onto the jth edge connected to vertex u k :
After vertices in all_front are copied to device memory, we use their degrees to compute a cumulative degree array (cumul[k]) by means of an exclusive scan operation. The frontier expansion kernel is then launched with r ¼ P nÀ1 s¼0 d s threads. Each thread processes one local edge ðu; vÞ originating from the frontier, i.e., one element of column u of the CSC (see Fig. 2 ). Algorithm 3 shows the pseudo code for the frontier expansion kernel. The source vertex u is found by performing a binary search for the greatest index k such that cumul[k] is less than or equal to the global thread id (line 2) and by accessing the kth location of the frontier array (line 3) 3 . The column offset of vertex u is found in the uth element of the CSC column index array col. The row offset of the destination vertex is computed by subtracting from the thread id the degrees of the vertices preceding u in the frontier array. Finally, the local id v of the destination vertex is found by accessing the CSC row index array at the location corresponding to the sum of both the column and the row offsets (line 4).
The status of the neighbor v is then checked (lines [5] [6] . If the vertex has already been visited, then the thread returns. Otherwise, it is marked as visited with an atomicOr operation (line 7). That instruction returns the value of the input word before the assignment. By checking the vertex bit in the return value, it is possible to identify the first of different threads handling edges leading to the same vertex (assigned to different columns) that set the visited bit (line 8). Only the first thread proceeds, the others return. The bitmap is used not only for the vertices owned locally but also for those, belonging to other processors (in the same processor-row), that are reachable from local edges. In other words, the bitmap has an entry for each row of the CSC matrix. This makes it possible to send external vertices just once in the fold exchanges because those vertices are sent (and marked in the bitmap) only when they are reached for the first time in the frontier expansion phase.
Each unvisited neighbor is added to the array associated to the processor-column of its owner processor in preparation for the subsequent fold exchange phase (lines 12 and 14). The processor-column is computed based on the local index of the neighbor (line 9) and the counter for such column is atomically incremented to account for multiple threads appending for the same processor (line 10). If the neighbor belongs to the local processor its level is also set (line 15). Finally, regardless of the owner, the neighbor predecessor is set (line 17).
After the kernel has completed, the array containing the discovered vertices (grouped by owner) is copied to host memory in preparation for the fold communication phase.
The bitmap, level and predecessor arrays are never copied back to the host because they are used only by GPU kernels.
Optimizations
In the current CUDA implementation, we introduced an optimization to mitigate the cost of the binary searches Threads to data mapping assigning one thread per edge. From top to bottom, the first array represents the local column indexes corresponding to frontier vertices. The second contains the local degrees of each vertex, i.e., the number of non-zeroes per column (23 in total), and the third one is the cumulative degree array. The CUDA grid is launched with (at least) 23 threads. By searching the cumulative array for the greatest entry less than its global id, each thread finds its frontier vertex. Finally, threads mapped to the same frontier vertex process its edge list (a CSC column).
3. Since consecutive threads search for consecutive values, sequences of threads mapped onto the same adjacency lists access the same memory locations during the binary search. Different sequences share an initial part of the search path whose length depends on their relative distance. As a consequence, the memory access pattern is quite efficient since threads in each sequence perform broadcast memory accesses that are served with single memory transactions, one for each search jump.
performed at the beginning of the frontier expansion kernel. Since threads search for their global index in the cumulative array, non-decreasing indexes are returned to consecutive threads: binsearch maxleðcumul; gidÞ binsearch maxleðcumul; gid þ 1Þ where gid is the global thread identifier that is equal to threadIdx:x þ blockIdx:x Ã blockDim:x:
That relation makes possible to scan the columns with fewer threads than the number of edges originating from the frontier without increasing the number of binary searches performed per thread. This is done by assigning each thread to a group of consecutive elements in the CSC columns. If not enough elements are available in the column, the group overlaps on the next column. Then, each thread performs a binary search only for its first edge to obtain a base column index. The indexes for subsequent edges tid þ i are found by incrementing the base index until ðtid þ iÞ ! cumul½base þ 1. The overhead introduced by these linear searches is very limited because the majority of edge groups are contained in a single column and so the base column index is never incremented.
Among other optimizations, there is the passage of readonly arrays to kernels with the modifiers const/restrict. In this way the compiler uses the read-only data cache load functions to access them. Moreover, in order to increase instruction-level parallelism, edges assigned to a thread are not processed sequentially, one after the other. Instead, they are processed together in an interleaved way by replacing each kernel instruction with a sequence of independent instructions, one for each edge. In this way, since threads do not stall on independent instructions, each group is executed in a way that hides arithmetic and memory latencies. Fig. 3 reports the total time spent in the frontier expansion kernel to perform single-GPU BFS operations with a variable number of edges per thread on an R-MAT graph with scale¼21 . The maximum performance is obtained by assigning four edges per thread with a performance gain of $40 percent with respect to the case with a thread per neighbor.
Frontier Update
In the frontier update phase, local vertices remotely discovered and received during the fold exchange, are processed to find those that are locally unvisited, i.e., whose visited bit is not set.
Since this phase processes the vertices received, instead of their adjacency lists, fewer computational resources are required compared to those necessary in the frontier expansion phase.
As in the expand communication phase, vertices returned by the fold exchange are grouped by sender in different blocks and thus, before starting the processing, they are copied to device memory in contiguous locations.
After the copy, the frontier update kernel is launched with a thread per vertex. Each unvisited vertex has its level and predecessor set and it is added to the output array. As in the frontier expansion kernel, atomic operations are used to synchronize both the accesses to the bitmap and the write operations in the output buffer. For what concerns the predecessor, since in the fold phase we do not transmit source vertices, the thread, lacking this information, stores the sender processor-column in the predecessor array (the sender saved the predecessor).
After the kernel has completed, the output array is copied to host memory and it is appended to the frontier array.
RESULTS
The performances hereafter reported have been obtained on the Piz Daint system belonging to the Centro Svizzero di Calcolo Scientifico. Piz Daint is a hybrid Cray XC30 system with 5,272 computing nodes interconnected by an Aries network with Dragonfly topology. Each node is powered by both an Intel Xeon E5-2670 CPU and a NVIDIA Tesla K20X GPU and is equipped with 32 GB of DDR3 host memory and 6 GB of GDDR5 device (GPU) memory. The code has been built with the GNU C compiler version 4.8.2, CUDA C compiler version 6.5 and Cray MPICH version 7.2.2. For the GPU implementation of the exclusive scan we used the CUDA Thrust library [14] , a well-known, high performance template library implementing a rich collection of data parallel primitives.
The code uses 32-bit data structures to represent the graph because the memory available on a single GPU is not sufficient to hold 2 32 or more edges and it transfers 32-bit words during the communications because the frontier expansion/update and expand/fold communications work exclusively on local vertices.
We use 64-bit data only for graph generation/read and partitioning. For the generation of R-MAT graphs we use the make_graph routine found in the reference code for the Graph500 benchmark. The routine returns a directed graph with 2 scale vertices and $edge factor Â 2 scale edges. We turn the graph undirected by adding, for each edge, its opposite. Most of the following results are expressed in TEPS, a performance metric defined by the Graph500 group as the number of input edge tuples within the component traversed by the search, divided by the time required for the BFS to complete. The tests have been performed with both R-MAT generated and real-world graphs. Table 1 reports the processor-grid configurations, scales and edge factors used for R-MAT graphs.
In the following we report the times required by i) the expand exchange; ii) frontier expansion; iii) fold exchange; and iv) frontier update phases. We do not measure the time required by any data exchange carried out after the completion of the BFS (e.g., the exchange of predecessors of the vertices that have been visited).
Comparison between the 2D Code and the Original Code
In Fig. 4 To highlight the impact of improvements, in Table 2 , we report the performance of the 2D code obtained on a single GPU with both the Fermi and Kepler architectures. The code runs more than twice as fast on the Kepler GPU. Fig. 5 shows a comparison of frontier expansion times. Starting from 16 GPUs, the frontier expansion performed with atomic operations is two times faster than our original approach supporting benign race conditions. These results are quite interesting because in the recent past several authors (including us) proposed different techniques aimed at reducing the use of GPU atomic operations that were blamed for poor performance ( [2] , [3] , [15] , [16] ). Fig. 6 shows the weak scaling plot obtained by using a number of GPUs ranging from 1 to 4,096. In order to maintain a fixed problem size per GPU, the graph scale has been increased by 1 for each doubling of the number of processors, starting from scale 21 whereas the edge factor had been set equal to 16 (we set the scale equal to 21 because it is the maximum value that can be used in the 1D implementation, against which we compare many results). For each scale, we report the harmonic mean of the TEPS measured in 64 consecutive BFS operations (a different root vertex is chosen at random for each search). The code scales up to 4,096 GPUs where the performance reaches $400 GTEPS on a graph with 2 33 vertices and $280 billions of directed edges. Fig. 7 shows the strong scaling plot obtained by visiting a fixed R-MAT graph with scale 25 and edge factor 16 with an increasing number of GPUs. Here we used the maximum scale that fits on a single GPU, to have more work to be done when the number of processes increases. The code scales linearly up to 16 GPUs. With 32 processors, the cost of data transfers becomes comparable to the cost of computations and with more GPUs the advantage becomes marginal and the efficiency drops. Fig. 8 reports the compute and transfer times of the code. The compute time is the aggregate time required by the frontier expansion and update phases. Since the problem size per processor is fixed, the amount of data received by any process (remote frontiers and discovered vertices) grows with the row and column sizes of the processor grid. For that reason, when the number of processors grows also the amount of data to be exchanged increases, and so it does the execution time. Transfer time includes the time spent in the expand and fold exchanges and in the allreduce operation at the end of the BFS loop. Up to 2,048 GPUs, data transfers require less than half of the total BFS time. With 4,096 GPUs the communications become dominant and take almost 60 percent of the total search time. Above 4,096 GPUs, we do not expect the code to scale anymore but we did not test it directly.
Performance Analysis of the 2D Code
In Fig. 9 it is reported the timing breakdown of the four steps performed in the BFS round: expand exchange, frontier expansion, fold exchange and frontier update. Times are summed across the BFS steps and averaged across 64 operations. By increasing the number of processors, the time required by the frontier expansion phase reduces and, as expected, the weight of the communications increases, becoming dominant with 4,096 GPUs. The time required by the frontier update phase is always much smaller than the time spent in the frontier expansion. With any configuration, it accounts for less than 10 percent of the total time required by the four steps. Finally, Table 3 reports the results obtained traversing real-world graphs obtained from the Stanford Large Network Dataset Collection [17] .
Among them we selected undirected graphs with the highest number of edges.
LATEST COMMUNICATION OPTIMIZATION
The results presented in Section 4 (Figs. 8, 9) show that the communication times steadily grow and eventually exceed the compute time using 4,096 nodes. In order to reduce communications cost we reduced the number of vertices exchanged by using a bitmap to keep track of the visited status of both local and non-local vertices. To further reduce the size of the messages we decided to fit to our code an approach proposed in [11] , [18] consisting in using bitmaps also for data transfers. The idea is that when the size of vertices lists to be sent exceeds, in bits, the number of indices local to the receiving process, then it is more convenient to send a bitmap with the bits corresponding to the outgoing vertices set. This technique reduces the communication times by limiting the data transmitted to a fixed amount whenever he number of vertices to be transferred would grow over a given threshold. Obviously, that happens in the most expensive steps of the visit. Assuming that the indices to be transferred are 32-bit words and that the size of the subgraph per processor is fixed and equal to SCALE ¼ 21 we have that the bitmap size is, regardless of the number of processors:
With respect to the number of vertices, the threshold T over which it is more convenient to use a bitmap is
Whenever a list with more than T vertices needs to be sent, just 256 KB of data are actually sent, independently of the size of the list (that can grow up to 8 MB in the example above). This chance applies to both the frontier expansion and fold phases.
In order to maximize the benefits, it is necessary to use the bitmaps only when the number of vertices in the list exceeds the threshold T .
We implemented a simple mechanism to dynamically switch between the two representations, using the most convenient alternative for the two transfer procedures at each step of the visit. It is worth noting that with R-MAT graphs the switching point happens during the BFS step, between the expand and the fold phases. This is in accordance to the analysis of the expansion and contraction of the frontier reported by previous studies [3] , [4] , [19] .
For what concerns the packing/unpacking of the vertices lists into/from bitmaps, we used two different approaches. The packing has been implemented into specialized versions of the frontier expansion and update kernels that directly produce bitmaps instead of lists. For the unpacking, we developed specific warp-centric kernels based on the Kepler's shuffle instructions. The overhead introduced by these kernels is negligible with respect to the frontier expansion/ update kernels and largely compensated by the dramatic reduction of the data-exchange times that it makes possible.
The impact of this optimization on the overall performance is remarkable and leads to a net 2x speed-up in the number of TEPS as shown in Fig. 10 . Fig. 11 reports the aggregated computation and communication times of the code using the bitmaps. The drop of communication times from $200 to $40 ms with 4,096 GPUs is an indication that the code may scale even on larger GPU clusters.
RELATED WORKS
In recent years high performance implementations of graph algorithms have attracted much attention. Several works tackled the issues related to both shared memory and distributed memory systems. On the single GPU different solutions have been proposed to address workload imbalance among threads. The naive assignment, in which each thread is assigned to one element of the BFS queue, may determine a dramatic unbalance and poor performance [8] . It is also possible to assign one thread to each vertex of the graph but, as showed by Harish and Narayanan [20] , the overhead of having a large number of unutilized threads results in poor performance. To solve this problem, Hong et al. [2] , [21] proposed a warp centric programming model. In their implementation each warp is responsible of a subset of the vertices in the BFS queue. Another solution has been proposed in the work of Merrill et al. [3] . They assigned a chunk of data to a CTA (a CUDA block). The CTA works in parallel to inspect the vertices in its chunk.
We devised an original data mapping described in [8] . Then, to reduce the work, we used an integer map to keep track of visited vertices. Agarwal et al. [1] , first introduced this optimization using a bitmap that has been used in almost all subsequent works.
On distributed memory systems many recent works rely on a linear algebra based representation of graph algorithms [5] , [18] , [22] , [23] .
Ueno and Suzumura [22] presented a hybrid CPU-GPU implementation of the Graph500 benchmark, using the 2D partitioning proposed by Yoo et al. [10] . Their implementation uses the technique introduced by Merrill et al. [3] to create the edge frontier and resort to a novel compression technique to shrink the size of messages. They also implemented a sophisticated method to overlap communication and computation in order to reduce the working memory size of the GPUs.
A completely different algorithm that uses a directionoptimizing approach has been proposed by Beamer et al. [4] and subsequently extended for a cluster of CPUs [23] . The direction-optimizing method switches between the top-down and the bottom-up traversal. The bottom-up search dramatically reduces the number of traversed edges during the most expensive computational levels of the BFS by searching the parent in the frontier starting from the subset of unvisited vertices. The parent search implies a serialization in order to minimize the required work. However, on shared memory systems (including single GPUs [19] , [24] ) the BFS performance increases significantly.
In a recent work, Checconi and Petrini [7] demonstrate the chance of having an effective implementation of a distributed direction-optimizing approach on the BlueGene/P by using a 1D partitioning. That partitioning simplifies the parallelization of the bottom-up algorithm but it may require a significant increase in the number of communications. Their results show that the combination of the underlaying architecture and the SPI interface is well suited to the purpose. The authors report that replacing the SPI with MPI incurs a loss of performance by a factor of nearly 5 although the MPI-based implementation cannot be considered optimal. This suggests that the scalability of a distributed implementation may be worse on different network architectures.
The work presented by Hiragushi and Takahashi [25] achieves an outstanding speed-up for the implementation of a direction-optimizing BFS on a single GPU, nearly six times faster than more recent implementations [19] , [24] . However, it also shows a very poor scalability on a multiGPUs system. Satish et al. [11] implemented a distributed BFS with 1D partitioning that shows remarkable scaling properties up to 1,024 nodes. They devised a technique to delay the exchange of predecessors. We implemented the same technique independently [9] and use it in the present work.
Comparisons with State-of-the-Art Implementations on Shared Memory and Distributed Memory Systems
In Tables 4 and 5 we report our results along with some state-of-the-art implementations on both shared and distributed memory systems (respectively). The purpose is to put our work in perspective with state-of-the-art BFS implementations more than provide a detailed and exact performance comparison. It is apparent from Table 4 that our implementation has good performance on the single GPU even when compared to CPU shared memory implementations. This is mainly due to the efficiency of our data mapping and atomic primitives of the Kepler architecture. Our results with 4,096 GPUs (830 GTEPS) outperform most distributed implementations and, as far as we know, our code performs better than any other distributed BFS implementation running on a cluster of GPUs.
CONCLUSIONS
We presented the performance results of our new parallel code for distributed BFS operations on large graphs. The code employs a 2D partitioning of the adjacency matrix for efficient communications and uses CUDA to accelerate local computations. The computational core is characterized by optimal load balancing among GPU threads, taking advantage of the efficient atomic operations of the Kepler architecture.
We further enhanced the code by using a bitmap to reduce the data exchanged among processors during the most expensive BFS steps. This optimization, improved the performance up to a factor 2 with 4,096 GPUs.
The result is a code that shows good scalability up to 4; 096 Nvidia K20X GPUs, visiting 830 billion edges per second of an R-MAT graph with 2 33 vertices and $280 billions of directed edges. We compared the performance of the new code with that of the original one, which relied on a combination of parallel primitives in place of atomic operations, on the same cluster of GPUs. The 2D code is up to eight times faster on R-MAT graphs of the same size. For all the implementations the graph is of type R-MAT.
Mauro Bisson received the PhD degree in computer science from the "Sapienza" University of Rome in 2011. Since then, he has been postdoctoral fellow at various research institutions in the United States and in Italy where he applied GPU computing to address problems in fluid dynamics, cryptography, and graph processing. He has recently joined NVIDIA as a developer technology engineer.
Massimo Bernaschi has been for 10 years with IBM working in the field of parallel and distributed computing. He is currently with the National Research Council of Italy (CNR) as a chief technology officer of the Institute for Computing Applications. He is also an adjunct professor of computer science at the "Sapienza" University of Rome.
Enrico Mastrostefano received the degree in physics and the PhD degree in computer science from the "Sapienza" University of Rome. His research interests include numerical simulation and high-performance computing.
" For more information on this or any other computing topic, please visit our Digital Library at www.computer.org/publications/dlib.
