This paper presents a blocked united algorithm for the allpairs shortest paths (APSP) problem. This algorithm simultaneously computes both the shortest-path distance matrix and the shortest-path construction matrix for a graph. It is designed for a high-speed APSP solution on hybrid CPU-GPU systems. In our implementation, two most compute intensive parts of the algorithm are performed on the GPU. The first part is to solve the APSP sub-problem for a block of sub-matrices, and the other part is a matrix-matrix "multiplication" for the APSP problem. Moreover, the amount of data communication between CPU (host) memory and GPU memory is reduced by reusing blocks once sent to the GPU. When a problem size (the number of vertices in a graph) is large enough compared to a block size, our implementation of the blocked algorithm requires CPU GPU exchanging of three blocks during a block computation on the GPU. We measured the performance of the algorithm implementation on two different CPU-GPU systems. A system containing an Intel Sandy Bridge CPU (Core i7 2600K) and an AMD Cayman GPU (Radeon HD 6970) achieves the performance up to 1.1 TFlop/s in a single precision.
Introduction
The all-pairs shortest paths (APSP) problem is to find the shortest paths between all-pairs of vertices in a weighted graph [1] , [2] . The problem is one of the most fundamental graph problems, and there are applications of the APSP problem in bioinformatics, social networking, traffic routing, etc. The APSP problem is an instance of a general framework in the so called algebraic path problem [3] , [4] which covers several graph, matrix, and language processing problems. It means that acceleration of the APSP problem leads to acceleration of all these problems.
A well-known solution of the APSP problem is to apply the classical Floyd-Warshall (FW) algorithm [5] , [6] . The FW algorithm requires O(n 3 ) operations on O(n 2 ) memory space, where n is the number of vertices in a graph. Blocked algorithms of the FW algorithm were proposed to efficiently utilize hierarchical memory structures of current processors [7] - [9] . For a high-speed solution of the APSP problem, different variants of the blocked algorithms have been implemented on CPUs [10] - [12] , GPUs [13] - [17] , and an FPGA [18] .
The previous studies, however, presented implementations of blocked APSP algorithms only for computing the shortest-path distance matrix. In the APSP problem, construction of a path with the shortest distance is required in many cases. Buluç et al. [14] mentioned a possibility to construct the shortest paths by using a matrix which keeps an intermediate vertex between every pair of vertices. To our best knowledge, there have been no report on implementations of a blocked APSP algorithm for computing both the shortestpath distance matrix and the shortest-path construction matrix. We consider high-speed solutions on GPUs as the most prominent because the massively parallel architecture and the high memory bandwidth of GPUs meet requirements for a fast implementation of blocked APSP algorithms. This paper introduces a blocked united APSP algorithm for computing both matrices at the same time. The presented united algorithm is an extension of the blocked algorithm for a hybrid CPU-GPU system designed in our previous study [17] . The proposed united algorithm can solve the APSP problem of a graph whose required memory size is larger than the capacity of GPU memory. In the algorithm, there are two most compute intensive parts (kernels), both of which run on the GPU. The first kernel is to solve the APSP sub-problem for a block of sub-matrices. The other kernel is a matrix-matrix "multiplication" for the APSP problem. For high utilization of the GPU kernels, we optimized data communication between CPU (host) and GPU. We have evaluated the performance of our APSP implementation on two different systems containing an Intel CPU and an AMD GPU. We show comparison results among our APSP implementations on the CPU-GPU systems and implementations only on a CPU or a GPU.
The rest of this paper is organized as follows. Section 2 describes the APSP problem and the Floyd-Warshall algorithm. Section 3 presents our blocked united algorithm for the APSP problem. Section 4 shows how we implemented the blocked algorithm on the hybrid CPU-GPU systems and benchmark results of the implementation. Section 5 concludes this paper with a mention of future work.
All-Pairs Shortest Paths Problem
Let G = (V, E) be a weighted directed graph with a edgeweight function w : E → R that assigns a real-valued weight to each edge, where V = {1, 2, . . . , n} is a set of n vertices, E ⊆ V × V is a set of edges, and R is a set of real numbers. We use an adjacency matrix representation for the graph G. In an adjacency matrix A = [a i, j ], a i, j represents the corresponding value of an edge (i, j).
Copyright c 2012 The Institute of Electronics, Information and Communication Engineers
The input of the APSP problem is an n × n matrix W = [w i, j ] which contains edge weights of a given graph G, where
Given a weight matrix W, we can compute a weight (or distance) of the shortest paths between all-pairs of vertices. The computed result is an
, where d i, j denotes the shortest-path distance from i to j. We can define each d (k) i, j by the following recurrence equation [1] :
Note that algorithms discussed in this paper allow existence of negative-weight edges in a graph G but do not allow negative-weight cycles.
To find the all-pairs shortest paths, it is required to compute not only the shortest-path distance matrix D but additionally a shortest-path construction matrix C = [c i, j ]. If at least one shortest path exists between a vertex i and a vertex j, then c i, j indicates the highest-numbered intermediate vertex on the shortest path, and otherwise, it is undefined (NULL). Initial values of a construction matrix are all undefined, i.e., all c i, j is defined as the following equation [2] :
The Eqs. (1) and (2) lead to a dynamic-programming solution known as Floyd-Warshall algorithm, which is shown as Algorithm 1. In the following algorithm descriptions, we eliminate superscripts used in the Eqs. (1) and (2) . The FW algorithm contains a three-nested loop which is similar to the standard matrix-matrix multiply-add algorithm, except the FW algorithm has stricter data dependencies such that the outermost k-loop cannot be interchanged with the other two inner loops. This FW Algorithm 1 requires 4n 3 operations (addition, comparison, and two conditional assignments) on 2n 2 data of the two matrices, and our proposed blocked Algorithm 3 described later asymptotically requires the same number of operations.
After computing the shortest-path construction matrix C by solving the APSP problem, the shortest paths between all-pairs can be built. between a given pair of vertices i and j.
To construct the shortest paths, there are other methods such as computing the predecessor matrix described in [1] . However, we consider that computing the construction matrix is more suitable for a high-speed solution on CPU-GPU systems because the computation for construction matrix requires less memory references.
Blocked United Algorithm
This section presents a blocked united algorithm for the APSP problem. This algorithm is based on the blocked algorithm discussed in our previous paper [17] , to which we added the computation of shortest-path construction matrix. The blocked algorithm is designed for a fast APSP solution with high GPU utilization on hybrid CPU-GPU systems.
In the blocked united algorithm, we partition both the n × n distance matrix and construction matrix into blocks of b × b sub-matrices, where b is a blocking factor. For simplicity, we suppose that n is in multiples of b in the following algorithm description. Let us identify a sub-matrix of block index (I, J) by A I,J = [a i, j ], where 1 ≤ I, J ≤ n/b and 1 ≤ i, j ≤ b. Figure 1 shows an example of 12 × 12 matrix with the blocking factor b = 3.
Algorithm 3 shows our blocked united APSP algorithm. In this algorithm, each iteration of the outermost loop essentially performs the same number of operations as b iterations of the Floyd-Warshall Algorithm 1; however, the two algorithms may find different shortest paths with an equal minimum distance because the two algorithms have differ- 
// Phase 2: updating the pivot column blocks
// Phase 3: updating the pivot row blocks
/* Phase 4: updating the remaining (non-pivot) blocks 
the blocked united algorithm itself recursively with a small modification. We use the latter approach and Algorithm 4 shows the blocked algorithm modified for the sub-problem. Like Algorithm 3, Algorithm 4 also has four phases but is followed by a process for merging the computed submatrices. The scalar Floyd-Warshall Algorithm 1 is used in Phase 1 of Algorithm 4. In Phases 2-4 of both blocked Algorithms 3 and 4, every block can be updated by using a matrix-matrix "multiply-add" (MMA) operation designed for the APSP problem. If the computation of path construction matrix is not taken into consideration, the MMA operation is called matrix-matrix multiply-add in min-plus algebra [19] , [20] (or tropical semiring [21] ). The difference between matrix multiply-add in linear algebra and in min-plus algebra is in the type of operations to obtain corresponding result. In linear algebra, we need arithmetic multiplication and addition and an element z i, j of sub-matrix Z is updated with the formula of z i, j ← z i, j + b k=1 x i,k · y k, j . In the minplus algebra, the addition is replaced by minimum operation and the multiplication is replaced by arithmetic addition; thus, an element z i, j is updated with the formula of
The MMA in min-plus algebra can be extended to simultaneously compute the path construction matrix and the distance matrix, and this extended MMA is described in Algorithm 5, which is invoked from the blocked Algorithm 4 for APSP sub-problem. The MMA Algorithm 5 requires four input matrices X, Y, Z, C to produce two output matrices Z, C. In the blocked Algorithm 3, we use another method for the MMA shown in Algorithm 6. In the MMA Algorithm 6, the matrix-matrix "multiplication" in the extended min-plus algebra (MINPLUS) and the matrix-matrix "addition" are separately executed. The MINPLUS operation is performed on a GPU while the matrix-matrix "addition" (minimum) operation is carried out on a CPU. This separation reduces a data sending of two input matrices Z, C to the GPU. As a result, it makes possible to alleviate a bandwidth requirement for data communication between the CPU (host) memory and the GPU memory. 
Implementation and Experimentation

Experimental Environment
We port our program of the blocked APSP Algorithm 3 to two kinds of hybrid systems containing an Intel CPU and an AMD GPU. Table 1 shows the configurations of the two used systems. The prominent difference between the two GPU architectures is that the Cayman GPU uses a four-wide symmetric VLIW (Very Long Instruction Word) instead of a five-wide asymmetric VLIW on the Cypress GPU [22] , in addition to a number of other enhancements for both graphics and more general workloads. The singleprecision peak performance of the Cayman is a little lower than that of the Cypress; nevertheless, the Cayman is considered to deliver higher performance in most of applications because of the higher memory bandwidth. Both GPUs also support double-precision floating-point instructions with the peak performance of 676 GFlop/s in the Cayman and 544 GFlop/s in the Cypress though, in this study, we only deal with a single-precision. The CPU (host) and the GPU are connected through 16 lanes of PCI-Express 2.0 buses whose aggregated peak bandwidth is 8 GB/s. Both systems run on Ubuntu 10.04, and the Linux kernel version is 2.6.32-37 in System A and 2.6.32-35 in System B (see Table 1 ). The installed display driver is AMD Catalyst 11.11. We use gcc 4.6.2 compiler with -O2 optimization option for program compilation and AMD Accelerated Parallel Processing (APP) SDK v2.5 for GPU programming. Input graph data were generated by using R-mat random graph generator in GTgraph [23] with the average vertex degree of 20. Note that the performance of our im-plementations is not sensitive to graph sparsity. All performance values presented in this paper are average of ten times measurements.
GPU Kernels
The blocked APSP Algorithm 3 requires two different kernels which run on the GPU. The first one is a kernel for the sub-blocked APSP Algorithm 4 (SubBlockedAPSP). The second one is a kernel for the matrix-matrix "multiplication" (MINPLUS in Algorithm 6).
Note that the performance in Flop/s is calculated by the formula of
where n is the problem size and the running time does not include the time for GPU initialization. In addition, the running time of the GPU kernels does not include the communication time for matrix data between the host and the GPU. The GPU kernels for the APSP problem cannot be implemented with high-performance FMA (fused multiplyadd) instructions which perform two floating-point operations (multiplication and addition) per clock cycle. In our case of the APSP problem, four instructions are needed for four operations. Thus, although the peak FMA-performance of both GPUs is around 2700 GFlop/s, we can expect at most half of the peak performance for the APSP problem, i.e., 1352 GFlop/s on the Cayman and 1360 GFlop/s on the Cypress.
Kernel for Sub-Blocked APSP Algorithm
The kernel for the sub-blocked APSP Algorithm 4 is to compute the most intensive part (lines 4-16) in Algorithm 4 on the GPU. We have developed the kernel in OpenCL (Open Computing Language) 1.1 [22] , [24] . Note that the merge operation (lines [17] [18] [19] [20] is performed on the CPU. Separating the merging computation from the whole computation of the algorithm has a positive effect for increasing performance. The separation reduces data communication time because two b × b matrices D and C are not needed to be sent to the GPU. Algorithm 4 introduces another blocking factor b s (≤ b). The size of this blocking factor highly influences the performance of the kernel. Phase 1 of the algorithm is to apply the Floyd-Warshall Algorithm 1 for a b s × b s sub-matrix. To implement the FW algorithm on the GPU, a synchronization among GPU threads is required in the beginning of every outermost iteration (immediately after line 3 in Algorithm 1). It means that the upper blocking factor b s is limited by the size of shared memory called Local Data Share (LDS). We have selected 64 as the size of blocking factor b s because the required size (32 kB = 2 · 64 2 · 4 Bytes) for the two blocks D K s ,K s , C K s ,K s is equal to the LDS size and the blocking factor should be a power-of-two for easy usage of the kernel invoked in the implementation of the main 
Kernel for Matrix-Matrix "Multiplication"
In the blocked APSP Algorithm 3, the matrix-matrix "multiplication" (MINPLUS in Algorithm 6) is the most frequent and compute intensive part, which strongly affects the total performance; thus, a fast computation of MINPLUS kernel on the GPU is essential for a high performance implementation of the blocked algorithm. For the MINPLUS kernel, we have modified an SGEMM (single-precision general matrix multiply) kernel which was developed in our previous studies [25] , [26] . The SGEMM kernel was written in an assembly-like language called Intermediate Language (IL) [27] . The measured performance of the SGEMM kernel is up to 2432 GFlop/s (90% of 2703 GFlop/s) on the Cayman and 2137 GFlop/s (79% of 2720 GFlop/s) on the Cypress.
In the SGEMM kernel, a basic multiply-add operation of z i, j ← x i,k · y k, j + z i, j can be implemented with a single FMA instruction on the GPU. However, in the MINPLUS kernel, four different instructions are required to implement a single "multiply-add" operation. This indicates that, for the same matrix size, the MINPLUS kernel takes approximately four times longer running time than the SGEMM kernel. The four required instructions are an addition in- struction (corresponding to line 15 in Algorithm 6), a comparison instruction (line 16), and two conditional move instructions (lines 17, 18) . Figure 4 shows the performance of the MINPLUS kernel on the Cayman and the Cypress as a function of matrix size in multiples of 256. Currently, the kernel does not work if a matrix size is not in multiples of 256 (= 64 · 4). It is because a width of global buffer used within a kernel in IL has to be a multiple of 64 elements and each element consists of four single-precision floating point values [28] . The maximum performance of the MINPLUS kernel is 1248 GFlop/s (92% of 1352 GFlop/s) on the Cayman and 995 GFlop/s (73% of 1360 GFlop/s) on the Cypress. A performance variability of the kernel is seen and it is particularly large for small problems sizes (b ≤ 768).
Implementation of the Blocked United APSP Algorithm on Hybrid CPU-GPU Systems
We have implemented the blocked APSP Algorithm 3 on the hybrid CPU-GPU systems. Strategies for performance optimization are almost identical to our previous study [17] . In this paper, we show the way how to increase a GPU utilization of the APSP implementation by optimizing data communication between CPU (host) and GPU. Note that we cannot apply the present optimization if n/b ≤ 2. Communication latencies of a matrix data transferred between CPU and GPU are the bottleneck for the high GPU utilization. To hide these latencies, we first consider a way to reduce the amount of data communication. The discussed MINPLUS kernel produces two b × b output matrices Z , C from two b × b input matrices X, Y. This means that four b × b matrices are needed for an execution of the kernel. An input matrix can be reused if a block to be computed is in the same row block or column block as the previously computed block. Our APSP implementation updates blocks (b×b matrices) on each outermost iteration K (= 1 to n/b) of the blocked united Algorithm 3 in the order shown in Fig. 5 . Following the order, an input block D K,K is reused in Phases 2 and 3 of the algorithm and either an input block D I,K or Fig. 5 Order of block computation of our APSP implementation on each outermost iteration (identical to Fig. 8 in [17] ). D K,J is reused in Phase 4, after sending this block to the GPU.
As it can be seen from Algorithm 3 and the order of computing (Fig. 5) Since Phase 1, which requires 4 · 3b 2 data communication, is exceptional (i.e. it needs its own kernel), the amount of an overhead related to one block data sending for each first block update in three Phases 2-4 (blocks associated with order 2, 5, 8 in Fig. 5 ) is totally 3·4b 2 , and the amount of whole data communication in Phases 2-4 is 3 · 4n 2 . Therefore, the ratio of the overhead amount to the whole communication amount equals to b 2 /n 2 . Figure 6 depicts the overhead ratio as a function of n/b. A staircase pattern is drawn in the figure because our implementation uses a padding technique to solve the APSP problem on a graph whose problem size (number of vertices) n is not in multiples of a blocking factor b. In the padding technique, the values of padded portion in the distance matrix are initialized as infinity ∞ (sufficiently larger value) such that the implementation actually solves the APSP problem on a graph whose size is the next multiple of the blocking factor (n pad = (n + b − 1)/b · b). As it can be seen from Fig. 6 , if n/b > 4, the overhead ratio is less than 1%. Moreover, if n/b > 10, the ratio is less than 0.1%. Because of it, we can view the implementation of each block update in Phases 2-4 as equal to send one b × b matrix to the GPU and receives two b × b matrices from the GPU.
We next consider an overlapping of the data communication and the GPU computation, to hide the communication latencies. The required memory bandwidth for the overlapping is computed by
The value of peak performance is set with the maximum performance of the MINPLUS kernel. The arithmetic intensity [29] is the ratio of total floating-point operations to total data communication amount. As our APSP implementation requires three b × b matrices in single precision for 4b 3 operations, the arithmetic intensity can be estimated as
The measured bandwidth between the CPU and the GPU is around 3 GByte/s, and the computed re- Note that the upper size of blocking factor is limited by a capacity of pinned memory (memory mapped to the PCIExpress memory space), and we cannot use an extremely large blocking factor (up to 2816 in the implementation). Figure 7 shows the performance of the APSP implementation on the CPU-GPU systems for different blocking factors. The performance is saturated on System B which contains a Cypress GPU when the blocking factor b ≥ 1536. On the other hand, a performance saturation is not seen on System A containing a Cayman GPU. This difference comes from the fact that the overlapping of the GPU computation and the communication cannot be implemented perfectly in the Cayman (an overlapping of the GPU computation and the CPU computation works). The maximum performance Fig. 8 Performance of the APSP implementation with a near optimal blocking factor as a function of the problem size on the CPU-GPU systems. Fig. 9 Performance of the APSP implementation with the near optimal blocking factor as a function of the problem size n and with a predefined blocking factor b = 2048 on System A.
is 1116 GFlop/s on System A and 960 GFlop/s on System B, and the efficiency is 89% (1116 of 1248) and 96% (960 of 995), respectively.
As stated above, our implementation uses a padding technique for solving the APSP problem whose size n is not in multiples of a blocking factor b. The disadvantage of the padding technique is that an extra computation is needed for the padded portion. Especially, cases where n = i · b + 1 are worst (i is a positive integer). In these cases, a big amount of extra computation is implemented though the ratio of extra computation to whole computation is small for relatively large graphs.
To avoid a drastic performance degradation for such worst cases, our APSP implementation chooses a near optimal blocking factor as a function of a problem size, instead of using a fixed blocking factor. The optimal blocking factors have been found empirically by computer experiments. Figure 8 shows the performance of the APSP implementation with this optimization. Figure 9 compares the performance using the optimal blocking factor with the one using a predefined blocking factor b = 2048 on System A for 4097 ≤ n ≤ 8192. Choosing the optimal blocking factor prevents the drastic performance degradation. For example, when n = 6145 (= 3 · 2048 + 1), the performance with the optimization is 752 GFlop/s while the one with b = 2048 is 389 GFlop/s only. A performance variability is larger on System B than System A, and big differences are seen when n ≤ 4992 on System B and n ≤ 1408 on System A.
Performance Comparison
Finally, we compare the performance of our APSP implementations on the CPU-GPU systems with APSP implementations on a GPU or a CPU (see Table 1 for the specification of the processors). The implementation on the GPU is the SubBlockedAPSP kernel (Algorithm 4), which is described in Sect. 4.2.1, with including the communication of matrix data between the host and the GPU. The implementation on the CPU is also based on the sub-blocked Algorithm 4 without the merging operation. Moreover, it utilizes OpenMP (Open Multi-Processing) directives and eight-way Intel AVX (Advanced Vector Extensions) instructions on the Sandy Bridge CPU (Core i7 2600K) or four-way SSE (Streaming SIMD Extensions) instructions on the Bloomfield CPU (Core i7 970).
To implement a single "multiply-add" operation for computing both shortest-path distance and construction matrices on the CPUs, we need four AVX/SSE instructions (addition, comparison, and two conditional move instructions), which are same as for the GPU. Since the CPUs do not support a dual issue of any combination of the four instructions, we can expect at most half the peak performance, i.e., 108.8 GFlop/s on the Sandy Bridge and 76.8 GFlop/s on the Bloomfield. When we apply the Eq. (3) to calculate the CPU performance, we obtain, as a maximum, 60 GFlop/s on the Sandy Bridge and 46 GFlop/s on the Bloomfield. On the other hand, to implement computing of a shortest-path distance matrix only, two AVX/SSE instructions (addition and minimum) are needed. Therefore, the total number of operations for the computation is 2n 3 , and the performance of the implementation is up to 105 GFlop/s on the Sandy Bridge and 65 GFlop/s on the Bloomfield. The CPU implementation for computing both matrices requires more number of vector registers by price of worsening memory access regularity. It leads to the lower performance than computing a distance matrix only. Table 2 shows the running time of the different implementations for computing both distance and construction matrices. As it can be seen from the results, the APSP implementation on System A is the fastest when the problem size n ≥ 4096. When the problem size is relatively small (128-256), the Sandy Bridge outperforms the other implementations.
We also compare APSP implementations for computing a shortest-path distance matrix only (see Table 3 ). The running time of the implementation on System B is shorter than the one presented in our previous paper [17] . In this work, for updating the pivot block D K,K in Algorithm 3, we use the sub-blocked APSP Algorithm 4 without construction matrix. This algorithm requires 2b 3 operations while the algorithm used in our previous paper requires 2b 3 log 2 b operations. Table 3 shows System B runs faster than System A when n ≥ 4096. The maximum performance of the MIN-PLUS kernel for distance matrix is 1303 GFlop/s on the Cayman and it is higher than 1068 GFlop/s on the Cypress. As mentioned above, the GPU computation cannot be overlapped with the data communication on System A while it is possible on System B. The computing of distance matrix requires two b × b matrices in single precision for 2b 3 operations, and thus, the arithmetic intensity can be estimated as 
Conclusion
This paper has presented a blocked united all-pairs shortest paths (APSP) algorithm for hybrid CPU-GPU systems. The blocked algorithm computes the shortest-path distance matrix and the shortest-path construction matrix at the same time. We have designed the algorithm so that the amount of data communication between CPU (host) and GPU is reduced. In the blocked algorithm, the matrix-matrix "multiplication" for the APSP problem plays the key role for the high performance. The APSP algorithm has been implemented on the two kinds of CPU-GPU systems. The implementation runs close to the peak performance when a problem size (the number of vertices in a graph) is larger than a few thousands. The sustained performance is 1.1 TFlop/s in single precision on a system containing an Intel Sandy Bridge CPU (Core i7 2600K) and an AMD Cayman GPU (Radeon HD 6970). We guess that the proposed blocked algorithm is applicable to other hybrid systems containing GPU-like accelerators. Future work includes a further extension of the blocked APSP algorithm to cluster systems consisting of multiple CPU-GPU nodes. 
