We present a new implementation of the Floyd-Warshall AllPairs Shortest Paths algorithm on CUDA.
INTRODUCTION
Graphics Processing Units are parallel processors offering high FLOPS/sec for low cost. NVIDIA's CUDA framework makes it practical to take advantage of this resource for high performance computing by providing a scalable programming model and a stable and comprehensible machine model to program against. Even with the tools provided by CUDA, making full use of the inherent processing power of a GPU can be a challenge, since the speed of a computation may not be limited by the speed of the processing units, but can also be limited by memory bus bandwidth or problems with thread scheduling.
Algorithms that find the length of the shortest path between every pair of vertices in a weighted and directed graph solve the all pairs shortest paths problem. This class of algorithms, including the Floyd-Warshall algorithm (e.g. [2] ), have a wide variety of applications in areas as diverse as bioinformatics, routing, and network analysis. The FloydWarshall algorithm is a standard approach for all pairs shortest paths problem that works with negative edge weights, doesn't suffer performance degradation for dense graphs, and has predictable execution regardless of the underlying data. It has can be decomposed into n 3 atomic tasks and runs in Θ(n 3 ) time, where n is the number of vertices in the graph.
Implementations of the Floyd-Warshall algorithm for CUDA have been presented in the past. Harish and Narayanan [3] presented an implementation assigning a single thread for each atomic task. This execution time of this approach was limited by the time necessary to pass data over the bus between global memory and the multiprocessors. Katz and Kider [4] achieved a speedup of 5 − 6× over Harish and Narayanan's simple implementation by using a blocked approach that performs several tasks for each data element passed over the global memory bus. It requires three 32 by 32 tiles of data elements from the adjacency matrix to be stored in on chip shared memory for the duration of the execution of a thread block, which limits the flexibility of the GPU thread scheduler, resulting in significant exposed instruction latency.
Our implementation achieves a further speedup of over 5× over the implementation of Katz and Kider. This includes a 2.1 − 2.3× speedup from reducing the instruction count and using less expensive instructions, and a 2.3 − 2.5× speedup from reducing the shared memory required by each thread block, allowing the thread scheduler to more effectively hide latency. Our improvements to shared memory usage consist of making efficient use of the on chip registers and staging the algorithm so that not all shared data dependencies for a block need to be in shared memory at any one time. On the NVIDIA Tesla C1060, our implementation is able to solve APSP for any graph with single precision edge weights containing 16,384 vertices in 53.06 seconds.
2 This is over 150x as fast as a basic Floyd-Warshall implementation running on our CPU. This paper is organized as follows. In Section 2, we discuss previous work in optimizing the Floyd-Warshall algorithm.
1: {Assume all wxy are initialized to the weight of edge (x, y)} 2: for all 1 k n do 3:
for all 1 i n do 4:
for all 1 j n do 5:
wij ← min(wij , w ik + w kj ) 6: end for 7:
end for 8: end for In Section 3, we discuss the blocked Floyd-Warshall algorithm used by Katz and Kider ([4] ) in more depth. We discuss the improvements we made to the basic blocked FloydWarshall in Section 4. In Section 5, we present our results and show that they are close to the best possible on our hardware.
PREVIOUS WORK
An important improvement in cache utilization for the FloydWarshall algorithm was found by Venkataraman et al. [5] . (Ferencz and Diament presented a similar improvement a few years earlier in [6] .) The method of Venkataraman et al. first partitions the incidence matrix into multiple tiles. From this, they noted that each tile of the matrix may be processed more extensively (i.e., one may iterate the "outer loop" multiple times over the tile) before proceeding to another tile. This approach allows for a significant improvement in cache utilization. They reported a speedup of between 1.6× and 1.9× over the standard implementation on their targeted architecture.
Further work on improving cache utilization was done by Penner and Prasanna [7] , and also Han et al. [8] . The work of Han et al. is particularly useful. They created a program that generates an implementation of the Floyd-Warshall algorithm that is optimized for a specific architecture, including the use of SIMD vectorization that is available on modern PC processors.
Bondhugula et al. [9] consider the efficient use of FPGAs (Field Programmable Gate Arrays) to improve performance of Floyd-Warshall. They noted (published in 2006) that their result was "an improvement by a factor of tens over the CPU implementation".
Harish and Narayanan demonstrated the use of CUDA to solve APSP (along with several other graph algorithms) [3] . Soon after, a significant improvement was made by Katz and Kider [4] . Their implementation utilized the blocking algorithm described by Venkataraman et al. [5] . Their approach resulted in a 5×-6.5× improvement over that of Harish and Narayanan.
BLOCKED FLOYD-WARSHALL ALGO-RITHM ON CUDA 3.1 The Floyd-Warshall Algorithm
The Floyd-Warshall algorithm is a familiar algorithm to solve the all pairs shortest paths problem. Let G = (V, E) be a weighted directed graph, where V = {v1, v2, . . . , vn}. Let wij be the weight assigned to the edge (vi, vj ). If (vi, vj) / ∈ E, then wi,j = ∞, and for all vi ∈ V , wi,i = 0. Otherwise, wi,j ∈ R, such that no negative cycles exist. Let d i,j by iterating over k and updating wi,j to be the shortest path from i to j using only vertices 0 through k. (see Figure 1) At completion, the incidence matrix W contains d (n) i,j for all i and j.
Harish and Narayanan [3] implemented Floyd-Warshall on CUDA in the simplest and most direct way possible. They created a kernel containing a single iteration of the innermost loop, and repeatedly assigned as many threads running this kernel as possible. Since the GPU is designed to efficiently schedule a large number of threads, this approach works reasonably well and typically offers some speedup over the same algorithm on a CPU, but it suffers from global memory bus saturation.
It is easy to see that the core algorithm requires 3 words to be sent from global memory to the multiprocessors for the calculation and 1 word to be sent back to global for storage. This is 16 bytes sent across the global memory bus for each task performed. On our setup, the measured speed for device-to-device memcpy operations for the NVIDIA Tesla C1060 is 77 GB/sec. Based on this, we can calculate that we can expect to perform fewer than 4.8 * 10
9 tasks per second, and in fact our measured speed for this algorithm is approximately 2.6 * 10 9 tasks/sec (see section 5). On the other hand, the advertised computing speed for the NVIDIA Tesla C1060 is 933 GFLOPs/sec. Even unoptimized, the basic calculation of finding the correct index to process, adding, and taking a minimum requires fewer than the 359 FLOPs the Tesla is capable of performing in the measured time for a single task.
Blocked Floyd-Warshall
Katz and Kider [4] recognized this problem and applied existing research (e.g. [5] ) on improving cache efficiency of the Floyd-Warshall algorithm to make practical use of the shared memory associated with each multiprocessor in the GPU (see [10] and [11] for details on the CUDA machine and programing models). The algorithm they used is a blocked Floyd-Warshall that performs several tasks for each data element transferred across the memory bus. Figure 2 shows a basic sequential blocked Floyd-Warshall algorithm.
In the blocked algorithm, the matrix is partitioned into tiles and the tasks are performed in stages. Katz and Kider use 32 × 32 tiles, and 32 tasks are performed for each data element in the matrix during each stage of the algorithm. Each stage requires the execution of three kernels. The first kernel calculates 32 tasks for each data element in a single tile on the diagonal of the adjacency matrix, termed the "independent block." Tasks in the independent block depend only on other tasks in the independent block, or on tasks that were computed in a previous stage. The second kernel calculates 32 tasks for each data element in each tile aligned with the independent block in either the i-or j-direction. These are the "singly dependent blocks." Tasks in the singly dependent blocks have one data dependency within the block, and one dependency in the already calculated independent block. There are Θ(n) singly dependent blocks in each stage. Finally, the third kernel calculates 32 tasks for each of the remaining tiles. These are the "doubly dependent blocks," and each of these blocks depends entirely on two singly dependent blocks. Since there are no dependencies internal to a doubly dependent block, these tasks may be performed in any order. There are Θ(n 2 ) doubly dependent blocks in each stage, and it is the efficiency with which this stage is performed that determines the speed of the algorithm.
In Katz and Kider's implementation of the doubly dependent kernel for this algorithm, two singly dependent tiles containing dependencies and the doubly dependent tile being processed are loaded into shared memory for each thread block. The thread block then performs 32 tasks for each data element in the doubly dependent tile before copying the doubly dependent tile back to global memory. Clearly, the total data sent through the global memory bus is reduced by a factor of 32 in comparison with the implementation of Harish and Narayanan. Performing a similar bandwidth calculation as before, we would expect that, if the speed of the algorithm were limited by the memory bus, it would be capable of performing 154 * 10 9 tasks/second on the NVIDIA Tesla C1060 in the ideal case, or perhaps half that if the performance were comparable to the previous algorithm. In fact, our results (see section 5) indicate that this algorithm can perform 14.9 * 10 9 tasks/second, which indicates that it is limited by the rate it performs the tasks on the multiprocessors and not by the rate it can send data between the multiprocessors and global memory. This is not a surprise, since, in the ideal case, we would need to be able to perform a task with just 6 FLOPs before bandwidth would be the determining factor.
Thread Blocks and Multiprocessors in the Blocked Floyd-Warshall
Each multiprocessor on a device of compute capability 1.3 like the NVIDIA Tesla C1060 can manage up to 1024 threads and has 16384 bytes of shared memory and 16384 one-word registers. A program will assign a large number of thread blocks, and multiple thread blocks may be processed simultaneously on each multiprocessor, subject to the limitations just mentioned.
The blocked Floyd-Warshall algorithm described above uses a tile size of 32. Each thread block processing a doubly dependent block copies three tiles to shared memory -the doubly dependent tile to process and two singly dependent tiles with the dependencies for doubly dependent tile. The total shared memory consumed per block is 3 tiles * 32 2 words per tile * 4 bytes per word +32 bytes for parameters = 12320 bytes. Since this is more than half of the total shared memory available per multiprocessor, only one block can be assigned to a multiprocessor at any one time.
196 threads assigned to a single multiprocessor can hide latency from register dependencies, and 512 threads is enough to hide latency of global memory access for most applications [12] . This assumes that a significant portion of the threads to be scheduled are available when needed. If a thread is waiting at a synchronization point, it can't be scheduled.
1: {Assume all wxy are initialized to the weight of edge (x, y)} 2: for all 0 b < n/s do 3:
{Process the independent block first.} 4:
for all b * s k < (b +
{Tasks in a singly dependent block depend on the independent block.} 12:
{i-aligned singly dependent blocks} 13:
for
{j-aligned singly dependent blocks } 23:
for all 0 jb n/s do 24:
for all b * s k < (b + 1) * s do 25:
for all jb * s i < (jb + 1) * s do 26:
for all b * s j < (b + 1) * s do 27:
wij ← min(wij , w ik + w kj ) 28:
end
{Most blocks are doubly dependent. Notice that k is now innermost.} 33:
for all 0 ib n/s do 34:
for all 0 jb n/s do 35:
for all jb * s i < (jb + 1) * s do 36:
for all ib * s j < (ib + 1) * s do 37:
for all block * size k < (block + In this application, the kernel first copies part of the tiles it needs from global memory, and then waits at a synchronization point. Since only one block can be assigned to each multiprocessor, when the last thread to copy data from global memory executes its instruction, there are no threads available to schedule. Accessing global memory has significant latency (hundreds of cycles) which will be exposed when the scheduler is starved for threads.
STAGED BLOCKED FLOYD-WARSHALL
We applied two rounds of improvements to a blocked implementation matching the performance reported in Katz and Kider. First, we performed standard optimizations to reduce the instruction count and use less expensive instructions. Primarily, this was using bit shifts instead of division or modulus operators where possible and unrolling loops. These basic optimizations yielded a speedup of 2.1 − 2.3×
Once we had a well optimized basic blocked implementation, we applied two techniques to reduce the shared memory consumed by each thread block. First, we made more efficient use of the multiprocessor registers. Second, we staged the data load so that only a fraction of the singly dependent tiles containing the calculated dependencies for the current doubly dependent tile being processed were in shared memory at any one time. The net effect of these optimizations was to reduce the shared memory used by a thread block by a factor of nearly 12 without changing the amount of data flowing between global memory and the multiprocessors. Enabling multiple thread blocks to be assigned to each multiprocessor enables the thread scheduler to effectively hide the latency of accessing global memory and allows more tasks to be assigned to each thread which reduces the computational overhead of calculating initial indexes. We changed from using 256 threads per thread block to 64. These changes resulted in an additional 2.3−2.4× speedup, for a total improvement of approximately 5.2×.
Efficient Use of Registers
One critical observation is that each multiprocessor has more memory available in registers than in shared memory. For a device of compute capability 1.3, there are 16384 4-byte registers and a total of 16384 bytes of shared memory. By reading data that does not need to be shared among threads in a thread block directly to registers, the total shared memory consumed can be reduced. This optimization will be useful when thread occupancy or the number of blocks assigned to a multiprocessor is limited by shared memory and additional registers are still available.
In this case, all of the tasks that involve updating a specific data element in the doubly dependent matrix are assigned to a single thread, and no thread needs to access the data elements assigned to another thread. As a result, the doubly dependent tile can be read directly into registers. Instead of reading the doubly dependent tile into an array of size t * t in shared memory, each thread reads its assigned elements into a local array of size t * t/h where h is the number of threads per thread block. In general, the compiler will create an array with constant indexes in registers. An array that doesn't have size determined at compile time will be created in local memory, which is vastly slower. Figure 3 shows the memory layout in shared memory when the doubly dependent tile is stored in the registers. This optimization reduces the shared memory consumed per block to 2 * 32 2 + 32 = 8224, which is more than half of the available 16384. It is still only possible to assign a single thread block to each multiprocessor, so this improvement does not by itself affect performance.
Staged Load of Singly Dependent Tiles
To further reduce shared memory usage, we read the singly dependent blocks in stages. All of the tasks in the doubly dependent block in a certain k-range will depend a specific subset of the data elements in the singly dependent tiles. Specifically, tasks in the range (i, j) k to (i, j) k+m will depend on data elements in the i-aligned tile in the range (i, k) to (i, k + m) and in the j-aligned matrix in the range (k, j) to (k + m, j). Since the data elements of the doubly dependent tile are evenly assigned to the threads in a thread block, each thread will have the same number of tasks in a given k-range. As a result, we can break the algorithm into stages, each separated by a synchronization point. In each stage, the data dependencies for the tasks in the next k-range of size m are loaded into shared memory arrays, the threads are synchronized, and the next m tasks for each data element are performed. Using this approach, only t * m data elements for each singly dependent tile need to be loaded into shared memory at any one time. Figure 4 shows the usage of shared memory during a single stage of the update of the doubly dependent tile. In this case, the red areas of the i-aligned and j-aligned tiles contain all the dependencies for tasks in the doubly dependent tile over some range of k. When this range is updated, the next slice of each singly dependent tile will be loaded into shared memory, and the doubly dependent tile will be updated over the next k-range.
Our specific implementation uses a tile size of 32 and stages the algorithm over 4 iterations. The total shared memory used in this implementation is 2 * 32 * 4 * 4 + 32 = 1056, so as many as 15 blocks could be run on each multiprocessor given the shared memory usage. The limiting factors are now the total threads assigned to a multiprocessor and the registers used for each thread block.
Implementation
There are two significant hurdles to implementing the staged load of the singly dependent tiles. First, it is necessary to retain coalesced access to global memory. A single half warp that reads from global memory should access aligned and contiguous words. Using a row-major data structure, a slice of the j-aligned tile will be composed of contiguous 16-word blocks but a slice of the i-aligned tile will not. Figure 5 shows this situation. A set of 4 adjacent columns from a row major matrix involves accessing only 4 adjacent data elements. By tiling the data in 4 by 4 blocks, we can take 4 rows or 4 columns with 16 or more adjacent elements in either direction without increasing the load on the global memory bus.
In order to be able to read both a set of 4 columns and a set of 4 rows in contiguous 16-word blocks, we use a doubly [8] . In this ordering, each 32 by 32 tile and each 4 by 4 tile is contiguous in memory. Within the entire matrix, the 32 by 32 tiles are arranged in rowmajor order, and within each 32 by 32 tile the 4 by 4 tiles are arranged in row-major order.
Second, it is necessary to avoid shared memory bank conflicts when accessing the singly dependent tiles. Shared memory is arranged into 16 banks. There are two patterns of memory access that allow a read from shared memory to be completed in a single cycle. Either all of the threads in a half warp must read the same data element, or each thread must read from a different memory bank. With all of the data elements assigned to a thread in a half warp adjacent in row-major order as they are in the Katz-Kider implementation, the shared memory access is very naturally good. In the j-aligned tile, each thread accesses adjacent memory locations, and in the i-aligned tile, each thread in the half warp accesses the same memory location. When the data is arranged into 4 by 4 tiles, however, this simple memory access pattern breaks down. Threads 0, 4, 8, and 12 all access the same data element in the j-aligned tile and threads 0, 1, 2, and 3 access the same element in the i-aligned tile, resulting in 4-way data conflicts. This data access pattern causes each shared memory access to take 4 processor cycles. Since each task involves two reads from shared memory, this is a significant hit to performance.
To ameliorate this problem, it is important to remember that, although all of the tasks in a doubly dependent block must be performed, the order in which they are performed doesn't matter. As a result, different threads in a single half warp can process the tasks for their data elements in any order; they are not restricted to performing them in the most natural order, with k proceeding from 0 to t. Instead, the first task performed by a thread in a given halfwarp is the task identified by the sum of the i and j indexes within the tile modulo 4. This results in conflict free shared memory access. Figure 6 shows different patterns of shared memory access that result from using a 4 by 4 tiled memory pattern. The top figure shows the memory access pattern when the matrix is in row-major order. In this case, the data elements processed by threads in a half warp are in a row 16 elements long. Each thread will access an element in the j-aligned tile with the same j-index as it has and k-index determined by the k-iteration it processes. Each thread will access the same element in the i-aligned tile, since they all have the same i-index. This results in a conflict free pattern of memory access. Each element accessed in the j-aligned tile is in a different memory bank since they fall on 16 contiguous words. The single element in the i-aligned can be broadcast to all of the threads in a single cycle. data elements updated by threads in a half warp are arranged in a row of 16. Although the same indexes will be accessed as in the case using a row-major data ordering, these data elements will share data banks. For example, the data elements in the j-aligned tile having relative indexes (0, 0) and (4, 0) will be in the same bank.
The bottom figure shows the approach we used to solve this problem. Instead of iterating over tasks (i, j) k to (i, j) k+4 in order, the task chosen depends on the position of the thread in it's half warp as described above. This results in conflict free shared memory data access.
RESULTS AND ANALYSIS
The tests were run on a computer having 1300 MHz AMD Phenom TM 9950 Quad-core processors and NVIDIA Tesla C1060 GPUs. No program used more than 1 processor or GPU. All programs were compiled as 32-bit applications. Table 5 and figure 7 show the performance of our various implementations on different problem sizes.
The CPU implementation is a basic implementation on the CPU. The Harish and Narayanan algorithm assigns a single thread for each task performed. The Katz and Kider algorithm overcomes the bandwidth limitations faced by Harish and Narayanan with blocking. The blocking size used is 32. The Optimized and Blocked algorithm is a version of the Katz and Kider algorithm that applies a number of optimizations to reduce the amount of time spent calculating results. The Staged Load algorithm reduces the latency exposed in the Katz and Kider algorithm by reducing the amount of shared memory used.
The Harish and Narayanan implementation is limited by the bandwidth of the global memory bus. Since each task requires 16 bytes of total traffic over the bus, the bandwidth achieved by this algorithm is 42 GB/sec, somewhat short of the 77 GB/sec bandwidth measured on our device for device-to-device memcpy.
The Katz and Kider implementation is limited by the time spent on calculations. It can perform 14.9 * 10 9 tasks per second. Based on the advertised speed of 933 GFLOP/sec for a NVIDIA Tesla C1060, this is equivalent to using 62.7 FLOPs for each task. Since each basic task consists of a single add, a minimum, and the accessory operations of locating and loading at least one dependency in shared memory and assigning the result of the calculation to a register, we could reasonably hope to get the actual total processing to under 10 FLOPs per task.
Our staged load implementation performs approximately 73.6 * 10 9 tasks per second. If it is limited by bandwidth, it achieves 46 GB/sec, which is less than the 70 GB/sec or so we could reasonably hope for. If it is limited by the processing speed, it is using the equivalent of 12.7 FLOPs per task. Since small changes to the code executed on the multiprocessor affect performance, we believe that it is still limited by the processing speed. In either case, it is possible that further optimizations could improve the performance of this blocked algorithm by a factor of 1.5 or less.
ACKNOWLEDGMENTS
We would like to thank George Purdy for suggesting the topic and Fred Annexstein for his insight and advice. 
