We present an experimental study of parallel algorithms for solving the single source shortest path problem with non-negative edge weights (NSSP) on large-scale graphs. We implement Meyer and Sander's ∆-stepping algorithm and report performance results on the Cray MTA-2, a multithreaded parallel architecture. The MTA-2 is a high-end shared memory system offering two unique features that aid the efficient implementation of irregular parallel graph algorithms: the ability to exploit fine-grained parallelism, and low-overhead synchronization primitives. Our implementation exhibits remarkable parallel speedup when compared with a competitive sequential algorithm, for low-diameter sparse graphs. For instance, ∆-stepping on a directed scale-free graph of 100 million vertices and 1 billion edges takes less than ten seconds on 40 processors of the MTA-2, with a relative speedup of close to 30. To our knowledge, these are the first performance results of a parallel NSSP problem on realistic graph instances in the order of billions of vertices and edges.
Introduction
This paper primarily discusses parallel algorithms and implementations for solving the single source shortest path problem on large-scale graph instances. In addition to applications in combinatorial optimization problems, shortest path algorithms are finding increasing relevance in the domain of complex network analysis. Popular graph theoretic analysis metrics such as betweenness centrality [27, 9, 41, 43, 34] are based on shortest path algorithms. Our parallel implementations target the scale-free graph family, a well-studied model [7, 24, 12, 53, 52] for real-world large-scale graphs that captures features such as a low diameter, heavy-tailed degree distributions modeled by power laws, and self similarity. We also conduct an experimental study of performance on several other graph families, and this work is our submission to the 9th DIMACS Implementation Challenge [18] on Shortest Paths.
Sequential algorithms for the single source shortest path problem with non-negative edge weights (NSSP) are studied extensively, both theoretically [22, 20, 25, 26, 56, 58, 35, 32, 48] and experimentally [21, 30, 29, 15, 61, 31] . Let m and n denote the number of edges and vertices in the graph respectively. Nearly all NSSP algorithms are based on the classical Dijkstra's [22] algorithm. Using Fibonacci heaps [25] , Dijkstra's algorithm can be implemented in O(m + n log n) time. Thorup [58] presents an O(m + n) RAM algorithm for undirected graphs that differs significantly different from Dijkstra's approach. Instead of visiting vertices in the order of increasing distance, it traverses a component tree. Meyer [49] and Goldberg [31] propose simple algorithms with linear average time for uniformly distributed edge weights.
In this paper, we primarily focus on parallel implementations of NSSP. Prior parallel NSSP algorithms have been reviewed in detail by Meyer and Sanders [48, 51] . There are no known PRAM algorithms that run in sub-linear time and O(m + n log n) work. Parallel priority queues [23, 11] for implementing Dijkstra's algorithm have been developed, but these linear work algorithms have a worst-case time bound of Ω(n), as they only perform edge relaxations in parallel. Several matrix-multiplication based algorithms [36, 28] , proposed for the parallel All-Pairs Shortest Paths (APSP), involve running time and efficiency tradeoffs. Parallel approximate NSSP algorithms [42, 16, 57] based on the randomized BreadthFirst search algorithm of Ullman and Yannakakis [60] run in sub-linear time. However, it is not known how to use the Ullman-Yannakakis randomized approach for exact NSSP computations in sub-linear time.
Meyer and Sanders give the ∆-stepping [51] NSSP algorithm that divides Dijkstra's algorithm into a number of phases, each of which could be executed in parallel. For random graphs with uniformly distributed edge weights, this algorithm runs in sub-linear time with linear average case work. Several theoretical improvements [50, 46, 47] are given for ∆-stepping.
The literature contains few experimental studies on parallel NSSP algorithms [37, 54, 39, 59] . Prior implementation results on distributed memory machines resorted to graph partitioning [14, 1, 33] , and then running a sequential NSSP algorithm on the sub-graph. Heuristics are used for load balancing and termination detection [38, 40] . The implementations perform well for certain graph families and problem sizes, but no speedup may be possible in the worst case.
Implementations of PRAM graph algorithms for arbitrary sparse graphs are typically memory intensive, and the memory accesses are fine-grained and highly irregular. This often leads to poor performance on cache-based systems. On distributed memory clusters, few parallel graph algorithms outperform the best sequential implementations due to long memory latencies and high synchronization costs [4, 3] . Parallel shared memory systems are a more supportive platform. They offer higher memory bandwidth and lower latency than clusters, and the global shared memory greatly improves developer productivity. However, parallelism is dependent on the cache performance of the algorithm and scalability is limited in most cases. While it may be possible to improve the cache performance to a certain degree for some classes of graphs [55] , there are no known general techniques for cache optimization because the memory access pattern is largely dependent on the structure of the graph.
We present our shortest path implementation results on the Cray MTA-2, a massively multithreaded parallel architecture. The MTA-2 is a high-end shared memory system offering two unique features that aid considerably in the design of irregular algorithms: fine-grained parallelism and low-overhead synchronization. The MTA-2 has no data cache; rather than using a memory hierarchy to reduce latency, the MTA-2 processors use hardware multithreading to tolerate the latency. The low-overhead synchronization support complements multithreading and makes performance primarily a function of parallelism. Since graph algorithms often have an abundance of parallelism, these architectural features lead to superior performance and scalability. Our recent results highlight the exceptional performance of the MTA-2 for implementations of key combinatorial optimization and graph theoretic problems such as list ranking [3] , connected components and subgraph isomorphism [8] , Breadth-First Search and st-connectivity [5] .
Our main contributions in this paper are as follows:
• We conduct an experimental study of parallel shortest path algorithms designed for shared memory architectures. Prior studies have predominantly focused on running sequential SSSP algorithms on graph families that can be easily partitioned, whereas we also consider several arbitrary, sparse graph instances.
• We present parallel performance results of the ∆-stepping algorithm on the Cray MTA-2, a multithreaded architecture. We attain impressive performance on low-diameter graphs. We also analyze performance using machine-independent algorithmic operation counts.
• On the MTA-2, we are able to solve NSSP for large-scale realistic graph instances in the order of billions of edges. For instance, ∆-stepping on a synthetic directed scale-free graph of 100 million vertices and 1 billion edges takes 9.73 seconds on 40 processors of the MTA-2, with a relative speedup of approximately 31. Also, the sequential performance of our implementation is comparable to competitive NSSP implementations (such as the DIMACS reference solver).
This paper is organized as follows. Section 2 provides a brief overview of ∆-stepping. Our parallel implementation of ∆-stepping is discussed in Section 3. Section 4 and 5 describe our experimental setup, performance results and analysis. We conclude with a discussion on implementation improvements and future plans in Section 6. Applendix A describes the MTA-2 architecture.
2 Review of the ∆-stepping Algorithm
Preliminaries
Let G = (V, E) be a graph with n vertices and m edges. Let s ∈ V denote the source vertex. Each edge e ∈ E is assigned a non-negative real weight by the length function l : E → R. Define the weight of a path as the sum of the weights of its edges. The single source shortest paths problem with non-negative edge weights (NSSP) computes δ(v), the weight of the shortest (minimum-weighted) path from s to v. δ(v) = ∞ if v is unreachable from s. We set δ(s) = 0.
Shortest path algorithms maintain a tentative distance value for each vertex, which are updated by edge relaxations. Let d(v) denote the tentative distance of a vertex v. d(v) is initially set to ∞, and is an upper bound on δ(v). Relaxing an edge v, w ∈ E sets d(w) to the minimum of d(w) and d(v) + l(v, w). Based on the manner in which the tentative distance values are updated, most shortest path algorithms can be classified into two types: label-setting or label-correcting. Label-setting algorithms (for instance, Dijkstra's algorithm) perform relaxations only from settled (d(v) = δ(v)) vertices, and compute the shortest path from s to all vertices in exactly m edge relaxations. Based on the values of d(v) and δ(v), at each iteration of a shortest path algorithm, vertices can be classified into unreached (d(v) = ∞), queued (d(v) is finite, but v is not settled) or settled. Label-correcting algorithms (e.g., Bellman-Ford) relax edges from unsettled vertices also, and may perform more than m relaxations. Also, all vertices remain in a queued state until the final step of the algorithm. ∆-stepping belongs to the label-correcting type of shortest path algorithms.
Algorithmic Details
The ∆-stepping algorithm (see Alg. 1) is an "approximate bucket implementation of Dijkstra's algorithm" [51] . It maintains an array of buckets B such that B[i] stores the set of vertices {v ∈ V : v is queued and d(v) ∈ [i∆, (i + 1)∆]}. ∆ is a positive real number that denotes the "bucket width".
In each phase of the algorithm (the inner while loop in Alg. 1, lines 9-13, when bucket B[i] is not empty), all vertices are removed from the current bucket, added to the set S, and light edges (l(e) ≤ ∆, e ∈ E) adjacent to these vertices are relaxed (see Alg. 2). This may result in new vertices being added to the current bucket, which are deleted in the next phase. It is also possible that vertices previously deleted from the current bucket may be reinserted, if their tentative distance is improved. Heavy edges (l(e) > ∆, e ∈ E) are not Input: G(V, E), source vertex s, length function l : E → R Output: δ(v), v ∈ V , the weight of the shortest path from s to v
Algorithm 1: ∆-stepping algorithm Input: v, weight request x Output: Assignment of v to appropriate bucket
Algorithm 2: The relax routine in the ∆-stepping algorithm relaxed in a phase, as they result in tentative values outside the current bucket. Once the current bucket remains empty after relaxations, all heavy edges out of the vertices in S are relaxed at once (lines 14-16 in Alg. 1). The algorithm continues until all the buckets are empty.
Observe that edge relaxations in each phase can be done in parallel, as long as individual tentative distance values are updated atomically. The number of phases bounds the parallel running time, and the number of reinsertions (insertions of vertices previously deleted) and rerelaxations (relaxation of their out-going edges) costs an overhead over Dijkstra's algorithm. The performance of the algorithm also depends on the value of the bucket-width ∆. For ∆ = ∞, the algorithm is similar to the Belman-Ford algorithm. It has a high degree of parallelism, but is inefficient compared to Dijkstra's algorithm. ∆-stepping tries to find a good compromise between the number of parallel phases and the number of re-insertions. Theoretical bounds on the number of phases and re-insertions, and the average case analysis of the parallel algorithm are presented in [51] . We summarize the salient results.
Let d c denote the maximum shortest path weight, and P ∆ denote the set of paths with weight at most ∆. Define a parameter l max , an upper bound on the maximum number of edges in any path in P ∆ . The following results hold true for any graph family.
• The number of buckets in B is ⌈d c /∆⌉.
• The total number of reinsertions is bounded by |P ∆ |, and the total number of rerelaxations is bounded by |P 2∆ |.
• The number of phases is bounded by dc ∆ l max , i.e., no bucket is expanded more than l max times.
For graph families with random edge weights and a maximum degree of d, Meyer and Sanders [51] theoretically show that ∆ = θ(1/d) is a good compromise between work efficiency and parallelism. The sequential algorithm performs O(dn) expected work divided between O( dc ∆ · log n log log n ) phases with high probability. In practice, in case of graph families for which d c is O(log n) or O(1), the parallel implementation of ∆-stepping yields sufficient parallelism for our parallel system.
Parallel Implementation of ∆-stepping
See Appendix A for details of the MTA-2 architecture and parallelization primitives.
The bucket array B is the primary data structure used by the parallel ∆-stepping algorithm. We implement individual buckets as dynamic arrays that can be resized when needed and iterated over easily. To support constant time insertions and deletions, we maintain two auxiliary arrays of size n: a mapping of the vertex ID to its current bucket, and a mapping from the vertex ID to the position of the vertex in the current bucket. All new vertices are added to the end of the array, and deletions of vertices are done by setting the corresponding locations in the bucket and the mapping arrays to −1. Note that once bucket i is finally empty after a light edge relaxation phase, there will be no more insertions into the bucket in subsequent phases. Thus, the memory can be reused once we are done relaxing the light edges in the current bucket. Also observe that all the insertions are done in the relax routine, which is called once in each phase, and once for relaxing the heavy edges.
We implement a timed pre-processing step to semi-sort the edges based on the value of ∆. All the light edges adjacent to a vertex are identified in parallel and stored in contiguous locations, and so we visit only light edges in a phase. The O(n) work pre-processing step scales well in parallel on the MTA-2.
We also support fast parallel insertions into the request set R. R stores v, x pairs, where v ∈ V and x is the requested tentative distance for v. We only add a vertex v to R if it satisfies the condition x < d(v). We do not store duplicates in R. We use a sparse set representation similar to one used by Briggs and Torczon [10] for storing vertices in R. This sparse data structure uses two arrays of size n: a dense array that contiguously stores the elements of the set, and a sparse array that indicates whether the vertex is a member of the set. Thus, it is easy to iterate over the request set, and membership queries and insertions are constant time. Unlike other Dijkstra-based algorithms, we do not relax edges in one step. Instead, we inspect adjacencies (light edges) in each phase, construct a request set of vertices, and then relax vertices in the relax step.
All vertices in the request set R are relaxed in parallel in the relax routine. In this step, we first delete a vertex from the old bucket, and then insert it into the new bucket. Instead of performing individual insertions, we first determine the expansion factor of each bucket, expand the buckets, and add then all vertices into their new buckets in one step. Since there are no duplicates in the request set, no synchronization is involved for updating the tentative distance values.
On the MTA-2, accessing the same memory location concurrently by several threads incurs a performance penalty. We call these high-contention memory locations hot spots, and need to minimize these to ensure good scalability. For instance, in the relax routine, the bucket size counter may become a hot spot if a significant number of vertices in the current request set are inserted into the same bucket. This is particularly true for low-diameter graph families such as random and scale-free graphs. However, this leads to a performance penalty only in the case of very large problem instances (random graphs with 500 million to 2 billion edges) using over 30 processors.
To saturate the MTA-2 processors with work and to attain high system utilization, we need to minimize the number of phases and non-empty buckets, and maximize the request set sizes. Entering and exiting a parallel phase involves a negligible running time overhead in practice. However, if the number of phases is O(n), this overhead dominates the actual running time of the implementation. Also, we enter the relax routine once every phase. There are several implicit barrier synchronizations in the algorithm that are proportional to the number of phases. Our implementation reduces the number of barriers. Our source code for the ∆-stepping implementation, along with the MTA-2 graph generator ports, is freely available online [44] .
Experimental Setup

Platforms
We report parallel performance results on a 40-processor Cray MTA-2 system with 160 GB uniform shared memory. Each processor has a clock speed of 220 MHz and support for 128 hardware threads. The ∆-stepping code is written in C with MTA-2 specific pragmas and directives for parallelization. We compile it using the MTA-2 C compiler (Cray Programming Environment (PE) 2.0.3) with -O3 and -par flags.
The MTA-2 code also compiles and runs on sequential processors without any modifications. Our test platform for the sequential performance results is one processor of a dual-core 3.2 GHz 64-bit Intel Xeon machine with 6GB memory, 1MB cache and running RedHat Enterprise Linux 4 (linux kernel 2.6.9). We compare the sequential performance of our implementation with the DIMACS reference solver [19] . Both the codes are compiled with the Intel C compiler (icc) Version 9.0, with the flags -O3. The source code is freely available online [44] .
Problem Instances
We evaluate the sequential and parallel performance on the core graph families provided in the DIMACS benchmark package [19] . The core package includes the following graph families:
• Random graphs: Random graphs are generated by first constructing a Hamiltonian cycle, and then adding m − n edges to the graph at random. The generator may produce parallel edges as well as self-loops. Two random graph families are defined: Random4-n (n grows, and the maximum weight is set to n. m is set to 4n) and Random4-C (n is fixed, and C grows).
• Grid graphs: This synthetic generator produces grid-like graphs. The grid contains x · y vertices, (i, j) for 0 ≤ i < x and 0 ≤ j < y. A vertex (i, j) is connected to adjacent vertices in the same layer. If i < x − 1, each vertex (i, j) is also connected to the vertex
, y = 16) and Square grid (x = y = √ n) families are defined, similar to random graphs.
• Road graphs: Road graph families with transit time (USA-road-t) and distance (USAroad-t) as the length function.
In addition to the core families, we study graph instances from the scale-free graph family (denoted by ScaleFree4-n). These are low-diameter graphs with an unbalanced degree distribution, and the structural properties modeled by these graphs are been observed in several real-world complex networks [7, 2, 24] . We use the R-MAT graph model [13] to generate Scalefree4-n graph instances. The R-MAT scale-free generator is included in our GTgraph [45] synthetic graph generator package.
All the synthetic core graph generators assume randomly distributed edge weights. We report results for an additional log-uniform distribution also. The generated integer edge weights are of the form 2 i , where i is chosen from the uniform random distribution [1, log C] (C denotes the maximum integer edge weight). We define Random4logUnif-n and ScaleFree4 logUnif-n families for this weight distribution.
Methodology
For sequential runs, we report the execution time of the reference DIMACS NSSP solver and the baseline Breadth-First Search (BFS) on every core graph family. The BFS running time is a natural lower bound for NSSP codes and is a good indicator of how optimized the shortest path implementations are. It is reasonable to directly compare the execution times of the reference code and our implementation: both use a similar adjacency array representation for the graph, written in C, and compiled and run in identical experimental settings. Note that our implementation is optimized for the MTA-2 and we make no modifications to the code before running on a sequential machine. The time taken for semi-sorting and mechanisms to reduce memory contention on the MTA-2 both constitute overhead on a sequential processor. Also, our implementation assumes real-weighted edges, and we cannot use fast bitwise operations. Unless otherwise specified, assume that the value of ∆ is set to n m for all graph instances. We will show that this choice of ∆ may not be optimal for all graph classes and weight distributions.
On the MTA-2, it would be meaningless to compare the performance of our optimized parallel implementation with the sequential DIMACS solver. Parallelizing the reference solver to run on a massively multithreaded machine such as the MTA-2 is a non-trivial task. So, we report the execution time of a multithreaded level-synchronized breadth-first search [5] , optimized for low-diameter graphs. The multithreaded BFS scales as well as δ-stepping for the core graph families, and the execution time serves as a lower bound for the shortest path running time.
On a sequential processor, we execute the BFS and shortest path codes on all the core graph families, for the recommended problem sizes. However, for parallel runs, we only report results for sufficiently large graph instances in case of the synthetic graph families. We parallelize the synthetic core graph generators and port them to run on the MTA-2.
Our implementations accept both directed and undirected graphs. For all the synthetic graph instances, we report execution times on directed graphs in this paper. The road networks are undirected graphs. We also assume the edge weights to be distributed in [0, 1] in the ∆-stepping implementation. So we have a pre-processing step to scale the integer edge weights in the core problem families to the interval [0, 1], dividing the integer weights by the maximum edge weight.
The first run on the MTA-2 is usually slower than subsequent ones (by about 10% for a typical ∆-stepping run). So we report the average running time for 10 successive runs. We run the code from three randomly chosen source vertices and average the running time. We found that using three sources consistently gave us execution time results with little variation on both the MTA-2 and the reference sequential platform. We tabulate the sequential and parallel performance metrics in Appendix B, and report execution time in seconds. If the execution time is less than 10 −3 seconds, we round the time to four decimal digits. If it is less than 10 −2 seconds, we round it to three digits. In all other cases, the reported running time is rounded to two decimal digits.
Results and Analysis
Sequential Performance
First we present the performance results of our implementation on the reference sequential platform for the core graph families. The BFS, ∆-stepping, and reference DIMACS implementation execution times on the recommended core graph instances are given in Section B.1. We observe that the ratio of the ∆-stepping execution time to the Breadth-First Search time varies between 3 and 10 across different problem instances. Also, the DIMACS reference code is about 1.5 to 2 times faster than our implementation for large problem instances in each family. Table 1 summarizes the performance for random graph instances. For the Random4-n family, n is varied from 2 11 to 2 21 , the maximum edge weight is set to n, and the graph density is constant. For the largest instance, ∆-stepping execution time is 1.7 times slower than the reference implementation and 5.4 times the BFS execution time. For the Random4-C family, we normalize the weights to the maximum integer weight. We do not observe any trend similar to the reference implementation, where the execution time gradually rises as the maximum weight increases.
The sequential performance of ∆-stepping on Long grid graphs (Table 2 ) is similar to that on Random graphs. However, the reference implementation is slightly faster on long grids. For square grids and road networks, the ∆-stepping to BFS ratio is comparatively higher (for e.g., BFS to ∆-stepping ratio is 4.71 for the largest Square-n graph, and 3.74 for the largest Random4-n graph) than the Random and Long grid families.
Figs. 1 and 2 summarize the key observations from the tables in Section B.1. Comparing execution time across graphs of the same size in Fig. 1 , we find that the ∆-stepping running time for the Random4-n graph instance is slightly higher than the rest of the families. The ∆-stepping running time is also comparable to the execution time of the reference implementation for all graph families. Fig. 2 plots the execution time normalized to the problem size (or the running time per edge) for Random4-n and Long-n families. Observe that the ∆-stepping implementation execution time scales with problem size at a faster rate compared to BFS or the DIMACS reference implementation.
∆-stepping analysis
To better understand the algorithm performance across graph families, we study machineindependent algorithm operation counts. The parallel performance is dependent on the value of ∆, the number of phases, the size of the request set in each phase. the light request set in each phase, for each core graph family. ∆ is set to 0.25 for all runs.
If the request set size is less than 10, it is not plotted. Consider the random graph family ( Fig. 3(a) ). It executes in 84 phases, and the request set sizes vary from 0 to 27,000. Observe the recurring pattern of three bars stacked together in the plot. This indicates that all the light edges in a bucket are relaxed in roughly three phases, and the bucket then becomes empty. The size of the relax set is relatively high for several phases, which provides scope for exploiting multithreaded parallelism. The relax set size plot of a similar problem instance from the Long grid family (Fig. 3(b) ) stands in stark contrast to the random graph plot. It takes about 200,000 phases to execute, and the maximum request size is only 15. Both of these values indicate that our implementation would fare poorly on long grid graphs. On square grids (Fig. 3(c) ), ∆-stepping takes fewer phases, and the request set sizes go up to 500. For a road network instance (NE USA-road-d, Fig. 3(d) ), the algorithm takes 23,000 phases to execute, and only a few phases (about 30) have request set counts greater than 1000. Fig. 4 plots several key ∆-stepping operation counts for various graph classes. Along with the core graph families, we include ScaleFree4-n, RandomlogUnif4-n, and LonglogUnif4-n graph classes. All synthetic graphs are roughly of the same size. Fig. 4(a) plots the average shortest path weight for various graph classes. Scale-free and Long grid graphs are on the two extremes. A log-uniform edge weight distribution also results in low average edge weight. The number of phases (see Fig. 4(b) ) is highest for Long grid graphs. The number of buckets shows a similar trend as the average shortest path weight. Fig. 4(d) plots the total number of insertions for each graph family. The number of vertices is 2 20 for all graph families (slightly higher for the road network), and so ∆-stepping results in roughly 20% overhead in insertions for all the graph families with random edge weights. Note the number of insertions for graphs with log-uniform weight distributions. ∆-stepping performs a lot of excess work for these families, because the value of ∆ is quite high for this particular distribution.
We next evaluate the performance of the algorithm as ∆ is varied (tables in Section B.2). Fig. 5 plots the execution time of various graph instances on a sequential machine, and one processor of the MTA-2. ∆ is varied from 0.1 to 10 in each case. We find that the absolute running times on a 3.2 GHz Xeon processor and the MTA-2 are comparable for random, square grid and road network instances. However, on long grid graphs (Fig. 5(b) ), the MTA-2 execution time is two orders of magnitude greater than the sequential time. The number of phases and the total number of relaxations vary as ∆ is varied (Tables 5 and 6 ). On the MTA-2, the running time is not only dependent on the work done, but also on the number of phases and the average number of relax requests in a phase. For instance, in the case of long grids (see Fig. 5(b) , with execution time plotted on a log scale), the running time decreases significantly as the value of ∆ is decreased, as the number of phases reduce. On a sequential processor, however, the running time is only dependent on the work done (number of insertions). If the value of ∆ is greater than the average shortest path weight, we perform excess work and the running time noticeably increases (observe the execution time for ∆ = 5, 10 on the random graph and the road network). The optimal value of ∆ (and the execution time on the MTA-2) is also dependent on the number of processors. For 
Parallel Performance
We present the parallel scaling of the ∆-stepping algorithm in detail (see Section B.3). We ran ∆-stepping and the level-synchronous parallel BFS on graph instances from the core families, scale-free graphs and graphs with log-uniform edge weight distributions. Define the speedup on p processors of the MTA-2 as the ratio of the execution time on 1 processor to the execution time on p processors. Since the computation on the MTA-2 is thread-centric rather than processor-centric, note that the single processor run is also parallel. In all graph classes except long grids, there is sufficient parallelism to saturate a single processor of the MTA-2 for reasonably large problem instances.
As expected from the discussion in the previous section, ∆-stepping performs best for lowdiameter random and scale-free graphs with randomly distributed edge weights (see Fig. 6  and 7 ). We attain a speedup of approximately 31 on 40 processors for a directed random graph of nearly a billion edges, and the ratio of the BFS and ∆-stepping execution time is a constant factor (about 3-5) throughout. The implementation performs equally well for scale-free graphs, that are more difficult to handle due to the irregular degree distribution. The execution time on 40 processors of the MTA-2 for the scale-free graph instance is only 1 second slower than the running time for a random graph and the speedup is approximately 30 on 40 processors. We have already shown that the execution time for smaller graph instances on a sequential machine is comparable to the DIMACS reference implementation, a competitive NSSP algorithm. Thus, attaining a speedup of 30 for a realistic scale-free graph instance of one billion edges (Fig. 7) is a remarkable result. Table 7 gives the execution time of ∆-stepping on the Random4-n family, as the number of vertices is increased from 2 21 to 2 28 , and the number of processors is varied from 1 to 40. Observe that the relative speedup increases as the problem size is increased (for e.g., on 40 processors, the speedup for n = 2 21 is just 3.96, whereas it is 31.04 for 2 28 vertices). This is because there is insufficient parallelism in a problem instance of size 2 21 to saturate 40 processors of the MTA-2. As the problem size increases, the ratio of ∆-stepping execution time to multithreaded BFS running time decreases. On an average, ∆-stepping is 5 times slower than BFS for this graph family. Table 8 gives the execution time for random graphs with a log-uniform weight distribution. With ∆ set to n m , we do a lot of additional work. The ∆-stepping to BFS ratio is typically 40 in this case, about 8 times higher than the corresponding ratio for random graphs with random edge weights. However, the execution time scales well with the number of processors for large problem sizes. Table 9 summarizes the execution time for the Random4-C family. The maximum edge weight is varied from 4 0 to 4 15 while keeping m and n constant. We do not notice any trend in the execution time in this case, as we normalize the edge weights to fall in the interval [0, 1]. Similarly, there is no noticeable trend in case of the Long-C family (Table 11) . Tables 10 and 12 give the execution time for ∆-stepping for the long and square grid graphs respectively, as the problem size and number of processors are varied. For Long-n graphs with ∆ set to n m , there is insufficient parallelism to fully utilize even a single processor of the MTA-2. The execution time of the level-synchronous BFS also does not scale with the number of processors. In fact, the running time goes up in case of multiprocessor runs, as the parallelization overhead becomes significant. Note that the execution time on a single processor of the MTA-2 is two orders of magnitude slower than the reference sequential processor (Fig. 5(b) ). In case of square grid graphs, there is sufficient parallelism to utilize up to 4 processors for a graph instance of 2 24 vertices. For all other instances, the running time does not scale for multiprocessor runs. The ratio of the running time to BFS is about 5 in this case, and the ∆-stepping MTA-2 single processor time is comparable to the sequential reference platform running time for smaller instances. Tables 13 and 14 summarize the running times on the USA road networks. The execution time does not scale well with problem size, as the problem instances are small. We observe better performance (lower execution time, better speedup) on USA-road-d graphs than on USA-road-t graphs.
Conclusions and Future Work
In this paper, we experimentally evaluate the parallel ∆-stepping NSSP algorithm for the 9th DIMACS Shortest Paths Challenge. We study the algorithm performance for core challenge graph instances on the Cray MTA-2, and observe that our implementation execution time scales impressively with number of processors for low-diameter sparse graphs. We also analyze the performance using platform-independent ∆-stepping algorithm operation counts such as the number of phases, and the request set sizes, to explain performance across graph families.
We would like to further study the dependence of the bucket-width ∆ on the parallel performance of the algorithm. For high diameter graphs, there is a trade-off between the number of phases and the amount of work done (proportional to the number of bucket insertions). The execution time is dependent on the value of ∆ as well as the number of processors. We need to reduce the number of phases for parallel runs and increase the system utilization by choosing an appropriate value of ∆.
We are currently optimizing parallel implementations of Thorup's algorithm for NSSP, and the Bellman-Ford algorithm for solving the single-source shortest paths allowing negative edge weights. For all-pairs shortest path computations, we expect our optimized Thorup implementation to outperform ∆-stepping. We intend to repeat this NSSP experimental study on the MTA-2 with Thorup's implementation in the near future.
Our parallel performance studies have been restricted to the Cray MTA-2 in this paper. We have a preliminary implementation of ∆-stepping designed for multi-core processors and symmetric multiprocessors (SMPs) that we wish to add to this study. We expect the SMP implementation would also perform well on low-diameter graphs.
A The Cray MTA-2
This section is excepted from [6] .
A.1 Architecture
The Cray MTA-2 [17] is a novel multithreaded architecture with no data cache, and hardware support for synchronization. The computational model for the MTA-2 is thread-centric, not processor-centric. A thread is a logical entity comprised of a sequence of instructions that are issued in order. An MTA-2 processor consists of 128 hardware streams and one instruction pipeline. A stream is a physical resource (a set of 32 registers, a status word, and space in the instruction cache) that hold the state of one thread. An instruction is three-wide: a memory operation, a fused multiply-add, and a floating point add or control operation. Each stream can have up to 8 outstanding memory operations. Threads from the same or different programs are mapped to the streams by the runtime system. A processor switches among its streams every cycle, executing instructions from non-blocked streams. As long as one stream has a ready instruction, the processor remains fully utilized. No thread is bound to any particular processor. System memory size and the inherent degree of parallelism within the program are the only limits on the number of threads used by a program.
The interconnection network is a partially connected 3-D torus capable of delivering one word per processor per cycle. The system has 4 GBytes of memory per processor. Logical memory addresses are hashed across physical memory to avoid stride-induced hot spots. Each memory word is 68 bits: 64 data bits and 4 tag bits. One tag bit (the full-empty bit) is used to implement synchronous load and store operations. A thread that issues a synchronous load or store remains blocked until the operation completes; but the processor that issued the operation continues to issue instructions from non-blocked streams.
The MTA-2 is closer to a theoretical PRAM machine than a shared memory symmetric multiprocessor system. Since the MTA-2 uses parallelism to tolerate latency, algorithms must often be parallelized at very fine levels to expose sufficient parallelism. However, it is not necessary that all parallelism in the program be expressed such that the system can exploit it; the goal is simply to saturate the processors. The programs that make the most effective use of the MTA-2 are those which express the parallelism of the problem in a way that allows the compiler to best exploit it.
A.2 Expressing Parallelism on the MTA-2
The MTA-2 compiler automatically parallelizes inductive loops of three types: parallel loops, linear recurrences and reductions. A loop is inductive if it is controlled by a variable that is incremented by a loop-invariant stride during each iteration, and the loop-exit test compares this variable with a loop-invariant expression. An inductive loop has only one exit test and can only be entered from the top. If each iteration of an inductive loop can be executed completely independently of the others, then the loop is termed parallel. To attain the best performance, we need to write code (and thus design algorithms) such that most of the loops are implicitly parallelized.
There are several compiler directives that can be used to parallelize various sections of a program. The three major types of parallelization schemes available are
• single-processor (fray) parallelism: The code is parallelized in such a way that just the 128 streams on the processor are utilized.
• multi-processor (crew ) parallelism: This has higher overhead than single-processor parallelism. However, the number of streams available is much larger, bounded by the size of the whole machine rather than the size of a single processor. Iterations can be statically or dynamically scheduled.
• future parallelism: The future construct (detailed below) is used in this form of parallelism. This does not require that all processor resources used during the loop be available at the beginning of the loop. The runtime growth manager increases the number of physical processors as needed. Iterations are always dynamically scheduled.
A future is a powerful construct to express user-specified explicit parallelism. It packages a sequence of code that can be executed by a newly created thread running concurrently with other threads in the program. Futures include efficient mechanisms for delaying the execution of code that depends on the computation within the future, until the future completes. The thread that spawns the future can pass information to the thread that executes the future via parameters. Futures are best used to implement task-level parallelism and the parallelism in recursive computations.
A.3 Synchronization support on the MTA-2
Synchronization is a major limiting factor to scalability in the case of practical shared memory implementations. The software mechanisms commonly available on conventional architectures for achieving synchronization are often inefficient. However, the MTA-2 provides hardware support for fine-grained synchronization through the full-empty bit associated with every memory word. The compiler provides a number of generic routines that operate atomically on scalar variables. We list a few useful constructs that appear in the algorithm pseudo-codes in subsequent sections.
• The int_fetch_add routine (int_fetch_add(&v, i)) atomically adds integer i to the value at address v, stores the sum at v, and returns the original value at v (setting the full-empty bit to full). If v is an empty sync or future variable, the operation blocks until v becomes full.
• readfe(&v) returns the value of variable v when v is full and sets v empty. This allows threads waiting for v to become empty to resume execution. If v is empty, the read blocks until v becomes full.
• writeef(&v, i) writes the value i to v when v is empty, and sets v back to full. The thread waits until v is set empty.
• purge(&v) sets the state of the full-empty bit of v to empty.
B Tables
B.1 Sequential performance of ∆-stepping implementation on the reference platform (a) Random4-n core family. Problem instance denotes the log of the number of vertices. A directed random graph of n vertices, m = 4n edges, and maximum weight C = n. Table 1 : Sequential performance (execution time in seconds, and normalized performance with reference to the baseline BFS) of our implementation for the core random graph families. Table 2 : Sequential performance (execution time in seconds, and normalized performance with reference to the baseline BFS) of our implementation for the core long grid graph families.
(a) Square-n core family. Problem instance denotes the log of the grid x dimension. x = y and m n ≃ 4. Table 3 : Sequential performance (execution time in seconds, and normalized performance with reference to the baseline BFS) of our implementation for the core square grid graph families. Table 7 : MTA-2 performance (execution time in seconds, normalized performance with reference to the baseline BFS, relative speedup) of our implementation on Random4-n graphs. Problem instance denotes log of the number of vertices. p denotes the number of processors. m = 4n edges, and maximum weight C = n. Table 8 : MTA-2 performance (execution time in seconds, normalized performance with reference to the baseline BFS, relative speedup) of our implementation for RandomLogUnif4-n graphs. Problem instance denotes the log of the number of vertices. p denotes the number of processors. m = 4n edges and maximum weight C = n. Table 10 : MTA-2 performance (execution time in seconds, normalized performance with reference to the baseline BFS) of our implementation on Long-n graphs. Problem instance denotes the log of the rectangular grid x dimension. p denotes the number of processors, y = 16, n = xy, m ≃ 4n edges, and maximum weight C = n. Table 11 : MTA-2 performance (execution time in seconds, normalized performance with reference to the baseline BFS) of our implementation on Long-C graphs. Problem instance denotes the log of the maximum edge weight. p denotes the number of processors. The grid dimensions are given by x = 2 14 and y = 16. Table 12 : MTA-2 performance (execution time in seconds, normalized performance with reference to the baseline BFS) of our implementation on Square-n graphs. Problem instance denotes the log of the number of the grid dimension x. p denotes the number of processors. x = y, n = xy, m ≃ 4n edges, and maximum weight C = n. Table 13 : MTA-2 performance (execution time in seconds, normalized performance with reference to the baseline BFS) of our implementation on USA core road graphs (distance is the length function). p denotes the number of processors. Table 14 : MTA-2 performance (execution time in seconds, normalized performance with reference to the baseline BFS) of our implementation on USA core road graphs (transit time is the length function).
Problem
