Shortest path algorithms are key elements of many graph problems. They are used in such applications as online direction finding and navigation, and modeling of traffic for large scale simulations of major metropolitan areas. As shortest path algorithm are execution bottlenecks, it is beneficial to move their execution to parallel hardware such as FieldProgrammable Gate Arrays (FPGAs).
Introduction
Transportation infrastructure is a highly complex system, where hundreds of thousands of vehicles traverse vast road networks. The ability to understand and predict responses of the network to perturbations of the traffic flow is of great value to planners. Exploring the effects of various changes to the infrastructure provides understanding of what to expect when similar changes happen in reality. TranSIMS-LANL is a Los Alamos National Laboratory project aimed at understanding many aspects of infrastructure, including traffic, energy consumption, land use planning, traffic safety, and emergency evacuation [2] .
Part of the TranSIMS data is based on long-form census reports. These reports provide detailed information about the behaviors of large segments of a population. Given data on where a person lives, works, goes after school, etc., we can know where in the city grid they will be at any given time. As they have to get from one place to another, we also know roughly what routes they will travel. Thus, we can model the transportation network at any given point of the day, given certain assumptions and an appropriate choice of routing model.
The current routing algorithm we are using in hardware is a basic shortest path approach. This does not take into account loading on the freeways, but is less data intensive and easier to calculate in parallel. However, it is sufficient to model many types of travel plans. Some of the questions that can be answered are:
time of the day to divert traffic onto side roads, and how will it affect those roads?
• Given a natural disaster, what is the best way to evacuate a city?
The computational cost of these simulations is high, ranging into days. This reduces the amount of science that can be completed. Hardware acceleration provides a potential path for improving the productivity of infrastructure scientists. The traffic "micro-simulator" of TranSIMS have already been implemented [11] . However, the routing problem remains unaddressed. Road and epidemiological graph data is highly sparse, providing a challenging problem to hardware designers. Sparse graphs have very little non-zero data, allowing for highly efficient compressed representations. However, determining shortest paths from a sparse graph is difficult because of the irregular data access patterns of algorithms and the tendency of some solution techniques to fill the connection matrix.
The sparseness of the Los Angeles infrastructure graph in question has a connectivity of 2.75 in 560,000 intersections.
The use of a parallel single-source shortest path approach to solving the all-pairs or many-pairs shortest path problem can leverage a sparse representation of a graph. Because each shortest path problem is only concerned with one set of edges, it is does not damage the sparsity of the graph during the solution. The approach we take is similar to the A* strategy [7] :
In this formulation, distance(v, dest) is the Euclidean distance from the endpoint of the edge (u, v) to the destination. This is calculated from the longitude and latitude stored for each node. In the A* algorithm, a "heuristic estimation" is used instead of the weight of the edge. In this case, w(u, v) = path distance + d(v) is distance traveled so far plus the Euclidean distance d(u, v) to the objective.
The interest in using A* over Dijkstra's [6] is that it can utilize the knowledge of the graph we already have -that is, latitude and longitude, to predict a node's distance from an objective point. In a general graph G(v, e, w) this sort of short-cutting is impossible, but we have a map graph (G(v1, e, w, lat, long)), which can reduce unnecessary graph exploration.
In order for this approach to be as efficient as possible, several assumptions are required:
• total number of paths > 1000
• extract-min should require ≈ #units * time for memory fetch
The cost of extract-min, in cycles, is roughly the number of paths in the set Q of possible paths. This number can increase in an exponential manner (depending on the graph) during the execution of the shortest path.
Equivalently, #units ≈ time(extract-min) / time for memory fetch, meaning that if the number of units able to fit on the device is too small, the memory bandwidth will not be used to its maximum capacity.
The objective of this work is to maximize the possible throughput ( The objective is for this to maximize the usage of the available random access bandwidth as well, Because Dijkstra's algorithm and A* are only concerned with the new nodes, and the structure of the graph determines how new nodes present themselves, we must simulate to develop an understanding of the memory access patterns of the system. For instance, a tree that is explored using Dijkstra's from the root will always present new nodes. A lattice (as in a regularly laid-out road network) is easily predictable in terms of the A* traversal. At each node, the choice of the next node to explore is generally the edge that leads in the direction of the destination. However, few cities are that simple. Freeways, bridges, rivers, and undeveloped areas can make route finding, and the data access patterns associated with them, more complex.
In this work, hardware implementation is accomplished through the use of a small A* core replicated on the order of 20 times on an FPGA device. The objective is to maximize the use of on-board randomaccess memory bandwidth through the use of multithreaded latency tolerance. Each shortest path core is responsible for one shortest path calculation, and when it is finished it outputs its result and requests the next source from a queue.
Because the graph database size is measured in 3 the hundreds of megabytes, the XD-1's SRAM memory is insufficient. In addition to the A* cores, we have developed a memory management system allowing round-robin servicing of the nodes as well as an SRAM cache memory managed over the Hypertransport bus.
Related Work
Some other work has been done in exploring some of the hardware components used in this work. One, [8] , deals with a fast heap implementation, and an implementation of Dijkstra's that uses a full hardware priority encoder [10] . In both situations, the priority encoder requires a large percentage of the slices available. This prevents the designers from using a larger number of units to process shortest paths in parallel. Because we are feasibly prioritizing thousands of distance measures, it is infeasible to implement this in an O(1) full-parallel hardware solution. By spreading the extract-min operation over O(n) cycles with O(1) hardware, we take advantage of the round-robin memory scheduler's graph request latency while simultaneously fitting far more hardware on the device.
The Floyd-Warshall architectures, such as [3] , are only appropriate for dense matrices. Because the all-pairs problem demands a solution from every node to every other node, the benefits of the sparse matrix are lost as the algorithm runs. The dynamic programming solution fills in the entire cost matrix D, as follows:
Because of the matrix-filling property of FloydWarshall's, it is pointless to use a sparse-matrix data structure. While the matrix starts out sparse, it becomes completely filled. The sparse structure has high overheads in terms of random-access to elements. Thus, the combination of an All-pairs solution and a sparse matrix is not ideal. The parallel single-source problem, run over all required paths is able to leverage the sparse graph more effectively.
Introduction to Our Approach
Determining the next edge to explore is a key element of the A* algorithm, and the running time of the algorithm is largely determined by the data structure used to provide this information. Software implementations use a Fibonacci Heap to provide O(log(n)) average performance. However, this is an amortized performance achieved through a complicated data structure that would be difficult to implement well in hardware. There have been implementations of priority queues [8] but they require many device resources.
The Fibonacci heap is the key to an efficient software implementation. However, based on our simulations it seemed that the Fibonacci heap might be providing more features than is needed, at a cost that is undesirable. The main observation is that the heap can grow without bound during execution of the A* algorithm. Edges are potentially added whenever a new node is explored, but queue elements are only removed when they are explored. This leads to a queue that can grow into the thousands of elements for a large graph. However, after simulation, we noticed that the size of the queue -within limits -does not affect the correctness of results.
For any route, given a start, destination, and graph of the road network, there is a certain size of queue that is required. The purpose of the queue, going back to the A* algorithm, is to provide some ability to backtrack. The A* algorithm repeatedly performs the extract-min() operation on the heap, allowing the best greedy route to be followed. This is fine for a regular grid, but in a real city, with complications such as housing developments, rivers, and freeways, the best greedy route is often not able to be completed. For instance, see Figure 2 , which is a close-up of the example route shown in Figure 1 . The best greedy route leads the algorithm into a dead-end. The heap is then consulted for the next best option. This may mean that all of the nodes leading into the dead-end will be explored. Because the algorithm only knows about the nodes in the queue, it cannot backtrack if the queue is not big enough. A subdivision, for instance, with one entry point but many internal streets, can prevent the routing from completing. Our tests of the Los Angeles network, over 150 randomly selected start/destination pairs, show that a queue size of 16 elements is suffi-4 
Simulation Results
In tests of the Los Angeles dataset using A*, the maximum heap size required was 4,926 over a set of randomly selected origin/destination pairs. With a 20 bit street identifier (there are 560,000 some street segments in Los Angeles. There may be lots of "slip" roads that can be removed via optimization) and a 16 bit distance, we will require about 10 block RAMs. This results means we can fit at least 20 shortest path units on a chip, depending on the device. Figure 3 shows the number of completions and accuracy versus an infinite buffer as the queue size is reduced. Even while the accuracy of the returned route is reduced, the algorithm still completes the majority of the cases until the queue size is reduced to 8, which is the equivalent of 3-4 intersections worth of streets.
These simulation tests have led us to a novel modification to the A* algorithm that makes it ideal for hardware implementation. Rather than an unnecessarily complex hardware Fibonacci heap implementation, we have developed a simple bubble sort core that provides one extract-min() operation every k cycles, where k is the depth of the queue. Bubble sort would not normally be an efficient method for providing min-heap functionality. The average behavior is worse than a randomized sorting method, like quick-sort. However, the hardware resources required to implement it are very low, consisting of a single register and comparator.
The algorithm only requires one minimum element per exploration time step. Another option would be to just remove the minimum element from the queue, which would not require any write-back to the queue. While this would be somewhat simpler, as the "bubbling" action does require a write-back port on the memory, it would cause gaps in the queue where elements have been removed. Additionally, the bubbling action has the additional benefit of pushing large elements toward the tail of the queue. As large elements are less likely to be used, there is little problem if they are overwritten as the queue loops. Figure 4 illustrates the basic behavior of the queue system. As a node is explored, its edges are added to the queue at the head end. If the queue is full, the head pointer can force the tail pointer to move backwards, thereby pushing the oldest elements out of the queue. Not only are the elements near the tail pointer old, they are also the most likely to be farther from the destination. This is partly due to the bubbling action of the extract-min(), and partly due to the behavior of the A* algorithm itself. In each exploration phase, the algorithm attempts to explore nodes closer to the destination. Thus, the most recently added edges are both close to the head pointer and more likely to be selected as the next minimum element extracted from the queue. These behaviors combine to reduce the chance of useful element being overwritten when the queue overwrites data.
The infrastructure graphs used in our application are highly sparse, averaging 2.75 edges per node. Earlier work in handling sparse-graph algorithms [5] has attempted to mitigate the problems of the memory wall by distributing BRAM-stored data with a lightweight on-FPGA network. However, due to the size of the infrastructure graphs, BRAM does not provide sufficient storage. Thus, the data for our application must be paged in from DRAM to an SRAM cache.
Architecture
The efficient bubble sorting technique allows for more units per device. The delay of the min-bubble operation is mitigated through the same multi-threading technique that allows the SRAM (and, occasionally, the DRAM-paged cache memory manager) latency to be ignored. By having several dozen A* units working in parallel, an SRAM request for one unit is handled while another unit is busy sorting its queue. This allows the SRAM bandwidth to be maximized -the round-robin memory controller can provide a new data request on every SRAM data cycle, because the A* units immediately sort their queue as soon as a data request is completed and then wait for their next access graph memory.
The Cray XD-1 supercomputer [4] combines high performance microprocessors, a high speed, low latency interconnect network, and reconfigurable computing elements. This provides an environment where data transfer latencies and bandwidth associated with I/O busses is greatly reduced. The tight integration of processors and reconfigurable computing changes the meaning of reconfigurable super-computing. A supercomputing problem can be split between the CPU and FPGA with close synchronization and fast communication between software and hardware. Despite the large amount of bandwidth on the RapidArray network between the FPGA and the Opteron CPUs, FPGA to SRAM bandwidth is still higher than the DRAM to FPGA bandwidth. Thus, it is beneficial to attempt reduce the required amount of data traffic.
The method for accessing the FPGA's SRAMs is shown in Figure 5 . The multiplexor (mux) between the traffic engine and the RapidArray Transport (RT) core allows the host to write to the SRAM. Reads do not require a mux for the data, but both read and write require muxes for the address lines (not shown). Using this setup we benchmarked reads and writes from the host to the FPGA.
The Los Angeles road graph has roughly 1,421,000 edges. In our data structure representation, this occupies about 21 MB. Thus, the 16 MB QDRII SRAM banks on the XD-1 will not be able to hold the entire graph at once. While the Los Angeles graph is only slightly larger than the available SRAM, larger graphs are always available. Whenever the SRAM does not have the particular graph block required, it starts the software-controlled cache memory system. This is illustrated in Figure 6 . Because the XD-1 is much faster at softwarecontrolled block DRAM transfers to the SRAM than FPGA-initiated DRAM requests, we control the SRAM-cached memory system using the CPU. This allows the software to implement more complex page replacement policies and handle the file I/O. The round-robin memory controller switches to a new A* unit and determines if there is an outstanding request, requiring one cycle. If a memory request is ready, the memory controller looks at the high-order bits and compares it against a set of Content-Addressable Memories (CAM) entries. The CAM is a common memory type for full-associative caches because it allows memory blocks to be placed anywhere in memory without being restricted by a hard mapping between virtual and physical addresses.
The comparison is accomplished during the next cycle after the outstanding address request is valid. If the data is already in memory, the CAM returns the appropriate offset into the SRAM banks and puts the request on the memory address bus. The the SRAM has a guaranteed latency of 20 cycles (this includes some pipeline stages to meet timing across the FPGA). Thus, the memory controller can signal to the A* unit that it can expect its data on the bus after 20 cycles. Because the latency is guaranteed, the memory controller is no longer required to maintain control over the requesting unit and can go on to process the next unit. Given the necessary blocks already in memory, the 24 units can be serviced in 48 cycles.
If the data block is not already in SRAM memory, the memory controller switches into the DRAM mem- ory request mode. Because there is only one memory bus to the SRAM, the A* units can continue sorting, but outstanding requests will block until the requested data is paged in from DRAM. The controller writes a register detailing that it requires memory service and the block required. The CPU then initiates a DMA transfer from DRAM to the SRAM and tells the memory controller which cache block was written.
As in all systems, DMA transfers to and from DRAM are expensive operations compared to normal SRAM-based memory fetches. Fortunately, the graph data is laid out in blocks such that there is temporal and spatial locality in the data accesses. This makes the cache behavior more efficient than if the data was laid out randomly in memory.
In order to reduce the overall data transfers, we do not guarantee completely accurate behavior is all situations. In case of failure, for whatever reason, the CPU handles the shortest path calculation for the failing unit. This is detected by unit timeout, essentially a heartbeat. As the software is responsible for offloading the completed shortest path, it is aware of a unit that is taking too long to complete a path. One of the situa- tions in which non-deterministic behavior can result is when the pages are swapped. While cache pages are written to in order to provide the "explored" flag, we do not provide support for writing SRAM blocks back to DRAM. With 64 cache blocks and the round robin memory scheduling, the last two blocks any given unit has accessed are likely to stay in memory. The spatial locality of the graph data means that accesses to memory areas from far in the past are unlikely, especially outside of the last two blocks accessed. The probability of an "explored" flag being relevant from a paged out block is highly unlikely. Regardless, in the case these assumptions fail, the CPU can handle the hardware failure. The Cray XD-1 has 16 MB of SRAM available to the FPGA. We address this as a unified single bank with many cache blocks addressed from a single memory manager.
Each edge request (two 64-bit words) coming out of the QDRII SRAM is laid out as illustrated in Table 1: Each memory word is contains enough information to provide the data inputs to the queue. Latitude and longitude is used to compute the distance to the objective. This is necessary to provide the queue's sorting ability. The address provides the location in memory for the node in question. At that address, the Num Edges following memory words are the edges connected to the node. The 5-bit Explored field allows each node to be marked with a unit ID concatenated with a completed-route counter. As the nodes are explored, the field is marked. If a A* unit sees its own ID, it will assume it has already seen the node and will not add the edge to the queue. The completed-route counter serves in case the same SRAM block is used for multiple consecutive routes on the same unit; there is a possibility that the node would be marked and thus skipped because the previous routing had used it. Figure 7 illustrates the effect of SRAM cache size on the number of page replacements for 1000 random Table 2 illustrates the effect of various SRAM block sizes on the total memory words transfered. Because the XD-1 only has 16 MB of 64 bit words (each edge requires two words), there are limited options for the block sizes. The V2Pro 60 device can support 24 A* processing elements. As the memory manager services the A* units in a round-robin fashion, there must be at least 24 blocks. We tested 16, 32, 64, 128, and 256 blocks, corresponding to block sizes of 65313, 32656, 16328, 8164, and 4082. This assumes, of course, that the whole memory is used for caching the graph; in practice a few pages are left for storing routing results. For each size, we tested two replacement policies, a round robin approach that is very easy to implement, and a last-used policy. The last-used approach uses more data to make its replacement decision than the round-robin policy. If the block of data isn't currently being used by any of the A* units, it is a sensible block to replace.
After experimentation with a set of random origin/destination pairs, we found that the last-used replacement policy with 64 blocks is the most effective methodology. The cost of starting a transfer is fairly high as it requires software reads of FPGA registers, thus one large transfer costs less than two transfers of one-half the size. As is clear from 
Performance Results
Using the Cray XD-1 with a Virtex 2-Pro 50 -7, we placed and routed the design with 24 A* units at 130 MHz. This clock rate is less due to the design of the A* units and more to the SRAM interfaces and arrangement of the pin constraints for the FPGA board.
Area is 18165 in slices, 48 of the 232 multipliers, and 112 of the 232 BRAMs. Table 3 contains the performance results for a CPU implementation and the DRAM-based SRAM-cached implementation. The results are based on a small set of randomly selected origin/destination pairs, with roughly 2 million accesses to the graph memory. The software implements the same approach as used in hardware, including the limited size queue. This allows the software to use the buffer size optimization. However, the code is not multi-threaded, which reduces the ability of the software to hide the memory access latency. One of our future goals is a multi-threaded software implementation for a massively multi-threaded system such as the Cray XMT platform.
The DRAM-based results are based on the full system implementation, including the memory manager that caches graph memory to the SRAM. The load time of the SRAM blocks is the main bottleneck for this implementation, and largely determines the running time and overall performance. The speedup of 50x over the CPU is lower than we had expected at the beginning of the project, but the data access costs are significantly higher with the addition of the caching SRAM design, that the overall system performance was severely impacted.
The speedup is largely due to the FPGA system's ability to run many threads simultaneously with their own private memories, and support a very large SRAM cache. Each of the A* units performs the sorting in its own private memory store that is not shared with the other units. The only shared memory accesses are the requests for new graph edges. The CPU must rely on cache memory for all of its operations, severely reducing its ability to execute multiple threads without waiting for memory accesses.
A multi-core CPU can simultaneously perform the sorting operations, but modern devices cannot provide the parallelism available on the FPGA. The SRAM cache on the FPGA is never polluted by operating system requirements. As well, it is significantly larger than most CPU caches.
While the architecture could easily be implemented in a custom ASIC -in fact, the simple units that make up the systolic array are designed explicitly for ease of ASIC implementation -the use of FPGA allows the user to utilize parameterized designs which allow for variable queue depth as well as optimized memory sizes for a particular problem. As well, FPGAs allow the design to be scaled upward easily as process technology allows for ever-larger gate counts.
Conclusion
This paper contributes two main advances to the field of shortest-path computation. One, the bubblesorted queue, is a restriction that is only possible because of the limited branching structure of transportation grids. However, it enables a much smaller hardware implementation of the A* algorithm than a Fibonacci heap, allowing a much higher level of parallelism. This parallelism allows for the memory bandwidth of the XD-1 to be utilized more effectively than if shortest paths were calculated sequentially.
The second contribution is the DRAM-paged SRAM-based cache memory round-robin manager. This is obviously not novel in terms of terms of computer architecture, but its implementation on the XD-1 does provide a capability that is useful to many ap- plications where data access is random over a large, sparse database. The next step in improving the modeling capabilities of this architecture is to add support for congestion. By keeping track of the load on any given road segment during a particular time-step, the system can weight edges such that heavily trafficked arteries are less desirable. A* normally sorts on the distance to the destination, but that can easily be changed to a weighted distance based on the congestion and average rates of a particular road link. This will provide more accurate modeling, as cars will be attracted to freeways until they too become congested, and otherwise route toward fast routes and away from traffic. Additionally, improving the cache performance through other graph representations and algorithmic strategies( [9] ) may improve overall application performance.
In [11] the traffic "micro-simulator" of TranSIMS was implemented. The ultimate goal of this work is to integrated the two components into a seamless whole that makes use of the high-speed interconnect of the Cray box to provide communication between the FPGA components. This will allow the agents to actually make decisions based on "real-time" data within the traffic system.
Another interesting component of research may be to implement the algorithm on the Cray XMT Platform ("Eldorado") [1] . A single device on the XMT can support up to 128 simultaneous threads, allowing the implementation of a very similar multi-threaded approach to what is implemented in this work on FPGA.
