In this paper, we present a technique to reduce the run-time memory footprint of FPGA routing algorithms. These algorithms require a representation of the physical routing resources and programmable connections on the device; this representation dominates the storage requirements of FPGA routers. We show that by taking advantage of the tile-based nature of FPGAs, we can reduce the amount of information that must be explicitly represented, leading to significant memory savings.
In this paper, we present a technique to reduce the run-time memory footprint of FPGA routing algorithms. These algorithms require a representation of the physical routing resources and programmable connections on the device; this representation dominates the storage requirements of FPGA routers. We show that by taking advantage of the tile-based nature of FPGAs, we can reduce the amount of information that must be explicitly represented, leading to significant memory savings.
To make our proposal concrete, we applied it to the routing algorithm in VPR and quantified the impact on run-time memory footprint, and place and route compile-time. We found that a memory reduction of 5X to 13X could be achieved at a routing runtime penalty of 2.26X and an overall place-and-route runtime penalty of1.28X
Introduction
Advancements in process technology have allowed for a dramatic increase in the capacity of commercial FPGA devices and this scaling is continuing at a steady pace. This places increasing demands on the accompanying FPGA CAD tools, resulting in larger run-time memory requirements, and longer compile times.
Reducing the compile time for CAD tools is an active area of research. Approaches include reducing the solution space explored [1] [2], parallelization [3] [4] , and hardware acceleration [5] . These works generally do not consider the memory footprint of the CAD tools. Unlike a CAD tool targeting an ASIC, a CAD tool targeting an FPGA must be aware of every programmable switch and routing resource on the device. As devices grow, the memory footprint needed to store this information grows. Already, CAD tools from some vendors require 64-bit machines to map circuits to large devices due to the memory requirements of the placement and routing algorithms. Other vendor tools run on 32-bit machines, but increasing FPGA size will mean even these will soon require larger, more expensive workstations.
Although the memory capacity of workstations is constantly increasing, reducing the memory footprint of FPGA CAD tools is important for a number of reasons: One of the most memory intensive steps in the FPGA CAD flow is routing [6] . The goal of the router is to use the prefabricated routing resources to implement each connection in a circuit. This requires a map of all routing resources available on the FPGA. This map is commonly represented using a directed graph called the Routing Resource Graph (RRG). Typically, the nodes in the graph represent wires and pins on the FPGA, and edges represent programmable connections. As the number of programmable elements and wiring tracks increase, this graph grows very quickly in size, and tends to be a dominating factor in the overall memory footprint of the FPGA CAD flow
In this paper, we use this insight to reduce the memory footprint of an FPGA routing algorithm. The key is a new representation of the physical routing resources on an FPGA. FPGAs typically consist of a number of similar tiles, and the resources within each tile are the same. Our approach is to explicitly represent the connections with each tile type using a graph similar to an RRG, adding edges to represent connections between tile types. This allows us to significantly reduce the run-time memory footprint of the routing step, which in turn reduces the overall memory requirements of the FPGA CAD flow. This representation, however, requires changes to the search algorithm at the heart of the FPGA router; these changes increase the runtime of the routing algorithm. Thus, there is a trade-off between memory footprint and run-time.
To make our ideas concrete, we have implemented our ideas in the commonly used academic place and route tool [7] , and quantified the impact on memory footprint and runtime. Our modified version of the VPR tool is available to the research community.
This paper is organized as follows. Section 2 gives background information on how physical routing information is represented in a routing resource graph. Section 3 then examines how the various detailed routing architecture parameters affect the RRG size. Section 4 presents the new routing resource graph representation and Section 5 then describes necessary changes to the routing algorithm. Section 6 reports the impact on memory footprint. Section 7 reports the impact on compile time. Finally, Section 8 comments on the impact on the routing solution.
Background
In this section, we describe how physical data is represented in an FPGA router, and show how previous researchers have attempted to reduce the storage required for this data. Although our discussions focus on the VPR router, other FPGA routers would have similar storage requirements.
Routing Resource Graphs
Two types of data must be maintained during routing. The first is a map of the physical segments and programmable switches in the architecture. The second is temporary data that is maintained during the routing process (in the Pathfinder algorithm, for example, this would include the occupancies, current cost, history cost, and base cost of each routing segment [8] ). Some routers, such as VPR, combine this information into a single structure.
The map of the physical segments and programmable switches can be represented by a Routing Resource Graph (RRG). In the VPR router, the RRG is a directed graph where each wire and logic pin on the FPGA is represented by a node. In addition to wire and pin nodes, there are also source and sink nodes to model pins that are logically equivalent. Programmable connections between the resources are represented by directed edges. Directional connections such as tristate buffers are modeled by a single edge, and bidirectional connections such as pass transistors are modeled by a pair of edges. Figure 1 shows an example of the routing resource graph for one logic block tile containing a 2-input lookup table (LUT) and a channel width of 2. In the VPR router, each node of the routing resource graph includes not only connectivity information, but also coordinates for mapping each node to its corresponding physical resource, information for the timing model (capacitance and resistance of each physical resource), and the capacity of each node. In addition, in VPR, each node contains temporary information for the router, including the current cost, history cost, and base cost of each node, the current occupancy of each node.
In the VPR router, the routing resource graph for a 200x200 FPGA with 150 tracks per channel and segment length of four requires 1.6GB of storage. Of this, the majority (1.58GB) is due to the static information that describes the connectivity, while only a small portion ( 0.02GB) is due to the dynamic information created by the router. This motivates us to focus on reducing the storage requirements for the static information; future work will investigate whether it is possible to improve the efficiency of the dynamic information storage.
Previous Work
There has been only a few papers focusing on memory reduction. The most relevant proposes a Just-in-Time (JIT) compiler for FPGAs, which configures an FPGA when it is about to execute [6] . Their routing algorithm uses a simplified graph where each node represents a switch box, and edges represent routing resources between the switch boxes. Although their technique produces a significant reduction in memory footprint, it only applies to simplified architectures aimed for fast JIT compilation. Programmable connections are modeled differently depending on whether they are between nodes in the same tile, or between nodes in different tiles. In the following three subsections, we describe each of these separately.
Connections Within the Same Tile Type
For programmable connections between resources that lie within the same tile instance, the edges are modeled in the traditional manner as shown in Figure 4 . (deltax,deltay) values correctly ((-1,0) in this example) the connectivity information is completely preserved.
Connections Between Different Tile Types
Programmable connections between tiles belonging to different tile types can be handled in the same way. However, sometimes the type of an edge's destination tile depends on the coordinates of the tile in the FPGA array. For example, consider a logic block tile at the centre of an FPGA. The neighbouring tiles in this case will be other logic block tiles. On the other hand, a logic block tile on the periphery of the FPGA would have some neighbours that are IO block tiles. In this case, edges are added from the source node to all potential destination nodes in all potential tile types. This is shown graphically in Figure 6 in which two edges are sourced from the horizontal track in the right-most logic block tile. One edge wraps around to the same logic block tile, while the other edge connects to the corresponding I/0 block tile type. The two edges are given the same (deltax,deltay) labels ((-1,0) in this case).
Clearly, in this situation, only one of the two edges corresponds to a physical routing switch for any given tile; which edge depends on the coordinates of the tile. It is up to the routing algorithm to keep track of the current x,y location during maze routing, and only use those edges which correspond to physical switches. More details regarding the router are in Section 5.
Connections Within Different Tiles of the same Type
When a programmable connection exists between routing resources that belong to different instances of the same tile type, we use an edge that "wraps around" as shown in Figure 5 . The upper connection in this example is a connection between two horizontal wire segments. In our graph of the tile, both of these wire segments are represented by the same node. When this happens the node emits an edge that returns to itself. The edge is labeled with two quantities: deltax and deltay. These values specify to which adjacent tile instance the edge connects. In this example, the delta values for the edge are (+1, 0). All edges are labeled in this way. In the lower connection in Figure 5, Figure 7 shows the pseudo-code for the Pathfinder-based VPR router. As long as the sink has not been found, the node with the best score is removed from the priority queue (PQ) and its neighbouring nodes are added back into the PQ. Since the router always knows the (x,y) coordinates of the source and sinks being routed, the (x,y) coordinates of every subsequent node that is added to the PQ can be calculated using the (x,y) coordinate of the source node as the starting point, and the relative positioning information stored in the (deltax, deltay) edge labels.
When expanding the neighbours of a node, the router iterates across the node's list of fanout edges and adds each and every node to the PQ. We modify this to handle nodes representing wire segments that are longer than unit length (i.e. L>1) and to check which edges should be used for expansion. The modified pseudo-code is shown in Figure 8 . This code will be further illustrated in an example in Section 5.3.
As described in Sections 4.3 to 4.5, not all edges in the graph correspond to physical switches. The routine fanout_exists uses the (x,y) coordinates of the tile being explored to compute whether each edge corresponds to a physical switch. The details of these calculations are not shown, but they are straightforward given the detailed routing architecture of the fabric being modeled. } } Figure 8 . Pseudo-code highlighting changes to the steps performed during neighbour expansion.
Example Illustrating Long Wires
This subsection presents an example to illustrate how long wires are handled by the algorithm. Figure 9 shows a Table 2 . Compile time results for synthetic benchmark circuits number of nets and sinks affect the amount of time spent on routing. We use two suites of benchmark circuits. The first suite consists of the twenty largest MCNC circuits. The second suite consist of seven synthetic circuits created by stitching together a number of MCNC circuits using the technique described in [11] . Although these circuits are much larger than the twenty largest MCNC circuits, they are still not as big as some of the architectures that we investigated in Section 5. However, we found the impact on compile-time to be generally independent of circuit size. We assume an architecture that uses length 4 wires, and logic blocks with 22 inputs and ten 4-input LUTs. Each circuit is placed and routed to the minimum array size and channel width. Table 2 reports the compile time results for the suite of synthetic circuits. Columns 2-3 shows the minimum array size and channel width required to place and route the circuit. Columns 4-6 show the placement, routing, and overall run-times when using the original version of VPR. Columns 7-8 shows the new routing and overall run-times and Columns 9-10 shows the ratios. The average increase in routing time is 2.26x and the average increase in overall compile time is 1.28x. An important observation is that the overhead for each circuit is close to the average, meaning that the compile time overhead is generally independent of the FPGA size.
The average increase in route time and overall compile time for the MCNC suite were found to be 2.16x and 1.29x respectively. The detailed results for this suite have been omitted due to space but yielded the same conclusions.
Impact on Routing Solution
An important property of this technique is that all of the information in the routing resource graph is maintained exactly. This means that there is no change to the routing solution. This was verified by comparing the routing serial number generated by VPR.
Conclusion
This paper proposed a method to compress the connectivity information within the routing resource graph by taking advantage of the regularity in the FPGA. By doing so, we showed that the memory footprint of the routing step, one of the most memory intensive steps in the CAD flow, can be dramatically reduced. Due to the extra steps carried out by the router, we found that the overall place and route compile time increased on average by 1 
