This paper describes an FPGA-based accelerator for maze routing applications such as integrated circuit detailed routing. The accelerator efficiently supports multiple layers, multi-terminal nets, and rip up and reroute. By time-multiplexing multiple layers over a two-dimensional array of processing elements, this approach can support multi-layer grids large enough for detailed routing while providing at 1-2 orders of magnitude speedup over software running on a modern desktop computer. The current implementation supports a 32 X 32 routing grid with up to 16 layers in a single Xilinx XC2V6000 FPGA. Up to 64 X 64 routing grids are feasible in larger commercially available FPGAs. Performance measurements (including interface overhead) show a speedup of 29X-93X over the classic Lee Algorithm and 5X-19X over the A* Algorithm. An improved interface design could yield significantly larger speedups.
INTRODUCTION
Routing is a key part of the physical design of electronic systems. While many different algorithms are used for routing, the classic Lee Algorithm for maze routing [1] remains a popular approach because it is guaranteed to find a shortest-path connection between two terminals.
The Lee Algorithm models the routing problem as a grid. It finds a connection between a source and destination terminal in three phases: 1) expansion, where gridpoints are labeled breadth-first in order of increasing distance from the source until the destination is encountered; 2) backtrace, where gridpoints are selected for a connection by following labels of decreasing distance from the target back to the source and marking the selected gridpoints as obstacles, and 3) cleanup, where expansion labels are removed.
The biggest disadvantage of the Lee Algorithm is that it is computationally expensive. The problems are often subdivided into separate global and detailed routing steps [2] , with detailed routing focusing on subregions of the overall routing surface consisting of "a few tens of tracks in either direction" [3] . However, even on this reduced problem size routing consumes a large amount of execution time.
Because of this performance issue, several designs have been proposed for hardware acceleration of the Lee Algorithm. During expansion, all gridpoints at an equal distance from the source terminal can ideally be labeled in parallel, reducing the expansion time to O(d). During cleanup, all gridpoints not representing routed wires can be cleared in one step, reducing cleanup time to O(1).
The most straightforward approach to exploiting this parallelism is to create an array of small processing elements (PEs) that directly represent the points in the grid. Connections between neighboring PEs allow the expansion phase to proceed in parallel. This full-grid [4] [5] [6] approach has the best potential for speedup but comes at a high cost in hardware, since a total of LN2 PEs are required.
An alternative virtual-grid [7] [8] [9] approach uses a smaller array of PEs and maps multiple gridpoints onto each PE. This reduces the total number of PEs but complicates the design of each individual PE, which must now keep track of each gridpoint that it represents and communicate that information with neighboring PEs. Other approaches have proposed the use of systolic arrays [10] or special-purpose hardware that accelerates a portion of the algorithm without addressing its full potential parallelism [11] .
More recent work on routing acceleration has focused on FPGAs. In FPGA routing the grid graph is replaced by a directed graph that models the FPGA's programmable interconnect. For example, [12] reported an accelerator for the well-known PathFinder [13] FPGA routing algorithm. This approach combined an FPGA-based priority queue implementation with a network of workstations which took advantage of coarse-grained parallelism in the algorithm for a reported 15X speedup over contemporary (1997) workstations.
Other work [14] proposed an aggressive acceleration approach in which the interconnection structure of a FatTree topology FPGA was augmented with hardware to directly support connection search, allocation of routed connections, and ripup ("victimization") of routed connections when no connection can be found. Simulations of this design predict a speedup of up to three orders of magnitude over PathFinder with design quality within 5%-25% of software. The estimated hardware overhead varies from 1 .5X for support just of path search to 2.5X when path allocation and victimization are included. Further work [15] added support for multiple-terminal nets, improved routing quality by considering congestion, added support for mesh-style routing topologies, and explored implementing the routing hardware on an existing FPGA. While simulations predict dramatic speedups, the hardware cost is high -each PE requires an estimated 155 4-input lookup tables (LUTs) for a single-layer mesh topology.
In contrast to these approaches, L3 [16] focuses on the classic multi-layer maze routing. It is intended for detailed routing of integrated circuits rather than FPGAs. L3 uses a modified full-grid approach with a two-dimensional array of PEs that supports multi-layer routing. Each PE stores the state of all layers at a particular (x, y) coordinate and timemultiplexes each layer's gridpoint (x,y,O), (x,y,1) ---(x,y,L-1) through a common state sequencer. This allows efficient processing of multi-layer grids using compact hardware.
L3 showed the potential of this approach, but was implemented in a relatively small FPGA that supported a 4-layer 16 X Fig. 1 shows the general organization of the accelerator. The heart of the design is a two-dimensional array of PEs attached to a control unit.
L4 ORGANIZATION
Each PE corresponds to a horizontal (x,y) position on the routing grid and represents the gridpoints on every layer at that horizontal position. Fig. 2 shows how a PE is connected to neighboring PEs for local communication during expansion. PEs can also be selected for specific operations using a row decoder and column decoder. The decoders can select either a single row/column or a range of rows and columns. This allows the selection of either a single PE or an arbitrary rectangle of PEs for parallel operation.
Every PE has CMD and STATE IN inputs that are broadcast from the control unit to all PEs simultaneously.
It also has a STATE OUT output; the control unit reads the logical AND of this output from all PEs.
As gridpoints from each successive layer circulate through the state sequencer of each PE from bottom to top, the function of the PE and next state (NS) of the current gridpoint are determined by the current state (CS), the CMD input and row/column select inputs RSEL and CSEL. However, since multiple PEs can be selected for an operation, WRITE can be used to set the state of all PEs in an arbitrary rectangular region in parallel. This is useful for initializing an obstacle, ripping up a rectangular wire segment, and clearing all gridpoints on the current layer. Similarly, the READ operation can read all gridpoints in a rectangular region and the deliver the logical AND of these results to the controller. This can be used as a form of "line probe" that determines whether all selected gridpoints are in the EMPTY state (i.e., they are free for a new connection).
The CLEARX command is used during the cleanup phase to return "expanded" gridpoints to the EMPTY The control unit uses this signal to determine when a connection target is reached. STATE OUT [1] is asserted low by any PE that is currently entering an expanded state. The control unit uses this signal as a watchdog since when high it indicates that no further expansion is possible.
The control unit is a finite state machine that accepts high-level routing commands from a host processor and broadcasts low-level commands to the PE array. Table 2 summarizes the commands implemented by the control unit. Each command is a 32-bit word that is passed from the host processor to the control unit using a FIFO. 32-bit result words are passed back to the host processor in the form of endpoints of wire segments and status words indicating either successful completion of the command or failure.
MULTI-TERMINAL NETS
The basic Lee Algorithm supports only two-terminal nets. However, multi-terminal nets can be routed as follows [2] : First, two terminals are routed, resulting in a connection between the two terminals. Next, the gridpoints that represent this connection are relabeled as source nodes and used to start an expansion search to the third terminal. This process is repeated until all terminals are connected.
The L4 accelerator implements this approach with an additional gridpoint state called TRACED. During multiterminal routing, the backtrace phase labels connections with the TRACED state instead of the BLOCKED state. During subsequent expansions, cells in the TRACED state are treated as "expanded" states by neighboring PEs, allowing expansion to proceed from all gridpoints in the partial connection. When all terminals in the net are connected, a final cleanup phase returns all gridpoints that are in the TRACED state to the BLOCKED state in parallel.
The control unit implements two new commands to support multi-terminal routing: ROUTE EXTEND_INIT, which initializes multi-terminal routing and routes the first two terminals, and ROUTE EXTEND, which is used to add connections to each additional terminal.
RIPUP AND REROUTE SUPPORT
A major drawback of maze routing is that it only considers one connection at a time. A shortest path connection for one net can block connections for nets that have not yet been routed. To overcome this problem, nets are usually routed in varying order using a rip up and reroute strategy [2] . In this approach, nets are routed first in an arbitrary order. When a net cannot be connected, "blocking" nets that prevent this connection must be removed. The blocked net is then routed followed by other nets.
L4 supports net ripup using the ability to select and write a rectangular region simultaneously. This allows each horizontal (east/west or north/south) segment to be removed by the controller using one WRITE command for each segment (instead of one command for each gridpoint).
A second and more important problem involves the identification of nets which must be removed to complete a blocked connection. The expansion search implemented by L4 cannot help here, since when it fails it only indicates that a path cannot be found. §~T2Jj I (c) (e) (1 Etching -support for rip up and reroute.
Some software routers (e.g., [17] ) attack this problem using a penalty function where expansion is allowed through gridpoints containing prior connections at a much higher cost than through empty gridpoints. If no obstaclefree path is found, the minimum cost path identifies the minimum number of occupied gridpoints that must be ripped up to complete the routing of the current net.
Supporting a penalty function in a software implementation requires a priority queue. In a full-grid accelerator, the penalty function can be realized by adding a counter to each PE [5] . This is not feasible in L4 because each PE represents multiple layers; the cost of counter storage for each layer would be excessive.
Instead, L4 uses an alternative approach called etching [18] . The idea of etching is analogous to the use of a solvent to remove a physical barrier in a manufacturing process. Etching is performed by each PE under control of an "etch enable" signal from the control unit. When etching is disabled, the PE operates as before. When etching is enabled, each PE allows expansion to occur through a gridpoint representing an obstacle as well as an empty gridpoint. The control unit normally operates with etching disabled, but if expansion fails, it enables etching for one expansion cycle, allowing expansion through routed gridpoints immediately adjacent to the limits of the previous expansion. It then continues the search with etching disabled again. If routing fails again, etching is again enabled for an additional expansion cycle.
While etching can be useful for removing wires that block a connection, it cannot be applied indiscriminatelyconnection terminals must not be removed, and routing problems may contain objects which cannot be removed. These gridpoints are marked with a new UNETCHABLE state, which cannot be removed by etching. Fig. 3 illustrates the application of etching on a singlelayer 4 X 4 grid with two nets. The terminals of each net are set to the UNETCHABLE state so that they cannot be removed during etching. Net 1 from source SI to target TI is routed first; this blocks the connection for Net 2 (a). During normal expansion of Net 2, all gridpoints that can be reached from source terminal S2 are labeled before expansion fails after two expansion steps (b). At this point, the control unit enables etching for the third expansion step, expanding two obstacle gridpoints representing the current wire for Net 1 (c). Normal expansion then resumes and the target T2 is reached in the fourth expansion step (d). During backtrace, the control unit marks gridpoints on the backtrace path as obstacles. It also reports etched gridpoints to the host processor, which must then identify which nets have been cut using an intersection check with previously routed nets. During cleanup, all normally expanded gridpoints are returned to the EMPTY state (e). Etched cells, on the other hand, are returned to the BLOCKED state. Finally, any cut nets must be completely ripped up and rerouted (f).
The PE implementation supports etching with a new ETCH ENABLE control input and an added state bit ETCHED that represents whether a cell was originally empty when labeled (false) or originally an obstacle (true). During backtrace, this bit identifies gridpoints that must be reported to the host processor as etched. During cleanup, the ETCHED bit allows gridpoints to be simultaneously returned to the EMPTY or BLOCKED state as appropriate.
IMPLEMENTATION AND RESULTS
The L4 implementation is coded in Verilog. The basic PE design consists of 116-147 lines of Verilog depending on which features are used. When compiled into hardware with the Xilinx ISE synthesis tool, the basic PE without etching or multiterminal routing requires 38 Look-Up Tables (LUTs) Table 3 provides implementation details for a complete 32 X 32 routing array that supports etching and multiterminal routing for up to 16 layers. It is implemented using a Xilinx Virtex-II XC2V6000 FPGA [19] on a Dini Group 3000K1OS [20] development board. The board includes a PCI interface that plugs into to a standard PCI slot on a desktop PC host.
The clock rate of the accelerator is limited by the delay path between the control unit and the routing array. To increase the clock rate over the previous L3 design, registers are included in the row and column decoders and the status output of the array. Using a relatively low (-4) speed grade FPGA, the routing array can operate at a frequency of 30 MHz, a significant improvement over the 24 MHz L3 accelerator despite the fact that the L4 array is four times larger than the 16 X 16 L3 array.
The current host is a 1.79Ghz Pentium 4 desktop PC running RedHat Linux 7.2. Host software sends routing commands and receives routing results using I/0 ports that are memory mapped through a Linux device driver. 'f) Table 4 summarizes performance measurements that compare the performance of L4 to software implementations of the Lee Algorithm and A* [23] Algorithm that are written in C. Hardware measurements were performed using the configuration described in the previous paragraph.
Software measurements were performed using a lightly loaded desktop PC with a 3.79 GHz EMT64 Pentium 4 and 3.5GB of RAM running 64-bit Ubuntu Linux v5.05. Both sets of measurements were performed using the Pentium 4 cycle counter to measure the elapsed time between the start and end of each command and the equivalent software.
Measurements were performed for accelerators instantiated to support 6, 8, and 16 layer grids on four sets of benchmarks. The first two sets of benchmarks are simple problems that are intended to show the range of speedups between a worst case (routing between adjacent gridpoints (0,0,0) and (0,0,1), which requires almost no expansion and a best case (routing between the grid "corners" at (0,0,0) and (31,31,L-1), which requires expansion of the entire grid by the Lee Algorithm. The third set measures routing time for 90 successive randomly selected two-terminal nets, while the fourth set measures 40 successive randomly selected multi-terminal nets, each with 2-5 terminals. Routing quality was comparable for each approach (total wirelength varied by less than 2.5%).
L4's speedup compared to the classic Lee Algorithm is dramatic (5X-94X), but the speedup over the A* algorithm is lower (4X-14X) because the A* algorithm biases its search toward the target and reduces the cost of the expansion phase. However, even the worst case shows a 4X speedup due to acceleration of the cleanup phase.
Much of the accelerator's time is consumed by interface overhead. To quantify this, a separate cycle counter was added to the accelerator hardware that counts accelerator clock cycles from beginning to end of each routing command. Subtracting this time from the time measured in the host processor gives the interface overhead. On the 6-layer 40 random multi-terminal net problem, interface overhead per net varies betweenI2.5pts and 117.2pts, with an average of 56.2pts (75% of the average routing time). The wide variation is due to the fact that routes containing several segments require a separate word transfer for each segment. Using block transfers could greatly improve performance, as would replacing the PCI interface with a faster interface.
These accelerator clock measurements also allow us to estimate an upper bound on the best-case speedup that could be achieved if interface overhead could somehow be completely removed. On individual nets, speedup ranges from 4.6X -240X over the Lee Algorithm and 12X-43X over the A* algorithm. While some interface overhead is inevitable, a better design could significantly improve performance.
A final experiment was performed to evaluate the effectiveness of etching compared to software routing with a penalty function. In this experiment, a set of 90 random multi-terminal nets were generated for an 8 layer 32 X 32 grid. The resulting problem is too congested to route completely, but can be used to evaluate how different routing algorithms select gridpoints for ripup.
L4 was compared to both the classic Lee Algorithm and the A* algorithm with a cost penalty for crossing obstacles. This result of this experiment is equivalent to one pass of a ripup-and-reroute algorithm in which "violations" have occurred but will be corrected in a later pass. Table 5 summarizes the results in terms of number of rectilinear wire segments in the final routing, total wire length, the number of gridpoints selected for ripup during etching ("EtchPts"), and the required execution time and the speedup of L4 compared to the software approaches.
The "EtchPts" metric is important because it identifies the number of gridpoints assigned to more than one net; these nets must be identified and ripped up in a later pass. A smaller number here implies less work to be done during the ripup stage. The hardware etching approach results in a significantly smaller number of ripped up gridpoints, but does so at an expense of about 6.5% overall wirelength. The segment count is significant because it corresponds to the number of bends found in each connect. L4 falls between the Lee and A* approaches on this metric. Finally, L4 displays higher speedups during etching for both the Lee and A* algorithms compared to Table 4 ; this is because all gridpoints that are reachable form the source gridpoint must be expanded before obstacle gridpoints can be considered.
CONCLUSION
This paper has described the design of an FPGA-based hardware accelerator for maze routing that is intended for detailed routing of integrated circuits. Experiments show promising speedups over software implementations that could be further enhanced by an improved interface design.
The 32 X 32 X 16 grid size approaches the size needed to support detailed routing, and larger commercially available FPGAs could support even larger arrays. For example, the Xilinx Virtex-4 XC4VLX200 could support up to 64 X 64 X 16 grids for basic routing and 50 X 50 X 16 grids for the full design described in this paper.
There are a number of areas for future work in this project. First, the basic net routing capability here must be exploited by developing a ripup and reroute algorithm that uses the routing accelerator to search for multiple connections. This effort is currently underway. Second, modem routing tools must deal many issues that go beyond simple path search, such as non-uniform spacing, preferential routing directions on different layers, via restrictions, and other physical considerations and electrical considerations such as cross-talk and signal integrity. Further research must address these issues. Finally, the L4 approach could be extended to FPGA routing by modifying the PE design to represent FPGA routing resources.
