This paper presents a performance-oriented placement and routing tool for eld-programmable gate arrays. Using recursive geometric partitioning for simultaneous placement and global routing, and a graph-based strategy for detailed routing, our tool optimizes source-sink pathlengths, channel width and total wirelength. Our results compare favorably with other FPGA layout tools, as measured by the maximum channel width required to place and route several benchmarks.
Introduction
Field-programmable gate arrays, or FPGAs, provide a versatile and inexpensive way to implement and test VLSI designs 7, 16]. FPGAs are available in a number of styles and con gurations 40] . One of the most common FPGA architectures 9, 43] consists of a matrix of user-con gurable logic blocks interconnected by a set of programmable routing resources ( Figure 1 ). FPGA reprogrammability is achieved at the expense of performance, as there may be long signal delays through the recon gurable routing resources 39]. To increase FPGA performance, partitioning and technology mapping have been extensively studied 11, 20, 27, 35] . However, the observation that circuit performance is impacted more by routing delays rather than by device delays 6, 26] has focused recent attention on routing 8, 15, 31, 32, 42] .
This paper presents a performance-oriented FPGA Placement and Routing (FPR) tool. FPR is based on a recursive geometric strategy for simultaneous placement and global routing, followed by a graph-based detailed-routing phase. FPR heuristically minimizes both wirelength and sourcesink pathlengths. Thus, FPR optimizes the number of FPGAs required to implement a given design, as well as the performance of the implementation. In particular, FPR successfully lays out This work is supported by NSF grants CCR{9224789 and MIP{9107717 (Cohoon) , a Virginia Space Grant Fellowship (Ganley) , and NSF Young Investigator Award MIP{9457412 (Robins). Our benchmarks and additional related papers may be found at WWW URL http://www.cs. a number of large industrial benchmark circuits using smaller channel widths than other FPGA layout tools, and also optimizes source-sink pathlengths as a secondary criterion.
The rest of the paper is organized as follows. Section 2 provides an overview of our methodology. Section 3, Section 4 and Section 5 detail the main phases of FPR, namely placement, global routing and detailed routing, respectively. Section 6 establishes the e cacy of our implementation on industrial benchmark designs, and we conclude in Section 7. The Appendix develops some theoretical results for multi-weighted graphs used in the multi-objective optimization phase of detailed routing. Preliminary versions of this work have appeared in 1, 2, 3].
Overview
FPGA logic blocks typically contain a programmable look-up table, which enables arbitrary combinational-logic functions of up to four variables to be implemented. Each logic block thus contains a small portion of the overall circuit logic. The logic blocks are interconnected by channel segments, which are linked together by switch blocks. The switch blocks contain programmable internal connections among certain subsets of incident channel segments. Switch-block edges are often implemented as pass transistors, which can be \turned-on" to interconnect incident channel edges. Finally, connection edges allow logic-block pins to latch onto adjacent channel segments.
During the FPGA design process, placement and routing are performed following the technology mapping phase. Technology mapping decomposes the circuit design into units of logic, which are then assigned to speci c logic blocks during placement. Thus, the input to FPR consists of unplaced logic blocks and a set of nets (a net is a set of logic block I/O pins that must be inter-connected). FPR performs simultaneous placement and global routing using a recursive geometric technique called thumbnail partitioning, which decomposes the circuit area into an m n grid, for some small xed m and n. This grid is called the partitioning template. The placement is then optimized and a global routing is determined relative to the partitioning template using optimal rectilinear Steiner arborescences (RSAs) 34] (i.e., minimum-weight shortest path trees). Since m and n are small and xed, these optimal RSAs (called thumbnails) may be precomputed for e cient lookup during execution. Setting m = n = 3 yields the basic 3 3 partitioning template that is used in our implementation (Figure 2(a) ). Thumbnail partitioning is a generalization of sharp partitioning 5], which in turn is a generalization of quadrisection 38] . Our strategy consists of placement and global routing, followed by detailed routing. During placement and global routing, a partitioning heuristic is used to assign the logic blocks to regions in the partitioning template, minimizing source-sink pathlengths as well as the total length of the thumbnails. When the circuit area is divided according to the partitioning template, each logic block lies in one of the m n regions. For each net, we construct a pointset in the m n grid, where a point is present in a region if some logic block associated with the net lies in that region (Figure 2(b) ). A thumbnail over this pointset is then determined (Figure 2(c) ).
To reduce overall routing congestion, alternative thumbnails are selected in order to balance the number of thumbnail edges that cross each edge of the partitioning template. \Virtual" pins are then created at the intersections of thumbnails and partitioning-template edges (Figure 2(d) ), and the algorithm is then applied recursively to each subregion of the partitioning template. This scheme simultaneously produces both a placement and a global routing in which source-sink pathlengths, total wirelength, and maximum channel congestion are all heuristically minimized. The resulting placement and global routing is then used in the detailed-routing phase to produce a complete routing solution.
During the detailed-routing phase, nets are assigned speci c routing resources based on global routes. By modeling the FPGA routing architecture as a graph, e cient graph-based algorithms may be used to produce detailed-routing solutions. Nets are routed one at a time; as resources are committed to nets, the corresponding edges in the underlying graph are made unavailable to subsequent nets.
The next three sections detail the main phases of FPR, namely: (1) logic-block placement and thumbnail selection for balancing congestion, (2) global routing, and (3) detailed routing.
Placement
The placement phase overlays the FPGA with the partitioning template and initially partitions the design logic into m n regions. Cut lines of the partitioning template go through switch blocks so that each logic block lies entirely within a single region of the partitioning template. The distribution of logic blocks among regions of the partitioning template is then improved using simulated annealing 28], where a move consists of swapping two logic blocks that lie in di erent regions of the partitioning template. The simulated annealing objective is to minimize (1) the sum of the maximum source-sink pathlengths in the thumbnails over the nets, and (2) the total length of the thumbnails for all nets. Note that the I/O blocks on the perimeter of the FPGA are not moved during these iterative re nement steps.
Routability is a primary concern during the FPGA design process 6, 10]. An important measure of the quality of a placement and global routing is maximum congestion, which in our case is the number of thumbnail edges that cross any given partitioning-template edge. Thus, once logic blocks have been assigned to regions in the partitioning template, a congestion-balancing step is undertaken as follows.
A typical pointset can have many thumbnails; for example, Figure 3 illustrates a pointset and its eight thumbnails. The objective of the congestion-balancing step is to assign one of the precomputed thumbnail alternatives to each net in a manner that minimizes the maximum thumbnail congestion. This task is accomplished using the following greedy heuristic:
Sort the nets in ascending order of the number of distinct thumbnails for each net; and For each net on this list, choose the thumbnail that minimizes the maximum congestion induced by all previously processed nets.
Intuitively, this scheme postpones the global routing of nets for which there are a greater number of thumbnail choices; this enables FPR to better compensate for the less avoidable congestion incurred earlier by nets with fewer thumbnail choices. 
Global Routing
After FPR has mapped the logic blocks to regions in the partitioning template and each net has been assigned a thumbnail, every edge in each thumbnail is then assigned to a speci c switch block along the crossed cut-line of the partitioning template. Each such switch block is then conceptually added as a new \virtual" pin in the net. The portion of each net within each region of the partitioning template is then passed on to a lower level of the recursion (this is similar to the virtual terminal 5] and terminal propagation 14] techniques). Thus, the global routing computed for a net corresponds to the topology of its thumbnail.
Assignment of nets to switch blocks is accomplished in a manner similar to PHIroute 37] . The number of nets that can be assigned to each switch block is bounded by the number of nets crossing the cut, divided by the number of switch blocks on the cut. This construction induces a structure that may be represented by a complete bipartite graph with nets in one partition and switch blocks in the other. Edge weights in this graph model the cost of assigning a net to the corresponding switch block. Assignments are then determined by computing a minimum-cost matching 33].
Recursion terminates when a region contains at most one logic block, along with the adjacent channel segments and switch blocks. We then route nets within the channels surrounding the logic block (if it exists) while minimizing the maximum channel congestion. In our implementation, an optimal solution is computed using integer programming 30]. This is e cient in practice since the number of nets involving any single logic block is small 17].
Detailed Routing
After placement and global-routing, FPR performs detailed routing by assigning speci c channel and switch-block edges to each net. The placement and global-routing phase passes the following information to the detailed router: (1) locations of relevant logic-block pins (i.e., the net to be routed), (2) a \loose" route for the net (leaving unspeci ed the edges within channel segments and switch blocks), and (3) switch blocks that are likely to serve as Steiner nodes in the detailed routing (Figure 4 ). A design goal for FPR has been the ability to handle a wide variety of FPGA architectures. Towards this goal, we have adopted a graph-based approach to detailed routing. Each switch block contains internal switch-block edges that may be programmed to connect incoming channel edges. The routing structure of the entire FPGA is captured by a routing graph: detailed routes on the FPGA correspond to paths in the routing graph, and vice-versa ( Figure 5 ). In a routing graph, vertices model logic-block and switch-block nodes, while the edges correspond to connection, channel, and switch-block edges. This strategy enables the detailed router to employ generic graph algorithms in order to produce detailed-routing solutions.
Using the routing-graph approach, detailed routing entails interconnecting the logic-block vertices using edges and vertices inside the corresponding global-route region. This goal is modeled by the graph Steiner tree (GST) problem: given graph G = (V; E), where V is the vertex set and E V V is a set of weighted edges, nd a minimum-weight tree in G that spans a subset of the vertices N V (the logic-block vertices in a net), using switch-block vertices as possible Steiner nodes. The cost of a tree T, denoted T, is the sum of the costs of its edges. Since the GST problem is NP-complete 24], we utilize the heuristic of Kou, Markowsky and Berman 29] (KMB), which approximately solves the GST problem in polynomial time, and is guaranteed to yield solutions with cost less than twice the optimal. While the KMB heuristic always nds a feasible detailed routing if one exists, it often does not \branch" at the appropriate Steiner nodes (Figure 6(a) ). This potential drawback is e ectively ameliorated using the greedy strategy described below.
Our detailed-routing algorithm is based on combining a greedy, iterated heuristic 21, 25] with the KMB algorithm; we refer to this hybrid method as the Iterated-KMB (IKMB) algorithm 1].
Given a routing graph G = (V; E), a net N V , and a set S of potential Steiner nodes, we de ne the savings of S with respect to N as KMB G (N; S) = KMB G (N) ?KMB G (N S). Intuitively, KMB G (N; S) represents the interconnect savings incurred by KMB when Steiner nodes in S are included into the node set N to be spanned. This is illustrated in Figure 6 (b), where using a candidate Steiner node from the shaded switch block results in an optimal solution. In order to e ciently nd such Steiner nodes, a set of candidate Steiner nodes is determined for each net. Candidate Steiner nodes are switch-block nodes that correspond to Steiner switch blocks (Figure 4) . The IKMB method operates by repeatedly nding candidate Steiner nodes that reduce the overall KMB cost by the largest amount, and then including them into a growing set S of Steiner nodes. The cost of the KMB tree over N S decreases with each added node, and the construction terminates when there is no x 2 V with KMB(N S; fxg) > 0. The nal topology is obtained by computing the KMB construction using N S as the pins and the remaining V ?(N S) nodes as potential Steiner nodes. The overall IKMB method is more formally described in Figure 7 . The placement and global-routing phases seek to minimize congestion, thereby enabling the detailed router to nd a feasible (and high-quality) solution more easily. However, since it is NPcomplete to determine whether there exists a feasible detailed-routing solution for all nets 41], we use a deterministic net-ordering scheme to route nets one at a time. When a detailed-routing solution for a net is found, the corresponding routing resources are committed to that net and are made unavailable for subsequent nets (i.e., they are removed from the underlying graph). If infeasibility is encountered during the detailed routing of a net (i.e., some logic-block pin is unreachable in the routing graph from the other pins of the net), the following two heuristics are employed.
First, an incremental \wavefront-expansion" technique is rst used to gradually \loosen" the global route, allowing the detailed route to detour around local blockages caused by previouslyrouted nets (Figure 8 ). Note that wavefront expansion determines the region searched by the routing algorithm, as opposed to the order in which graph edges are explored 22]. Secondly, we strive to minimize congestion, which is a measure of resource utilization. To gauge congestion, we divide routing resources into disjoint groups according to functional similarity and physical proximity. For example, all channel edges interconnecting the same two switch blocks form a group, as do all edges inside a particular switch block. As nets are routed, the detailed router updates each group's congestion information (i.e., the number of edges in each group taken by all previously routed nets). Multi-objective optimization is used in the IKMB graph searches to heuristically minimize a combination of wirelength and congestion (See the Appendix for additional details). Thus, within the region speci ed by the global route, our detailed router searches for a feasible solution minimizing both congestion and wirelength.
We found that in practice, the majority of those nets that fail to route using the initial global route become routable after only a single loosening operation. In cases where wavefront expansion fails to produce a routing solution, we employ a \move-to-front" heuristic 36], where unroutable nets are moved to the beginning of the net-routing order and the new routing order is attempted.
Experimental Results
Our algorithms have been implemented using C++ in the Sun/UNIX environment and incorporated into FPR. Two FPGA architectures, corresponding to Xilinx 3000-series and 4000-series parts, were modeled 7, 43] (these architectures are identical to the ones used by CGE 8], SEGA 32] and GBP 42], respectively). We compared the performance of these tools on fourteen large benchmark circuits: the suite of ve 3000-series benchmarks used by 8], and the suite of nine 4000-series benchmarks used by 32] and 42]. The 3000-series benchmarks were routed on FPGAs with switch-block exibility F s = 6 and connection exibility F c = d0:6 We, where W During FPGA physical design, a common objective is to minimize maximum channel width. (Smaller channel width implies the ability to route larger designs on a xed-size part.) Table 1 shows the maximum channel widths of actual complete placement and routing solutions produced by FPR; these compare favorably with CGE 8] for the 3000-series benchmarks, and with SEGA 32] and GBP 42] for the 4000-series benchmarks. The channel width required by FPR is smaller than that required by CGE, SEGA, and GBP in 8 of the 14 benchmark circuits, and is equal on all but one of the remaining 6 benchmark circuits (further improvements have been recently obtained in 4]).
We also measured how well FPR optimizes total wirelength and maximum source-sink pathlengths or radius. Since previous works do not report these statistics, we have implemented a modi ed version of FPR, called FPR-S, that uses unrooted Steiner trees as thumbnails 17], instead of the preferred arborescence thumbnails described in Section 3. We compared the solutions produced by FPR-S against performance-oriented solutions produced by the unmodi ed FPR tool. We observe that the additional 1:0% in wirelength used by FPR yields a 6:7% decrease in radius (Table 2 ). We believe the 1:0% total wirelength di erence is insigni cant but the 6:7% di erence in average radius is signi cant. Therefore we recommend the use of FPR with its use of RSA's over FPR-S and other similar tree-based tools. The time to run FPR is comparable to other tools: CPU times to completely layout the circuits on a Sun SparcServer 10/514 workstation ranged from several minutes for the smallest circuit to several hours for the largest. Figure 9 shows the solution produced by FPR for the smallest of the benchmark circuits. 
3000-Series Benchmarks

Conclusion
We have developed FPR, a placement and routing tool for FPGAs that combines a recursive geometric strategy for simultaneous placement and global routing with a general graph-based detailed-routing algorithm. FPR addresses performance issues by minimizing source-sink pathlengths as well as total wirelength and maximum channel width. FPR compares favorably to existing tools on both 3000-series and 4000-series Xilinx-type parts, as measured by the maximum channel width required for complete layout of a number of industrial benchmarks.
Acknowledgements
We thank Matt Saltzman for the use of his matching code, and Steve Brown and Jonathan Rose for supplying the benchmark circuits. We are grateful to Dr. Bob Grafton of the National Science Foundation for his support and advice. Overall 12.2 12.3 1.0 8.2 7.6 -6.7 Table 2 : Comparison of arborescence-based FPR against Steiner-tree-based FPR-S. Wirelength statistics re ect average number of channel segments used by nets in the circuit; radius statistics re ect average number of channel segments encountered on longest source-sink path for each net. The % column gives the percent change from FPR-S to FPR. This technique is exible in that new criteria are easily incorporated into the model by introducing additional weight sets into the graph. Such a framework subsumes e.g., \alpha-beta" routing (which has been used for jog minimization in IC design 12, 23]), and also has practical application in non-VLSI domains 13].
Let V = fv 1 ; v 2 ; ; v n g be a set of nodes, and let E V V be a set of edges. We de ne a k-weighted graph G = (V; E) to be a weighted graph with a vector-valued weight functioñ w : E ! < k . In other words, associated with each edge e ij 2 E is a vector of k real-valued weightsw ij = (w ij1 ; w ij2 ; ; w ijk ). Note that ordinary weighted graphs are a special case of k-weighted graphs, with k = 1. Proof: Let G = (V; E) be a 3-node 2-weighted graph, with edge weights (a; x), (b; y), and (c; z). . We therefore conjecture that our proven bounds can be made considerably tighter, and leave this as an open problem (recently, tighter bounds were indeed derived for MSTs over multi-weighted graphs 19]).
