Routing for FPGAs has been a very challenging problem due to the limitation of routing resources. Although the FPGA routing problem has been researched extensively, most algorithms route one net at a time, and it can cause the netordering problem.
INTRODUCTION
Due to their low manufacturing cost and $me, field programniahle logic arrays (FPGAs) have been very popular for rapid system prototyping, logic emulation, and reconfigurable computing. Figure 1 shows a typical array-based FPGA architecture. An array-based FPGA consists of a two-dimensional array of logic modules, vertical and horizontal routing channels, and switch modules. The logic niodules contain combinational and sequential circuits to realize logic functions. The routing channels and thc switch modules comprise the routing resources of an FPGA. The *This work was partially supported by the National Science Foundation under grant CCR-0244236 Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided lhat copies are no1 made or distributed for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redislribute to lisls, routing channels usually have various lengths of wire segments to improve circuit performance and maintain reasonable routability at the same time. Routing of an FPGA is performed by programming the switches to connect the wire segments. Due to their high RC delays and large area, the routability of switch modules are usually limited.
It is widely known that the feasiblity of FPGA design is most constrained by routing resources, and routing delays dominate the performance of FPGAs. The routing problem of the array-based FPGAs has been extensively studied by researchers [4, 6, 9, 12, 13, 14, 16, 181 . Most FPGA routers route one net at a time, and they suffer from netordering problem in which routing results may vary significantly depending on the ordering of the nets to route. The PathFinder algorithm[l4] alleviates this problem by welldesigned ripup and reroute scheme. The VPR routerjl61, based on a careful implementation of the PathFinder a l g e rithm, is a very successful placement and routing tool.
In this paper, we present an effective routability driven detailed routing algorithm, which also considers routing delay. Our algorithm first considers the problem of routing all the nets connected t o one common logic module through logically equivalent input pins. We assume the LUT-based logic modules for our target architecture. Our algorithm exploits the fact that the input pins of a LUT are logically equivalent, and it routes all the nets connected to one LUT simultaneously. Our algorithm is hased on min-cost How computations, and it guarantees to find a congestion-free wire type assignment solution if one exists. Furthermore, it can find B solution with minimum total delay at the same time. Although network flow frameworks have been used for various requires prior specific permission andior a fee.
Copyright2003 ACM 1-58111-762-11031001 I ... $5.00.
ICCAD'O3, November 11. 13.2003 
Figure 2 : A routing g r a p h a n d the routing trees for t w o nets. Black nodes represent input pin nodes, white nodes correspond t o o u t p u t pin nodes, a n d t h e g r a y nodes correspond to the wire segments. Edge p a i r s b e t w e e n the nodes are represented b y t h e bidirectional edges for simplicity. flow frameworks have been used for various routing a l g e rithmsi3, 7 , 11, 151, most of those algorithms need to solve the multicommodity flow problems which cannot guarantee integer flow solutions. Although [15] proposed an algorithm based on min-cost flow algorithm, it can only handle the nets connected to a common node.
To alleviate the possible ordering problem in LUT selection, and to further improve the routing results, we adopt an iterative refinement scheme based on Lagrangian relaxation. In the Lagrangian relaxation framework, the routing problem is transformed into a sequence of subproblems called the Lagrangian subproblems. Each subproblem corresponds to the routing problem for all the nets connected to a LUT, and it is solved by min-cost flow computation. A t each iteration of our algorithm, violations in congestion constraints are reflected in the value of corresponding Lagrangian multipliers, and the multipliers guide the router.
The rest of the paper is organized as follows. The FPGA dctailed routing problem is defined in section 2. In section 3, wc present the flow network graph construction scheme and our nctwmk flow based algorithm for routing nets together with the Lagrangian relaxation franiework lor iterative refinement. Experimental results are shown in section 4, and we conclude the paper in section 5.
PROBLEM FORMULATION
Routing of an FPGA is performed by programming t,he switches to connect the wire segment,s. Unlike the interconncclion tracks in custoni ICs, a wire segment in an FPGA cannot. he shared by different nets. Together with performaricc constraints, this congestion constraints make FPGA routing a very challenging problem. The problem of routing FPGAs is assigning nets to routing resources to route all nets successfully. The routing architecture of an FPGA can be modeled with a routing graph G(V, E ) . A set of nodes V in this directed graph represents the input pins and output pins of logic modules. and the wire segments. The set of edges E corresponds to feasible connections between the routing resources represented by nodes. We can attribute a set of capacity R and a set of cost C to each node and edge in G. A capacity of a node (or an edge) denotes the available number of the pin or wire segment (switch), and a cost of a node (an edge) represents the routing delay through the corresponding routing resource. For detailed routing, capacities for all the routing resources can be set to 1. A route of a net corresponds to a subtree in G. The root of a routing tree is the source of the net, and all the leaf nodes are the sinks of the net. Because no resource can be shared by different nets, the routing trees for the nets are vertex disjoint. Figure 2 shows an example of a routing graph and routing trees. Nodes and edges are superimposed on the corresponding routing resources. The problem addressed in this paper is stated below: F P G A detailed r o u t i n g problem: Given a muting resource graph G for an FPGA architecture, find verlex disjoint muting trees an G for all the nets.
To alleviate the net ordering problem, we route several nets connected to a common logic module simultaneously rather than routing nets one by one. As shown in Figure 1 , we assume a widely used model of a logic module, which consists of several look-up tables (LUTs) and flip-flops. In this paper, we assume that a logic module has one LUT in it for the simplicity of presentation unless stated otherwise. Because all the inputs of a LUT are logically equivalent, they are permutable when they are assigned to the routes for the nets connected to the LUT. In our routing algorithm, we route all the nets connected t o a LUT simultaneously. For multiple-pin nets, we route a portion of net (net segment) branched from a partial routing tree previously constructed for the net. In the example shown in Figure 2 , both of netl and net2 are connected to a logic module Lz. Net1 is a multiplepin net. Suppose the routing tree for netl is already Constructed partially (from L1 to Lq using wire segments a and g) when the logic module La is processed. Then, only the portion of the tree branching from this partially constructed tree needs to be routed when the module Lz is handled. Before stating the overall routing problem, we define a smaller problem a s follows:
The R o u t i n g for One LUT (ROL) Problem: Given a routing resource graph G(V, E ) and a LUT, find routes for all the net segments connected to the LUT such that each edge and node is used at most once.
Because we will use min-cost Row computation to solve ROL problem, we can also minimize the total cost of the routes for the nets. By solving the ROL problem for each LUT, we can solve the overall routing problem for an FPGA.
As ROL problem for each LUT is solved, branches of routings trees for the nets are gradually constructed, and after solving ROL for all the LUTs, we can get the routing trees far all the nets in an FPGA. In the example of Figure 2 , although only a portion of net1 is routed when L4 is p r e cessed, by the time when routing for Lz is finished, routing trees for bath of nctl and net2 are fully constructed. The FPGA detailed routing problem can be formulated as:
where xi& is a decision variable for each node (or edge) defined as 1, if the routing of net IC uses node (or edge) i { 0, otherwise Xik = c, is the cost of node (or edge) i, z is the vector of z.k, and X is the set of all possible routes of each net.
ALGORITHM DESCRIPTION
In this section, we describe the algorithms to solve the problems introduced in the previous section. By performing the min-cost max flow computations on the flow network which is constructed from G, our algorithm for the ROL problem, ROLNF, can solve the ROL problem in polynomial time. Because each routing resource can be accessed by several nets, there can be dependency between the flows computed for each LUT. To avoid this ordering problem of LUTs processed by ROLNF, we adopted an iterative refinement scheme based on Lagrangian relaxation.
Lagrangian relaxation is a general technique for solving optimization problems with difficult constraints. In Lagrangian relaxation, constraints are relaxed and added to the objective function after being multiplied by coefficients called Lagrangian multipliers. In our scheme, the congestion constraints are relaxed, and the Lagrangian multipliers used to relax the congestion constraints are added to the cost for each node to guide the router to find less congested routes for the nets.
Min-cost flow algorithm to route nets for one LUT
To solve the ROL problem for a LUT, we construct a flow network from the routing graph. We will apply a mincost flow algorithm on the constructed flow network. Let L = { l~, 12, ..., 1,) be a set of all LUTs in an FPGA. Given a routing graph G(V, E ) with capacity R and cost C and a LUT l k , we construct the flow network Gj(V,,E,) as follows:
1. V, = V U {s, SI, s2, ..., sn, t } , where s is a source node, and s, is a subsource node, and t is a sink node of G, (Vf,Ef) . n is the number of nets connected to lii.
{(p,,t)(p, E Sp}. T, denotes a partially constructed routing tree for net i, and S, is a set of nodes correspond to the input pins of lk.
1 , 2 )..., n ) , E , , = { ( s ; , u ) l i = l , Z ,..., n , v E T , } , E , = The constructed flow network for a ROL problem is illustrated in Figure 4 . Note that the subsource nodes and the edge set are adaptively updated for each lk depending on the number of nets connected to l h and the partial routing tree of each net constructed in the process of solving ROL problems for other LUTs. By connecting a subsource node to all the nodes belonging to a partial routing tree for a net, we can find the best branching point for the Steiner tree. To make our problem conform to the classical network flow framework, we transform G, (V,.E,) to a directed graph in which only edges have capacities and costs. Note that any undirected edges, which can be formed due to bidirectional wire segments, in Gf (V,, E,) can be transformed to a pair of directed edges with the cost and the capacity of the original undirected edge [Z]. By node splitting transformation, any node i with nonzero cost and capacity is transformed into the two nodes i ' and i". This transformation replaces each of the original edges (j,i) and ( i , k ) into (j,i') and ( i " , k ) , respectively. It also adds an edge (i'"'') with the cost and the capacity of node i. Figure 3 shows an example of the network transformations. It can he shown that any flow in G, is a routing solution for a subset of the given nets to route. Each flow from s to t through subsources corresponds to a route for a net segment connected to li. A node occupied by a flow corresponds to a wire segment used for the route, and the flow along an edge denotes that the corresponding switch is used for the route. If a flow f exists and 1 1 1 = n> then we can find a feasible solution for all the net segments connected to the LUT, and the cost of the flow is the cost of a sohtion to the ROL problem. Since we assigned 1 to all the edges and the number of edges connected to s is n., and it is the same as the number of edges connected to t , I f '
where f ' is the maximum flow in G,. I f f ' < n, there is no feasible solution to the ROL problem, and the min-cost maximum flow assigns routing resources to the routes for as many net segments connected to the LUT as possible with minimum total delay costs. The following theorem shows that the ROL problem can be exactly solved by a network flow computation on GI. From the computed flow in GI, a solution t o the ROL problem can be derived by a depth-first search from each input pin node of the LUT to each subsource node in G j . Bc cause each subsource node s, is associated to a net segment and it is connected to the nodes belonging to the partial routing tree associated with that net, the node connect,ed to the subsource node among the nodes in partial routing tree becomes a branching point for the branch from the routing tree to the LUT. Figure 4 shows a flow f corresponding to a ROL solution for 4 net segments connected to a LUT. In this example, the net from 11 is a multiple-pin net, and a portion of this net is routed while 114 is processed. Hence, the subsource 52 is connected to all the nodes along the partially connected routing tree. Because only a flow of size one can flow through s2, only one node is selected as a branch- There are several polynomial-time optimal algorithms avail-THEOREM 2. The ROL-NF algorithm exactly solves the ROL problem in O(VEloglog~m.~log(VC,or) 
Iterative refinement using Lagrangian relaxation
In this section we solve the FPGA detailed routing problem. We apply the R O L N F algorithm successively on all the LUTs in an FPGA. To avoid ordering problem ofselscting LUTs, we adopt an iterative refinement scheme based on Lagrangian relaxation. We relax the congestion constraints in the FPGA detailed routing problem formulated in the previous section. Each of the constraints is multiplied hy the corresponding Lagrangian multiplier, and added to the objective function. Let We solve this reduced Lagrangian subproblem LS; by solving the ROL problems for all LUTs in L after assigning (et + A; ) to each node (edge) i as a cost. After solving the ROL problem, we reset the capacity of the edges to allow the relaxation of the congestion constraints. To discourage nsing routing resources used in routing for nets to other LUTs, we define the ci term as follows,
where d; is a delay cost of node (edge) i, and qi is the penalty term proportional to the number of nets using node [edge) i currently.
It. is known that, t,he minimum value of the Lagrangian subproblem for any given vector X is a lower hound on the optimal objective function value of the original optimization problem. Hence, the tightest lower hound to the optimal objective function value is obtained by solving
which is known as the Lagrangian dual problem. To solve the Lagrangian dual problem, an iterative approach is used.
At each iteration, we solve L S i by solving LS!, for a given A, and then update the Lagrangian multipliers for the next iteration using the solution of the current iteration. The 
4.
Call R O L N F 5.
6. Update X 7. Repeat Step 2-6 until no shared resource exists end Rip u p all the nets connected to la Update costs and reset capacities 
EXPERIMENTAL RESULTS
We have implemented our algorithms in C programming language on a SUN Sparc Ultra 5 (360MHz) with 128M memory. The experiments are performed on 9 circuits from MCNC benchmark (171. The placed netlists were generated using the placer in VPR [Is] . We assumed a symmetricalarray-based FPGA, where each logic block contains four 4-input lookup tables and four flip-flops. We set F, = 3 and F, = W , where W is the number of wire segments of each channel. F, denotes the number of connections for each wiring segment entering the switch box. F, denotes the number of tracks t o which each logic block pin can connect. For the purpose of comparison, we used identical intrinsic delay values and timing models of VPR.
We compared the minimum number of tracks per channel to achieve feasible routings for all nets, critical path delay, and the total wire length from FlowRoute with those from VPR router. Because FlowRoute is basically a congestiondriven detailed router, we compared our results from the ones obtained by running VPR router in congestion-driven mode. Results are shown in Table 1 . FlowRoute used smaller number of tracks per channel for 3 circuits. Although FlowRoute is a congestion driven router, because it is based on min-cost flow computation, it shows improvcment in critical path delay up to 28.9 %(average 14.1 %).
Because the objective function of the problem of this paper is the total sum of delays of routing resources, we also compared total wire length. The wire lengths in the table are represented as integer multiple of one logic module length. The total wire length used to route all nets are reduced up to 13.3 % (average 8.3 %).
CONCLUSIONS
In this paper, we proposed a congestion-driven detailed router for FPGAs. In our algorithm, we route all the net segments connected to a LUT simultaneously rather than routing one net at a time. By routing several net segments simultaneously using min-cost flow computation, our algu rithm can alleviate the net ordering problem. To avoid ordering problem in selecting LUTs, we adopted an iterative refinement scheme based on Lagrangian relaxation. Each of Lagrangian subproblems is solved by successive application of min-cost flow based routing algorithm on all the net segments connected to each LUT.
We could find feasible routings for the benchmark circuits with less or equal number of routing tracks per channel compared to VPR router. Furthermore, the total delays of the nets are also reduced, which can contribute to reducing critical path delay of the circuit. We compared our algorithm with VPR router and the experimental results shows that our algorithm is very effective.
