Short Papers

TRACER-fpga: A Router for RAM-Based FPGA's
Ching-Dong Chen, Yuh-Sheng Lee, Allen C.-H. Wu, and Youn-Long Lin Abstract-We describe a routing method for the design of a class of RAM-based field programmable gate arrays (FPGA). We model the interconnect resources as a graph. A routing solution is represented as a set of disjoint trees, each connecting all terminals of a net, on the graph. An expansion router is used for connecting a net. Initially, nets are connected independently of one another. Conflicts among nets over the usage of interconnect resources are resolved iteratively by a rip-up and rerouter, which is guided by a simulated evolution-based optimization technique. The proposed approach has been implemented in a program called TRACER-fpga. As compared with CGE [2] and SEGA [3], TRACER-fpga in general requires fewer routing tracks at the expense of longer wiring delay. It is suitable for low-speed applications such as hardware emulation.
I. INTRODUCTION
Because of their low, nonrecumng engineering (NRE) cost and low manufacturing time, field programmable gate arrays (FPGA's) have become the most popular application-specific integrated circuits (ASIC's) for fast system prototyping. Hence, it is important to develop good computer-aided design (CAD) software to aid the design of FPGA's. A typical FPGA design process consists of partitioning, logic synthesis, technology mapping, placement, and routing. In this paper, we focus on the last problem.
There are many commercial FPGA's. The RAM-based FPGA, such as Xilinx's [l]XC300O's and XC~OOO'S, is a widely used class of them. A RAM-based FPGA consists of a two-dimensional array of conjgurable logic blocks (CLB's) , a number of rows and columns of predefined wiring channels, and many programmable switches. Each CLB has a look-up table (LUT) , which can implement any Boolean function of up to K variables (IC is usually 5 or 4 in what are available to date). Each wiring channel consists of a number of prefabricated wiring segments. Wiring segments may be of variable lengths. A programmable switch may be turned on to connect either two wiring segments, or a CLB's pin and a wiring segment.
Since the wiring segments are prefabricated, the routing problem can be viewed as to select a subset of programmable switches for turning on. This is very different from semicustom layouts, such as standard cells or mask-programmable gate arrays, in which wiring segments and vias can be deployed virtually everywhere on the grid.
Because of the limited interconnect resources, it seems difficult routing an FPGA. However, if we view it as a searching problem, limitation usually implies a small solution space. This observation motivates us to take a graph search approach towards the FPGA routing problem. We model the interconnect resources as a graph, where each node represents either a CLB pin or a wiring segment, and each edge a programmable switch. The edges become directed if Manuscript received July 6, 1993; revised February 7, 1994 and September 20, 1994 . This work was supported in part by the National Science Council, ROC, Contracts NSC-82-0404-E-007-143 and NSC-83-0404-E-007-020. This paper was recommended by Associate Editor G. Zimmermann.
The authors are with the Department of Computer Science, Tsing Hua University, Hsin-Chu, Taiwan 30043.
IEEE Log Number 9406935.
the switches are unidirectional. We further model the routing problem as to find on the graph a set of disjoint trees (subgraphs), in which each tree connects all terminals of a net. The routing is done in two stages: initial routing and rip-up-andrerouting. In both stage, a net is connected using an expansionrouting-based algorithm. During the first stage, nets are routed sequentially and independently of one another. That is, for every net we ignore the existence of any previously routed nets. There is no ordering imposed on the nets. Inevitably, there will be conflicts over the usage of routing resource among nets. During the second stage, conflicts are resolved iteratively. Within an iteration, some nets are ripped-up and rerouted. The selection of nets for ripping-up is guided by a simulated evolution-based optimization technique. The rerouting is done again with the expansion router except that the presence of other already routed nets is no longer ignored. The rest of this paper is organized as follows. In the next section, we define the routing problem for the RAM-based FPGA architecture. In Section 111, we briefly describe what has been done for the problem. In Section IV, we propose our approach towards the problem and describe TRACER-fpga, a software program implementing the proposed approach. Section V presents a series of experimental results and compares TRACER-fpga with CGEEEGA, a well-known tool for the same problem. Finally, we draw a conclusion in Section VI by pointing to some areas for potential further research.
PROBLEM FORMULATION
A typical RAM-based FPGA consists of three types of objects: 1) the configurable I / O blocks (IOB's); 2) the two-dimensional array of configurable logic blocks (CLB's); and 3) the interconnect resources consisting of wiring segments and programmable switches.
Each programmable switch is a pass-transistor controlled by a static RAM cell. The content of the RAM cell determines whether the passtransistor is on or off. When the RAM cell is high, the pass-transistor is turned on. On the other hand, when the RAM cell is low, the pass-transistor is off.
We model the routing resources as a graph. Each node in the graph represents either a wiring segment or a CLB pin. Each edge represents a programmable switch in a connection box or a switch matrix. A connection box allows CLB pins to be connected to the routing channel while a switch matrix provides routing paths from one channel to another. A programmable switch between two wiring segments W1 and Wz which are represented in the graph by nodes VI and V2, respectively, is represented by the edge linking nodes VI Based on the graph model, we view the problem of routing a net as to find on the graph a tree that spans over all nodes corresponding to all terminals of the net. In a feasible solution to the routing problem, all trees must be disjoint.
and Vz.
III. PREVIOUS WORK To the best of our knowledge, there are only two published routers, CGE [2]and SEGA [3] , both from the University of Toronto, targeted towards the routing problem of RAM-based FPGA's.
CGE (coarse graph expansion) [2] first uses a global routing technique previously developed for standard-cell-based layout to 02784070/95$04.00 0 1995 IEEE decompose each net into a number of two-terminal connections and route them in the minimum distance coarse paths. Its primary goal is to distribute the connections among the channels so that the channel densities are balanced. It then chooses for each twoterminal connection exact wiring segments to implement the coarse path assigned during the global routing. Each coarse path is expanded into a number of exact paths. A cost function is used to help iteratively select among all exact paths the best one to realize the corresponding two-terminal connection. Once an exact path has been selected, all exact paths expanded from the same coarse path and all other exact paths competing with it for routing resources are eliminated from further consideration. This iteration halts when no more unrealized exact paths are left. The cost function helps achieving two goals: 1) to identify a path that is essential for a connection, and 2) to select a path that has a relatively small negative effect on the remaining unrealized connections.
SEGA (SEGment Allocator) [3] is intended for RAM-based FPGA's withvariable-length wiring segments. It focuses on not only the issue of 100% completion, but also addresses the allocation of wiring segments to connections in a way that matches the lengths of the wiring segments to the lengths of the connections. Therefore, long connections do not suffer from long propagation delay through multiple programmable switches. SEGA employs the same strategy as CGE. The main difference comes from the cost function. SEGA's cost function helps to minimize 1) the wastage due to allocating long wiring segments to short connections and 2) the number of wiring segments used during the selection of an exact path for a connection.
Conventional physical design approaches divide the routing problem into two subproblems: global routing and detailed routing. The primary reason behind this division is in the problem complexity. This division leads to suboptimality even if both subproblems are solved optimally. Therefore, whenever possible such a division should be avoided.
Decomposition of a multiple-terminal net into two-terminal subnets also results in poor routing quality. This has been demonstrated through research in Steiner tree construction. Therefore, the decomposition of multiple-terminal nets should be avoided whenever it is possible.
In an instance of routing problem, nets compete with one another for interconnect resources. What is good for a net is not necessarily good for the whole layout. That is, in a globally good routing, some nets may need to be routed using suboptimal patterns. Any strategy without rerouting will be inferior no matter how the nets (or subnets) are ordered for routing. Therefore, it is essential to have a good rerouting strategy.
IV. THE TRACER-fpga TRACER-fpga is an extension of TRACER, a global router for building-block layouts [4], for FPGA routing. It uses a graph to represent the routing architecture and views the routing problem as a searching on the graph. TRACER-fpga consists of two tasks: initial router, and rip-up and rerouter. The initial router generates a routing which serves as the seed for the rip-up and rerouting process. The initial routing result may be, and usually is, infeasible due to the existence of conflicts among nets over the usage of routing resources. The rip-up and rerouter performs a simulated evolution process until a feasible solution is obtained. 
This algorithm is an extension of the classical Lee's Algorithm [5] for routing on a maze. A component is a connected subgraph of the routing graph. At the beginning, every component is just a pin vertex. A component is expanded by adding to it one or more adjacent vertices. A vertex is adjacent to a component if there exists an edge between itself and a vertex of the component. The choice of which vertices to be included during each expansion is made in a breadth-first fashion. That is, expansions are made out of all expanded vertices with the smallest distance value.
As the expansion proceeds, a value is associated with the newly included vertex to indicate the distance from the vertex to the original unexpanded components. Whenever two components are expanded into each other, a spanning connection between these two original components is found by backtracking. S(G, , G n ) in the algorithm denotes such a connection. These two components along with their connection are merged and treated as a single component for later expansion. The process iterates itself until there is only one component left. That is, all pins have been connected together.
While connecting a net, TRACER-fpga ignores the existence of all other nets.That is, it makes use of all routing resources. Inevitably, some wiring segments and/or switches will be occupied by more than one net. This is called violation and will be passed down to the rip-up and rerouter for resolution.
Because the initial router routes nets independently of one another, there usually are violations against the FPGA routing limitation. The rip-up and rerouter resolves the violations using a simulated evolution-based optimization technique. Simulated evolution has been applied to various areas of CAD including placement, channellswitchbox routing, partitioning, and scheduling.
The most essential issue here is how to select the nets to be ripped up. It is intuitive that a net causing a lot of violations (i.e., it competes with other nets for the same interconnect resources) is a good candidate for ripping up. However, such a straightforward approach may lead to a locally, instead of globally, optimum solution. Simulated evolution provides a randomized scheme that allows both good and bad nets to be ripped up. A bad net has a great chance to be ripped up while a good net has a small but nonzero chance to be ripped up. Rip up net n,; for each ripped up net n3
Reroute net n z using the multiple-component growth algorithm; endwhile if (Time-out) ret urn(FA1LURE); ret urn(SUCCESS); end A net is scored according to its connection length and the number of violations it involves with the following formula:
actual J e n g t h; estimated-mindength, score(n,) = cy * + (1 -a ) * num-violations, where cy is a parameter for the user to express hisher preference of one term over the other. The worse net n z is routed, the higher score(n,) is. The normalization is done such that all nomlizedscores have values between 0.05 and 0.95 with the best net being scored 0.05 and the worst 0.95. The following formula is used:
normalized-score(n,) score(1z;) -lowest-score highest-score -lowest-score' = 0.05 + 0.9 * To determine whether a net should be ripped up, a (pseudo) random number between 0 and 1 is generated and compared with its normalized score. If the random number is smaller, the net is ripped up. The set of ripped up nets is then rerouted one at a time starting with the worst one. The rerouting is done using the same multiple component growth algorithm used by the initial router except the presence of those already routed nets is no longer ignored. This is accomplished by taking into account, in addition to the distance, the conflicts over the usage of interconnect resources during the calculation of expansion cost. That is, an expansion into a vertex which has been occupied by some other nets is only possible at a very expensive cost. In our present implementation of the algorithm, we let the cost be 10 times more expensive than that of an empty one. By expanding into an occupied vertex, we effectively do not resolve the violation. Instead, we hope that the occupying nets will be ripped-up and rerouted during later iterations.
There is no guarantee that all violations will be resolved. To prevent TRACER-fpga from looping itself infinitely, we set a limit on the CPU time using the time-out function. If the routing is still infeasible after a user-specified period of time, TRACER-fpga will report a FAILURE and exit.
After a feasible solution is found, the post-optimizer tries to find a better solution with less total wire length. It sequentially rips up and reroutes one net at a time while maintaining the feasibility of the routing. This process is repeated several times until no further improvement can be obtained. Note that the number of tracks used will not be reduced.
V. EXPERIMENTAL RESULTS
We have implemented TRACER-fpga in C programming language on a SUN Sparc2 workstation. To test TRACER-fpga, we have 
ROUTING RESULTS OF THE FIRST SET OF BENCHMARK
obtained from Professor Stephen Brown of the University of Toronto, the CGE/SEGA program and two suites of benchmark circuits. We added an interface routine to CGE, so it can build the graph and netlist files for TRACER-fpga, call TRACER-fpga, and read and display TRACER-fpga's routing results.
The first suite of benchmarks has been used in a paper describing CGE [2] .The targeted routing architecture is similar to that of the Xilinx XC3000 series FPGA's. That is, Fs is 6 and Fc is 0.6W [2] , where W is the number of tracks (wiring segments) of each channel. FS denotes the flexibility of a Switch Matrix ( S block) and is defined as the number of connections for each wiring segment entering the S block. Similarly, FC denotes the flexibility of the C block and is defined as the number of tracks that each CLB (L block) pin can connect to. For each benchmark circuit, TRACER-fpga tried several values of W starting with the channel density determined by the global routing of CGE. If it succeeded with a particular W , it reduced W by 1 and tried again; otherwise, it stopped.
The routing results for the first set of benchmarks are summarized in Table I . The channel density is calculated by the global router of CGE, which is a lower bound on the number of tracks needed by CGE. The table shows that TRACER-fpga needs significantly fewer tracks in each channel to complete the routing. TRACER-fpga was able to do better than that of CGE's lower bound because it does not decompose multiple-terminal nets and it does not constrain the routing length of a net. Also shown in the table are the CPU time consumption for each of the benchmark circuits. For the largest benchmark (i.e., 203) with 586 CLB's, the routing is done in about 10 minutes, which is acceptable in practice.
The second set of benchmarks has been used in a paper describing SEGA [3] .The routing architecture is similar to that of the Xilinx XC4000 series FPGA's. That is, the wiring segments can be of variable length. Here we set FS = 3 and Fc = W , which are the same as [3] . The routing results are summarized in Table 11 . Again TRACER-fpga needs considerably fewer tracks than SEGA. Also shown in the table is the maximum net delay calculated by and based on the delay model of SEGA. It is clear that TRACER-fpga's wiring delay is much longer than SEGA's. A detail net-by-net comparison reveals that the long delay was caused by the detouring of a few nets in each benchmark.
VI. CONCLUSION
We have proposed a routing method for the design of RAM-based FPGA's. We model the interconnect resources of an FPGA as a graph. To route a design we find a set of disjointed trees, each for a net. Routing of an individual net is done with an expansion-routing on the graph. Conflicts over the usage of interconnect resources are resolved by a rip-up-and-reroute strategy, which is guided by the simulated evolution heuristic. TRACER-fpga was able to achieve very dense routing for two suites of benchmark circuits using a reasonable amount of CPU time. The long wiring delay makes TRACER-fpga unsuitable for highperformance design. However, the track efficiency makes it suitable for slow-speed applications such as hardware emulation where the clock rate ranges from 1 to 10 MHz.
Circuit
The largest circuit in our benchmark suites has 586 CLB's. As the technology advances, the number of CLB's in a circuit will increase. TRACER-fpga will slow down as the routing graph grows. It is important to improve the programming and data structuring efficiency so that TRACER-fpga can keep up with the technology progress. Furthermore, to improve the wiring delay of TRACER-fpga, we have to integrate the delay model into the expansion router. That is, instead of searching for the shortest connection, it should look for the fastest connection. In exhaustive self-testing, all combinations of the test patterns must be generated. This can be achieved using the linear feedback shift register (LFSR) by cycling through all 2" -1 LFSR states. Exhaustive self-testing provides a thorough test, in that it eliminates the need for a fault model or fault simulation process and, at the same time, it can achieve very high fault coverage [5] , [8] . However, this technique may require prohibitively long test time for a circuit with many inputs. To shorten testing time, McCluskey [9] proposed a verification testing technique based on reduced exhaustive test patterns (RETP) by concurrently overlapping the exhaustive test patterns of different output cones. However, this technique will not help if most outputs depend on too many inputs. A more general solution is pseudoexhaustive testing by partitioning a large circuit into subcircuits, such that each subcircuit has a sufficiently small number of inputs to make Manuscript received September 22, 1992; revised January 19, 1994 and September 15, 1994 
