A new algorithm for technology mapping of Lookup Table- based Field-Programmable Gate Arrays is presented. It has the capability of producing slightly more compact designs than some existing mappers, and more significantly the flexibility of trading routability with compactness of a design. W e have implemented the algorithm in the h a p program, and compared its routability with two other mappers. h a p can produce mappings with better routability characteristics, and more significantly h a p produces routable mappings when other mappers do not.
Introduction
Field-Programmable Gate Arrays (FPGAs) are integrated circuits consisting of arrays of gates that can be configured -and reconfigured -by the system designer through software, rather than by chip manufacturer in the fabrication line. The success of FPGA technologies is supported by a set of CAD tools to aid the design process. The two final steps in the implementation of a prototype on an FPGA are 1) its mapping to a network of logic cells (technology mapping or logic partitioning), and 2) the assignment of the network cells to physical cells on the device and the configuration of routing structures to interconnect them (placement and routing).
RAM based reprogrammable FPGAs such as the Xilinx LCA use a Lookup We present a technology mapper for LUT-based cell arrays, h a p , which can balance cell utilization with the goal of producing routable mappings by dilating a design. This provides the designer with the flexibility of trading number of CLBs for routability. Tools which use logic and routing resources effectively admit architectures with proportionally more area devoted to logic cells.
To analyze the routability of mapped designs we place and route them on the Xilinx XC3000 LCA series.
Routability is a function of the FPGA architecture and the place-and-route tools as well as the design. The XC3000 has a channel based routing architecture and uses a simulated annealing place-androute tool which are typical of gate array architectures and tools in use today.
LUT technology mapping
The major activities of logic synthesis for L U T based FPGAs are:
Logic minimization: Technology independent multilevel logic minimization alters the network to reduce the literal count, while technology dependent minimization techniques take into account the number of inputs of the LUTs in simplifying and rearranging the network.
Node decomposition/splitting: Node-splitting (or decomposition) can be part of the logic optimization and/or the covering phase (below) where the effects of the various alternatives can be better evaluated. In an AND/OR network, the children of a node can always be regrouped arbitrarily into subtrees.
LUT covering: This phase can be viewed as a graph theoretic problem in which a covering of the network with LUTs is constructed. The final set of LUTs must implement the primary outputs of the network.
Cell packing: To increase cell utilization, some architectures provide cells whose Lookup tables can be sub-divided to implement multiple output functions. Cell packing consists of combining the LUTs generated by the covering phase into groups which can be accommodated in a single cell. The XC3000 family accommodates one All of these tools (with the exception of hydra and possibly xnfmap) have the common trait of considering cell packing separately from covering. The approach to mapping taken in h a p is a greedy one with respect to covering the network as quickly as possible with packed LUTs. All feasible LUTs are considered and the largest ones which can be paired together are selected.
Routability in gate arrays
Routability in gate arrays has been analyzed using two-dimensional stochastic models [ 11, 121. El Garnal estimates the expectred number of tracks required in each channel at about half t,he prodnct of the average number of pins per cell and the average wire length. The XC3000 series violates this model by not providing full switch matrix connectivity as well as by providing long lines (heterogeneous routing resources). However, the model does support the following two observations about the XC3000 series: 1) Routing larger designs and packages is more difficult since the average wire length is expected to increase while the routing resources (channel width) remains constant.
2) The pins-to-cell ratio affects the routability since the average number of pins per cell varies while the channel width remains constant.
Based on practical experience and supported by El Gamal's results, two factors which affect the routability of a mapped design can be identified:
Ratio of pins to cells:
This ratio is a measure of the amount of traffic in and out around a cell.
Ratio of pins to nets:
This ratio is a measure of the amount of traffic on the chip. This is more of a global measure, whereas the pins to cells ratio is a local measure.
These observations lead to the following goals for a mapper: 1 ) minimize the number of cells, 2) reduce net fanout by creating LlJTs which cover multiple occurrences of the same input, 3) reduce net fanout by pairing LlJTs which share inputs, and 4) reduce the number of pins per CLB. The first three goals are compatible, however the fourth is in conflict with both the goal of minimizing the number of cells, and the goal of maximizing L U T pairing (two output pins will be used rather than one). Also attempting to minimize the number of cells may favor 5-input LUTs, which conflicts with the goal of reducing net fanout by pairing multiple LUTs within single cells.
The Rmap algorithm
The input of the L U T technology mapping problem is an acyclic directed Boolean network whose sinks are the primary outputs of the circuit and whose roots are the primary inputs. Each node of the graph has a Boolean function associated with it. We assume that technology independent logic minimization has been performed on the network and that the nodes of the network have been decomposed so that the only Boolean functions represented in the network are ANDs and ORs, possibly with negated inputs: an AND/OR network.
A set of nodes which induces a subgraph of the network having only one sink ( a node with no fanout) can be implemented by a L U T . We call such a set of nodes a block and associate it with its sink node (output). The d e p e n d e n c y set of a block is the set of nodes outside the block which have edges directed into the block. A block is feaszble if the size of its dependency set does not exceed the maximum lookup-table width (e.g., k = 5 in the XC3000 series).
In Figure 1 , the block {tu? y, U} has dependency set The output of the mapper consists of a covering of the network by feasible blocks and an assignment of the blocks to cells. A covering of the network is a collection of feasible blocks which satisfies the following constraints:
1. each primary output is the output of some block 2. no primary input is in any block, and 3. each non-primary input node is the output of some block, or it belongs to every block containing one or more of its parents.
The three main steps in hap's covering and packing algorithm are described below.
Create all feasible blocks
For each node z we find the dependency sets for all possible feasible blocks for which z is the output. These sets (and their blocks) are obtained from the dependency sets of 2's children in a bottomup traversal of the network. Blocks of 2's children are combined to form blocks for E , and any of these new blocks which contain a node in its own dependency set are discarded. This is the case for node U in Figure 1 when combining the blocks for its two children with dependency sets {z, c, d } and {y, U}, respectively. The resulting block would depend on y and contain y. We discard this block rather than repair it, because the repaired block will always be generated from another pair of blocks. In this case, combining blocks for the dependency sets {z, c , d } and {z, c , d , v} yields the block for dependency set { I , c , d , v} with nodes y, z , and w . Figure 2 gives all of the dependency sets for the nodes of the network in Figure 1 . 
Create all potential pairings
A potential pairing is created for each pair of blocks, b l and 6% of x1 and x 2 if b l and bz can be packed together in a single cell and z1 4 bz and $2 6 61. Each block can also be paired with the empty block which provides the mechanism for packing a single block into a cell. The number of edges covered by a pairing is defined to be the sum of the edges covered by the individual blocks; edges common to both blocks are counted twice.
Select pairings
From the list of all potential pairings, b a p uses a greedy method to choose a pair of blocks to pack into one cell. A pairing which maximizes the following cost function:
is chosen. The number of input pins is the size of the union of the dependency sets of the two blocks, and the number of output pins is 1 or 2. The parameters C I and CO are varied to affect the pins-to-cell ratio. Ties are resolved by looking ahead to see how the selections affect the remaining available pairings. After selecting the pairing the network is modified and any invalid blocks (and their pairings) are removed. This is accomplished by a traversal originating a t the block output which decrements the outdegree of each block node visited and visits its children only when its outdegree reaches zero. All nodes in the block whose outdegrees reach zero are removed from the network. After committing a pair including the block {w,y} from the network in Figure 1 , node w becomes a primary input while node y remains, and nodes x and v become primary outputs.
Repeatedly selecting pairings(step 3) will produce a covering and cell packing if the initial network can be covered by feasible blocks. However, if the dependency set of every block of a node x has size greater than k, then z might not be included in any feasible block. To ensure that each node can be covered, three techniques described in [16] are first used to structurally decompose nodes with fanin higher than 3. In hap, node-decomposition is a preprocessing step which attempts to increase the chances of finding pairable blocks. Although potentially large, 
Experimental results
The experiments are conducted using three different mappers on the subset of the standard MCNC combinational circuit benchmarks having numbers of CLBs and 1/0 pins feasible for realization with the With C I = c g = 0, b a p produces comparatively fewer CLB counts and performs very well in reducing global congestion. However, it has the poorest local congestion grades. With C I = CO = 1 , h a p produces comparatively more CLB counts, however it scores well in both reducing global and local congestion. Chortle's performance is competitive on CLB counts, but poor on global congestion. To place and route the designs, they are converted to .map format, and then translated to .1ca format by map2lca which uses a mincut approach to obtain an initial placement, and finally placed and routed by apr. Small designs were placed and routed with the default settings without difficulty. The vnrouted designs are marked with asterisks in Table 1 .
To measure the routability of the mappers, we ran apr ten times on these four designs to record the Table 2 : Average and minimum #s of unrouted pins(P) and nets(N) after 10 APR runs; alu2 is routed on a 3042PG132, duke2 is routed on a 3042PQ100, misex3 is routed on a 3064PG132 and alu4 is routed on a 3090PC84. average and the minimum numbers of unrouted pins and nets. A partially routed design with one or two unrouted nets or pins is considered to be acceptable since the routing can often be completed with additional routing iterations or human intervention. The results in Tables 1 and 2 demonstrate that by simply setting the parameters of the cost function to CI = cg = 1 in h a p , we can trade routability for compactness of a design. A routed mapping for the circuit a h 4 w a s obtained with CI = 2.2,co = 1, despite the discouraging initial result of 99 unrouted pins obtained with CI = CO = 0. Several observations can be made from Tables 1 and 2 : positive CI and CO values deliver less compact but more routable mappings, increasing CI or CO generates more routable mappings, and continuing to increase the values of CI or CO increases utilization of the devices, but doesn't impair routability.
