Technology mapping is the process of taking a finegrain network describing a multiple-output logic function, and covering it with cells to get a network that is legal in a given iechnology. The goals of the mapping are t o produce small, fast, and testable circuits. This paper introduces X t m a p , a new technology mapper f o r f-input 
an output, and generates many ways for that output to be implemented in a single cell, covering some portion of the fine-grain network. Each of the candidate implementations is evaluated (tested) using a simple heuristic, and the best one chosen. The inputs of the new cell then are added to the set of functions that need to be mapped. Figure 1 gives simplified pseudocode for the algorithm. In the actual implementation, mappings are chosen recursively for the inputs of the new cell, and the ToMap set is not explicitly represented.
To complete the description of a generate-and-test mapper, three subroutines have to be described: e the method for choosing which node of the finee the heuristics for determining the value of a cell e the generator for single cells (Section 2.3).
Choosing the next node to map
Xtmap has a top-level loop that iterates over the principal outputs, and uses a recursive call to the mapper for each input to a newly created cell. Because the ToMap set is implicitly represented on the recursion stack, it need not be explicitly represented. No attempt is made to choose the order in which the principal outputs or the inputs to a new cell are mapped.
2.2
grain network to map next (Section 2.1), (Section 2.2), and
2.1

Value of an FPGA cell
The basic assumption of Xtmap, as with any greedy mapper, is that a series of locally optimal choices (with respect to the heuristic functions) will result in a globally good solution. The heuristic functions need to capture any interaction between cells that is likely to be important in the final solution.
One of the biggest advantages in mapping to FPGAS, rather than to cell library technologies, is that all cells cost the same, and so we only need to look at how well we cover the fine-grain network.
One key simplification of the generate-and-test paradigm is that only the generator needs to know about the function of a cell-the function itself does not matter in determining the value of the cell. Instead, the cell can be evaluated only on the basis of which nodes in the fine-grain networks are inputs to the cell, and which are hidden-that is, which nodes are covered by the cell but are neither inputs nor outputs of the cell.
The value of a cell is computed as the sum of weight functions computed on the inputs and hidden set, as shown in Figure 2 .
One would expect it to be more useful to use already mapped nodes in the input set, than to have to map new nodes, particularly new ones with low fanout.
This means that we would expect to set the weights to meet the constraints a1 > a2, a2 < 0 , and a3 < 0. Fanout appears in a denominator, because the difference between a fanout of 1 and 2 is much more significant than a difference between 11 and 12, this means that a negative value for a3 gives a fairly large penalty for a mapping with low-fanout inputs, and a much smaller penalty to ones with high-fanout inputs. Furthermore, a mapping for a cell is generally better if the size of the portion of the network remaining to map is less. If we use the sum of the area estimates for the inputs as an estimate of how much remains to be mapped, we would expect a4 < 0.
There seems to be little advantage to using up part of a cell to re-implement already mapped nodes, and so we would expect to have bl < 0. For hidden nodes, one would expect it to be best to hide nodes that haven't been implemented, particularly if they have low fanout.
Thus one would expect to have b2 > 0 and b3 > 0.
To minimize delay in the circuit, the Delayweight should be set to a negative number. The expected constraints on the weights were usually met by sets of parameters found by the learning algorithms described in Section 3, but all were violated by at least one of the weight sets found by the random neighborhood search learning technique. The values used in the results section (a1 = 0, a:! = -30, a3 = -14, a4 = -11, bl = 1, b:! = 5, b3 = 5, Delayweight = -256) were found by learning on one network, and discovered to work reasonably well on many networks. The only expected constraint that is violated is that bl is positive, instead of negative. N o attempt has been made to see if the results are improved by reducing b l .
The quality of the delay estimate arrival is important for producing good mappings. At first, the algorithm just used the longest distance from a principal input to the node in the fine-grain network, but this did not prove to be a particularly good predictor of delay, and the circuits produced had larger delays than those produced by Chortle and Dagmap.
An improved delay estimate was obtained by doing an initial mapping of the fine-grain network using Xcmap ( a re-implementation of the main algorithm of Dagmap), and recording the Iulheighl: the height in the mapped circuit. By setting the delay weight to a sufficiently large negative number, the generateand-test algorithm can be forced to produce circuits whose delay (as measured by lutheight) is as small as that produced by Xcmap, as long as the generator for the node is guaranteed to try the set of inputs used by Xcmap (Xtmap does try that set of inputs, as described in Section 2.3). 
Generator for table-lookup cells
The generator needs to produce a number of feasible single-cell implementations for a node in the fine-grain network. The approach I have taken is to write a simple generator for each different FPGA technology, rather than a general-purpose generator to use with a variety of technologies. However, this is not essential to the concept, and a general-purpose generator could be written, perhaps along the lines of Proserpine [I] . A generator has been written for f-input table-lookup functions, and other generators are being written for other target FPGA technologies.
For table-lookup cells, the generator can be fairly simple, as the function of the cell is irrelevant-all that matters is how many inputs are needed. The genera.tor for Xtmap tries to produce all possible vertex cut sets of size f or less separating the node being mapped from the primary inputs using a simple depth-first search algorithm.
Simplified pseudocode for the generator is given in Figure 3 . The recursive procedure takes two parameters: a set of nodes t o consider as inputs to the cell and a set of nodes that are not to be expanded in the search. The nodes in the input set are replaced one at a time by their own inputs, until the complete input set gets too large. After a node has been expanded in this way, it is marked so that it will not be expanded again, to avoid repeated enumeration of the same cut sets. The set of hidden nodes (hidset) keeps track of the nodes that have been removed from the input set, for use in determining the value of the cell (see Section 2.2).
The actual algorithm used in Xtmap is slightly more complicated. One simple but important modification is that unnecessary inputs are removed from i n s e t , before inset is counted and the cell is evaluated. An input i is considered to be unnecessary if inputs(i) C i n s e t . This optimization is inefficiently implemented, and accounts for a large part of Xtmap's running time.
Note that not all vertex cut sets with ( i n s e t ( 5 f are enumerated by this simple algorithm, as some vertex cut sets may require intermediate cut sets with more than f nodes, and these would be missed.
Another, more important improvement can be made to the generator by adding an alternative definition of the inputs of a node. Both the Xmap and Xcmap 'RyExpansions(set inset, set noexpand, set hidset) greedy mappers compute legal (linset I 5 f) vertex cut sets for every node in the fine-grain network. These cut sets can be used as alternatives to inputs(i) both in the TryExpansions algorithm and in defining the unnecessary nodes of inset. Updating the set of hidden nodes requires a little more work than before, as all the nodes between i and the stored vertex cut are hidden, not just i.
Even with the improvements,the algorithm misses some f-cuts, and these missed cuts are often the most interesting ones for high-quality mapping, since they are generally lower in the network than the cuts that are enumerated. The algorithm could be improved by using the techniques of FlowMap [4] or Rmap [18] to ensure that all f-cuts are enumerated.
The pseudocode for the version of Try Expansions used to produce numbers for the benchmarks is shown in Figure 4 . The function lutinputs(i) looks up a legal cut set for i found by the algorithm for Xcmap (unless Xmap found a smaller delay for the node, in which case the cut set found by Xmap is used).
Because the Xcmap (Dagmap) algorithm guarantees minimum delay implementation for tree circuits and for nodes that can be implemented in a single cell directly from the principal inputs, adding the lutinputs expansions causes the generator to catch many of the interesting possible cells that it otherwise would have missed.
Learning the weights for the heuristics
My original hope was that a good set of weights could be chosen and fixed for all circuits, but I was unsuccessful in finding a good set of weights manually, and so I implemented learning algorithms to try to find good weights. Two different learning algorithms have been implemented: random neighborhood search and learning from an existing mapping.
The random neighborhood search algorithm is slow, After tuning the weights on a few circuits, I added a further feature that kept a list of good settings for weights. For each new circuit, all the settings already recorded were tried first, and then some number of random searches were made, adding a new setting to the list if a better was found by the random search. I had hoped that pruning the list would leave me with a few good settings that would work for many circuits.
Unfortunately, running 45 benchmark circuits resulted in 25 different weight settings, each of which was better than any of the others for at least one circuit.
The second learning approach was based on a very different philosophy. Instead of blindly searching for a setting by looking at the results of the mapping, I tried to derive a setting from a good mapping. I started with an area-efficient mapping from Xmap, then ran the cell-generation procedure for each node that was used in that mapping. Instead of evaluating each potential cell, it was compared with the cell found by Xmap. Weights were increased or decreased so as to favor the cell found by Xmap over the candidate one. For example, if the generated cell had more hidden nodes that were already mapped, then bl would be decreased, but if it had fewer, 61 would be increased.
Running Xtmap with the parameters learned in this way produced circuits similar to those found by Xmap (small, but not fast), but was, of course, much more expensive to run. Similarly, running Xtmap with the delay weight set to a large negative number duplicated mapping found by Xcmap (fast, but not small). The advantage of Xtmap is that the delay weight can be set to intermediate values, getting circuits that are both small and fast.
In order to create repeatable results, the benchmark results in this paper did not use learning. Instead, a single parameter set was used, and the penalty for delay increased by factors of two until the unit-delay from Xtmap was as small as from the Xcmap mapping (see Table 1 ). This single parameter set seems to produce results very similar to (but slightly better than) doing the deterministic learning procedure for each benchmark. Based on earlier benchmark runs, following this with 10 repetitions of random search would probably have improved the results slightly, but increased the running time substantially.
One interesting observation was that running Xtmap twice for the des benchmark with the same parameters produced different results. The explanation is that the arrival time estimate lutheight is not just the result of running the Xcmap algorithm, but is updated after every mapping, so that it represents the lowest known height. For some nodes in des, Xtmap finds a better mapping, and updating the lutheight for these nodes results in an improved mapping when Xtmap is re-run. Table 1 gives some benchmark results for some of the recently published technology mappers, all being run on identical networks. Xtmap was run with the parameters a1 = 0, a2 = -30, a3 = -14, a4 = -11. bl = 1, b2 = 5, b S = 5, Delayweight = -258. For each mapper, the number of 5-input lookup tables and the unit delay are given. Although Xtmap appears to be about as good as the best previously published mapper (FlowMap), the unpublished results from FlowMap-r seem to be better.
Conclusions and Future Work
For delay minimization, the best previous results were reported for FlowMap [4], and for area minimization mis-pga [15], but mis-pga gets much of its area optimization from high-level optimization and decomposition, rather than from mapping per se. Although the Xcmap (Dagmap) algorithm produces optimal delay for trees, on one highly reconvergent example (not in the table) Xmap produces a circuit with fewer levels of lookup tables. FlowMap produces optimal delay on arbitrary networks.
Xtmap usually gets delay values as good as FlowMap, and with about the same area penalty.
The CPU time reported in Table 1 is for reading the BLIF or EQN input file and doing all three mappings on a Sparcstation SLC. The slowest operation is Xtmap itself, which is much slower than simple greedy mappers, primarily because of the cost of evaluating all the potential cells that a.re not used. Xtmap is significantly slower than Xcmap or Xmap, but not as slow as chortle-d. (I do not have timing information for FlowMap, and so cannot compare execution times of Xtmap and FlowMap with it.) Improving Xtinap's cut-enumeration algorithm should improve both Xtmap's speed and performance. Improving the arrival time estimates (perhaps using FlowMap to compute lutheight) would also improve performance.
I believe that technology mapping efforts should focus on producing faster mappers, rather than better mappers, because it is probably more useful to spend the CPU time on improving higher-level optimization.
To illustrate this, some results are reported for techniques that high-level optimization before (or during) mapping. Table 2 reports results for the same benchmarks as Table 1 . The cputime reported in Table 2 does not include the Optimization time, just the mapping time. The three "X" mappers were run on networks created by running two iterations of the ITEM script shown in Figure 5 on the benchmark files used for the FlowMap results, which is a better starting point than used for the comparisons in Table 1 The ITEM optimization script. does not have any notion of critical paths, and attempts to reduce delay on all outputs, even non-critical ones. An optimization technique that concentrated on area reduction off the critical paths, and delay reduction near the critical path would probably produce better area-delay products-even so, this technique did produce better mappings for several benchmarks than any previously reported results.
Although simple greedy algorithms have been getting fairly good results, other paradigms for technology mapping remain to be examined. I have a student now investigating using ratio-cut partitioning to map to f-input lookup tables, so that more global information can be used in deciding which nodes in the finegrain network are worth implementing as cell outputs. We hope to get more area-efficient mappings with this technique.
Mapping for routability
Most mapper research has focussed on area or delay minimization, but on the Xilinx chips, the routing resources are usually the limiting factor on any design. There has been some previous work on mapping for routability (notably Rmap [IS]), but the available pla.ce-and-route tools have random enough behavior that it is difficult to compare the qualities of different mappers just by placing and routing the circuits. One measure that I believe will be useful is the total pin count (number of connections to cells) of a mapping (after merging into two-output cells). Figure 5: Optimization script used w i t h ITEM to create the networks mapped by the "X" mappers in Table 2 .
T w o iterations of this optimization script were run on each network. T h e block-covering technique is described in [19] , and a rough description o f local factor is in [13] , but the variable ordering heuristic has n o t been published yet [20] . and X t m a p results are from optimizing the same starting points as used for the FlowMap results using an ITEM script t h a t optimizes for the product of edge count and lutheight. Results for des are for running the optimization script o n the same starting point as Table 1 , because the minimizer ran o u t of memory minimizing on the DMIG-optimized network. Results for C499 are f r o m a less expensive script t h a t did not require canonical forms, because the script of Figure 5 was unable to complete on C499. The CPU time column (seconds on a Sparcstation SLC) counts only the mapping, not the optimization, which took anywhere f r o m seconds to 2
hours. FlowMap and mis-pga were run with different starting points, and so the quality of the optimization is really being compared, rather than the quality o f the mappers. Italics are used to mark the best area-delay product i n each row. Note t h a t for rot the best result in Table 1 is better. T h e result for count can be improved to 39:3 and C880 to 145:7 by running the optimization script on the original benchmark network, rather than an already optimized and decomposed one; apex7 t o 100:3 by optimizing the network used i n Table 1 ; and vg2 to 28.3 by using the cheap optimization script used for C499, rather than the expensive one. 
