Abstract-Technology mapping, based on directed acyclic graph covering, suffers from the problem of structural bias: The structure of the mapped netlist depends strongly on the subject graph. In this paper, the authors present a new mapper aimed at mitigating structural bias. It is based on a simplified cut-based Boolean-matching algorithm, and using the speed afforded by this simplification, they explore two ideas to reduce structural bias. The first, called lossless synthesis, leverages recent advances in structure-based combinational-equivalence checking to combine the different networks seen during technologyindependent synthesis into a single network with choices in a scalable manner. They show how cut-based mapping extends naturally to handle such networks with choices. The second idea is to combine several library gates into a single gate (called a supergate) in order to make the matching process less local. They show how supergates help address the structural-bias problem and how they fit naturally into the cut-based Boolean-matching scheme. An implementation based on these ideas significantly outperforms state-of-the-art mappers in terms of delay, area, and run-time on academic and industrial benchmarks.
I. INTRODUCTION
T HE TASK of technology mapping in standard-cell logic synthesis is to express a given Boolean function as a network of gates chosen from a given standard-cell library to optimize some objective function such as total area or delay. In these general terms, technology mapping is intractable. The problem is usually simplified by first representing the Boolean function as a good initial multilevel network of simple gates called the subject graph. The subject graph is then transformed into a multilevel network of library gates by means of local substitutions. This simplification means that the structure of the subject graph dictates, to a large extent, the structure of the mapped network; this is known as structural bias.
In this paper, we present a new Boolean technology mapper aimed at mitigating the effects of structural bias. At the core of the mapper is a simplified Boolean-matching algorithm that is faster than structural matching and produces better results. With the speed afforded by this matching technique, we propose two complementary techniques to reduce structural bias: lossless synthesis and supergates. Lossless Synthesis: To obtain a good structure for the subject graph, a number of technology-independent synthesis steps are usually performed. An example of this is the SIS script.rugged shown in Fig. 1(a) . Each step in the script is heuristic, and the subject graph produced at the end of the script is not necessarily optimal; indeed, it is possible that an intermediate network is better in some respects than the final network.
We explore the idea of combining these intermediate networks into a single-subject graph with choices and using that to derive the mapped netlist. This is shown schematically in Fig. 1(b) . Following the work of Lehman et al. [13] , the mapping is done optimally over all the networks encoded in the subject graph. The mapper is not constrained to use any one network but can pick and choose the best parts of each network. We call this lossless synthesis since no network seen during the synthesis process is ever lost. By including the initial network in the choice network, we can be sure that the heuristic logicsynthesis operations never make things worse. Furthermore, as in [13] , multiple scripts (with possibly different objectives) could be used to accumulate more choices.
The challenging part of this proposal is the efficient creation of the choice network. Toward this end, we propose to leverage Fig. 2 . Simple example illustrating the utility of supergates. Consider the subject graph shown in (a) where each node represents a 2-input AND gate, and a dashed edge indicates the presence of an inverter. The match in (b) is easy to find since both inputs of the XOR are present in the subject graph. In contrast, the match in (c) is hard to find, since the MUX input labeled x is not present in the subject graph. The match in (c) can be found by combining some gates into a supergate. Note that all the inputs of the supergate are still required to be present in the subject graph. recent advances in combinational equivalence checking. Stateof-the-art equivalence checkers depend on finding functionally equivalent internal points, in the networks being checked, in order to reduce the complexity of the decision procedure. By using a combination of random simulation and SAT, it is possible to quickly determine equivalent points in two circuits. These equivalent points provide the choices for mapping in our proposal.
Supergates: A supergate is a single-output combinational network of a few library gates, which is treated as a singlelibrary gate by our algorithms. The advantage of doing this may not be immediately obvious: One might expect that if a supergate matches at a node, the conventional matching algorithm would return the same result, except it would match library gate by library gate rather than all the gates at once. This is not true, and Fig. 2 provides a simple counterexample.
The subject graph is shown in Fig. 2(a) . Conventional mapping would find the mapping consisting of the XOR and AOI gates as shown in Fig. 2(b) but would fail to find the mapping with the MUX as shown in Fig. 2(c) . To see this, observe that one of the inputs of the MUX [labeled x in Fig. 2(c) ] is not present in the subject graph. Consequently, the MUX by itself is not a valid match for f. In contrast, the inputs of the XOR (nets p and ¬d) in Fig. 2(b) are both present in the subject graph, thus making the XOR a valid match at f. Now, if we connect the MUX, the XOR, and the inverter together to form a new gate [in the manner shown in Fig. 2(c) ], it is easily verified that this new gate is a match at f in the subject graph. This new gate is a supergate built from the library gates expressly for the purpose of finding better matches.
This example illustrates the main idea behind supergates. Using bigger gates allows the matching procedure to be less local and, thus, less affected by the structure of the subject graph. Furthermore, as this example illustrates, supergates are useful even with standard-cell libraries that are functionally rich.
The three ideas presented in this paper-the faster matching algorithm, lossless synthesis, and supergates-are logically independent. However, one of the contributions of this paper is to show how these ideas come together naturally in the context of cut-based Boolean mapping.
In Section II, we present related work. Section III has an overview of the mapper and details of the matching algorithm. In Section IV, we present practical techniques of how to construct the choice network for lossless synthesis and extend the basic mapping to handle choices. In Section V, we present details of supergate construction. In Section VI, we present the results of experiments designed to determine the run-timequality tradeoff of the techniques presented in this paper. We also present comparisons with other industrial and academic mappers. We conclude in Section VII by pointing out some limitations of these techniques and suggesting future work.
Finally, we note that the focus of this paper is on mitigating structural bias, that is, on the logical aspects of the technologymapping problem. Therefore, to simplify exposition, we present the algorithms in the context of a simple gain-based-delay model, which does not take capacitive loading into account. Such a model is useful in practice, in a flow that uses technology mapping, for topology selection and does sizing and buffering iteratively with placement. Furthermore, the techniques in this paper continue to be applicable in a more traditional loadbased flow since they are techniques to increase the number of topologies considered during mapping and are orthogonal to the usual techniques of considering loads during technology mapping.
II. RELATED WORK
The literature on technology mapping provides a spectrum of techniques that trade structural bias for computational complexity. The classical structural approaches, such as tree and directed acyclic graph (DAG) covering [8] , [12] , lie at one end of this spectrum. They have relatively short run time but provide suboptimal results since their mapping choices are completely constrained by the given subject graph. Constructive Boolean approaches [9] , [19] lie closer to the other end of the spectrum. Although they do not depend as much on the structure of the subject graph, they are limited by the choice of their (heuristic) decomposition schemes and local view, since they decompose one node at a time. These limitations combined with long runtime makes them useful mostly in a resynthesis flow after a mapped network has been obtained by some other means.
The approach described by Lehman et al. [13] lies between these two extremes: A number of different local algebraic decompositions are encoded into the given subject graph as choices. The subject graph itself is constructed by combining the results of different technology-mapping scripts. The ideas relating to lossless synthesis presented in this paper are variations on this basic approach. Beyond the obvious difference of using intermediate networks from one synthesis script, there are significant algorithmic differences. First, the use of structuralequivalence checking to detect choices [as opposed to binary decision diagram (BDDs)] allows for lossless synthesis on large circuits. Second, the extension of Boolean matching to handle choices presented in this paper overcomes the runtime limitations of the structural-matching methods used by Lehman et al. Furthermore, there is an improvement in quality since matching is done directly on general DAGs, and complex multi-fanout gates are handled more naturally.
Wavefront mapping [21] is a practical enhancement of [13] that maps the circuit in stages, thus reducing the amount of match data that is stored at one time, although at the cost of optimality. An extension of this algorithm allows for dynamic decomposition based on a partially mapped circuit. Compared to the approach of Lehman et al. [13] , it reduces the number of decompositions that need to be explored. The idea of wavefront can be used in conjunction with the techniques in this paper to reduce the amount of match data that needs to be stored in the memory at one time.
In addition, this paper puts forth the idea of combining library gates into supergates to obtain a more Boolean matching. By squinting a little, one can view supergates as being an alternate source of choices. The choices induced by supergates come from the library rather than the network. In the literature, similar combinations of gates have been used for other purposes such as constructive decomposition [19] and rewiring [2] . We note here that supergates allow library-aware Boolean decompositions to be explored as opposed to algebraic decompositions in [13] , and we defer the discussion to Section V-E.
The work related to the simplified Boolean matching is presented in Section III-B.
III. BOOLEAN MATCHING
We introduce the simplified Boolean-matching algorithm in the context of the overall mapping flow. The mapping procedure is a Boolean cut-based one [4] , [6] , [7] on a DAG using dynamic programming [3] , [12] to guarantee delay optimality.
A. Overview of Boolean Mapping
The input to the mapping procedure is an AND-Inverter graph (AIG) [11] . An AIG is a DAG, whose nodes represent either AND gates or primary inputs (PIs). Its edges represent wires. Inverters are represented by bubbles on the edges. Given an AIG, the mapping is done in five steps.
Step 1) Compute k-Feasible Cuts: A feasible cut of a node N in the AIG is a set of nodes {X i } in the transitive fan-in cone of N such that an arbitrary assignment of values to X i completely determines the value of N . A feasible cut is redundant, if the value of a node in the cut is completely determined by an assignment of values to the other nodes in the cut. A k-feasible cut is a feasible cut of size at most k that is not redundant. The cut {N } is always a k-feasible cut of node N (for any k) and is called the trivial cut.
Let Φ(N ) denote the set of k-feasible cuts of node N . If N is a PI, then Φ(N ) = {{N }}. If N is an AND node with inputs A and B, then Φ(N ) =
We compute all 5-feasible cuts of every node in the network by the simple bottom-up traversal based on the above recursion. Although in general a graph may have O(n 5 ) 5-feasible cuts,
we found that most test cases have between 20 and 30 5-feasible cuts per node. We restrict our attention to 5-feasible cuts, since our experiments show that the total number of cuts increases very quickly with k. Pruning techniques have to be applied, and the mapping results are not significantly better (since the pruning is quite arbitrary).
Step 2) Compute Truth-Tables of Cuts: The next step is to compute the local function of a node in terms of its cut. This is done for every nontrivial k-feasible cut of every node in the network. Given a node N and a cut {X i } of that node, formal variables are assigned to the each cut node (in no particular order). Using these variables, the functionality of the node is computed symbolically. We note here that BDDs are not necessary for this computation: Since usually only 5-feasible cuts are considered, the symbolic-function computation can be performed efficiently using 32-bit machine words to represent truth-tables. In what follows, we use the words "function" and "truth-table" interchangeably.
Step 3) Boolean Matching: For each node in the network, for every cut, an appropriate gate (if one exists) is chosen from the library. Each gate, thus chosen, is called a match for the node. Our matching procedure differs from the traditional approach, and we present this in greater detail in the following section.
Step
4) Compute Best Arrival Time at Each Node:
Starting from the PIs and working in topological order toward the outputs, the best arrival time is computed for each node from among all its matches.
Step 5) Choose the Best Cover: In the reverse topological order, the best gate for each primary output is chosen. Next, the best gates implementing the inputs of these gates are chosen and so on until all primary inputs have been reached.
B. Simplified Boolean Matching
In the matching step, we wish to choose the appropriate library gate to implement a given function. The traditional solution is to use NPN-equivalence classes (two Boolean functions are NPN equivalent if one can be transformed into the other by permuting or negating the inputs or negating the output). First, the NPN-canonical representatives of the library gates are computed. The gates are stored in a hash table indexed by the representatives. During matching, the NPN-canonical representative is computed for every cut function. It is used to look up the hash table to find the appropriate gate.
However, this approach is slow since it requires the computation of the NPN representative for every cut function. Furthermore, once the appropriate gate is found, the appropriate variable correspondence must be found between the library gate and the cut function. Since these computations are done in the inner body of the mapper, there has been a lot of research on speeding them up [1] , [5] , [7] , [22] .
Our simplified matching procedure is motivated by the fact that in the mapping procedure described, the functions have only five variables. It is therefore advantageous to precompute all functions obtained by permuting the inputs to library gates and to add those to the hash table. Thus, during the actual matching, there is no need to compute a canonical representative. The cut function can be used directly to look Fig. 3 . Node x must be mapped in both polarities in order to map p and q optimally for delay.
up the hash table. In addition to avoiding the NPN-equivalence checking, it also avoids the need to establish input correspondence after the gate is found.
Example: For an AOI gate having the function ¬(
are kept although they are the same function. This is because the input-to-output delays are often different even for functionally symmetric pins.
Once again, all functional computations are done with truthtables. In Section V-B, on supergate generation, we describe this precomputation procedure in more detail (the precomputation can be thought of as generation of supergates consisting of single-library gates).
So far, we have not considered the phase assignment at the inputs. This is because the mapping is done "dual rail" as explained in the following section on optimal phase selection.
The problem with this simplified matching procedure is that it is not scalable. Indeed, this would not work if the functions we were interested in had, for example, ten variables. But, in the framework of Boolean mapping, it suffices since we deal with functions having only a few variables. 1 This means that in the worst case (a library gate with five inputs and no symmetries), 120 functions are added to the hash table (instead of one as in the NPN-representative case). However, since each match is now only a hash-table lookup (versus an NPN-representative computation, lookup, and input correspondence), the matching procedure runs faster.
C. Optimal Phase Selection
The mapping quality can be improved by exploiting the additional flexibility of mapping each node in both polarities: positive (as above) and negative. When the final mapping is selected (Section III-A, Step 5), the appropriate polarity is chosen to guarantee the shortest delay on each path. This may lead to an increase in area since the same node may be required in both polarities and may have to be duplicated.
Example: Consider the AIG fragment in Fig. 3 . Node x is required in both polarities. If it is mapped in only one polarity then the arrival time at either p or q increases by an additional inverter delay.
Observe that this "dual-rail" mapping means that the inputs of a cut are available in both polarities. Consequently, the function of a cut is not precisely determined. It belongs to a class of functions which differs only by complementation of inputs. This class is called the N-equivalence class. There is greater flexibility, since any gate belonging to the N-equivalence class may be used to implement the cut.
The simplified matching procedure above is extended to work with N-equivalence classes. During library preprocessing, for every permutation of a gate, the N-equivalence representative is computed. This is defined as the function with the lexicographically smallest truth table. Similarly, during matching for every node function, the N-equivalence representative is computed and used for looking up in the hash table.
The alert reader would have noticed that this extension negates some speed benefit of the simplified matching procedure presented in Section III-B. However, this may be justified with the following arguments. First, computing for the Nequivalence representative is less expensive than computing for the NPN-equivalence representative, since the class is smaller (2 n versus 2 n+1 n!). Second, this is a necessary price to pay for exploring this larger search space of decompositions (cf. the inverter transform in Lehman et al. [13] ). In structural mappers, this search space is explored by adding a pair of inverters between two nodes and by adding a wire (with zero cost) to the library. That technique is not well suited for Boolean mapping since it significantly increases the number of cuts in the network.
D. Using Don't Cares
An advantage of the Boolean-matching technique outlined above is the ability to handle local satisfiability don't cares during the matching process. If the matching is done with k-feasible cuts, then for every node, a cut larger than k is constructed. This can be done by a simple depth-first search starting from the node (for example, if matching is done with 5-feasible cuts, a larger cut of size, say, 10 is considered). For every k-feasible cut of the node, it is possible to compute the satisfiability don't cares symbolically. This can be done either with exhaustive simulation using truth-tables or with BDDs. Once the satisfiability don't cares of a k-feasible cut are identified, each don't care minterm can be set to one or zero to obtain a completely specified-cut function. This completely specified function is used to look up the hash table as before. Thus, each k-feasible cut gives rise to a set of cut functions when satisfiability don't cares are considered.
The main drawback of this method is the exponential blowup in the number of cut functions, since each don't care gives rise to two possible functions. Thus, in practice, only an arbitrary subset of the cut functions can be used for matching. Our preliminary experiments confirmed that using don't cares in this manner improved the quality of mapping. However, in our application, the increase in quality was not justified by the increase in run time.
IV. LOSSLESS-LOGIC SYNTHESIS
The idea behind lossless-logic synthesis is to "remember" every network seen during a synthesis flow (or a set of flows), and to select the best parts of each network during technology mapping. This is useful for two reasons. First, technologyindependent synthesis algorithms are usually heuristic, and so, there is no guarantee that the final network is optimal. By mapping using only the final network, we might miss out on a better result that could be obtained from an intermediate network in the optimization flow.
Second, synthesis operations usually apply to the network as a whole. Therefore, a flow to optimize delay might significantly increase the area, since the whole network is optimized for delay. By combining such a delay optimized network with another network that has been optimized for area, it is possible to get the best of both. On the critical path, the mapper can choose from the delay-optimized network, whereas off the critical path, the mapper chooses from the area-optimized network.
The main problem is constructing the choice network efficiently. In Section IV-A, we give an overview of how this is done. In Section IV-B, we extend the Boolean-mapping procedure of Section III-A to handle choices.
A. Constructing the Choice Network
The choice network is constructed from a collection of networks that are functionally equivalent. The key idea is to use recent advances in equivalence checking that are based on identifying functionally equivalent internal points in the networks being checked.
Conceptually, the procedure is as follows: One can imagine each network to be decomposed into AND gates and inverters to form an AIG. Now, for every node in the network, the global function is computed, for example, by building BDDs. All the nodes that have the same global function are collected in equivalence classes. Thus, the choice network is an AIG, which has multiple functionally equivalent points collected in equivalence classes.
However, for large circuits, computing global BDDs is not feasible. Note that in the procedure outlined, previously it is not necessary to actually compute BDDs. One can use random simulation to identify potentially equivalent nodes and, then, use an SAT engine to verify equivalence and construct the equivalence classes. We have implemented a package called functionally reduced AIGs (FRAIGs) that exposes the same API as a BDD package but internally uses simulation and SAT (the details of this package are in the technical report [18] , and a discussion of the issues involved can also be found in recent work on structure-based equivalence checking [10] , [11] , [14] , [15] ).
Example: Fig. 4 illustrates the creation of a network with choices. Networks 1 and 2 show the subject graphs obtained from two networks that are functionally equivalent but structurally different. The nodes x 1 and x 2 are functionally equivalent (up to complementation) in the two subject graphs. They are collected in an equivalence class in the choice network, and an arbitrary member (x 1 in this case) is used as a representative for the class in the choice network. Note that there is no choice corresponding to node o, since the procedure described above detects the maximal commonality between the two networks.
Algebraic Rewriting: A different way to generate choices is by iteratively applying the Λ and ∆-transformations described by Lehman et al. [13] . Given an AIG, we use the associativity of AND to locally rewrite the graph (the Λ-transformation), i.e., whenever the structure AND(AND(x 1 , x 2 ), x 3 ) is seen in the AIG, it is replaced by the equivalent structures AND(AND(x 1 , x 3 ), x 2 ) and AND(x 1 , AND(x 2 , x 3 )). If this process is done until no new AND nodes are created, it is equivalent to identifying the maximal multi-input AND gates in the AIG and adding all possible tree decompositions. Similarly, the distributivity of AND over OR (the ∆-transformation) provides another source of choices. In Section VI, we present experimental results that compare the flexibility provided by these choices with those from lossless synthesis.
Note that this leads to a new way of thinking about logic synthesis: One can use arbitrary transformations to rewrite the network and create choices. The best combination of these choices is selected during mapping.
B. Mapping With Choices
The cut-based Boolean-mapping procedure of Section III-A can be extended naturally to handle equivalence classes of nodes. Only the cut-computation step needs modification. Given a node N , let N denote the equivalence class it belongs to. Let Φ (N ) denote the set of cuts of the equivalence class N . Then
where, if A and B are the two inputs of N , Φ(N ) is given by
This expression for Φ(N ) is a slight modification of the one in Section III-A. The cuts of N are obtained from the cuts of the equivalence classes of its inputs (instead of the cuts of just its inputs). The reader may verify that in the absence of choices (which corresponds to the situation when all equivalence classes have only one node), this computation is essentially the same as the one presented in Section III-A.
As before, the cut computation can be done in a bottom-up manner from PIs to outputs in a single pass.
Example: Consider the computation of the 3-feasible cuts of the equivalence class {o} in Fig. 4 . Let x represent the equivalence class {x 1 , x 2 }. Now 
Now, since a = {a}, and x 1 = x, we get Φ ({o}) = {{o}, {a, x 1 }, {a, x 2 }, {a, q, r}, {a, p, s}}.
Observe that the set of cuts of o involves nodes from the two choices x 1 and x 2 , i.e., o may be implemented using either of the two structures. The subsequent steps of the mapping process (Section III-A, steps 2-5) remain unchanged, except now, the operations are done for equivalence classes of nodes rather than for individual nodes.
V. SUPERGATES
A supergate is a single-output combinational network of a few library gates, which is treated as a single-library gate by the mapping procedure. As the example in the introduction shows (see Fig. 2 ), supergates match larger portions of the subject graph than library gates. This makes the matching more Boolean and less dependent on the structure of the subject graph. This greatly increases the number of matches seen by the mapper and leads to better results.
In what follows, we use the term simple gates to mean the gates in the original library.
A. Use in Mapping
Supergates require no change to the mapping procedure since they are no different from simple gates for the mapper. Supergates are generated in a preprocessing step as described in Section V-B. The supergate library is generated once, stored compactly in a file, and used when technology mapping is invoked. The supergates are recomputed only if changes are made to the original library. This is why supergate generation has an additional advantage of reducing the total run time of mapping by precomputing and reusing the mapping information, which depends on the library but not on the network to be mapped.
After the network is mapped using supergates, each supergate in the mapped netlist is replaced by its constituent simple gates; thus, the final netlist consists of only the library gates.
B. Supergate Generation
Supergates are generated recursively in a number of rounds. In each round, we generate a new set of supergates and compute their functions. These supergates are used in subsequent rounds to generate new supergates. As usual, we represent functions by truth tables.
Let k be the maximum number of inputs in the support of the supergates we consider. For example, when mapping with 5-feasible cuts, k would be 5.
Let G i be the set of supergates at round i. Let L be the set of simple-library gates. Let J k denote {1, . . . , k}, π(X) the set of permutations of a set X, and |g| the number of inputs of a simple gate g ∈ L.
The supergate generation is as follows. Initially, G 0 = {x i |i ∈ J k }, where x i is an elementary function. Each G i+1 is the union of G i , and an additional set of functions of the form g (x i 1 , x i 2 , . . . , x i n ), where g ∈ G i and (i 1 , i 2 
The total number of rounds corresponds to the maximum number of logic levels in a supergate and is an user-specified parameter. If the generation is stopped after the first round (i.e., at G 1 ), then the set of supergates contains only the library gates with all permutations of input variables (cf. Section III-B).
Example: Let k = 3, i.e., we are interested in supergates with at most 3 inputs. Let L = {AND, OR}.
G 0 = {x 1 , x 2 , x 3 }, where the truth-table of x 1 is 0101 0101, x 2 is 0011 0011, and x 3 is 0000 1111.
Each G i+1 contains, in addition to the functions in G i , functions of the form AND(y 1 , y 2 ) and OR(y 1 , y 2 ), where y 1 , y 2 are functions in G i .
Thus, x 1 , AND(x 1 , x 2 ) (whose truth-table is 0001 0001), AND(x 2 , x 1 ), AND(x 1 , x 3 ), etc., are some functions in G 1 . Similarly, AND(AND(x 1 , x 2 ), x 3 ), AND(OR(x 2 , x 1 ), AND(x 1 , x 3 )), AND(AND(x 1 , x 2 )), and AND(x 1 , x 3 )), are some functions in G 2 .
C. Pruning by Dominance
Since the above procedure is exhaustive, a large number of supergates are generated. However, some of the supergates generated above are suboptimal, i.e., they are dominated by other gates.
Example: Consider the gate AND(AND(x 1 , x 2 ), AND(x 1 , x 3 )) from the example. It has worse delay and area than the functionally equivalent supergate AND (AND(x 1 , x 2 ), x 3 ) .
Whenever a new supergate is created, it is checked against existing supergates that implement the same function (by means of a hash table). If the new supergate is worse in terms of area and delay than an existing one, then it is not added to the set of supergates.
Also, note that, usually in industrial libraries, functional symmetry of the underlying gate is not very useful for pruning. For example, both AND(x 1 , x 2 ) and AND(x 2 , x 1 ) have to be retained since the pin-to-pin delays are different in the two cases.
D. Pruning by Resource Limits
Another technique to reduce the number of supergates is by pruning based on resource limits. This is important because of a combinatorial explosion inherent in the previous formulation.
Example: If at one point there are 1000 supergates and a 4-input NAND is used as a root gate, there would be 1000 4 supergates to consider. Many of these would be added (since the pin-to-pin delays are different), and the number of supergates would increase significantly, making the next round impossible to complete.
We experimented with a number of heuristics to reduce the combinatorial explosion. The simplest heuristic is to use only small support gates (say ≤ 3) as root gates. A second heuristic to set the area and delay limits on the supergates. These limits can be handled efficiently by sorting the supergates, and using only those supergates as inputs to the root gate such that the resulting supergate would be within the area and delay limits.
The supergate-generation technique presented above is rather basic and inefficient because of the bottom-up nature of the generation process. Improving the generation process is an interesting research problem. One idea is to study the cut functions actually encountered during mapping and, then, to employ constructive-decomposition techniques to generate good supergates for the commonly occurring functions. This leads to the exciting possibility of a mapper that learns from the circuits it processes!
E. Comparison With Algebraic Rewriting
Recall from Section IV-A that algebraic rewriting includes the addition of choices by adding alternative decompositions based on associative and distributive transforms in the manner of Lehman et al. [13] .
If supergates are generated exhaustively without resource limits, it is possible to do exact logic synthesis (the entire circuit would map to a single supergate). In this theoretical setting, mapping with supergates is the most general form of logic synthesis. In practice, as discussed, only a limited set of supergates can be used.
Practically, the search space explored by supergates is different from the search space explored by rewriting in two ways. First, since supergates are built by combining gates, supergates capture different Boolean decompositions of a function. Therefore, they are more general than algebraic rewriting that by definition is limited to algebraic decompositions. Second, since the set of decompositions explored depends on the library and the pruning techniques described above prune away suboptimal decompositions, supergates explore a space of "good" decompositions relative to the gates in the library.
In summary, the decompositions due to supergates are a targeted-although sparse-sampling of the large space of Boolean decompositions. In contrast, algebraic rewriting is a dense sampling of the smaller space of algebraic decompositions.
VI. EXPERIMENTAL RESULTS
The techniques described in this paper have been implemented in the MVSIS system [17] . The implementation is freely available. We performed a number of experiments to characterize the performance of the mapper in its various modes and to benchmark it against other state-of-the-art mappers.
Area Recovery: In the experiments reported here, delay is the parameter of interest, and area is not directly controlled during mapping. However, unnecessary duplication of logic during DAG-mapping can lead to poor area. The area of the mapped netlist is improved by making two passes after the mapping procedure described in Section III-A. Before each pass, the required time is computed at every node. The nodes are then processed in topological order (from inputs to outputs). For nodes having positive slacks, matches that minimize area, without violating required times, are chosen. The two passes differ in the metric used to measure area. The first pass uses area-flow [16] , a global metric that accounts for fan-out. The second pass uses exact area, a local metric that captures the area contribution due to gates in the maximal fan-out-free cone of a match (in exact area, the area cost of a match is the sum of the areas of all gates used exclusively by the match). Our experiments indicate that the combination of these heuristics (in this order) is very effective: Post sizing area is reduced by about 30% without increasing delay.
A. Basic Evaluation
The first two experiments were done in an academic setting, using mcnc.genlib and simple-load independent-delay model. The benchmarks chosen were the 15 largest publicly available ones (listed in Table II) , and they were preprocessed with the script shown in Fig. 1(a) followed by balancing for delay. For lossless synthesis, the choices were generated using the scheme shown in Fig. 1(b) .
Characterizing the Quality-Run-Time Tradeoff: Table I summarizes the relative delay, area, and run-times of the mapper in its various modes. As might be expected, the fastest runtime is obtained when neither supergates nor lossless synthesis is used (this mode is called baseline), and the best quality (32% improvement in delay over baseline) is obtained when both techniques are used (the mode is called B + L + S).
In addition to these extreme cases, Table I also shows the intermediate situations when either lossless synthesis or supergates are used alone, giving a range of quality-run-time tradeoffs. We note that since the absolute run-time of the baseline mapper is very small (less than 4 s for the largest public benchmarks), the order of magnitude increase in run-time when using both lossless synthesis and supergates is acceptable for better quality.
Experimental Comparison With Algebraic Rewriting: Since a direct comparison with the implementation described by Lehman et al. [13] was not possible, we performed a simple experiment to estimate the effect of algebraic rewriting. As per Section IV-A, a number of associative and distributive decompositions were added through local rewriting. Decompositions were iteratively added until the number of nodes in the AIG tripled.
Compared to baseline, algebraic rewriting led to a 9% reduction in delay (cf. the 32% reduction with B + L + S). When used in conjunction with either lossless synthesis or supergates, rewriting led to a smaller improvement in delay (about 4% in both cases). When used in conjunction with both supergates and lossless synthesis, there was an improvement of only 2% in delay. This confirms the analysis in Section V-E, that in practice, supergates and algebraic rewriting explore different search spaces.
In the subsequent experiments, we use the baseline and the B + L + S modes (supergates and lossless synthesis) for comparisons with other mappers.
Comparison With SIS: Table II shows the performance of the mapper on the benchmark circuits in comparison with the mapper in SIS (used in delay-optimal mode). In the baseline mode, the mapper runs five times faster than the tree mapper in SIS [20] and produces 33% better delay without degrading area. In the B + L + S mode, the mapper produces the best results with 30% reduction in delay over baseline, and 54% over SIS.
B. Comparison With Industrial Mappers
The next set of experiments was conducted in an industrial setting. The examples are timing-critical combinational blocks extracted from a high-performance microprocessor design, which were optimized for delay during technology-independent synthesis. After technology mapping, buffering and sizing is done separately in accordance with a gain-based flow. As part of the mapping, an attempt is made to prefer those gates that can drive the estimated fan-out loads (this is done iteratively like area recovery using fan-out-estimation techniques similar to those used in the area-flow algorithm [16] ). Table III shows a comparison of the mapper with two other state-of-the art mappers: DAG mapper [12] and GraphMap, which is an independent implementation of the Lehman-Watanabe mapper [13] that uses Boolean matching. Both mappers do not have area recovery. Using supergates and choices, the mapper outperforms both GraphMap and DAG mapper in delay and area and has a significantly shorter run-time. Table IV shows the performance of the mapper on some larger blocks from the microprocessor, in comparison with DAG mapper. Delay reduces by 12%, while area (measured after sizing) reduces by 24%. Thus, with larger blocks, the improvement in area is greater. It was pointed out in [12] that DAG mapper can produce significantly faster circuits compared to the traditional tree-mapping approach [8] . However, the area increase for DAG mapper sometimes can be quite significant. The significant area reduction by the new mapper makes DAG-mapping approach much more practical, especially when leakage-power consumption is becoming an increasingly important consideration in high-performance designs.
VII. CONCLUSION AND FUTURE WORK
This paper demonstrates that Boolean mapping, based on the simplified matching algorithm and optimal phase selection, is a better alternative to structural matching, since it produces superior results with shorter run-time. Supergates and choices fit nicely into this framework and greatly improve the quality of mapping by mitigating structural bias. Furthermore, the intermediate networks seen during technology-independent synthesis are a useful source of choices for the final mapping. Supergates, although generated by brute-force enumeration, improve the quality of mapping even with industrial libraries.
To give a balanced view of the techniques presented, we should point out their limitations. The exhaustive-cut computation that works very well in baseline mode (when no choices are used) becomes a computational bottleneck when many choices are added. We have developed pruning heuristics to restrict the number of cuts considered for each node, but extensions of the techniques proposed in [4] to handle choices would be useful.
A general limitation of cut-based matching methods is that library gates, with many inputs, cannot be handled. In practice, this is solved by using a structural matcher just for those gates during the matching phase (as is done in the IBM system [21] ). We are currently exploring more seamless techniques to extend Boolean matching to gates with many inputs.
The exhaustive nature of supergate generation (as presented in Section V-B) is inefficient since: 1) the generated functions may not correlate well with the actual cut functions in the circuits and 2) the same function may be generated multiple times. It would be interesting to explore methods for guidedsupergate generation, where more computational effort is invested in finding the supergates for the frequently occurring cut functions. This suggests the possibility of a mapping procedure that learns from the previous runs how to guide supergate generation.
For our current prototype within the gain-based methodology, sizing and buffering are performed after mapping during physical synthesis. We plan to extend our mapper for use in a flow that combines logical and physical synthesis.
