List of Tables
Introduction
The goal of logic synthesis is to produce a circuit which satisfies a set of logic equations, occupies minimal area and meets the timing constraints. Most logic synthesis systems currently available split this task into two phases -a technology independent phase and a technology dependent phase.
In the first phase, transformations are applied on a Boolean network to find a representation with the least number of literals in the factored form. Additional timing optimization transformations are applied on this minimal area network to improve circuit performance. The role of the technologydependent phase is to finish the synthesis of the circuit by performing the final gate selection from a target library. The technology-dependent phase is, to a large extent, constrained by the structure of the optimized Boolean network.
Prior Work
The technology mapping problem can be stated as follows: Given a Boolean network representing a combinational logic circuit optimized by technology independent synthesis procedures and a target library, we bind nodes in the network to gates in the library such that area of the final implementation is minimized and timing constraints are satisfied.
A successful and efficient solution to the minimum area mapping problem was suggested by K. Keutzer and implemented in programs such as DAGON [10] and MIS [7] . The idea is to reduce technology mapping to DAG covering and to approximate DAG covering by a sequence of tree coverings which can be performed optimally using dynamic programming as follows. A set of base functions is chosen, such as a 2-input nand gate and an inverter. The optimized logic equations (obtained from technology independent optimization) are converted into a graph where each node is one of the base functions. This graph is called the subject graph. Each library gate is also represented by a graph consisting of only base functions. Each such graph is called a pattern graph. (Each library gate may have many different pattern graphs.) The technology mapping problem is then defined as the problem of finding a minimum cost covering of the subject graph by choosing from the collection of pattern graphs for all gates in the library. For area optimization, the cost of a cover is defined as the sum of gate areas. For minimum delay optimization [18] , the cost of a cover is defined as the critical path delay of the resulting circuit.
This approach is extended in [24] to solve the technology mapping problem minimizing area under delay constraints as follows. The authors first compute a range of "interesting" values for the required times at each node (by finding the minimum area and the minimum delay mapping solutions) and then divide this range into equal intervals. The best mapping solution for each of the required times are generated and stored at the node during a postorder travers al (from primary inputs to primary outputs) of the tree. The final mapping solution is generated during a preorder traversal (from primary outputs toward primary inputs) of the tree. In order to obtain high-quality mapping solutions, this method requires a small time step resulting in large number of delay-area points. In contrast, our method [4] works with the arrival times (as opposed to the required times), keeps all (and only) non-inferior delay-area points, and does not need an a priori range of interest for arrival times.
Technology decomposition (the procedure for converting an optimized Boolean network into a NAND-decomposed network) is the precursor to the technology mapping step. It is an open problem to determine which of the possible NAND-decomposed networks yields an optimum solution when an optimum covering algorithm is applied [3] . Different decomposition schemes for minimizing area [18] , minimizing delay [14, 26] , or reducing routing complexity [16] have been introduced by various authors. In this paper, we assume that the DAG has already been decomposed into two-input NAND and inverter gates.
Other technology mapping programs based on rules [6, 2] , heuristic gate merging and sizing [11] , algebraic identity [14] , and Boolean matching [12] have also been proposed in the literature.
Overview and Organization of the Paper
In this paper, we present an efficient algorithm for generating a technology mapping solution with minimum gate area subject to given delay constraints. Our approach consists of two steps: In the first step, we compute delay functions (which capture arrival time -gate area tradeoffs) at all nodes of a NAND-decomposed network, and in the second step we generate the mapping solution based on the computed delay functions and the required times at the primary outputs.
The paper is organized as follows. In Section 2, we introduce some terminology and describe the timing model. Section 3 presents details of our algorithm. Sections 4 and 5 are devoted to the extension from trees to general DAGs and the complexity analysis. Section 6 describes an extension to account for the wire delays during the delay function computation step. We present our results and concluding remarks in Sections 7 and 8.
Terminology and Timing Model
Consider a match g at node n of a NAND-decomposed tree. The inputs to node n consist of nodes n i that fanout to node n (that is, n = n 0 1 + n 0 2 if n has two inputs or n = n 0 1 if n has a single input).
The nodes which are covered by match g are denoted by merged(n; g). The nodes which are not in merged(n; g) but fanin to merged(n; g) are denoted by inputs(n; g). The mapped-parent(n i ) is the set of nodes n j for which there exists a matching gate g such that n i 2 inputs(n j ; g). Note that given node n and gate g matching at n, inputs(n; g) are uniquely determined. However, n may have many distinct mapped parents (Figure 1 ).
Figure 1 goes here.
With each node in the network we store a delay curve (also referred as a delay function). A point on the curve represents the arrival time at the output of the node and the total gate area which is required to map its transitive fanin cone up to (and including) the node. In addition to the area and delay value, the matching gate and input bindings for the match are also stored with each point on the curve. Points on the curve represent various mapping solutions with different tradeoffs between area and speed. We are interested in a mapping with minimum area satisfying delay requirements.
Consequently, we can drop point P 1 on the curve if there exists another point P 2 on the curve with lower area but equal or lower delay. This is possible because the solution associated with P 2 is superior to the solution associated with P 1 in terms of area, delay or both. By dropping points, the delay curve can always be made monotonically non-increasing without loss of optimality. We would refer to P 1 as an inferior point. Point P ? = (t ? ; a ? ) is a non-inferior point if and only if there does not exist a point P = (t; a) such that either t t ? ; a < a ? or t < t ? ; a a ? .
Lemma 2.1 The delay function for a node contains the set of all non-inferior points and is monotonically non-increasing.
In addition, if the difference in delay among two points is small (according to some userspecified parameter ), we drop the point with higher area without any noticeable impact on the quality of the result. Similarly, points which are close in terms of their areas are merged together.
For the delay computation, we have adopted the pin-dependent MIS library delay model as described next. Suppose that gate g has matched at node n, then the output arrival time at n is given by: arrival(n; g; C n ) = max n i 2inputs(n;g) ( i;g + R i;g C n + arrival(n i ; g i ; C n i )) where i;g is the intrinsic gate delay from input n i to output of g, R i;g is the drive resistance of g corresponding to a signal transition at input n i , C n is the load capacitance seen at n, arrival(n i ; g i ; C n i ) is the arrival time at input n i corresponding to load C n i seen at that input, and g i is the best match found at input n i .
The above equation can be easily modified to calculate the rise or fall delays based on the phase of the gate (INVERTING, NON-INVERTING or UNKNOWN), the corresponding rise and fall delay parameters for the inputs, and the output load. To simplify the exposition however, we will use the above generic delay equation.
This delay equation is based on a static timing analysis model which ignores the false path problem. Doing true timing analysis ( [13, 5] ) during technology mapping will significantly increase the computational and space complexities of our proposed algorithm as follows: 1) Timing analysis will have to be performed in a (either explicit or implicit) path-based manner rather than the blockoriented manner in which it is currently performed; 2) The new delay equation will not satisfy the principle of dynamic programming and hence the number of area-delay tradeoff points that will have to be stored in the delay functions of nodes will become exponential (as the points will have to be annotated with the sub-path that generates them and the sensitization conditions along that sub-path). In this paper, we will not consider this issue further.
Tree Mapping
In this section, we focus on tree mapping. Later, we shall describe extensions to DAG mapping.
In particular, we describe two tree-traversal operations which are applied to a NAND-decomposed tree in order to obtain a technology mapping solution which minimizes the total gate area while satisfying the timing constraints.
First, a postorder traversal is used to determine a set of possible arrival times at the root of the tree. Once the user specifies a single required time, a second, preorder traversal is performed to determine a specific technology mapping solution. This scheme is similar to that proposed in [23] in order to solve the optimal orientation problem for a slicing tree of macro-cells.
We begin by stating that the possible accumulated gate areas at each node can be described as a function of the arrival times at the node. The accumulated gate area is the total area used by the gates which have matched nodes in the transitive fanin cone of the node. The arrival time is the earliest time at which the signal at the output of the node settles within 50% of its final value (due to a signal transition at some primary input). The delay function is therefore represented by a set of ordered pairs of real positive numbers (t; a), where a piecewise linear function a = f(t)
can be constructed which describes the graph of all possible accumulated gate areas. This function describes all possible arrival time-area tradeoffs at a given node. The delay function at an input node of the NAND-decomposed tree consists of ordered pairs (t; a) where t and a have been specified by the user (in case of primary inputs of the network) or have been previously computed (in case that inputs to this tree are outputs of other trees).
Postorder Traversal
On the first traversal, we begin at the leaf nodes of the NAND-decomposed tree. Since each leaf node possesses a set of possible arrival time-area points which are reflected in its delay function, the delay function at any mapped-parent(n) must also reflect these possible arrival time-area tradeoffs.
A postorder traversal of the NAND-decomposed tree is performed, where for each node n and for each gate g matching at n, a new delay function is produced by appropriately merging the delay functions at the inputs(n; g). Merging must occur in the common region among all delay functions in order to ensure that the resulting merged function reflects feasible matches at the inputs(n; g). The delay functions for successive gates g matching at n are then merged by applying a lowerbound merge operation on the corresponding delay functions. At a given node n, the resulting delay function will describe the arrival time-area tradeoffs in propagating a signal from the tree inputs to the output of n. To illustrate the delay function computation procedure, consider the example in Figure 2 . It shows the computation of the delay function for match g at node D. The inputs to the match are nodes A and B. The delay functions for A and B are known at this time. To compute a point on the delay function for node D, we select a point from delay function of inputs, say point a on delay curve of node A. The delay of point a is 3 units. So, we look for a point on the delay function of node B with delay less than 3 which has the minimum area. In this example, d is the desired point. Similarly, we generate all other points on the curve. Note, that there is no point on delay-curve(D)
corresponding to the point e on delay-curve(B), as there is no point on delay-curve(A) which has delay less than or equal to delay(e).
Figure 3 goes here.
To illustrate the lower bound merging procedure, consider the example in Figure 3 . Here, we have already generated the delay-curves for the matching gates g 1 and g 2 at node C. In order to obtain the composite delay curve at n, we must merge the two delay curves into one. This operation is simple since we only need to keep the non-inferior points on either curve. The minimum of the two delay-curves is computed, and information is attached to each point on the resulting delay curve indicating which gate alternative yields that point.
The delay function computation and merging are performed recursively until the root of the tree is reached. The resulting function is saved in the tree at its corresponding node. Thus, each node of the tree will have an associated delay function. The set of (t; a) pairs corresponding to the composite delay function at the root node will define a set of arrival time-area tradeoffs for the user to choose from.
Preorder Traversal
The user is allowed to select the arrival time-area tradeoff which is most suitable for his application.
Given the required time t at the root of the tree, a suitable (t; a) point on the delay function for the root node is chosen. The gate g matching at the root that corresponds to this point and inputs(root; g) are, thus, identified. The required times t i at inputs(root; g) are computed from t, g, and the observation that inputs(root; g) must now drive gate g. The preorder traversal resumes at inputs(root; g) where t i is the constraining factor and a matching gate g i with minimum a i satisfying t i is sought.
Timing Recalculation
The gate delay is a function of the load it is driving. During the postorder tree traversal, the output of current node n i , is not mapped hence the load capacitance is unknown (unless, all the gates in the library have identical pin capacitances). This load cannot be taken to be zero as that will introduce excessive error during the postorder traversal; at the same time, t he average number of pins per gate in a mapped circuit is between 2 and 3, therefore, we heuristically choose the load value to be that offered by the smallest two-input NAND gate in the library. When we come to a node n 2mapped-parent(n i ) with matching gate g, we know the exact load seen by n i . This load is equal to the input capacitance of g and is, in general, different from the default load. Therefore, in order to calculate the arrival time at node n, the output arrival times for all nodes in inputs(n; g) must be adjusted to account for the change in the load capacitance [15] . Similarly, during the preorder tree traversal, when a gate g is selected to match at n, the load seen by inputs(n; g) must be recalculated.
In order to account for this load change ( i ), the delay curves at the inputs have to be appropriately shifted. In particular, since the drive resistance of gate matching at n i and giving rise to a point p j on delay-curve of n i is stored with that point, the delay shift is computed as i p j :gate:drive. 
Accounting for the Unknown Load Values
The shift in delay for a point is a function of change in the load and the matching gate's driving resistance at the point. Different points on a curve may shift by different amounts depending on the matching gate. Differential shift may make the curve non-monotonic. In the worst case, a previously inferior point on the curve might have become non-inferior, had it not been dropped earlier. This may cause an optimal mapping being rejected. A possible solution is not to drop inferior points from the delay curves till we reach the output node. This will require a large number of points being stored for each curve without much gain. Proof Maximum delay shift among the points on the curve is given by (R max ? R min ) (C max ? C min ). If this quantity is , then the error will be within the specified tolerance. The other possible solution is to use a load bin method similar to that of MIS2.2. For each load bin, we store a delay curve. If the load bins are separated by less than =(R max ? R min ), then the timing error will be less than . In practice, most of the libraries have a small number of gate series (e.g., performance-versus area-optimized, low-power versus high-power series). Within each series, the gates tend to have almost the same pin capacitances. Therefore, use of one load bin per gate series should be sufficient.
Note that during delay estimation we ignore the wire load (or alternatively, approximate it based on the expected average interconnect length and the capacitance per unit length of interconnect).
In fact, wire load can vary by a large amount (compared to the variance in pin capacitance) depending on the placement and routing. Therefore, it does not pay much to improve the accuracy of computing gate loads while ignoring (or only roughly capturing) the wire loads.
DAG Mapping
Most of the real circuits are not trees, but general DAGs. The problem of mapping a DAG even for the constant load model is NP-hard [3] . Therefore, we resort to heuristics. One heuristic is to decompose the DAG into a number of trees such that the inputs for each tree come from other tree outputs or the primary inputs. During the delay curve computation step, entire trees are processed in postorder and delay curves are computed for each primary output of the DAG. During the gate assignment step, entire trees are mapped in preorder. This heuristic which does not allow mapping across tree boundaries is similar to that used by DAGON.
Alternatively, we could avoid decomposing the DAG into trees as follows. During the delay curve computation step, nodes are visited in postorder. For each node, we compute the delay curve as in case of trees. However, if the input for a candidate match at the node is coming from a multiple fanout node we divide the area contribution of that input by the fanout count of the input node. By reducing the area contribution we tend to favor a solution in which multiple fanout nodes are preserved after mapping, which reduces logic duplication and improve the final mapped area.
This heuristic which permits tree boundary crossing was also implemented in the MIS mappers [7, 24] . This approach leads to smaller circuits, and is the one which we adopted for our mapper.
During the gate selection step, if we come to a node which has already been mapped, we check if the mapped solution at the node satisfies the timing requirement. If so, we keep the mapping; otherwise, we replace it with another solution from the delay curve which satisfies the current timing requirement and has minimum area. The new solution may have higher area compared to the previous solution. Note that satisfying the current timing can only decrease the delay for the previous cones, although it may increase the total gate area.
The solution for circuits with multiple outputs also depends on the order in which the output cones are processed. During the delay curve generation step, when we are computing the signal arrival time for a match g at node n, we need to recalculate the load seen at inputs(n; g). For n i 2 inputs(n; g), some of the fanouts of node n i (other than g) may have already been mapped (because they are part of a logic cone which has been processed), and hence, the contribution of these fanouts to the load can be calculated exactly. This incremental load recalculation will result in more accurate arrival time calculation at the output of n. Similar incremental load recalculation is applied during the gate assignment step.
function technology map( , ) is a NAND-decomposed Boolean network is a vector of required times at primary outputs begin
for each node n 2 (in postorder) do compute delay curve(n) end for each primary output po 2 do assign best gate(po; t po ) end end
Complexity Analysis
Consider a gate g (with k inputs) matching at node n where input i has N i points on its delay curve.
The delay curve corresponding to match g at node n has N = P k i=0 N i points in the worst case.
The time required to generate each point, assuming that delay curve for each input is sorted, is O(klog(N max )) (time for binary search) where N max is the maximum N i . Thus, the total time for generating delay curve per candidate match is O(kNlog(N max )).
For a finite size library, the maximum number of gates that can match at a node n is bounded which means that the number of points on the delay-curve(n) will remain linear in the total number of points on the delay-curve of inputs(n; g) for various matching gates. Therefore, the number of points increases linearly from one level to another. Despite this, the number of points could still grow exponentially in terms of the number of levels in the tree. However, if the tree is nodebalanced (its height is logarithmic in the number of its leaf nodes), then the number of points will remain polynomial. In practice, the increase in number of points is even lower due to the fact that a large number of points generated are inferior points which are dropped and not propagated t o higher levels.
It is observed that the range of areas generated for various solutions varies only by a factor of two, which means that if we use only 50 points at each node, the solutions produced will be at most 2% poorer in the area compared to the case where unbounded number of points are allowed.
With a fixed upper bound P on the number of points, the time to generate delay curve becomes O(k 2 Plog(P)) which is a constant since the number of inputs of any gate in the library is bounded.
6 Placement-Driven Mapping
Motivation
Interconnections are becoming a major concern in today's high-performance, high-density ASIC designs because the distributed RC time delay of these lines increases rapidly as chip sizes grow and minimum feature sizes shrink [1] . With recent studies [20, 9] indicating that interconnections occupy more than half the total chip area and account for a significant part of the chip delay, it is appropriate that wiring is integrated into the cost function for logic synthesis.
To elaborate on the importance of the wire load, consider a two-input NAND gate driving an inverter gate through 0.2 cm of aluminum interconnect (2 m wide, 0.5 m thick, with a 1.0 m thick field oxide beneath it). 0.2 cm is the expected length of a local interconnect line on a 2cm 2cm chip [1] . We calculate the rise time (to 50% of its final value) at the input of the inverter gate using two methods. One method ignores the capacitance of the interconnect line and uses delay = g + R g C g = 0:4ns where g is the intrinsic gate delay, R g is the on-resistance of the driver gate, and C g is the input capacitance of the fanout gate. The second method [19] uses delay = g + R g (C g + C unit length) = 1:0ns where C unit is the interconnect capacitance per unit length and length is the interconnect length. 1 Gate and interconnect parameters are taken from data sheets for an industrial 1.0 micron ASIC library: g = 0:3ns, R g = 1KΩ, C g = 0:1pF , C unit = 3pF=cm. The delay calculations clearly show that the interconnect capacitance dominates the gate input capacitance.
In summary, with the existing technology, the capacitive term is dominated by the capacitance between the interconnection and substrate. For local aluminum lines, the resistive term is dominated by the on-resistance of the MOS transistor. As the chip dimension increases and the minimum feature size decreases, the interconnection capacitance bottoms at about 1 -2 pF=cm while the input gate capacitance decreases. Therefore, the RC delay of interconnect lines will become even more dominant in the future.
In [15] an attempt is made to increase the interaction between logic synthesis and technology mapping. The idea is to generate a "companion" placement solution for the circuit before it is mapped. This placement is then used to evaluate the cost of a matching gate during the mapping process. The placement is dynamically updated in order to maintain the correspondence between the logic and layout representations. In the end, a mapped network and a corresponding placement solution are generated. The placement solution is then globally relaxed in order to produce a feasible placement according to the target layout style (e.g., standard-cell or sea-of-gates). Using these techniques, circuits with smaller area and higher performance have been synthesized [15] .
Accounting for the Wire Delays
For submicron technologies, the effect of interconnect on circuit delay is of more importance than its effect on the circuit area. Therefore, we only consider the former effect here. The latter effect can be easily captured in a similar fashion. It is straight-forward to incorporate the wire cost into the area-delay mapper as described next.
The delay function at each node now consists of a set of non-inferior pointsP = (ť; a) whereť is a number representing the gate and wire delay, and a is the area. The load at the output of a node consists of two components: the gate capacitance of fanout nodes and the wire load. The latter is calculated as the product of the wire length and the capacitance per unit length of interconnect.
Node positions are needed to compute the wire length. These position are however known after the initial placement of the unmapped network. Consequently, the wire length can be calculated using a number of different models. These models include the star connection model, the enclosing 1 Interconnect resistance may be ignored without introducing much error. In addition, the transmission line properties of interconnect lines are ignored for on-chip connections.
rectangle approximation model, and the single trunk Steiner tree model. We use the last model as it is more accurate, yet can be computed efficiently. A single trunk Steiner tree consists of a single horizontal (vertical) trunk to which all nodes are joined by short vertical (horizontal) line segments.
In order to compute the wire length, first the direction of the trunk is determined by considering the x and y direction spans of all the locations for nodes and picking the direction with larger total span. Then the location of the trunk is found by finding the median of the locations for nodes in the appropriate direction.
It is desirable to incrementally update the position of matched gates while the delay functions are being calculated. This operation results in positions which more accurately reflect the position of the gates after the mapping procedure. For this purpose, once a gate g is mapped at a node n, the position of g is updated by placing g at the center of the positions for its (mapped) fanin gates and fanout nodes. Each point on the area-delay curve of a node is thereby annotated with the position of the gate matching at the node.
The network is then mapped one logic cone at a time during the preorder traversal as described in section 3.2.
Experimental Results
These procedures have been implemented in a C program called ADmap. We have run the recommended set of the MCNC benchmarks [27] (except for C6288 benchmark where the detailed routing step aborts) using the ADmap and compared the results to the MIS2.2 technology mapper [24] . The same technology independently optimized blif files were used as input in both cases. The circuits were first optimized using the script.rugged [21] . They were then decomposed into NAND gates and mapped using the MIS2.2 [24] and the ADmap. Finally, the circuits were placed using GORDIANL [22] plus DOMINO [8] and routed using YACR [17] . All results are reported after layout is completed. We used the lib2 library of the MIS2.2 package, assumed a value of 3pF=cm for the interconnect capacitance per unit length, and allowed a maximum of 16 points on each delay curve (see section 5). Table 1 presents the total gate area and the longest path delay after technology mapping and the total chip area (including gate and wire areas) and the circuit delay (including gate and wire delays) in the area mode of MIS2.2 mapper ( map -s ). Table 2 shows results of the ADmap in the area mode: All entries in this table are normalized with respect to the corresponding entries in the ADmap produces post-mapping results with 5% less area and 4% less delay. The ADmap produces post-layout results with the same area, but with 3% less delay. Table 3 shows the post-mapping and post-layout results for the MIS2.2 mapper in the timing mode ( map -n 1 -s ) while Table 4 contains the normalized results for the ADmap. 2 Entries in this table are normalized with respect to the corresponding entries in Table 3 . On average, the ADmap produces circuits with 17% less area and 18% less delay (after mapping) or alternatively with 10% less area and 17% less delay (after layout). Table 5 shows the ADmap results for a different input parameter C (increasing C tends to increase area and reduce delay). In this case, the ADmap produces circuits with 4% less area and 28% less delay (after mapping) or alternatively with 8% more area, but 26% less delay (after layout).
The normalized placement-driven ADmap (PLmap) results are shown in Table 6 for the same input parameter as that used for generating the data in Table 5 . Entries in this table are again normalized with respect to the corresponding entries in Table 3 . The PLmap produces circuits with 12% less area and 24% less delay (after mapping) or alternatively with 4% less area and 22% less delay (after layout). Table 7 Tables 8 and 9 which were generated as follows. After technology mapping by either ADmap or MIS2.2 mapper, we performed fanout optimization using the "-AFG" option of the MIS2.2 mapper which does fanout optimization followed by area recovery and a second fanout optimization pass (see [24] for details). The resulting circuits are then placed and routed. Table 8 presents the total gate area and the longest path delay after technology mapping and fanout optimization, and the total chip area (including gate and wire areas) and the circuit delay (including gate and wire delays) in the timing mode of MIS2.2 mapper ( map -n 1 -AFG -s ). Table 9 shows results of the ADmap (C = 1:0) in the timing mode; As before, all entries in this 
Conclusions
We have presented a powerful technique for technology mapping which generates solutions with different area/delay tradeoffs. Our technique unifies techniques for technology mapping with different objectives (minimum area, maximum performance, and minimum area under delay constraints) and is based on principles of dynamic programming and computation of delay curves. For a node-balanced NAND-decomposed tree, our algorithm finds the optimum area solution under delay constraints (subject to error due to unknown loads during delay computation step) in polynomial time and space. For the general problem of mapping DAGs, the algorithm retains its efficiency and produces results which are superior to those produced by other mappers.
The ADmap program has been extended to generating circuits with minimum average power consumption under delay constraints by constructing optimal (under a non-glitch delay model)
power-delay curves during the postorder phase [25] . Extension to combine pin permutation with technology mapping is straight-forward, but increases the complexity of the procedure. Area (mm^2) Figure 5 : The delay curve for the dalu benchmark circuit
