Abstract-In most libraries, gate parameters such as the pin-to-pin intrinsic delays, load-dependent coefficients, and input pin capacitances have different values for rising and falling signals. Most performance optimization algorithms, however, assume a single value for each parameter.
fines the gate assignment problem. Section IV presents the complexity analysis, while Section V describes the new dynamic programming algorithm, along with the implementation details and the data structures used. Section VI presents the experimental results. Finally, we conclude with directions for future work in Section VII.
II. GATE DELAY MODELS
The model used to calculate the delay through a gate (or cell) is of central importance in timing analysis and optimization.
Given a single-output gate (or cell) g, let (j; g) denote the delay from an input pin j of the gate g to the output of g. We will use g to denote the output of g as well. Two delay models are popular: load-independent and load-dependent. The load c g refers to the cumulative capacitance seen at the output of g. It is the sum of the input pin capacitances (p) of all the fanout pins p of g and the wire capacitances of the fanout nets.
In the load-independent delay model, the delay from an input pin j of a gate g to the output of g, (j; g) is the intrinsic delay (j; g) = (j; g):
In the load-dependent delay model, (j; g) = (j; g) + (j; g)cg:
Here, (j; g) represents the intrinsic delay from j to g, (j; g) the drive capability or load coefficient of the path from j to g and c g the load capacitance at the output of the gate g. The gate library specifies and parameters for all input-pin to output-pin paths within each gate and values for all the input pins. In general, (j; g) and (j; g) are different for different input pins j. If g has a single input pin (e.g., buffers, inverters) or if and values are identical for all input pins, we will drop the argument j.
The above description assumes a single value for each parameter , , and . However, it is well known that delays for the rising and the falling transitions can be quite different. In fact, every gate in the industrial gate libraries we have access to has different rise and fall delay parameter values. Quite often, these values are far off from each other. For instance, in one Fujitsu technology, the rise and fall values for a path in a simple gate differed by 45% and the values, by 100%! To handle this scenario, we use the subscripts r and f to denote rise and fall. For instance, r (j; g) denotes the intrinsic delay from input pin j to the output g when g switches from 0 to 1. Similarly, f (j; g) is the intrinsic delay when g makes a falling transition. We write these values as pairs: (r , f ) and (r, f ).
The gate delay parameters (r , f ) and (r, f ) are used to compute the arrival times at various gates and the delay through the circuit as follows. At each gate, both rise and fall arrival times are stored. The rise (fall) arrival time at a gate g denotes the maximum possible time it takes for a transition to travel from a primary input to the output of g and g makes a rising (falling) transition as a result. A topological traversal of the circuit from primary inputs to outputs is used to compute the rise and fall arrival times at each gate g using the rise and fall arrival times already computed at the fanin gates and the gate delays through g. An inversion through the gate should be considered appropriately while computing the arrival times. For instance, since a falling transition at the input of an inverter generates a rising transition at its output, the fall arrival time at the inverter's input should be used to compute the rise arrival time at its output. The rise (fall) delay of the circuit is the maximum rise (fall) arrival time over all primary outputs.
The arrival time at a primary output is the maximum of the rise and fall 0278-0070/03$17.00 © 2003 IEEE arrival times at that output. The delay of the circuit is the maximum arrival time at primary outputs. Equivalently, the delay of the circuit is the maximum of its rise and fall delays.
III. GATE ASSIGNMENT UNDER RISE AND FALL DELAYS
Gate assignment for minimizing the circuit delay is a fundamental problem in performance optimization of gate-level circuits. Ideally, each gate should be optimally sized during technology mapping. However, exact technology mapping is expensive in practice due to the large size of the technology library and due to the complex interaction between the gate being mapped and the unmapped portion of the logic. In addition, wire loads often cannot be estimated with sufficient accuracy during technology mapping to make the best choices for gate sizes. As a result, heuristics are used, which, among other things, may not select the best sizes for gates from a delay perspective [3] . This leaves scope for improving the circuit delay by resizing gates after technology mapping. Being an in-place optimization technique, gate assignment is also layout-friendly (i.e., it does not disturb the placement and routing of cells) and can be used during or after layout when more accurate wire load and wire delay information are available. Thus, gate assignment has become an important optimization problem in its own right.
The gate assignment problem can be stated as follows. We are given a circuit composed of single-output cells from a cell-library. For each cell C i , many different sizes 1; . . . ; k; . . . are available in the library, each size having possibly different area, input pin capacitances , intrinsic delays , and load coefficients . Let C k i denote assigning size k to cell C i . The gate assignment problem is to select the size of each cell in the circuit such that the circuit delay is minimized. We assume the loadindependent pin-to-pin delay model, in which the delay through a path within a cell is just the intrinsic delay . Although simplistic, this delay model is gaining popularity with the advent of gain-based synthesis [14] in the presence of large, almost continuous-sized libraries.
If only one value were to be used for each cell parameter (i.e., ), the problem can be solved optimally by a dynamic programming algorithm [6] . The algorithm traverses the network gates in a topological order from primary inputs toward primary outputs. When a cell C i is reached, the minimum possible arrival times at all its input pins are known. For each available size k of the cell Ci, the algorithm computes the arrival time at the output of C k i using the arrival times at its input pins j and the pin-to-pin delays (j; C k i ) for the cell size k. The algorithm then selects the size k 3 that minimizes the arrival time at the output of Ci and replaces C i with C k i . It then continues the traversal and size selection until the primary outputs are reached.
If both r and f are specified, the circuit delay is given by max {circuit rise delay, circuit fall delay}, which is what we wish to minimize. In one strategy typically used in industry and academia for optimizing with different rise and fall delays, each pin-to-pin delay within a gate is approximated by a single delay value: = maxfr; f g.
Under this approximation, the dynamic programming algorithm of [6] , described above, is exact. Using these single pin-to-pin delay approximations, we apply the algorithm on the circuit to obtain the new gate sizes. Then, we do a delay trace on the modified circuit using actual rise and fall delays (r; f ) to determine the true circuit delay. Let us call this strategy same rise-fall. We will show that this strategy is suboptimal. Consider the circuit shown in Fig. 1 . Let us assume rise and fall arrival times of 0 for all the primary inputs I1 through I4. Suppose there are the following two sizes for the AND gate: 1) with ( r , f ) = (7, 3) , and 2) with (r, f ) = (4, 6).
Similarly, suppose two sizes for the OR gate: 1) with ( r , f ) = (4, 4), and 2) with (r, f ) = (2, 5). With same rise-fall, the gate delays assigned to the two sizes of the AND gate are maxf7; 3g = 7 and 6, respectively, and to the two sizes of the OR gate, 4 and 5, respectively. Using these delay numbers, we can see that the size C Table I , row 5, it can be verified that the circuit delay under this assignment is 10. However, from Table I , it is easy to see that the optimum selection is obtained by choosing gates Another natural strategy for using the dynamic programming paradigm is the following (we will call it greedy):
Maintain both rise and fall arrival times at each cell. For each cell, select the size that minimizes the maximum of rise and fall arrival times at that cell.
However, as shown in [10] , greedy is also a suboptimal strategy. Consider once again the circuit of Fig. 1 and the gate sizes as shown above.
By following this greedy strategy, one would select gate C 2 1 (with delays (4; 6), since maxf4; 6g < maxf7; 3g), gate C 1 2 (with delays (4; 4)), and gate C 1 3 (also with delays (4; 4)-this can be seen from the fifth and sixth rows of Table I ). This leads to a circuit with the delay of 10, once again suboptimal.
As this example shows, we cannot decide locally at a gate the best size for it. We need to examine the fanouts as well. However, that may generate an exponential number of solutions by essentially enumerating all possible size selection choices in the circuit.
IV. PROBLEM COMPLEXITY
As seen in the previous section, finding the optimum solution to the gate assignment problem is nontrivial. It turns out that deriving an efficient optimum solution is difficult because the problem itself is intractable. We prove that the problem of gate assignment with different rise and fall parameter values is NP-complete even under the load-independent delay model, which is the simplest possible model. Theorem 1: The gate assignment problem is NP-complete under the load-independent delay model.
Proof: The problem is in NP since, given an assignment, checking that the circuit delay meets the requirement can easily be done by performing a delay trace on the circuit. To prove NP-hardness, we use the transformation is from the PARTITION problem. PARTITION, stated as a decision problem, is as follows [4] : Given an instance of PARTITION, we construct a circuit with jAj single-input single-output noninverting cells C 1 ; C 2 ; . . . ; C jAj , which are connected in a chain, with the output of Ci connected to the input of C i+1 (Fig. 2) . The circuit has a single input and a single output. 1 The cell C i corresponds to the item a i of A. Each cell C i comes in two sizes: C R i and C F i , and they have the following delay parameters:
The size C R i contributes only to the rise delay at its output, and the size C F i only to the fall delay.
We show that there exists A 0 A such that (3) is satisfied if and only if the circuit delay is at most W (A)=2 (i.e., rise and fall delays through the circuit are each at most W (A)=2).
Only If: Given A 0 such that (3) is satisfied. If ai 2 A 0 , select the size C R i for the cell C i ; otherwise, select C F i . Since each cell is noninverting, rise delay of the circuit is the sum of the rise delays of all the cells. Since the rise delay through a cell that corresponds to an item not in A 0 is zero, and through a cell that corresponds to an item in A 0
is the weight of the corresponding item, the rise delay of the circuit is precisely a2A w(a) = W (A)=2 [from (3)]. Similarly, the fall delay is a2A0A w(a) = W (A)=2. Hence, the circuit delay, which is the maximum of rise and fall delays, is W (A)=2.
If:
Assume there exists a size for each cell such that both rise and fall delays are at most W (A)=2. In fact, both must be exactly equal to W (A)=2, since each cell C i contributes w(a i ) either to the rise delay or the fall delay through the circuit and hence the total contribution of all the cells to rise or fall delay through the circuit is W (A). Create the 1 The argument in the proof remains the same if each gate has other fanins that are primary inputs. set A 0 as follows. If the size C R i is selected for the cell Ci, place the corresponding item a i in A 0 . It is easy to see that (3) is satisfied.
Note that, although some of the values in the proof are zero, an alternate proof that uses strictly positive values can also be constructed by adding a constant to all the cell delays. 2 It has been proved [7] that the problem of gate resizing for minimizing the circuit delay under area constraints is NP-complete. Our proof (of Theorem 1) can be obtained by replacing the cell delay and area parameters in the proof of [7] with the rise and fall delay parameters.
We used the simplest load-independent delay model to prove the complexity result. Clearly, the gate assignment problem remains NP-complete for the more realistic load-dependent delay model as well. Additionally, since gate assignment is a special case of technology mapping, the previous theorem also establishes that the problem of technology mapping for minimum circuit delay given separate rise and fall delay parameters under the load-independent delay model is NP-complete.
However, the PARTITION problem is not NP-complete in the strong sense. This distinction between strong and weak NP-complete problems is somewhat subtle, but important. Informally, a problem is NP-complete in the strong sense if its difficulty is not directly related to the values of the numbers used to describe an instance of the problem. Strong NP-complete problems are hard to solve even if the numbers that describe the instances of the problem are small. On the contrary, the complexity of weak NP-complete problems is directly related to the presence of large numbers in the instance description. NP-complete problems that are not strong can be solved in pseudopolynomial time, i.e., in time polynomial in the largest number involved in the problem description.
PARTITION can be solved in pseudopolynomial time using a simple dynamic programming technique [4] that takes time polynomial in the sum of the numbers in A.
Since the gate assignment problem was proved NP-complete by transforming a weak NP-complete problem, there remains the possibility that the gate assignment problem is not, itself, NP-complete in the strong sense. In the next section, we partially settle this open problem by showing that a pseudopolynomial time algorithm for solving the gate assignment problem exactly for tree circuits exists, thereby proving that the problem for tree circuits is not NP-complete in the strong sense. For the general directed acyclic circuits, the complexity of the problem still remains an open question.
V. PROPOSED METHOD
This section describes a dynamic programming approach that solves the gate assignment problem exactly in pseudopolynomial time for a tree circuit. The algorithm is based on traversing the nodes of the tree circuit topologically from primary inputs to the output, computing the permissible fall and rise times for each node in the circuit, selecting that choice at the primary output which yields the minimum value of the circuit delay (which is the maximum of the rise and fall delays) from all permissible delays, and propagating this value backward through the circuit and selecting the gate sizes. For simplicity, in the following exposition, we will consider the case where all the delays are integers, in some chosen unit.
Assume that, for each cell C i in the circuit, several different sizes C k i are available in the library, each having different pin-to-pin delays.
A. Delay Feasibility Table
The basic idea underlying the approach is to use a dynamic programming algorithm to compute, for each cell, the ranges of possible delays obtainable. For each cell Ci in the circuit, the permissible fall and rise times will be kept in a table T i , the delay feasibility table. T i (r; f) will contain a value k > 0 if it is possible to achieve a rise arrival time of r and a fall arrival time of f by selecting the gate size k for that cell. If it is not possible to meet either r or f, then T i (r; f) will contain the value 0.
Upper limits on the required size of the table can be obtained from the circuit delay obtained with the greedy strategy described in Section III. For the purposes of this analysis, we will assume the maximum circuit delay obtained with the greedy approach is m = D. This number represents an upper bound on the achievable circuit delay. For our example, as shown earlier, m = 10. As an example, consider again the circuit in Fig. 1 . For gate C 1 in that figure, it is possible to achieve a rise time of 7 and a fall time of 3 by choosing gate C According to the definition, T 1 (r; f) should take the value 1 for any pair (r; f) such that (r 7^f 3). It should also take the value 2 for any pair (r; f) respecting (r 4^f 6). Note that, for values of r and f satisfying both conditions (e.g., r = 8; f = 8), T i (r; f) can take either the value of 1 or 2, since both choices of gates achieve the desired arrival times. For gate C1, the delay feasibility table is shown in Fig. 3 . In this and the following figures, the value of r is used to index the column of the table and the value of f to index the row of the Fig. 4 is obtained.
For each cell Ci, the entries in table Ti are computed using a simple recursive relation. Let S i = fC j g denote the set of gates that are the direct fanins of C i . Let us also assume that the output of C j is connected to the input pin j of Ci. We are interested in determining if a rise arrival time of r and a fall arrival time of f is possible at the output of C i for some size assignment for all the gates in the subcircuit rooted at C i . We answer this by trying all gate sizes k at Ci one by one and then for each k checking the feasibility of appropriate rise and fall delay values at the fanin gates C j , as shown in the following relation:
T i (r; f) = k i 8 j s:t: C 2S T j (r j; k ; f j; k ) 6 = 0 (6) where • r j; k = r 0 r(j; C k i ) and f j; k = f 0 f (j; C k i ), if Ci is positive unate in the pin j, i.e., a rising (falling) transition at j causes a rising (falling) transition at C i (e.g., C i = C j + C z ).
• r j; k = f 0 f (j; C k i ) and f j; k = r 0 r(j; C k i ), if Ci is negative unate in the pin j, i.e., a rising (falling) transition at j causes a falling (rising) transition at Ci (e.g., Ci = C 0 j + Cz).
• r j; k = f j; k = minfr 0 r (j; C k i ); f 0 f (j; C k i )g, if C i is binate in the pin j, i.e., a rising (falling) transition at j can cause both rising and falling (rising and falling) transitions at C i (e.g.,
Here, r (j; C k i ) is the pin-to-pin delay from the input pin j to the output pin of C k i that corresponds to a rise transition on the output of the cell C k i caused by a transition on the input pin j, i.e., the input pin connected to the cell Cj. Similarly, f (j; C k i ) is the delay that corresponds to a fall transition on the output of cell C k i caused by a transition on the pin j. Expression (6) states that there is a gate assignment for the subcircuit rooted at the cell Ci, with the size k for the cell C i , which can result in the rise and fall arrival times of r and f, respectively, at the output of C i if and only if for each fanin gate C j of Ci, there is a gate assignment for the subcircuit rooted at Cj that can yield r j; k and f j; k as the rise and fall arrival times, respectively, at the output of Cj. The computation of r j; k and f j; k for the positive and negative unate cases is straightforward. For a binate pin j, consider a gate assignment rooted at C j , which along with the size k at C i results in a rise arrival time of r and a fall arrival time of f at Ci. Then, it must be the case that this assignment yields at the output of C j , the 
4)
For all values of r in f11 11mg and f in f1111 mg, do: if 8 j T j (r j; k ; f j; k ) 6 = 0 then set T i (r; f) to k.
It is assumed that references to T i (r; f) with r < 0 or f < 0 will yield the value of 0. To proceed with the example used above, we can now compute table T 3 for gate C 3 by applying the recursion in (6) . T 3 is shown in Fig. 5 On the other hand, T3(8; 9) is 0, because choosing either the size 1 or the size 2 for C 3 leads to infeasible delay assignments at the gate C 1 . For instance, selecting C 2 3 implies that T 1 (8 0 2; 9 0 5) must be nonzero. But T1 (6; 4) is 0. Similarly, selecting C 1 3 implies that T1(804; 9 04) must be nonzero, which is not the case. Note that in this example, for simplicity, we used only positive unate gates.
Since C3 is the circuit output, we stop the forward traversal of the circuit. Next, we select that entry (r; f) in T 3 which minimizes the maximum of fr i ; f i g over all nonzero entries T 3 (r i ; f i ). In our example, T3(9; 9) = 2 is the best entry, corresponding to the best achievable delay time of maxf9; 9g = 9, obtainable by selecting the gate C Note that in this example we used equal rise delays and equal fall delays for all input pins to the gate output only for the sake of clarity. In general, different pins will have different delays, and that is taken into account in (6).
B. Complexity Analysis and Optimizations
From (6) and the algorithm proposed, a first analysis shows that the creation of table T i requires time proportional to n i p i m 2 , where m is, as before, the dimension of the table, ni is the number of possible choices for gate Ci and pi is the number of input pins for gate Ci.
It turns out, however, that computing the entire table is not necessary. In fact, for the purpose of delay optimization, only the boundary of the filled area 3 in the table needs to be computed, since no optimal solution will ever use cells properly inside the filled area. In Fig. 4 , the boundary is shown in bold. Computing the cells in the boundary of a gate Ci can be performed by following the boundary of each gate in the direct fanin of C i and filling in the corresponding entries in T i . This procedure is illustrated in Fig. 6 . Since the maximum length of each boundary is no larger than 2 m, the computation of the boundary requires a time proportional to n i p i m: (7) With appropriate data structures, the memory requirements will also be proportional to this expression. Consider now a more realistic library wherein the gate delays are not integers, but floating point numbers. Assume that the precision of the gate delays is g (the granularity). For example, if the delays are given with a precision down to the hundredth of a nano-second, then g = 0:01. Additionally, assume that the value obtained for the circuit delay using the greedy approach described in Section III is D. Then, a table of size of m = D=g will be required.
Summing (7) for all gates in the circuit, we obtain that the complexity of the algorithm is O RP GD g (8) where R is the maximum number of choices available for any gate in the library, P is the maximum number of input pins in the gates, G is the total number of gates in the circuit, D is the circuit delay and g is the granularity.
Since a linear dependence on R, P , and G is unavoidable for any gate sizing algorithm, the extra complexity is paid by the term D=g. This result was to be expected since in pseudopolynomial time algorithms, the complexity is necessarily dependent on the size of the numbers involved, or, equivalently, on their precision.
C. Data Structures
Given the description of the algorithm and the observation that only the boundaries of the filled part of the tables need to be kept, there are several possibilities for maintaning the data stored in the tables Ti.
One possibility is simply to use a matrix for each table Ti. This matrix should be initialized to 0, and then only the boundaries need to be filled in. This solution is simple, but has the significant disadvantage that the memory requirements become proportional to GD 2 =g 2 and that the initialization step may actually become dominant for small values of g, leading to a significant inefficiency.
An interesting alternative is to use a sparse matrix data structure. If accesses to the matrix elements are based on a hash table method, the memory requirements are only proportional to GD=g, thus yielding a much better asymptotic behavior. However, this improvement may become noticeable only for higher values of D=g, since the hash table accesses impose a significant constant overhead.
A more radical approach can be taken for reducing the memory requirements even further. Note that the boundary of each table Ti is totally defined by the exterior corners of the boundary of the filled area.
For instance, the contents of the table T 0 1shown in Fig. 3 are totally defined by the values of the two corner cells T 1 (7; 3) = 1 and T 1 (4; 6) = 2. Using appropriate data structures, it may be possible to avoid the need to store the entire boundary. Note that, in the worst case, there may exist O(m) external corners of the boundary, but, on average, the number of corners will be much smaller than m, leading to very significant savings in memory usage. However, whether it is possible to explore this property by using appropriate data structures remains an open question. There are other significant details which can speed up the algorithm and reduce the memory usage, but they have not been implemented. One such optimization is based on the fact that the table size does not need to be the same for each node in the circuit, since the interesting part of the table T i may not cover the whole range of indices from 1 to m. In fact, positions in the table indexed by coordinates smaller than the smallest possible delay in the direct fanins of Ci are useless, since the table contains only zeros in that region. On the other hand, positions in the table indexed by coordinates larger than the largest possible delay value for gate C i are also useless, since they will never be used in the optimum solution. We believe these and related optimizations, when implemented, will reduce the memory and CPU usage by at least an order of magnitude, making the approach very competitive with the simple greedy strategy described in Section III.
D. Incorporating Wire Delays
In deep submicron technologies, wire delays are becoming a dominant component of the circuit delay. Any wire delay model in which the wire delay is independent of the characteristics of the driving and driven cells can be easily incorporated in our algorithm. If the delay of the wire connecting the driving cell C j to C i is w(C j ; C i ), r j; k and f j; k in (6) can be appropriately modified to include w(C j ; C i ). 
E. Extension to Directed Acyclic Graphs
The algorithm proposed above is provably optimum only for tree circuits. For general combinational circuits, i.e., directed acyclic graphs (DAGs), it may not even yield a feasible solution. The reason is as follows. Consider the circuit of Fig. 7 with the gate C 1 fanning out to two gates C2 and C3. Let C2 and C3 be the circuit outputs. Let C1 have two sizes: 1) with ( r , f ) = (3; 5), and 2) with (r, f ) = (5; 3). C 2 have one size with ( r , f ) = (3; 1) and C 3 have one size with ( r , f ) = (1; 3). Once again, for simplicity, we assume identical delay values for all inputs pins of a gate. All inputs arrive at time (0; 0). Our goal is to minimize the circuit delay. The delay feasible regions at the cell outputs are:
• C 1 : (r 3^f 5) _ (r 5^f 3).
• C 2 : (r 6^f 6) _ (r 8^f 4).
• C3: (r 4^f 8) _ (r 6^f 6).
If we use our algorithm, it will compute the best delay at the outputs C2 and C3, and hence the circuitis to be 6, corresponding to (r; f) = (6; 6). Since there is a single size for C2 and C3, to obtain these values, we need (3, 5) at the input of C 2 and (5, 3) at the input of C 3 . Both these inputs are connected to the output of C1. The constraint from C 2 requires that we select size 1 for C 1 whereas the constraint from C 3 mandates that we select size 2. Thus, (6, 6) is not realizable! The problem is that C1 is a multiple-fanout point, and the two fanout gates require selection of different sizes at C1. Both sizes, although possible, cannot be selected at the same time.
We can modify our algorithm for general DAGs in several ways. However, none of them is provably exact.
1) For a DAG, the best delay value`computed at the circuit outputs from the delay feasibility tables is a lower bound on the true minimum delay possible by gate assignment. In this method, we compute the delay feasibility tables for each gate as before, and select the best delay values at the outputs. Then, we traverse the gates backward (from primary outputs to primary inputs), selecting the size for each gate as dictated by the rise and fall delays propagated from the outputs, and propagating the (r; f)
constraints to the fanins. When we hit a multiple-fanout gate C, each fanout propagates different (r; f) constraints to C, requiring possibly different sizes for C. We pick the size that corresponds to the minimum value of maxfr; fg. After all the gates have thus been assigned sizes, we perform a delay trace on the modified circuit to obtain the true delay D of the resized circuit. If D =`, we know this algorithm has yielded the best possible delay. 2) Partition the DAG into trees, by cutting off at multiple fanout points. Order the trees topologically from inputs to outputs. For each tree in the order, apply the dynamic programming algorithm and select the gate sizes that minimize the delay at the output of the tree. Use these delay values while selecting the sizes for the gates in the trees later in the order. This is the heuristic widely used also in technology mapping. Currently, we have implemented only the first heuristic. Note that if arbitrary gate replication is allowed in addition to gate assignment, the dynamic programming algorithm (resulting in the delay value`) is provably optimum for general DAGs. To achieve this delaỳ , when traversing the circuit backward, for each multiple-fanout gate, create as many copies as the number of different sizes required by the fanouts.
VI. EXPERIMENTAL RESULTS
To evaluate the applicability and practical impact of our dynamic programming-based solution of the gate assignment problem, a preliminary implementation of the algorithm was developed and integrated with the SIS system [12] . In this section, we present the results obtained, both in terms of the final delay obtained for the circuit and the CPU times required to compute the solutions. We tested the algorithm on the set of circuits in the MCNC 91 benchmark, using Fujitsu's 0.25-library. For these experiments, we set the granularity g to 0.01 ns.
We implemented three gate assignment algorithms.
• Same Rise-Fall method, in which the pin-to-pin gate delay is approximated by a single delay value, and then the dynamic programming algorithm of [6] is applied. Finally, a delay trace on yields the true circuit delay.
• Greedy method of Section III: Under a topological traversal of the circuit, it selects for each gate the size that minimizes the maximum of rise and fall arrival times at that gate.
• Dynamic Programming (DP) method of Section V, as extended to general DAGs (see method 1 in Section V-E). The experiment was run on an Ultrasparc-60 with 2GB memory. Out of the 73 circuits, the DP method provably achieved the minimum delay on 69 of these circuits, and same rise-fall whereas greedy did so on 43 and 57 circuits, respectively. We know that a method generated the optimum delay value on a benchmark if it yielded a delay value identical to the corresponding lower bound`generated by the forward pass of the DP method. To the best of our knowledge, this is the first time it has been shown that the delays yielded by popularly used same-rise fall and greedy algorithms, as well as by the DP method are exact for most benchmarks. This empirical evidence of the optimality of the DP method on almost all the circuits is very encouraging.
On 44 of the 73 circuits, the same rise-fall, greedy, and DP methods all obtained identical delays, 42 of which were probably optimum. Statistics for the remaining 29 circuits are shown in Table II . The numbers of circuit inputs, outputs and literals are shown in columns 2 through 4. The table also describes the results obtained on these circuits. Column 5 lists the lower bound`on the optimum delay computed by the forward pass of the DP algorithm. This is the minimum delay any gate assignment algorithm can hope to achieve. Columns 6, 7, and 8 list the final delay values for the circuit where gate assignment was performed by the same rise-fall (S), greedy (G), and DP algorithms, respectively. Columns 9 and 10 list the percentage improvements in delay by DP over S and G, respectively. The last column shows the combined CPU times taken to solve the gate assignment problem with all the three methods.
The DP method resulted in provably minimum delay on 27 of the 29 circuits, whereas same-rise-fall did so only on 1 and greedy on 15. DP was better than same-rise-fall on 27 circuits and better than greedy on 13, and was never worse off than either of the two. The maximum delay improvement of DP over same rise-fall is 0.62% and that over greedy is 0.43%.
The performance of greedy was better than same-rise-fall: out of the 29 circuits, on 22 greedy was better than same rise-fall and on 2 it was worse.
VII. CONCLUSION AND FUTURE WORK
In this work, we analyzed the complexity of the gate-assignment problem in the presence of different rise and fall delays and presented a dynamic programming algorithm that solves it exactly in pseudopolynomial time for tree circuits under the load-independent delay model. This is the first efficient exact algorithm for the gate assignment problem for tree circuits under these conditions. We also presented two simple extensions of the algorithm to general circuits.
A preliminary implementation of the algorithm was used to evaluate the performance of the commonly-used same rise-fall and greedy algorithms against our proposed dynamic programming solution. Although this implementation uses relatively simple data structures, we were able to solve the gate assignment problem exactly for almost all circuits (69 out of 73) in a well-known benchmark set using the proposed dynamic programming technique, thus proving that this problem can be solved exactly for circuits of significant size.
These experiments have also shown that, in general, both same rise-fall and greedy approaches obtain results very close to the optimum, at least for the library used. This happens despite the fact that the library contained gates with significantly different rise and fall delays. We believe this is due to the fact that even for circuits with a moderate number of levels (say 5), the imbalances in the individual rise and fall gate delays cancel out by the time the primary outputs are reached. Nevertheless, we were able to show for the first time that for the load-independent delay model in the presence of rise and fall delays, same rise-fall and greedy approaches work quite well, achieving exact results on 43 and 57 circuits, respectively, out of 73.
There are several interesting directions for future work. One obvious direction for future research is the search for an exact pseudopolynomial algorithm for general combinational circuits. Another interesting area of research is the application of this method to the technology mapping problem. This should represent a relatively simple extension, as long as the load-independent delay model is assumed. Note that a generalization of the algorithm to the load-dependent delay model is not likely, since it has been recently proved that the gate assignment problem under the load-dependent delay model is NP-complete in the strong sense [9] .
The memory usage currently represents the most significant bottleneck faced by the algorithm. It limits the applicability of the implementation to circuits with tens of thousands of gates and libraries with very fine delay granularities. Additional work is needed on the data structures used in the manipulation of the tables.
