In this paper, we propose a new method to optimize a performance of a very large circuit. We find the best set of local transformations to be applied to the circuit, by inserting "padding nodes" on noncritical edges of the circuit, and calculating separator sets of the circuit using separator sets. Our method is robust for very large circuits, because its memory usage and calculation time are linear and polynomial order with the size of the circuit.
Introduction
In order to optimize a performance of a very large circuit, it is impossible to re-synthesize whole the circuit at a time. The best known strategy to date is repetition of local transformations: extracting a portion of a large circuit as a sub-circuit, which is critical in delay, re-synthesizing the sub-circuit, and restoring the subcircuit to its original portion ( 
PO3
Original Sub-circuit Re-synthesized Sub-circuit mum improvement of a circuit performance, it is important to find a "good"set of local transformations. We call a set of local transformations for re-synthesis a "selection set". However, it is conjectured an NP hard problem to find the optimum selection set with the minimum area and/or power with satisfying all timing constraints [1] . Singh et al. have proposed a very efficient framework for performance optimization using iterative local transformations [1] . They use a selection function to represent a set of selection sets. A selection function is a characteristic function with a set of variables fxig, each corresponding to a local transformation i. The selection function holds true for a sub-set of fxig, whose corresponding local transformations improve all outputs of the circuit in delay.
They implement a selection function with a BDD [2] . One selection set corresponds to a one-path of the BDD (a path from the root node to the constant "one" node). The optimum selection set corresponds to the minimum weighted path among all one-paths, and can be calculated in linear time with the size of the BDD. However, BDD has a problem of memory explosion. In the worst case, both memory usage and calculation time grow exponentially with the size of a circuit. Thus the selection function is not robust for very large circuits.
Another method of finding a selection set uses separator set [3] [4] . A separator set is a node set, which cuts all paths from inputs to outputs. Using a network flow algorithm the separator set with the minimum cost is calculated with time of polynomial order, and memory of linear order to the size of a circuit.
A separator set, however, does not always imply the optimum selection set. Only single-output tree-structured circuits hold it, but multi-output DAG circuits does not [1] .
We show an example of delay optimization in Fig. 2(a) . We want to find the optimum selection set, which reduces the slack at the output O1 by 3. We first calculate three variables for each node i: s(i) (slack at i), D(i) (amount of node delay reduction by the local transformation at i), and A(i) (cost of the local transformation at i. e.g.: increase of area and/or power).
In this example, only three nodes, T , W , and X, are resynthesizable. The optimum selection set is fT ; X g with cost of 9. The separator set, however, is fW; X g with cost of 11. Thus the separator set is not the optimum selection set.
The selection function at O 1 is F(O 1 ) = X 1 (W + T ). According to the on-set of the function, selection sets are fW; X g and fT ; X g. The latter has less cost and is chosen as the optimum selection set. We propose a new method to find the optimum selection set using separator sets. By inserting "padding nodes" on non-critical edges of the circuit, the optimum selection set can be derived from separator sets even for multi-output DAG circuits. Our method is robust for very large circuits, because its memory usage and calculation time are linear and polynomial order with the size of the circuit.
Algorithm
We assume that a circuit is a combinational Boolean network, and s(i)'s, D(i)'s, and A(i)'s at all node i are calculated. If D(i) is equal to or less than 0, A(i) should be set to +1.
First, we transform a multi-output circuit to a single-output one, by adding a "virtual output". We replace all output nodes with buffers with a node delay of 0, and connect those buffers' outputs to inputs of the virtual output. As a result, the slack at the virtual 0-7803-5832-X /99/$10.00 ©1999 IEEE. For all edge e, we calculate ds(e), a difference of node slacks at e's both ends:
ds(e) = s(tail node(e)) 0 s(head node(e)) (1) And if ds(e) > 0, we insert a "padding node" at e whose node delay is ds(e). And delay reduction and cost of its local transformation are set ds(e) and 0, respectively (i.e.: D(e) = ds(e) and A(e) = 0 ). Note that padding nodes are inserted at all non-critical fanin edges of all nodes (including the virtual output).
From the viewpoint of delay analysis, slacks of all nodes become equal. Padding nodes eliminate all differences in arrival times at all inputs, required times at all outputs, and path delays among all reconvergent paths.
From the viewpoint of delay optimization, we treat padding nodes re-synthesizable, such that their node delays can be reduced to 0 with no cost. Thus, padding nodes are more likely chosen in a selection set than any ordinary node with positive cost. Note that if a padding node is chosen in a selection set, we do not need to apply any local transformations to the padding node.
A separator set of a circuit with padding nodes has a very important property as follows:
Lemma 1 (Delay Reduction by A Separator Set)
If s is a n y separator s e t o f a c i rcuit w i t h p adding n odes, and s has 2nite c o s t , t h en s h a s a t l e a s t o n e o rdinary n o de ( i . e.: non-padding node), a n d i t r e d u ces the d elay of t he circuit by:
2
The first half of the lemma is proven by the facts that a separator set must include at least one node on the most critical path, and padding nodes are not inserted on critical paths. The latter half of the lemma is trivial.
An example of padding nodes is shown in Fig. 2(b) . There are two padding nodes p1 and p2, with delay reduction of 2 and 1, respectively. First, we find the separator set with the minimum cost:
fT ; p 1g. But delay reduction of this separator set is 2, and it is less than the objective delay reduction (3). Then we find the separator set with the second minimum cost: fX;p2g. Combination of those separator sets results in delay reduction of 3 and cost of 9, which is the same result of the selection function.
As seen in this example, combination of more than one separator sets is efficient. In this situation, it is important how to select a set of separator sets at a time with the minimum total cost achieving the objective delay reduction. We call such a set of separator sets a "multi-separator-set".
Definition 1 (Multi-Separator-Set)
A m u lti-separator-set is a set of o ne or m o re than one separator sets of t he same circuit. Those separator sets can share the same node in the c i r c u i t, if a sum o f d elay reduction o f separator s e t s sharing t h e n o de i s e q ual to or l e s s t h an the delay reduction of t he node.
We c a l l a m u lti-separator-set optimum, if total cost of nodes belonging t o i t i s t he minimum.
2
According to the following theorem, multi-separator-set and selection set are equivalent in terms of delay reduction and cost. Thus if we find the optimum multi-separator-set, we can get the optimum selection set by removing padding nodes from the multi-separatorset.
Theorem 2 (Selection Set and Multi-Separator Set)
Let be a c i rcuit, a n d 0 be the c i r c u it of with padding nodes inserted. Any selection s e t t in , w h ich reduces the circuit delay by D(t) > 0, h as its corresponding m u l t i-separator-set m in 0 , such that: m consists of nodes in t and p adding n o des.
Delay reductions o f t and m are t h e same: D(t) = D(m). 2

Proof
At 2rst, we prove t and p adding n o des contain a t l east one s e p arator set in 0 . Suppose that t and padding nodes do not contain any s e p arator sets in 0 . T hen there exists a path p from o ne input t o o n e o utput i n 0 , a l l w h o se nodes are not p adding n o des and i n cluded in t. This means p is the most critical path i n , a n d its path delay cannot b e reduced by t. T h is is c o n tradict to D(t) > 0.
Therefore, if D(t) > 0, w e c a n 2 n d a s e p arator s e t s, w h ich consists of n odes in t and p adding n o des. When D
(t) > D(s), we update t h e delay reduction D(i) for e a c h n o de i in s: D(i) ( D(i) 0 D(s). Note that this process also updates D(t): D(t) ( D(t) 0 D(s).
Until D(t) = 0, we can iterate 2nding a separator s e t i n 0 and updating n ode d elay reductions. This i t eration s h o uld terminate s i nce each iteration 2nds a d i fferent s e p arator set and 0 has 2nite s e p arator s e t s .
It is trivi a l t h a t a l l f o und separator sets c o n tain only n odes in t and padding nodes, and t h eir total circuit delay reduction is equal to the o riginal D(t).
2
According to Theorem 2, finding the optimum multi-separatorset has the same time complexity as finding the optimum selection set: i.e., NP hard. Thus, we approximate it by finding separator sets one by one like the proof of Theorem 2 (Fig. 3 ).
Step 1: Set a variable m separator set empty, which represents a multi-separator-set.
Step 2: Make a flow network from the circuit with padding nodes. Note that each node n in the circuit has its corresponding edge in the flow network, whose flow capacitance is equal to the node cost (A(n)). We give each node (including a padding node) a variable of "rest of delay reduction". Its value is initially set to the delay reduction of the node, and decreases each time the node is chosen as a member of a separator set.
Step 3: Calculate the maximum flow on the flow network by a flow algorithm. If there exist no flows with finite cost, the procedure ends and returns m separator set.
Step 4: Find a separator set, which corresponds to the cutset (set of edges) of the maximum flow, and merge the separator set to m separator set.
Step 5: For each node in the separator set obtained at Step 4, reduce its rest of delay reduction by the delay reduction of the separator set. If the rest of delay reduction becomes 0, we set the flow capacitance of the corresponding edge +1, so that we cannot any more choose the node as a member of a separator set.
Step 6: Return to Step 3. But this is overestimate in practice, because Dinic's algorithm is iteration of finding an argumenting path on a flow network. In Fig. 3 we raise or keep flow capacitance for all edges at Step 5. Then, returning back to Step 3, we do not need to cancel the previously calculated maximum flow, but, get the new maximum flow by finding argumenting paths along the edges, whose capacitance we have raised. Thus, we conclude there is no big difference among times for calculating a single separator set and a multi-separator-set.
Experimental Results
We have compared delay optimizing capabilities of our method with Singh's separator set [3] and selection function [1] methods. Singh's methods are implemented as "speed up -f " and "speed up" commands of SIS-1.2, respectively. We have implemented our method on our logic synthesis platform "Magus", which is written in C++ and Tcl/Tk.
We have generated initial circuits, by area optimization and timing-driven 2-input gate decomposition on SIS (script.rugged and "speed up -i" command).
For all three methods, we have used a unit-gate-delay model for delay estimation, a sum of literals for the optimization cost. A local transformation has been a collapse-and-decomposition: collapsing nodes in three levels and decomposing with two-cube-kernels in timing-driven mode [6] [3] .
Our experimental results are shown in Table 1 . The platform is a Linux PC with Pentium II 400MHz, and 1GB memory. In the table, "01Delay" and "1Lit" mean decrease in the circuit delay reduction and increase in the number of literals, respectively.
For all circuits, our method has successfully terminated with reasonable CPU time and delay reduction. But Singh's separator set method (speed up -f) has resulted in zero or too small delay reductions, and the selection function method (speed up) has aborted with three large circuits: s13207, s38417, and s38584. BDD's for the selection functions cannot be constructed even with 1GB memory. We have found speed up (SIS-1.2) does not use a good BDD variable ordering for the selection function. Thus we have modified speed up to use a depth first BDD variable ordering [7] , and obtained new results of Table 2 . However, speed up still aborts with s13207. A dynamic variable ordering might accomplish the circuit, but it would take much more CPU time.
The maximum memory usage of our method has been 85MB for s38417. Since memory usage of our method grows linearly, our method can handle more than ten times larger circuits than speed up with the same memory. This shows robustness of our method for finding a selection set.
For almost all circuits, our delay reduction is the same or a little bit better than speed up. We conclude capability of delay reduction of our method is comparable to speed up.
But ours tends to take more CPU time than speed up. We think two reasons. One reason concerns with an implement issue: our method takes many iteration loops, so that it can obtain maximal delay reduction. Another reason is our poor resolution procedure for a conflicting selection set. The local transformation "collapseand-decomposition" may cause a conflicting selection set: i.e, two or more than two local transformations in the same selection set can not be applied simultaneously because their collapsed node sets are overlapped to each other.
Algorithm finding a selection set (both of ours and speed up) assumes all local transformations can be independently applied. On the other hand, speed up has a smarter resolution procedure: when it finds a conflicting selection set, it makes a new selection function by AND-ing the original selection function and negation of the conflicting selection set (a function of a selection set is an AND of all positive literals each corresponding to a local transformation in the selection set). At next time to find a selection set, optimum non-conflicting selection set should be obtained since previously found conflicting selection sets are disabled in the selection function. But, this resolution procedure makes the BDD size of the selection function much larger, and more likely results in memory overflow.
speed up with BDD Var. Ordering Circuit 01Delay 1Lit CPU s13207 3 Memory Overflow 1684 s38417 
Conclusions
In this paper we propose a new method of finding the optimum selection set for re-synthesizing a large circuit. Our idea of a "padding node" is simple, but a combination with a network flow algorithm is very effective and robust. Comparing with Singh's methods, our method has reasonable capability in delay optimizing, and is much more robust.
In our future work, we will investigate a smarter resolution procedure for conflicting selection sets. We also will evaluate our method with technology dependent circuits.
