Bu er insertion is a technique that is used either to increase the driving power of a path in a circuit, or to isolate large capacitive loads that lie on noncritical or less critical paths. Gate sizing sets the sizes of gates within a circuit to achieve a given timing speci cation. Traditional design techniques perform gate sizing and bu er insertion as two separate and independent steps during synthesis. However, until sizing is performed, any information on capacitive loads is incomplete and therefore a bu er insertion algorithm must operate with incomplete information, leading to suboptimal results. Moreover, the insertion of bu ers can change the structure of the circuit su ciently so that it may lead to a di erent sizing solution from the unbu ered circuit. Therefore, these techniques of bu er insertion and sizing are intimately linked and it makes a lot of sense to integrate them into a single optimization.
given by merely using a TILOS-like gate sizing algorithm alone, as is illustrated by several area-delay tradeo curves shown in this paper.
Introduction
While a combinational CMOS circuit with minimum-sized transistors has a small area, its delay may not beacceptable. It is often possible to reduce the delay of such a circuit at the expense of increased area by increasing the sizes of certain transistors in the circuit. The well-studied optimization problem that deals with this area-delay tradeo is known as the sizing problem 1, 2, 3 and it is often formulated as minimize Area 1 subject to Delay T spec In some formulations, it is the power that is minimized instead of the area. For edge-triggered circuits, we need consider only one combinational subcircuit at a time, minimizing its area while meeting the timing requirements that state that the delay of each combinational segment should satisfy the clock period. Therefore, the problem of sizing eve n a v ery large circuit can be decomposed into individual problems of sizing individual combinational blocks to meet the clock period, and the problem complexity is considerably reduced. For the remainder of this paper, we will therefore assume that the circuit is purely combinational. For a given combinational circuit, the nature of the area-delay tradeo curve for gate sizing is as shown in Figure 1 . Typically, a small amount of sizing is adequate to reduce the delay corresponding to the unsized circuit, d unsized , in order to meet a loose delay speci cation. However, as the speci cation is tightened, the circuit has to be sized by greater degrees, until we reach the knee of the curve where it must be sized tremendously to achieve further delay reduction. Further, it is impossible to reduce the delay of a circuit inde nitely through sizing, and there is a minimum achievable delay, d min , that cannot be bettered through sizing.
Note that gate sizing does not change the topology of the circuit, but merely changes the sizes of individual transistors within gates. We note that some gates in a circuit can be sized excessively because of the large loads that they drive. The appropriate insertion of bu ers in a circuit can be used to prevent excessive sizing while meeting delay speci cations. In fact, as we will see, bu er insertion in conjunction with sizing often permits greater circuit delay reductions than sizing alone.
Traditionally, gate sizing and bu er insertion the fanout problem" 4, 5 have been carried out separately and at di erent stages of the design process 1 . However, as sizing changes the capacitances driven by various gates, the locations of high-capacitance nodes are accurately established only during sizing, and any optimizations performed before sizing are necessarily based only on educated guesses. Therefore, it is useful to combine the two optimizations into a single step, and this is the objective of this research.
In this paper, we rst present the delay model used here, and then list the situations in which i t is advantageous to insert bu ers. Next, we present an algorithm to combine sizing with bu er insertion, and show that the application of these two transformations in unison can provide signi cant bene ts.
Delay and area modeling 2.1 Transistor level modeling
We rst show h o w a n n-transistor of width w n;i is modeled by a set of capacitances and resistors. A p-transistor of width w p;i is similarly modeled. Since all the transistors are set to minimum length, the capacitances can bemodeled in terms of only the transistor widths. For an n-transistor, we can write the source drain capacitance C sdn i = C d;n 1 w n;i + C d;n 2 , and the gate capacitance as C gn i = C g;n 1 w n;i + C g;n 2 , where C d;n 1 ; C d;n 2 ; C g;n 1 and C g;n 2 are constants. The on-resistance, R i;n , o f a n n transistor is given by R in = Rn w n;i . As in previous work for example, 1, 2 , the circuit 1 The fanout problem, however, only tackles what we will later refer to as Type B bu er insertion.
area is modeled as the sum of all transistor sizes.
At the gate level, each gate G i is modeled by an equivalent inverter, parameterized with all n-p, transistor sizes set to w n;i w p;i . In this implementation, only static CMOS gates are considered. All transistors of the same type in a gate are assumed to have a uniform size. The ideas presented in this work are also applicable to the case where every transistor is allowed to have a di erent size. The pull-up pull-down structure is represented by an equivalent inverter with a p-transistor n-transistor size of S p;i S n;i that corresponds to the worst-case situation; this numberis referred to as the gate size. The relation between the gate sizes in the equivalent inverter and transistor widths in the gate can easily becomputed for various type of gates. For example, for a k-input NAND gate, S n;i = w n;i =k, S p;i = w p;i 2 .
The capacitance loading, C L , of gate G i can be calculated from the transistor sizes of its fanouts as follows:
where C intrinsic corresponds to the source and drain capacitance connected to the output node of G i . The wire capacitance values are based on the placement.
Delay computations
We rst demonstrate the calculation of the step delay, i.e., the delay under the assumption that the input to each gate is a step transition with zero transition time. Next, we will show how this assumption is relaxed to allow for the realistic case where nonzero transition times are possible.
The Elmore fall step delay, t f i , of gate G i can then be obtained from C L and S n;i as 6, 7 t f i ;step = R n C L S n;i :
3
The rise delay is similarly obtained as t r i ;step = RpC L S p;i . To allow for the e ect of nonstep input transitions, we use the inverter delay model presented in 8 . The e ect of the input-to-output coupling capacitance and input slope e ects are considered in this model. Consider the CMOS inverter structure driving a load C L , as shown in Figure 2 , where C M is the coupling capacitance between the input and the output nodes. When the applied input 2 Notice that Sp;i is wp;i and not k wp;i since in the worst case, only one of the k transistors in parallel will be on. A similar expression is used for the rise transition. The value of is taken to be twice the Elmore delay of the preceding gate, as in 2 .
Using this method to calculate the delays of individual gates, the PERT procedure is used to nd the critical path in the circuit as in 1 .
The proposed algorithm also requires the computation of the sensitivity of the gate delay with respect to a gate size. It is well-known 1 that the step delay sensitivity to a gate size can be computed by considering only that gate whose resistance is a ected by the gate size and its fanin gates whose load capacitances are a ected by the size of that gate. Therefore, the delay sensitivity computation under step inputs is a very local computation.
Under the improved delay model above that considers input transition times, the size of a gate a ects not only the delay of that gate and its fanin gates, but also the delay of all gates in the transitive fanout. The delays of the gates in the transitive fanout depend on the value of their input slew rates values, which i n turn, are dependent o n the delay of the current gate. However, it can easily be shown from the application of Equation 5 that for real parameter values, the e ect of changing a gate size is vastly diluted as one moves further and further away from the gate along its transitive fanout. For real circuits, we found that the size of a gate a ects only the current gate, its fanin gates, its immediate fanout gates, and their fanouts. Therefore, for all practical purposes, the sensitivity computation remains an inexpensive local computation, even under the improved delay model that considers input transition times.
Bu er insertion
The essential idea of bu er insertion is to reduce the delay at high capacitance nodes by reducing the load on the driving gate. To maintain signal polarities, we assume that each bu er consists of a pair of inverters that may besized appropriately. Thus, the addition of each bu er implies the addition of four new transistors to the circuit.
Notions of criticality
As a preliminary step, we de ne a critical path as any path that violates the timing speci cation. We also explain a nonquantitative and somewhat fuzzy term that we term as the criticality of a path. Roughly speaking, the criticality o f a path is dependent o n the magnitude of the violation, so that paths with large violations are identi ed as being highly critical, and those with small violations are only mildly critical. This notion is important since we observe that the greater the criticality of the path, the larger the amount of sizing required for the path to meet speci cations. Later in this paper, we will work towards developing measures to quantify the criticality of a path. Generally speaking, it has been our experience that bu er insertion is useful only for highly critical paths. This experience is based on our experimental results which use a measure of criticality, developed later in this paper, to quantify the criticality of a path. For mildly critical paths, it may be more advantageous to use sizing than bu er insertion. The intuition behind this is that mildly critical paths can be made to meet timing speci cations through a small amount of sizing; inserting a bu er implies an increase in area corresponding to the four new transistors that constitute the bu er, which is likely to belarger. Moreover, the addition of an excessive numberof bu ers can actually increase the delay of some paths of the circuit, and therefore we add them only where we must, namely, to reduce the delays on the highly critical paths. 
Types of bu er insertion strategies
We identify two situations in which the insertion of bu ers is advantageous, which w e will refer to as Type A and Type B bu er insertion scenarios, respectively. As shorthand notation, we will refer to an output H of a gate G being highly critical if some highly critical path passes through gates G and H; similarly, w e also refer to mildly critical and noncritical outputs.
Type A If a gate whose outputs are all highly critical drives a large capacitive fanout, bu er insertion can help in reducing the delays of these paths. Figure 3 shows the situation of type A bu er insertion 3 . By choosing an appropriate size of bu er, the fanout capacitance of Gate G may become smaller, and sum of the delays of the bu er and Gate G may be smaller than the delay of Gate G in the unbu ered circuit.
Type B If a gate has some highly critical outputs and some mildly critical and noncritical outputs, then one may isolate the capacitance of the noncritical outputs from the highly critical path by inserting a bu er, as shown in Figure 4 . The mildly critical paths constitute a gray area and must be assigned to be either critical or noncritical, based on measures that we will 3 The essential idea here is not dissimilar to the Mead-Conway idea of using chains of inverters to drive a large load, with a ratio of e minimizing the delay. However, we di er in the following ways: a our objective is not to minimize the delay but to meet a speci cation b if the circuit as a whole is anything other than a chain of inverters, it is not possible to use the constant-ratio idea to minimize the delay of the circuit. develop later in this paper. Since the fanout capacitance of gate G becomes smaller, the RC delay o f G is reduced, and therefore, the delay along the highly critical paths is reduced.
As a side-e ect, the delay along the noncritical paths may be increased. The additional delay introduced along noncritical paths that became critical after bu er insertion, can be made to meet speci cations through a small amount of sizing.
The challenge here is to quantify measures of criticality, and to use them to determine appropriate locations for bu er insertion.
Interestingly, the work in 9 that was performed independently also uses similar terminology for Type A and Type B bu ers. However, that work concentrates on reducing wire delay at the post-layout phase, an issue that we do not address here.
Examining the e ect of bu er insertion
We will examine the e ect of bu er insertion through a simple example. Consider the gate G shown in Figure 5 driving gates G a ; G b ; G c ; ; G f . If all of the outputs are on highly critical paths, then we would insert a Type A bu er immediately after G, driving all fanouts. The insertion of this bu er would change the fanout capacitance of G and therefore, its delay w ould change from D G;old to D G;new . If the delay of the bu er is D buf , then the change in the delay to the most critical output would beD G;new + D buf , D G;old , since all other gate delays in the circuit would beuna ected.
For the bu er insertion to be advantageous, this value must be negative.
Note also that this transformation would change the delay of any path passing through G by the same amount, and would therefore decrease it; for any path that does not pass through G, the delay remains una ected. Therefore, this transformation either reduces the delay at each primary output, or leaves it una ected.
If we consider Type B bu er insertion in Figure 5 , and let us now assume that G a and G b are highly critical, G e and G f are noncritical, and G c and G d are critical, but not highly critical. We will refer to a gate G i as being bu ered if the Type B bu er is placed between the output of G and G i , and we will consider it unbu ered otherwise. We can now place a single Type B bu er using the following ideas: G a and G b must certainly be unbu ered.
G e and G f should be bu ered, since they only add to the capacitance being driven by gate G. Although it is possible that G e and G f may become critical outputs after bu er insertion, they would, at worst, probably be very mildly critical since they were noncritical before bu er insertion.
The key issue is the status of G c and G d . If they are bu ered, then they may become highly critical after bu er insertion. On the other hand, if they are not bu ered, the capacitance at G may be too high and the delay of gate G may not be reduced su ciently by bu er insertion. Therefore, the best solution may either bu er o none, one, or two of the gates G c and G d , and a good criterion is required to determine which of these should be chosen.
On complexity and convexity issues
The transistor sizing problem is well known to beequivalent to a convex programming problem 1, 2 when the topology of the circuit is xed, since the area objective and the circuit path delays can be represented as posynomial 10 functions of the transistor sizes 4 .
However, when the structure of the circuit is allowed to change, this is no longer true. If there are b possible bu er locations associated with a path, then there are b possible delay functions f 1 ; f 2 f b each a posynomial, of which one or at most a few is optimal. Note that an optimal circuit is the circuit with the minimum area for the given delay speci cation. The path delay is thus f 1 or f 2 or f b , which cannot berepresented as a convex programming problem it may, however, be written as an mixed integer nonlinear programming problem, and its solution is not easy to nd. A second pointer to its di culty is that even a special restriction of the problem, that of nding the optimal locations for Type B bu ers in an unsized circuit, is NP-complete 5 . Therefore, we resort to heuristic methods for solving the problem.
Outline of the algorithm
The procedure developed here enhances the TILOS algorithm 1 which operates iteratively, identifying the most critical path in every iteration. The sensitivity of the path delay, D, to the area, A, given by @D=@A, is computed for all of the transistors along the critical path, and the transistor with the most negative sensitivity is bumped up by a factor, Bumpsize. Bumpsize is typically set t o a v alue that is just larger than one, and values between 1.1 and 1.5 have been seen to work well. The procedure continues until all timing speci cations are met. As in the TILOS algorithm, we begin with the unsized circuit as provided to us. We continue optimizing the circuit until all the delay constraints are met at every circuit output. Until that is achieved, in each iteration, we identify the most critical path, i.e., the path with the largest violation of the timing speci cation. We attempt to improve the delay along this path by one of several possible transformations bumping up the size of some transistor along the path 4 A posynomial is a function g of a positive v ariable w 2 R n that has the form gw = P j j Q n i=1 w ij i , where the exponents ij 2 R and the coe cients j 0. Roughly speaking, a posynomial is a function that is similar to a polynomial, except that a the coe cients j must be positive, and b an exponent ij could be any real number, and not necessarily a positive i n teger, unlike the case of polynomials. A posynomial has the useful property that it can be mapped onto a convex function through an elementary variable transformation 10 wi = e x i inserting a Type A bu er along the critical path inserting a Type B bu er to isolate noncritical paths from critical paths
The general philosophy behind the algorithm is shown below. It should be stressed that although this is the general philosophy, the actual implementation is somewhat di erent and will be elaborated on in subsequent sections. The iterations end if futher sizing does not result in delay reduction, and in fact, increases the circuit delay b y a signi cant amount.
In the following sections, we will consider the problems of developing gures of merit for bumping up a transistor and for inserting a Type A or a Type B bu er. Since we h a ve followed the TILOS template, we will also use the most negative sensitivity S T = @D=@A of a transistor to compare the relative gures of merit. Note that S T corresponds to the delay reduction caused by bumping up the transistor size; this fact will be used when we develop comparable gures of merit for inserting Type A and Type B bu ers.
Type
The purpose of using Type B bu ers is to insulate the noncritical paths from the highly critical paths, thereby enabling greater amounts of delay reduction for the circuit as a whole.
As stated in the outline of the algorithm, the objective is to determine a gure of merit that can reasonably becompared to the gure of merit for sizing, namely, the sensitivity of the most sensitive gate. Let us temporarily assume that we h a ve developed a way of measuring the criticality of a gate output, and that we can recognize the highly critical outputs. We will later show the precise method by which this is achieved in Section 4.1.3.
While considering a candidate Type B bu er location at one of the outputs of a gate, we rst consider the delay along its highly critical fanouts. By de nition, a Type B bu er will always reduce the delay to a highly critical fanout, and this is achieved at the expense of an increase in the area; the area increase corresponds to the area of the inserted minimum-sized bu er.
Therefore, a reduction in the delay by an amount D can bee ected by an area increase of A. We m ust now estimate the amount of area, A T , required by the sizing procedure to achieve the same delay reduction. If A A T , then we insert the Type B bu er.
To fairly compare the e ects of sizing and Type B bu er insertion, let us consider the following problem: It is tempting to estimate A T as , D S T recalling that S T , the gure of merit for transistor sizing, is the most negative sensitivity of a critical path transistor, as shown by curve a in Figure 6 . However, the corresponding change in area, A a , is only a lower bound on the value of A T and is typically not a very tight bound. This is because the sensitivity S T corresponds to a small perturbation, whereas the change D is large. A linear approximation would provide optimistic estimates of A T . Moreover, for a transistor with size x, it has been shown that
where K 1 ; K 2 are independent o f x 1 . The above equation is accurate for the delay of the current critical path, but not for the delay of the entire circuit, which i s the maximum path delays; note that the critical path may c hange when a transistor size is altered. A second idea would be to use Equation6 to estimate A T , as shown by curve b in Figure 6 . However, this estimate, A b , is accurate only if this same transistor is critical in every sizing step involved in reducing the delay b y D. This is typically not true, and therefore, such an expression would provide an upper bound on the area.
In most cases, the actual area-delay curve w ould lie between the two bounds as shown by curve c in Figure 6 , and our problem is to determine the shape of this curve and the value of A T . An additional complication is as follows. Consider, for a moment, the TILOS algorithm for transistor sizing, and let D be the delay of the circuit during the current iteration. Then, bumping up the size of an individual transistor causes a delay reduction of Don the current most critical path and an area increase of A. However, the circuit delay is not necessarily reduced by D since the bumping operation could cause a di erent path to become the most critical path. This approximation is justi able in TILOS because the area and delay c hange in each iteration are very small. Note that if we w anted to be exact, we should have considered the area increase required to constrain all path delays to D , D.
A (a)
However, if the delay i s c hanged from D by a large amount, D, as is the case in our situation, such a n approximation is invalid, and we must nd the area increase required by sizing to ensure that the delays of all paths and not just the current most critical path are less than D , D.
In other words, all of these paths must be sized appropriately, and A must be computed by considering the e ect of all of these paths, and not just the most critical path as in TILOS.
To take care of this problem, we consider all primary outputs, and nd the area increase required to ensure that the maximum delay o ver all outputs and not just at the critical output is no larger than D , D. To estimate the value of A T , given a speci c bu er insertion point, we rst calculate the change in the delay of the circuit due to the insertion of a minimum sized Type B bu er. At each such primary output i, w e use an extrapolation method to estimate the area increase, a i , required to match the circuit delay reduction. We then calculate the gure of merit for sizing as
Details of the procedure
The extrapolation procedure is implemented as follows. For each primary output, we store the e ect of the most recent sizing steps as a delay vs. circuit area table. We use those data to extrapolate the change in area corresponding to D at every output; these values are summed up to give A T , as described in the last section. Speci cally, w e use Lagrangian extrapolation 11 to estimate A T for D. We found that a fourth order polynomial approximation was adequate.
If A T , the estimated area required to achieve the delay reduction through sizing alone, is lower than A B , the area of a minimum sized Type B bu er to beinserted at the chosen point, then the bu er is inserted. If not, the algorithm abandons the Type B bu er insertion in the current iteration, and then chooses either a Type A bu er insertion or a sizing step.
Finding an appropriate location for bu er insertion
The criterion for Type B bu er insertion is to isolate the less critical paths from the more critical ones; if the total capacitance of the less critical paths is substantial, then signi cant delay improvements are possible. Therefore, our rst challenge is to develop a measure for criticality, which i s k ey to the success of this algorithm and is required to partition the fanouts of a gate into a critical" and noncritical" set, as shown in Figure 7 . We will also quantify the criticality of the mildly critical path, which constitute a gray area, and develop measures to decide whether they should be considered critical or noncritical. for each gate i, where x i is the amount b y which the gate size would be increased if it were to be bumped up. Therefore, i estimates the reduction in the gate delay through a possible bumping up operation. Note that gates with a positive sensitivity are assigned a i of zero since the gate size would be left unchanged if the bumping operation were to increase the delay. We de ne a measure for the criticality that we call , associated with each gate fanout. This measure is related to the amount by which the delay o f a circuit can be reduced and to the delay along a path. Fanouts with larger values are less critical than those with smaller values.
A backward PERT traversal 5 is performed from the primary outputs towards the primary inputs PI's to calculate the value of for each gate. The value at each primary output is set to be the di erence between the maximum delay at the primary output and the actual delay to that point. Therefore, increasing the path delay to that primary output by will leave the circuit delay unchanged.
If we know the value for all the fanouts of a given gate i, its own value is calculated as i = min j2fanoutsi j + slack j + j i j 9 where slack j represents the slack at fanout j. The slack is de ned as the amount b y which the delay along this path may be increased before it becomes the longest delay path in the circuit. Note that all elements in this equation have dimensions of delay. Therefore i is a measure of the amount of delay increase along a path from the gate to any primary output that can be absorbed easily," either by the slack or by a small amount of sizing. This is consistent with the idea that a high value of i at the fanout j of a gate i implies that the maximum delay path from i to a primary output through j is not very critical.
The next challenge is to use this measure of criticality to determine the best location at which a Type B bu er should be inserted to improve the current most critical path. The steps involved in determining the bu er location can now be summarized as follows:
1. Find the gate i with the maximum fanout capacitance along the most critical path of the circuit. We will consider inserting a Type B bu er at the output of this gate to partition the critical and noncritical fanouts 6 .
2. Find the maximum value of j of all fanouts of gate i; let max bethe maximum value of j . All fanouts j whose j is c 1 max where c 1 1 is an empirically tuned number are placed in the noncritical set 7 This has the e ect of placing all fanout gates with high values in the noncritical set. However, if too many gates are placed in the noncritical set, the capacitance to bedriven by the type B bu er may become too high. This could cause its delay to bevery large, so that path through i and some of the gates in the noncritical set may become very critical.
Therefore, the above classi cation may betoo optimistic, and we apply the next criterion described below.
3. Having determined the noncritical set, we next estimate the e ect of inserting a minimumsized bu er. When a bu er is inserted, the delay of gate i is reduced by a n amount D dec , which is the delay reduction along the critical paths. Along a noncritical fanout j, the delay is increased by D inc , D dec , where D inc is the increased delay due to the insertion of a bu er.
Recall that j is an estimate of the amount b y which the delay from j to the primary outputs may be increased before j lies on the most critical path of the circuit. Of this amount, D inc , D dec is consumed by the insertion of a bu er. Therefore, with the insertion of the bu er, we m a y s a y that the delay from j to the primary outputs may be increased by j ,D inc ,D dec before j would lie on the most critical path.
The larger this amount, the less critical the path would be after bu er insertion. Therefore, we calculate this quantity for each fanout and if its value is small, then we remove the fanout j from the noncritical set.
4. For any fanout j, i f j , D inc , D dec for some empirically determined , then the gate is moved from the noncritical set to the critical set.
The value of was chosen as c 2 d minsize , where d minsize is the delay of a minimum size inverter driving a minimum size load. The use of d minsize is purely for normalization purposes to ensure that the value of is of the correct order of magnitude. The value of c 2 is then determined empirically.
It was experimentally found that approximate values of c 1 = 0 :8 and c 2 = ,0:5 work well on the circuits that we tested. Note that a negative v alue of c 2 causes a negative v alue for .
Recall that a mildly critical path using the terminology of Figure 4 may also be bu ered o . The negative value for corresponds to a mildly critical path where the insertion of a bu er may actually increase the path delay enough to cause the value of j , D inc , D dec to be negative. When this value is negative, it might seem that the path delay w ould increase after bu er insertion. However, in our calculations, we assumed a minimum sized bu er, and if the value of j ,D inc ,D dec i s v ery slightly negative, we m a y recover from this easily by sizing the bu er by a small amount. Therefore, in practice, we found that a small negative value for c 2 gave good results.
Type A bu er insertion
During the iterative procedure, we observed that inserting a minimum-sized Type A bu er almost always caused the path delay to increase. However, by appropriately choosing the size of the Type A bu er, delay reductions can be e ected. The following procedure is used to estimate the potential delay reduction through Type A bu er insertion at each gate output:
1. Find the minimum most negative sensitivity among the gates along the most critical path, denoted as @D @x best .
2. For each gate on the most critical path, we calculate the values of D rise and D f a l l , the changes in the rise and fall delays, respectively, i f a T ype A bu er were to be inserted at that gate output. For the bu er insertion step to achieve a useful purpose, it must be ensured that both D rise 0 and D f a l l 0. Keeping the topology of the rest of the circuit constant in particular, keeping the sizes of the gates fanning into and out of the proposed bu er constant, the sizes of transistors in the bu er are estimated 8 .
Only those gates at which both the rise and fall delays can be reduced are considered as candidates for bu er insertion. For these gates, the sensitivity of the bu er, @D @x b u f f e r , is determined for the calculated size. If @D @x b u f f e r @D @x best , then this location is designated as a permitted bu er insertion location. The rationale behind this is that at this point, after bu er insertion, a the delays of all the paths driven by the bu er are smaller than those of prior to bu er insertion, and b the most negative sensitivity v alue on that critical path is made even more negative than before, implying that the increase in area due to bu er insertion could berecouped in future steps through sizing. In other words, the potential for reducing the path delays with bu er insertion is better than the ability without bu er insertion.
3. Among the permitted bu er insertion points in Step 2, the output of gate k with the best delay reduction is chosen to be the best Type A bu er insertion location. The value of D rise + D f a l l =2 is used to estimate the e ect of bu er insertion on the delay.
4. Having performed a Type A bu er insertion, the bu er and its predecessor gate k are now reset to the minimum size to correct for any o ver-sizing in k in the past. The sizing procedure is permitted to size these gates back up again in subsequent iterations to their optimal sizes, so that the solution is not unduly bound by a n y incorrect sizing choices that were made before the bu er was added. During this process, we prohibit further Type A bu er insertion until the iterations reach the point where the circuit delay becomes smaller than that before the insertion of this Type A bu er.
The nal algorithm
The pseudocode shown in Section 4 was only a general outline of our procedure, and we m a y n o w describe the pseudocode of the algorithm more accurately as follows: The algorithm chooses to consider the option of inserting a Type B bu er rst, and then considers the Type A bu er, nally defaulting to transistor sizing if neither is viable. It is possible to consider these in any order, but it was found that this ordering worked best for the circuit examples that we tried.
We n o w attempt to provide an estimate of the amount of computation involved in each iteration. While a detailed complexity analysis is unrealistic due to the unpredictability of the numberof iterations, it is useful to count the number of computations involved in each iteration of this algorithm.
We assume that the number of gate fanins and fanouts are bounded by a constant, which implies that delay and sensitivity calculation for each gate can be carried out in constant time. The timing analysis required to calculate the circuit delay i s OjV j + jEj, where jV j is the numberofvertices in the circuit graph, corresponding to the numberof gates in the circuit, and jEj is the number of edges in the circuit graph, where each edge corresponds to an interconnection from one gate to one of its fanouts. After the rst time, however, the computation is signi cantly reduced since incremental techniques are used. In the worst case, we only process all edges in the fanout cone of the predecessor of the gate that is sized or the bu er that is inserted. The worst-case complexity of this step is also OjV j + jEj, but the typical update is empirically seen to occur in much less time. During delay calculation, the slack a t each node is also calculated at no additional increase in the computational complexity. The next step involves the calculation of gate sensitivities along the critical path. If D c is the depth of the circuit largest number of gates on any path then the number of gates on the critical path is bounded above b y D c , and the amount of time required to compute the sensitivities and to nd the maximum sensitivity i s OD c . This step is the only computation required for the sizing operation, and is also required by the criteria of Type A and Type B insertion.
For type B bu er insertion, the calculation of values is required in the fanout cone of gates in the critical path. This can be carried out in OjV j time. This is followed by a PERT procedure that uses the slacks to compute the values in OjV j + jEj time. The gate on the critical path with the highest capacitance is found in OD c time. Since the number of fanouts is bounded, the use of the values to partition the fanouts into critical and noncritical fanouts is completed in constant time. Note that the comparison with sizing, illustrated in Figure 6 is performed in constant time.
For type A bu er insertion, the amount of time required for steps 1 through 4 in Section 4.2 is OD c , assuming as is seen in practice, that the iterations of step 2 take constant time. Therefore, in summary, each iteration requires OjV j + jEj time for timing analysis and slack calculation, OD c time for sensitivity calculation, OjV j + jEj time to evaluate type B bu er insertion, and OD c time to evaluate Type A bu er insertion, and since D c jV j, the overall complexity of each step is OjV j + jEj. We emphasize that due to the incremental techniques used, this is a pessimistic estimate of the complexity.
A brief note on unsuccessful strategies
For purposes of completeness and since a negative result is also sometimes a worthwhile result, we believe that it is also worthwhile to point out a few strategies that seem sound on the surface, but were found to be unsuccessful in our experiments.
Incorporating rollback
Since the insertion of a bu er is a drastic step, we considered including rollback, where after each bu er insertion, a certain number of prior sizing steps were nulli ed. The idea behind rollback is that any sizing steps performed immediately prior to the bu er insertion step may have been suboptimal since they were performed under the assumption that no bu er would be inserted. Therefore, it was thought to bea good idea to consider rolling back to an earlier iteration and resuming the process from there.
However, in practice, we tried several criteria for incorporating rollback and found that the results using rollback were seldom better, while the memory requirements and execution times were phenomenally large. Therefore, we abandoned the idea of incorporating rollback into the optimization process.
Gate cloning
An alternative to bu er insertion would be to clone gates to perform Type B bu er insertion. The primary idea is that instead of creating a new bu er, the use of cloned versions of a gate could beuseful. For example, if a Type B bu er is to beinserted at the output of an inverter, then cloning the inverter amounts to an additional expense of two transistors, while inserting a bu er which consists of two i n verters amounts to an expense of four transistors. Secondly, bu er insertion increases the number of levels in a circuit and therefore may cause unnecessary delay increases along paths that lead to outputs of moderate criticality. The use of cloning may resolve this problem by a voiding the insertion of an additional level of logic in the form of a bu er.
In our implementations we found that when we added gate cloning to the list of strategies used here, we never obtained better results on any of the benchmark circuits. This can be attributed to the fact that situations such as the above do not occur su ciently often in these circuits. However, one could certainly generate an arti cial example where gate cloning could be useful.
Experimental results
The algorithms described above h a ve been implemented in C on an HP 735 workstation. In Table 1 , we present the results on some circuits from the ISCAS85 13 and LgSynth91 14 benchmark suites.
For each circuit, the numberof gates jGj, the unsized delay D u , and the unsized area A u are shown. For a given moderate timing speci cation T spec , the area of our approach is compared with the area from our implementation of TILOS, which is a direct implementation from 1 . The di erences between our implementation of TILOS and 1 is that we replace each gate by an equivalent inverter, characterized by gate sizes W n and W p , and solve the circuit to nd the optimal W n and W p and, in case of our algorithm, the optimum bu er locations too. Moreover, we use a timing model that takes input slew times into consideration. Next to the area numbers the table are also shown in brackets the numberofType A and Type B bu ers. It is seen here that most of the bu ers in this table are of Type B. Although not shown in this table, it was observed that for tighter speci cations, a larger numberofType A bu ers were added. The CPU times for both methods are very similar. The area ratio shown in the last column shows the ratio of the area required by sizing alone as compared to the area required by our method. Therefore, this number should be at least equal to 1 as it always is here and a larger magnitude implies a better improvement o ver sizing alone. Signi cant improvements are possible in most cases. We point out that as the delay speci cation is tightened further, larger area savings are possible for each circuit for tightened constraints.
For example, for circuit c5315 with a timing speci cation of 190ns, our approach provided an area of 13619, while the area of the circuit using sizing alone was 15000. This corresponds to a savings of about 10. Our approach requires the insertion of 1 Type A bu er and 125 Type B bu ers to meet this speci cation.
The entire area-delay tradeo for this algorithm for three di erent benchmark circuits is shown in Figures 8 through 10 . In each case, it is seen that signi cant improvements are possible from the use of our approach, particularly for tighter speci cations. The reader is cautioned that although some curves, such as the one in Figure 9 , seem to be close to each other, in the steep region of the curves, even small di erences are greatly magni ed on the y-axis, and our approach gives signi cant cost savings in that region. For some circuits, such as c499 Figure 9 and c1355 Figure 10 the area from our approach for loose delay speci cations is very slightly worse than that from sizing alone. The explanation for this can beseen by examining our approach in a di erent light. In each step, the approach attempts to reduce the delay of the circuit, going along an area-delay tradeo curve that is similar in nature to that shown in Figure 1 , with a smaller value of d min . As the algorithm progresses, each iteration represents a motion to the left along this tradeo curve. The method typically looks ahead" to determine if a bu er will be required to meet a delay speci cation with the best area several iterations in the future. Therefore, for a few iterations after the bu er is inserted, the results are likely to beslightly suboptimal, and some of this is manifested in the results. Since we are primarily interested in sizing circuits to meet tight delay speci cations, which is a region where our algorithm works well, we have not taken any steps to remedy the occasional minor problem with loose delay speci cations.
Conclusion
In this paper, we h a ve aimed to support the basic idea that bu er insertion can help to improve the area-delay tradeo curve and have presented heuristic algorithms for the purpose. The gate sizing procedure is known to be power-conscious since it sizes gates only when necessary and reduces the dynamic power; in this work, the e cacy of this is further improved by considering bu er insertion to achieve the delay goal for the circuit with a smaller area power cost. Additionally, it is ensured that bu ers are added only as needed so as to minimize the area and the power dissipation, and the process of bu er insertion is targeted towards meeting a given speci cation, rather than towards minimizing the circuit delay. The techniques developed herein are supported by experimental results that demonstrate that improvements can be achieved both in the area and the minimum achievable delay in comparison with an algorithm that performs sizing alone.
