Abstract-Digital designs can be mapped to different implementations using diverse approaches, with varying cost criteria. Post-processing transforms, such as transistor sizing, can significantly improve circuit performance by optimizing critical paths to meet timing specifications. However, most transistor sizing tools have high execution times, and the possible delay gains due to sizing, and the associated costs are not known prior to sizing. In this paper, we present two metrics for comparing different implementations-the minimum achievable delay and the cost of achieving a target delay-and show how these can be estimated without running a sizing tool. Using these fast and accurate performance estimators, a designer can determine the tradeoffs between multiple functionally identical implementations, and size only the selected implementation.
I. INTRODUCTION AND MOTIVATION

I
MPLEMENTING a design involves synthesis (technology independent optimizations and technology mapping), placement, and routing. In a final timing correction step, transistors in logic gates are appropriately sized to speed up critical paths, thus incurring a cost (which may be area, or power) overhead for gains in circuit speed. Although recent approaches have tried to combine sizing with technology mapping [1] , [2] , exact wire loads are determined only after placement and routing, and it is difficult to estimate them accurately at the technology mapping stage. Therefore, gate size selection is still performed heuristically, which leaves a large scope for improving the circuit delay at later stages by sizing. The importance of this step can be judged by the amount of research carried out both in academia [3] - [6] and in industry [7] , [8] . A major drawback of these optimization tools is their large running times; it can take up to a few hours to calculate the appropriate solution for an industry-sized circuit. In this scenario, it is difficult for a designer to determine if an implementation will be able to meet performance goals after transistor sizing, or which circuit out of multiple different implementations for the same functionality should be chosen for further detailed optimization.
In this paper, we present an approach that estimates the benefits of sizing, but without incurring the overhead of running a sizing tool. This directly addresses the problem stated previously, since a designer can use our approach to compare a large number of implementations. We evaluate implementations based on two metrics; each of which is useful in different contexts. First, we consider the problem of estimating the minimum delay that can be achieved by an implementation, if sizing is applied to it. This metric allows a designer to determine if an implementation can meet a given delay specification. The delay of a circuit is the maximum delay of all PI-to-PO paths of the circuit. In order to meet design goals, transistor sizing is applied to the circuit to reduce this delay. The smallest value of delay that can be obtained in this manner is referred to as the minimum achievable delay. Most circuits are rarely sized in order to meet this minimum delay value, due to the associated high area overheads. However, those that are on the critical path may be. Additionally, the minimum achievable delay, along with the unsized circuit delay, also helps determine the range of delay values over which an implementation can be used.
In this work, we assume that the input to the sizing tool is a circuit that has been placed and routed, with all device sizes set to the minimum available value. During synthesis, device sizes may have been selected, but in the absence of physical information, these sizes are sub-optimal, and may take the design to a state that is far from best. Rather than taking such an arbitrarily sized design, we reset all sizes to the minimum so that all implementations have a similar initial state.
It may seem that the delay of such an unsized circuit can be used as an approximation for minimum achievable delay. However, this is not the case, as can be seen from the situation shown in Fig. 1 . This figure shows the normalized delays, before and after sizing for different implementations of two benchmark circuits, and . In terms of the unsized delays, implementation A of circuit is the fastest, implementations B and C are a few percent slower, while D and E are about 10% slower. However, once these implementations have been sized, we see that implementations D and E are actually the fastest. A similar situation is seen for circuit . Sizing all implementations is impractical, and since the unsized delay cannot be used, our estimator can be a useful tool to determine the best implementation.
The second aspect of making comparisons between different implementations is to determine, for a certain target delay, which implementation will have the least cost overhead after sizing. For convenience, we use the area of the implementation as a measure of the cost by which implementations can be selected. There is a direct correlation of area with other measures of cost, such as power dissipation, sub-threshold leakage and gate leakage, and a similar approach can be used when the cost function is power, or a weighted combination of area and power. This metric is applicable to circuits whose target delay of operation is greater than their minimum achievable delay, and hence need not be sized to the minimum delay value. Rather, the focus for these circuits is to minimize the cost while achieving the target delay. Currently, determining the cost is possible only after sizing has been performed, and as before, evaluating a large number of implementations is infeasible, due to the running time of current sizing tools. Simply using the delay and area of an unsized circuit can be misleading, since different implementations are superior at different delay points. This happens because the shape of the area-delay tradeoff curve can vary by implementation. Consider Fig. 2(a) , which shows the area-delay curves of multiple implementations of benchmark circuit C7552, with the area of each implementation shown on the -axis, and delay on the -axis. The extreme right point of each curve corresponds to the unsized circuit; this has maximum delay and the smallest area, and successively smaller delay values require larger areas. Note that the curves have a characteristic point (called the 'knee'), at which the rate of change of area with respect to delay changes drastically.
Each curve is bounded by the maximum delay (i.e., the unsized circuit delay) and the minimum achievable delay 1 . However, as can be seen, the shape of each curve can vary significantly. For example, in the curves shown in Fig. 2(a) , the knee of each curve can either be closer to one of the end points or in the center. This property varies between different circuits, as can be expected, but it also varies between implementations of the same circuit. For implementations and of C7552, the knee is closer to the minimum delay point. Hence, we initially observe large improvements in delay for relatively small area cost, for these implementations, but further delay improvement comes at the cost of large increases in area. The situation is reversed for implementations and , where the knee is closer to the maximum delay point. In this scenario, trying to determine which implementation is the best at some intermediate delay point without having knowledge of the entire area-delay curve is difficult.
Suppose a designer wants to determine the best implementation among those available for some target delay of . Calculating the minimum achievable delay and the unsized circuit delay of all implementations, the designer can determine that all implementations meet this target delay ( and do so trivially, since their unsized delay is greater than ). At a different target delay of , the implementations that have to be considered are , , and . Implementations and need not be considered, since their minimum achievable delay is larger than this value. However, this information is not sufficient, since which of these circuits should be selected is still not known. Ideally, s/he would like an ordering of these implementations based on the cost, which in this case, is the area. The required ordering for a delay of is , and for it is . Simply ranking implementations based on the unsized delays and areas is not enough, e.g., at one delay point, has lower area, and at the other is better. This situation, of different implementations being the best at different delay points, is also seen in implementations of other benchmark circuits. As a drastic example, consider Fig. 3 , which shows the area-delay curves of two implementations of benchmark circuit . Depending on the target delay selected, or , either implementation or will be preferred. Without estimating these area-delay curves, which implementation is more area-efficient can be determined only by generating the area-delay curves. To summarize, in this paper we present algorithms that provide an estimate of the minimum achievable delay of a given implementation, and an estimate of the complete area-delay trade-off curve. The algorithms do not run a sizing tool, and are, therefore, fast, but at the same time, they enable accurate comparisons between different implementations, as we will show in the results section. Our approach is based on the method of logical effort [9] , [10] , which is well suited for estimating the minimum achievable delay of a single path in a circuit, with a heuristic branching factor used to account for multiple fanouts. However, the critical path of a circuit changes dynamically according to the choice of distribution of capacitance over multiple fanouts. An important contribution and differentiator of our algorithm is a means of accurately determining the minimum achievable delay of a circuit by simultaneously considering all paths of the circuit. Logical Effort, and its associated drawbacks are described in the following section. We then show how the drawbacks of logical effort can be overcome, in particular, how multiple fanouts are handled in our approach. This is integrated into two algorithms, the first for estimating the minimum achievable delay, and the second for estimating the area-delay curve of an implementation. This work has been published in preliminary form in [11] and [12] .
II. LOGICAL EFFORT
The starting point of our approach is the method of logical effort, which has been widely used in a variety of application domains [1] , [13] - [15] as well as in industry standard EDA synthesis tools [16] , [17] . Using logical effort, the delay of a gate with is estimated by modeling it as a linear function of the load being driven as (1) where:
• Logical Effort ( ) is the complexity of the gate, relative to an inverter. It measures how much worse the gate is at driving a specified load than an inverter. The base case of an inverter is taken to have unit logical effort, and complex gates such as NAND, NOR, and XOR have successively higher values of logical effort. • Electrical Effort, or Gain ( ) describes how the electrical environment of the logic gate affects performance and how the size of the transistors in the gate determines its load-driving capability. is the load being driven and is the input capacitance of the gate under consideration.
• Effort Delay ( ) is the product of the logical and the electrical efforts of the gate.
• Parasitic Delay ( ) expresses the intrinsic delay of the gate due to its own internal capacitance, and is largely independent of the size of the transistors in the logic gate. This formulation separates the different components that contribute to the delay of a gate. It also provides the user with a means of sizing the gate-since the logical effort of a gate is fixed, if a particular effort delay is assigned to a gate, the input capacitance that meets this effort delay can be calculated as (2) As shown in [10] , (1) can be extended to estimate the minimum delay of a path of logic as (3) where is the path effort, is the path parasitic delay, and is the number of gates on the path under consideration. The path logical effort, , is the product of the logical efforts of the gates on the path. The path electrical effort is obtained as the product of the gate electrical efforts, or equivalently, by the ratio of the load being driven by the last gate and the input capacitance of the first gate. The minimum delay of (3) is obtained by distributing the path effort equally to each gate on the path. Thus, each gate is assigned a gate effort of . Starting with the gate at the output, that drives a fixed load of , the size of each gate can be successively determined by using (2) . Equation (3) can be used for determining the minimum delay (and the corresponding gate sizes) of a simple path of logic, in which each gate only drives the next gate on the path. However, realistic circuits have gates that drive multiple fanouts. In order to address this situation, [10] introduces the concept of a branching effort, , where is the total load being driven by the gate, and is the load contributed by the fanout on the path of interest. The gate effort is now defined to be . In Fig. 4 , gate X drives two gates, Y and Z, which have input capacitances and , respectively. The total load being driven by gate X is then , as shown, and is either or , depending on whether path P1 or path P2 is being analyzed.
The path effort in (3) is modified to , where , the path branching effort, is the product of the gate branching efforts of all gates on the path being analyzed. A similar methodology is followed in order to obtain the minimum delay, and the corresponding gate sizes, i.e., each gate on the path is assigned an effort of , and (2), used to calculate the gate sizes, is modified to include the effect of the branching factor as (4) In this manner, the branching factor tries to capture the effect of fanouts that are not on the path of interest. This approach, however, has a few serious flaws. First, paths are analyzed individually, and the interactions between the sizes required by each path are not taken into account. More importantly, the branching factor is assumed to be fixed, and it is calculated using the initial values of the fanout capacitances (in the example presented in [10] , all fanouts are shown to have the same size, both before and after sizing). Hence, when a path is sized using (4), the gate sizes of fanouts not on the path under consideration have to be scaled according to the branching factor initially selected. For example, in Fig. 4 , when path P1 is being analyzed, the branching factor of gate X is . Once the path effort of P1, , has been calculated, and used to size gates X and Y, the size of gate Z will have to be scaled by an appropriate amount, in order to keep the value of constant. If path P2 is not critical, increasing the size of Z is unnecessary, and only increases the load on gate X. If path P2 is analyzed separately, its path effort, , may require completely different sizes of gates X and Z, and gate Y will have to be scaled according to the branching factor . Thus, in the case of multiple fanouts, the optimal sizes of each fanout (gates Y and Z in Fig. 4) , and the corresponding size of the gate driving the fanout (gate X in Fig. 4 ) cannot be easily determined using the branching factor. Analyzing every path in a circuit separately, and the interactions among all paths is not feasible, because of the exponential number of such paths in a circuit. Thus, while the method of logical effort is well suited to analyze single path delays, it cannot be used directly when critical paths are not well defined, or can change. In the following section, we present an approach that can handle such scenarios.
III. DELAY CALCULATION INTEGRATING GATE SIZING
The advantage of logical effort is that it provides the user with a means of determining the delay of a path of logic while simultaneously determining the gate sizes required for achieving that delay. In this section, we present an approach that is equivalent to logical effort in the degenerate case (single fanouts and no routing capacitances being considered). Even in the degenerate case, our approach has higher accuracy, since only discrete gate sizes that are available in the technology library are used in our calculations. As mentioned in the previous section, logical effort has severe shortcomings when circuits with multiple fanouts are being analyzed. Similar drawbacks exist when routing capacitance is taken into consideration. Our approach overcomes these drawbacks by simultaneously considering gate sizes of all paths in the circuit. A key concept of our approach is calculating and propagating Delaycurves for all gates, which capture the effect of changing delay for different gate sizes. We first present this approach for simple paths, including the effects of routing capacitance. This is then extended to handle multiple fanouts.
A. Simple Paths
We define to be the set of all possible values of input capacitance, , of gate G, corresponding to different sizes of G. Consider the situation shown in Fig. 5 , where gate 2 G drives fanout gate F. The input capacitance of G and F are and respectively. The interconnect between the output of G and the input of F has a routing capacitance of . Thus, the load capacitance that G drives is the sum of the input capacitance of F and the routing capacitance, or (5) Hence, if F has different sizes, G will have corresponding values of load capacitance. Now consider the delay from the input of a gate G to any primary output. This delay has two components, the delay of G itself, , and the delay from the output of G to a primary output,
. Thus,
This equation is incomplete, since it does not take into account the sizes of G or its outputs. These can be incorporated as follows. The term on the left hand side of (6) depends on the size of G that is under consideration, and is therefore correctly represented by . We know that the delay of G depends on its load, , as well as its size, , and is given by (1). Hence, we represent the gate delay as . An improved version of (6) is, therefore
The last term corresponds to the delay from the input of F to a primary output. Hence, (7) can be rewritten as (8) For a simple path of logic, (8) is a recursive definition of delay. For a gate driving a primary output (and hence driving a fixed load), the delay to the primary output is simply the delay of the gate itself. The delay for any other gate is defined in terms of the delay from its output to the primary output.
Recall that the value of also depends on the input capacitance of gate F, , by (5). As mentioned before, gate F may have different sizes available. Hence, there are a corresponding number of different values of . We are interested in the minimum delay from the input of gate G to a primary output. However, the effect of different values of on each component of (8) is opposite: as increases, so does the delay of gate G, but the delay to primary output of gate F reduces. Thus, in order to obtain the minimum delay from the input of gate G to a primary output for a selected value of input capacitance of G, we need to examine all values of .
(9) The value of as defined in (9), for different values of constitutes the Delaycurve of gate G. It captures the minimum delay from the input of G to the primary output, for different sizes of G. Note that the sizes of all gates on the path from G to the primary output are implicit in the Delaycurve. The formulation of (9) leads directly to a dynamic programming based algorithm, presented in the following section.
B. Multiple Fanouts
In the case that G drives multiple fanouts, the load capacitance , is calculated as follows. Say G has fanouts, , as shown in Fig. 6 . We denote the possible values of the sum of the input capacitances of the fanouts by the set . Then the load capacitance of G is the input capacitances of these fanouts (an element from ), combined with the routing capacitance, .
(10) Fig. 7 . Combining Delay-C curves at multiple fanouts.
If we assume that each of the fanouts have sizes, then the number of values that we obtain, for is . However, we show later that the number of useful values to be considered is actually linear.
The delay calculation of (7) also changes in the case of multiple fanouts. In (7), since there was only one fanout, we could use its delay to a primary output in order to obtain (8) . However, in the case of multiple fanouts, we are interested in the maximum delay from the input of gate G to any primary output. Thus, the correct value to use for is the maximum delay to a primary output over all fanouts. Thus, (11) As before, we can have multiple values of , for different combinations of input capacitances of the fanout gates. Since we are interested in the minimum of these, we obtain (12) Note that a selection of sizes of the fanouts (which determine the corresponding input capacitances , and therefore, ) also fix the value of the load that gate G has to drive, , by (10). It may seem that the size of the set for a gate G with multiple fanouts is proportional to the product of the number of sizes of the fanout gates. Assume gate G drives four outputs, whose Delaycurves are represented by , , and , shown in Fig. 7 . If each fanout has sizes, each curve has points, and the size of is . However, we can show that most of the values in are redundant. For example, consider the tuple of the first points , , and from each of the curves in Fig. 7 . A tuple of the point from curve and any other point from , and (say , and ), is inferior to for the following reason. There are two values that are extracted from and , the maximum delay to a primary output, and the sum of the input capacitances represented by these combinations, which is used as the load in the delay calculation of gate G. The maximum delay is the same in tuples and , but the load presented by is greater than that of . Hence, the delay of G, and therefore its delay to a primary output is larger in this case. Since we are interested in minimizing the delay to a primary output, the solution offered by tuple will never replace that calculated using .
The above discussion directly leads to a strategy for efficiently selecting useful values of from the Delaycurves of outputs. First, these curves are stored in order of nonincreasing delay (and hence increasing sizes). The first value of is the routing capacitance plus the capacitance corresponding to the maximum-delay points from each curve, as in tuple . The next value is obtained by replacing the point with maximum delay (e.g., of curve in ), with the next point from the same curve ( ). This effectively ignores the combination of with remaining points from the other curves. This process is continued till the maximum delay point is the last point on its curve. Thus, in the worst case, the total number of combinations is of the order of the sum of number of points on each curve, rather than the product. This worst case occurs when the Delaycurves of all outputs are identical. In other situations, the number of combinations is smaller than the sum of the number of points on each curve.
Recall the drawbacks of the traditional branching effort mentioned in the previous section. A fixed branching effort, without considering the interactions between fanout branches can lead to suboptimal circuits, and considering the interactions is impractical, due to the number of combinations involved. Using Delaycurves allows us to efficiently assign sizes to multiple fanouts according to the criticality of each branch.
IV. ALGORITHMS FOR ESTIMATING THE BENEFITS OF SIZING
Based on the formulation of Delaycurves presented in the previous section, we now present two algorithms that determine the metrics mentioned in Section I. Calculating the Delaycurve of a circuit allows us to estimate its minimum achievable delay. Modifications to this approach can be used to determine the cost of sizing a circuit to a target delay, rather than sizing to the minimum achievable delay. Our formulation for calculating the Delaycurve of a gate has been presented in the previous section, the most general form being represented by (12) . As mentioned before, this is a recursive definition, each value on the curve being defined in terms of the curves of the fanouts. Algorithm 1 exhibits the hallmarks of dynamic programming: the optimal solution for the current gate size is defined in terms of the optimal solutions of its outputs, which are calculated once and used as needed. A single traversal of the circuit, from primary outputs to primary inputs is sufficient to calculate the Delaycurves of all gates in the circuit. Processing gates in such a topological manner ensures that when the Delaycurve of a gate is being calculated, the Delaycurves of all of its fanouts have already been determined.
A. Minimum Delay Estimation
Algorithm 1 Minimum Delay Estimation
We assume that primary inputs are inverters of fixed size. The Delaycurves at the primary inputs include the delay of these inverters, so that the loading effects of the gates being driven by the primary inputs are taken into account. Once we have calculated the Delaycurves at the primary inputs, determining the minimum achievable delay of the circuit is straightforward. For each primary input, we can calculate the minimum delay to any primary output, and the largest such value over all the primary inputs is the minimum achievable delay of the circuit. Algorithm 1 presents the complete algorithm, called Minimum_Delay_Estimation (MDE).
Run Time Analysis: Assume that there are sizes for each gate in a circuit with gates, and the maximum fanout on any gate is . The innermost loop is executed times, as shown previously, and the cost of determining the maximum delay point is . The second loop is executed times, since we assume sizes for each gate. Finally, since there are gates in the circuit, the outermost for loop is executed times. Thus, the running time of Algorithm MDE is . However, note that this is a very loose upper bound, since very few gates actually have fanouts. Algorithm MDE is optimal for trees. However, most circuits are DAG's, with reconvergent fanouts. The main problem with DAG's is that there are multiple paths from a particular gate to primary outputs, or between two gates. An implicit assumption of our algorithm is that the Delaycurves at multiple fanout points are independent, and that we are free to choose the combination of output delays and capacitances that best suit the current gate. However, with reconvergent fanouts, these choices are not independent of each other. Selecting a data point on one output restricts the choices on the other, and determining the relation between different outputs is intractable for general circuits. However, assuming independence is not unreasonable, for an estimator such as ours. If the reconvergent paths are completely unbalanced, i.e., their structure and logic is such that one always has smaller delay than the other, no errors are introduced due to the manner in which their Delaycurves are combined. The smallest value will consistently be selected for the path with smaller delay. An example of this situation is if the paths correspond to curves and in Fig. 7 . On the other hand, if the delays of the two paths are roughly of the same order (e.g., if they correspond to curves and ), our approach selects approximately similar values of input capacitances. This may lead to small inaccuracies, since the actual values of input capacitance may be slightly different. However, the error in delay estimation is limited, as shown by the results in Section V.
Our approach can also be used to obtain actual sizes of all gates in the circuit. In Algorithm MDE, we can store the value of the load of each output that induces the minimum delay (corresponding to values of and in (12) ). This information can be used in a forward traversal of the circuit, in order to generate sizes for every gate. A gate with multiple fanins has multiple choices for its size, which can be resolved by selecting the size imposed by the critical input 3 . The effect on the noncritical inputs is that they now have a load different from what was initially assumed. However, the difference in the delays from the primary inputs to the critical and noncritical inputs can be used to compensate for this. In fact, this difference can be usually be used to reduce the sizes of the transitive fanin cone of the noncritical inputs, as long as their delay does not become larger than that of the critical input. Gate sizes determined in this manner correspond to a circuit sized for minimum delay. These sizes can be used as an initial feasible solution for an exact sizing tool, instead of using the original unsized circuit. This can lead to a large improvement in running times of the transistor sizing tool, since a circuit sized using our approach is closer to the final solution than the initial, unsized circuit. Note however, that the area of a circuit sized in this manner cannot be used as an estimate for the area of the final circuit, due to the nature of the sizing problem. This issue is discussed in the following subsection.
B. Area-Delay Curve Estimation
As mentioned in Section I, not all circuits need to be sized to operate at the minimum achievable delay. For these circuits, we can trade area for delay, also achieving a reduction in power. In a scenario where a target delay is known, and multiple implementations of a given circuit are available, we would like to determine the cost (in terms of area) for achieving the given delay, which entails estimating the entire area-delay curve of each implementation. As before, determining the exact area-delay curve is expensive. In this subsection, we modify Algorithm 1 in order to quickly estimate the area-delay curve of a given implementation.
Since we are interested in determining the entire area-delay curve of the implementation, a natural approach would be to calculate the area of the transitive fanouts with the delays during the Delaycalculation of Algorithm 1. However, there are a few problems with this approach. There are multiple configurations of gate sizes that can achieve the same delay value, and hence multiple solutions for each delay value have to be stored. Unlike in Algorithm MDE, these solutions cannot be pruned. Finally, every combination of points in the enhanced Delaycurves of multiple fanouts has to be considered, which further increases the complexity.
We therefore need another approach to estimating the areadelay curve. Recall that the Delaycurves calculated in Algorithm MDE implicitly store sizes of gates in the transitive fanout cone required for achieving the minimum delay for each value of . Hence, we can size the circuit using different points on the Delaycurves of the primary inputs, and calculate the corresponding area. However, these points may not be optimal i.e., the area calculated using the above approach may not be the smallest area for a particular delay. For example, say we have a minimum delay of for and for , with corresponding circuit areas of and , and . It is possible that there is a nonminimum delay for an input capacitance of that had a corresponding circuit area , that is less than . The solution (corresponding to an input capacitance of ) is clearly better than the solution, but since only minimum delay points are considered, the superior solution is ignored.
Consider the circuit shown in Fig. 8 , with two branches of the circuit driving different loads. For some input capacitance of , we obtain a number of delay values, the minimum of which is stored in the Delaycurve, and the other delay values are discarded. However, we can size the circuit using the minimum as well as the discarded delay values (for the same value of ), and calculate the corresponding areas. These points are shown in Fig. 9 , and the best points for an area-delay curve perspective are the ones marked by a line. This procedure can be repeated for other values of , and the union of the solutions obtained gives us the area-delay curve desired. This is shown in Fig. 10 for three values of . The intersection in the curves corresponding to and is an example of sub-optimality if only the minimum delay points were to be considered.
Thus, we estimate the area-delay curve of a circuit by sizing it for different values of delay, for every value of and measuring the area. In order to keep the run time low, rather than sizing for all delay values, we size the circuit for a limited number of values (in our experiments, we found that selecting 10 sub-optimal delay points was sufficient). This has an impact on the accuracy of our results, but the effect is limited.
Our heuristic, called Algorithm ADC is shown in Algorithm 2. At the primary inputs, we store multiple Delaycurves. Each time is updated to a new value, we store the replaced value as an entry in a set of secondary curves. The minimum delay values from these secondary curves are then used to size the circuit, and obtain other points on the delay-area curve. Circuits sized in this manner have greater delay than the minimum achievable delay, and after area recovery, they have smaller area as well. The solution obtained using this approach is naturally not exact. However, as discussed above, since the auxiliary data of points on the secondary curve encode sizes of the outputs (and particularly, of sizes of multiple fanouts), these solutions still provide a good representation of the area behavior of the circuit at different delay points. That is, though we cannot use the area-delay curves to make absolute judgments, we can still make comparative judgments between different circuits.
Once the circuit has been sized, we determine the arrival and required times at each gate, and use the slack to reduce the sizes of the gates. This step can drastically reduce the area of a circuit, since the noncritical parts of the circuit are usually sized to be unnecessarily fast. After the Delaycurves have been calculated, the arrival and required times of each gate can be determined in two traversals of the circuit. This calculation is performed for each set of Delaycurves available, and hence the running time is dominated by that of Algorithm 1.
V. RESULTS
In the previous sections, we have presented two metrics for determining the benefits obtainable from sizing. In order to validate our approach to calculating these metrics, we proceed as follows. We use a library consisting of multiple sizes of an inverter and two-input NAND, NOR and XOR gates, at the technology node, characterized using the Berkeley Predictive Technology Model 4 [18] . In all, we have 10 discrete sizes of each gate type. The calculated gate sizes are rounded off to the nearest size available in this library. We use SIS [19] to map ISCAS and MCNC benchmark circuits, with varying optimization criteria (for area, delay and combinations of area and delay), and obtain 7 implementations. We then add random capacitances at all interconnects of these implementations twice, in order to simulate the effect of different placement and routing solutions. We thus obtain 14 different implementations for each benchmark circuit. The library is also characterized in order to obtain cor- 4 Available from http://www-device.eecs.berkeley.edu/~ptm 5 . Comparisons are made with our implementation of TILOS [3] .
A. Algorithm MDE
For all implementations of every benchmark circuit (obtained as described above), we apply our implementations of TILOS and Algorithm MDE, in order to determine the minimum achievable delay. Our goal is to measure the error between both delays obtained (exact, from TILOS and estimated, from Algorithm MDE).
Figs. 11 and 12 presents the comparison of Algorithm MDE with our implementation of TILOS for a few of the benchmark circuits. For each implementation, the first bar represents the delay of the unsized circuit. The second bar is the minimum delay obtained when the mapped circuit is sized using our implementation of TILOS, and the last bar is the minimum achievable delay estimated using Algorithm MDE. As can be seen by the correspondence between the last two bars for each implementation, our results agree with those obtained via TILOS. In every case, the execution time for our algorithm was less than a second, while our implementation of TILOS took from a few seconds for C17 up to more than 1500 seconds for C6288 6 . The average error for each circuit over all implementations is presented in Table I . Over all the benchmark circuits (59 in total), the average error is 6.01%. This error is due to the fact that most circuits are not trees, but have reconvergent fanouts. Our claim in Section IV-A, of assuming Delaycurves of reconvergent fanouts to be independent is borne out by the small magnitude of the error.
B. Algorithm ADC
We next present the results of generating estimated area-delay curves of all implementations of a circuit, using Algorithm ADC. The first goal of this approach is to correctly predict which implementation has the lowest cost for different delay points. In order to measure this, we generate the area-delay curves for all implementations, using TILOS and Algorithm ADC. In the entire range of available delay values, we select ten equally spaced delay points. Note that the number of implementations that can be sized to meet a particular delay value varies by circuit, as can be seen from the area-delay curves shown in Fig. 2 . We make pairwise comparisons between all implementations available at the selected delay point, and determine which implementation is better in each pair. In Table II , for each benchmark circuit, the number of comparisons made are shown in the second column. Next, we make the same comparison using the delay curves obtained from our implementation of TILOS. An incorrect comparison is when the ranking according to Algorithm ADC is different from that obtained from TILOS. As shown in the next column, incorrect comparisons occur only 6.71% of the time.
Next, we consider the error in the predicted area difference. When comparing implementations and , the better implementation is the one with smaller area. Let the corresponding areas be and , and assume , so that is the better implementation. The difference between the estimated areas of and , is calculated as . The corresponding difference between the areas from the actual area-delay curves of and , and is calculated as . The maximum and average difference between and are presented in columns 4 and 5 of Table II . For example, for circuit C2670, Algorithm ADC overestimates the difference in areas between two implementations by an average of 3.49%, while the maximum error is 18.87%. The maximum error does not happen too often, and for all circuits, the average error is 5.07%, while the maximum error is 28.61%. The last two columns present the maximum and average errors in area estimation for comparisons that were mis-predicted. While the maximum is large, it is rare, the average error in this case is 5.71%. As mentioned previously, incorrect predictions themselves are very infrequent.
VI. CONCLUSION AND FUTURE DIRECTIONS
In this paper, we identify and address two metrics that can be used to evaluate different implementations of a circuit. Using the algorithms presented in this paper, designers can quickly determine the minimum achievable delay that can be obtained by an implementation. We also present an algorithm for estimating the entire area-delay curve of all available implementations, so that given a target delay, the best implementation (in terms of area) can be selected. Both of these metrics are calculated without running an exact sizing tool, and are therefore fast, but do not sacrifice on accuracy, as shown by the results.
The concept of calculating and propagating Delaycurves is general, and can be applied to different areas as well. For example, current placement tools try to provide a solution that is delay-optimal, among other objectives. However, they ignore the gains that may be obtained via sizing. Our approach can be used to guide the placement tool, in effect making it "transistor-sizing aware," so that the final solution is globally optimal. Another area of application is in technology mapping, where circuits are broken into trees which are mapped individually, simply estimating the load values at the output of each tree. In [21] , Delaycurves are used to determine the optimal assignment of loads at tree outputs, leading to superior mapped solutions.
