Abstract
Introduction and previous work
Circuit optimization by transistor and wire sizing is an important part of designing custom high-performance digital circuitry. There are two well-known methods of circuit optimization, dynamic tuning and static tuning. In the former method 11, 2, 31, the user must specify input patterns for simulation and only those measurements that are actuated during the simulation can be optimized. Hence there is a heavy burden on the user t o correctly pose the tuning problem. In static tuning, the optimization is on a static-timing basis wherein all paths through the digital logic are considered simultaneously. The requirement of providing input patterns and the burden of stating a meaningful optimization problem are removed from the designer.
One of the best-known static tuners is TILOS [4], in which transistors are modeled by RC equivalent circuits, and the delay of a gate is represented by an Elmore [5] or Penfield-Rubinstein [6] delay model. The resulting sizing problem is shown to be posynomial in transistor Permission to make digital or hard copies of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or dishibuted for profit or commercial advantage and that copies bear this notice and the full citation on the first page. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. DAC 99, New Orleans, Louisiana 81999 ACM 1-581 13-092-9/99/0006..$5.00 11400 Burnet Road, M. S. 9460 Austin, TX 78758
and wire widths, and is converted to a convex problem by a simple mapping of variables. A heuristic method of solving this convex problem is employed to obtain the entire delay-area tradeoff curve. Subsequently, an exact solution to the convex problem was proposed in [7] . These methods work quickly and can handle large circuits. However, they suffer from the inaccuracy of approximating a logic gate by a n RC circuit and the concomitantly crude delay model, making them unsuitable for custom, high-performance design. Simulation-based static timing analysis, wherein each subcircuit is analyzed by time-domain simulation, is an ideal framework for path-independent optimization. In such a framework, custom circuitry can be accommodated, and all the benefits of static timing analysis are preserved. This paper presents an optimization technique that formulates the static tuning problem in a unique manner, is based on nonlinear optimization, uses fast transient simulation to evaluate custom circuitry, and uses incremental time-domain gradient computation by the adjoint method. This paper focuses on circuits composed of parameterized library cells. Both datapath and control circuits, whether synthesized or custom-designed, can be optimized. Employing simulation t o evaluate circuitry allows extension to general custom circuitry at the transistorlevel. The simulation is combined with efficient, incremental, time-domain sensitivity computation in order to provide gradients to the nonlinear optimizer. It is well known that solving large, nonlinear optimization problems is impossible without gradient information or at least some method of approximating gradients.
Section 2 demonstrates the novel formulation by means of a simple example. Details are provided in Section 3. A prototype program called EinsTuner which tunes combinational circuits consisting of parameterized library cells is described in Section 4, and numerical results are discussed in Section 5. Section 6 is devoted to special considerations for custom transistor-level and sequential circuits. Finally, future work and conclusions are presented. 
Problem formulation in detail
While the previous section described the concept in relation to a simplified situation, this section will lay out the full details of the problem formulation, including the handling of slews, slew dependencies, and separate rise and fall arrival times.
Variables
Each node of the circuit has four variables associated with it. For node i, the four variables are the rising arrival time A T , the falling arrival time AT:, the rising slew S T and the falling slew $. Slew is considered to be a 0%-100% measurement (lOO%-O% for falling signals) of an idealized ramp waveform, but the definition can easily be changed to suit other timing methodologies. 
Problem formulation by example
Consider the network of Figure 1 , consisting of three simple gates GI, G2 and G3. Let w be the n-vector of transistor widths to be optimized. Wires are ignored here for simplicity. Assume that we wish to minimize the worst arrival time at the primary outputs subject to an area constraint and simple bounds. The problem can then be stated as follows. minimize max(AT7, AT8)
In problem (l), the AT variables are the worst-case arrival times at each node of the circuit, the dij functions are the delays along the signal paths from each input pin to the output pin of each gate, A is an area target, and Li and U; are the lower and upper bounds on transistor widths (known as simple bounds). Area is modeled by a weighted sum of tunable transistor widths. Unfortunately, problem (1) In reterence to problem (2), the following points are to be noted. Assume that we are dealing with parameterized gates in which each channel-connected component (CCC) (i.e., each set of FETs that source-drain connected) has two design variables, wn and wp, denoting the width of all the NFETs and all the PFETs in the CCC, respectively. Again, the formulation can easily be generalized to a fullcustom situation. Finally, we have an auxiliary timing variable z. Figure 2 shows a generic multi-input multi-output CCC.
Timing constraints
For the propagate segment (i.e. , the signal arc in the CCC timing graph) from pin i to pin j , the constraints are shown below, assuming that the segment is an inverting segment. Depending on the type of gate, each propagate segment is classified as either an inverting segment, a non-inverting segment or both, and the constraints are listed appropriately. The entire network is traversed and constraints such as the ones listed below are gathered for every propagate segment.
Note that a superscript of r implies a rising signal, delay or slew, while a superscript o f f refers to the corresponding falling quantities. The nonlinear delay functions are denoted by dij, and the sij terms represent the nonlinear slew functions. The fanout capacitance at pin j is coutj , which is a function of the sizes of the fanouts of pin j augmented by any wire capacitances on that net. As in most static timers, fanout capacitance is approximated by a lumped, linear capacitance which is a function of the transistor and wire sizes of the immediate fanouts. The gate itself is modeled at the transistor level. The choice of the slew propagation inequalities in (3) is compatible with the conservative nature of timing analysis. Moreover, unlike other choices, it guarantees the continuity of the optimization problem.
Additional constraints and the objective function
There are a number of additional constraints required for successful optimization. If we are interested in minimizing critical delay, for each primary output j , two additional constraints
are required, where RAT indicates the required arrival time of rising and falling signals a t the primary outputs. Of course, if all the required arrival times are uniform, then those terms can be dropped from the constraints in equation (4) and z interpreted accordingly, without any change in the resulting solution. In addition, an objective function is included. If we are not interested in minimizing critical delay, but rather minimizing area subject to system timing requirements, then for each primary output, two additional constraints
AT; 5 RAT! A?
RA+
are required, and the z variable is unnecessary. Area is typically expressed as the weighted sum of some or all of the tunable transistor and wire sizes. Area can either be constrained or minimized. Keeping the area in check generally (but not always) keeps the power consumption at reasonable levels. Arrival times and slews on primary inputs are set equal to the assertions provided by the designer. Required slews are expressed as simple inequality constraints involving a single variable or simple bounds. Input loading constraints are expressed as some (usually simple) function of the fanout widths of the primary input being less than a required maximum loading.
The objective function can consist of just area, or just the quantity z or some weighted combination thereof. The weighting can be varied and the problem re-run to determine a tradeoff curve. Additional constraints for dynamic logic and sequential circuits are discussed in Section 6.
Implementation details Overview
The circuit optimization formulation of the previous sections was implemented in a prototype tool called EinsTuner. The software architecture of the program is shown in Figure 3 . The netlist is fed to the "outer layer" (lightly shaded box in Figure 3 ) of the EinsTuner program which performs an initial timing run. Then the tuning problem is expressed in the SIF (Standard Input Format) nonlinear optimization language [9]. The resulting SIF file is decoded to produce several problemspecific FORTRAN files. These files are compiled and the resulting objects linked with the optimizer objects and the "back-end" of EinsTuner to create a custom executable. The back-end consists of routines to evaluate nonlinear functions such as delays and slews and the gradients thereof and supply them to the optimizer on demand. The evaluation is carried out by invoking a circuit simulator through an application programming interface (API). The custom executable is then invoked to carry out the actual inner optimization. Upon completion, transistor sizes are snapped to a technology-imposed grid, a final timing run is performed, and the required files for back-annotation of the new sizes are generated by the outer layer.
Problem formulation
Several aspects of the problem formulation are noteworthy. While any reasonable start point and any reasonable simple bounds should theoretically converge to the same tuned circuit, we should keep in perspective that we are asking the optimizer to solve problems in 1,000-or even 10,000-dimensional space in a few hundred iterations. The situation is further complicated by the presence of numerical noise in the computed delays and slews. In fact, any simulation-based data is inherently noisy. Hence many choices were made in the problem formulation to make the situation conducive to the optimizer being as aggressive as possible, and converging in as few iterations as possible.
Units, scale factors and weight factors were chosen with great care so that the resulting problem was well-scaled. A concerted effort was made to keep the optimizer within a "physical range" of variables. Slews at non-primary-inputs were constrained to be within technology-specific lower and upper slew bounds. To increase optimization efficiency, a special method was used to determine the bounds and start points of arrival times and slews. A quick "mock timing" run is conducted before formulating the problem to determine a lower bound on all arrival times. 
z 2 ci(c), i = 1,2,. . . , n.
We know from the Kuhn-Tucker optimality conditions that the Lagrange multipliers corresponding to the constraints must sum up to -1 at the solution. Hence Lagrange multipliers were initialized to -l/n, where n is twice the number of primary outputs of the network in our case.
Simulation and gradient computation
EinsTuner is based on simulation and gradient computation of each CCC by the fast event-driven simulator SPECS [lo, 111. Whenever a transistor size, input slew or fanout capacitance of a CCC is updated, the optimizer automatically calls SPECS to re-compute the delays, slews, and gradients thereof. SPECS uses table models for device i-v characteristics. Specialized integration techniques, event-driven simulation and simplified device models enable SPECS to be 70x faster than AS/X, an IBM-internal SPICE-like simulator at a relative stage timing accuracy of 5%. Path delays, however are predicted more accurately. We note that the saturated-ramp signal approximation is standard in most static timing analyzers. Our use of transient simulation rather than analytic formulas or delay tables leads to improved accuracy. The key feature of SPECS exploited in EinsTuner is a mature gradient computation capability that has been extensively used in a dynamic tuner [12, 2, 31. SPECS computes incremental time-domain sensitivity information by both the adjoint and direct methods. The adjoint method is used in EinsTuner. The sensitivity of a time-domain measurement (such as delay, slew, power or noise) can be computed with respect to any number of parameters (such as transistor widths, input slews or fanout capacitances) in a single adjoint analysis. Each adjoint analysis is a small incremental overhead on the nominal simulation. Gradients with respect to transistor widths include chain ruling and combining gradients with respect to diffusion capacitances whose values depend on those widths. "Group adjoints" are employed efficiently to compute gradients of linear combinations of measurements such as slews. In tuning the benchmark circuit c2670, for example, an estimated 15 million time-domain gradients were computed. Without a fast, accurate and reliable time-domain sensitivity engine, it would not have been possible to create a tool such as Ei nsTu n er.
For each channel-connected component, the circuit is "constructed" by means of calls to a simulation application programming interface (API). The simulation conditions for all the propagate segments are concatenated in time and a single simulation of the CCC is performed to compute all the delays and slews. Simultaneously, the gradients of the delays and slews are computed with respect to each transistor size, input slew and fanout capacitance. All these gradients are cached in local arrays until requested by the optimizer. Much care was expended in the memory management of the API. Finally, various measures were taken to minimize the noisiness of the data provided by the simulator. Reduced noisiness in the data was crucial to achieving convergence on some of the larger benchmark circuits.
Nonlinear optimization
We use the large-scale, general-purpose nonlinear optimization package LANCELOT [9, 13, 141 with several special modifications for EinsTuner. LANCELOT uses an augmented Lagrangian merit function and employs a trust-region based algorithm. The merit function consists of a Lagrangian and a penalty term consisting of a weighted sum-of-squares of the constraints. Simple bounds are accommodated easily and efficiently by means of projections. The optimizer can be configured to use different preconditioners, to solve the inner bounded quadratic probIem approximately or accurately, and so on.
LANCELOT allows one to exploit group partial separability [9] in the problem structure. In the EinsTuner problem formulation, the nonlinear contribution to each delay or slew constraint depends on only a few variables. This sparsity is communicated to LANCELOT via the SIF file and exploited by the optimizer, a key enabler of being able t o solve large problems in relatively few iterations.
Since simulation and gradient computation are expensive compared to the optimization algorithms, it is worth going to great lengths in the optimizer to try to reduce the number of iterations required to achieve convergence. This principle was applied in several ways t o speed up the optimization. All the slack variables internally introduced by LANCELOT to convert inequalities to equality constraints and the z variable occur exclusively linearly in the objective function and constraints. Thus they occur at most quadratically in the merit function, since the penalty term in LANCELOT'S merit function squares the constraints. After each regular step of LANCELOT, a second step [15] can be computed analytically that updates these variables so as to further minimize the merit function. This two-step updating leads to fewer iterations.
Several steps were taken to encourage the optimizer to be aggressive, such as forcing a large initial trustregion radius and revising the criteria for trust-region management. AS a result, the optimizer often takes large steps that cause the circuit to "fail," meaning that one of the measured signals at the output of a CCC fails to switch in a reasonable time. In such a situation, the simulator sends a special return code to the optimizer. The optimizer skips the rest of the iteration, reduces the trust-region radius and tries again. In our opinion, failure recovery is a necessary ingredient of efficient circuit tuning.
Prior to the simulation-based version of EinsTuner, a version that models delays and slews of gates by means of analytic equations was developed. In this environment, exact gradients can be provided to the optimizer and the data is not noisy. This software prototype was shown t o consistently converge to within arbitrarily small gradient and constraint tolerances with default initializations and stopping criteria, thus validating the formulation of the problem. Therefore, if accurate and convex analytic delay models are available, the EinsTuner formulation can easily obtain the global optimum. Further, the success of this prototype bodes well for being able in the future to mix transistor-level modeling with analytic delay rules.
Simulation data is inherently noisy. In analytic problems, if the optimizer takes a small step, one expects a good match between the optimizer's model of the ndimensional space and reality. However, this safety net does not exist in the case of simulation-based data. Several optimization choices were made t o deal with noise. In particular, a special stopping criterion was developed to detect that no further significant improvement is readily available because we have a step size at which the change in the data is dominated by noise.
Numerical results
EinsTuner was tested on a number of combinational benchmark circuits, including two actual circuit designs from a high-performance microprocessor. Tests were conducted on a pool of IBM RS6000 machines.
Other than the actual designs (adder, i f t i , incrmntr and ioperdf) and the two artificially generated problems (inv3 and a 3 3 ) the testing procedure was as follows.
The design was synthesized from the ISCAS-85 suite of combinational benchmarks into-an implementation consisting of restricted library cells using an internal logic synthesis tool. Each gate was treated as a parameterized cell with one variable controlling the width of the NFETs and one variable controlling the width of the PFETs. In some gates (such as OAI21) not all NFETs or PFETs were identical. Nonetheless, all NFETs and PFETs were each ratio-ed to a single parameter. The schematic was sized by employing a simple gain-based heuristic that involves traversing the graph from the primary outputs to the primary inputs. A gain factor of 4.0 was used to convert the 4. fanout capacitance seen at each node of the network into a fanin capacitance. The fanin capacitance was then converted into total equivalent fanin gate width by means of a technology factor. A , f 3 (PFET to NFET width parameter) ratio specific to the type of gate was used to apportion the equivalent transistor width between the NFETs and PFETs. The area of the resulting heuristically-sized schematic was computed. EinsTuner was then configured to minimize critical delay, subject to staying within the same area as the heuristic sizing. The purpose of-applying the heuristic sizing was so that EinsTuner results could be compared to reasonablytuned circuits. More importantly, however, EinsTuner was able to solve the optimization problems as indicated by the smallness of the projected gradient and infeasibilities at the solution.
In the case of the remaining benchmarks, identical steps to the above were followed, but instead of a heuristic initial sizing, the real initial sizes were used. The optimization was then constrained to tune at constant area and constant input loading. These designs had already been well-tuned prior to the application of EinsTuner. Table 1 shows the size of the various benchmarks. Note that the largest problem, while still being a modestsized 2,796-transistor circuit, had over 5,600 variables and over 5,600 constraints, a moderately large problem by nonlinear optimization standards. Table 2 shows the actual numerical results. The critical path delay obtained by the heuristic method and the formal optimization, and the percentage improvement are shown in the third major column. In the case of the four actual designs, the original delay is shown in the heuristic column for convenience, but no heuristics were employed. An important measure of the success of the optimization is revealed by the smallness of the infeasibilities. The worst ("W") and average ("A") infeasibilities of the arrival time ("AT") and slew constraints are shown in the table. Every single constraint of every single problem was satisfied to within 1.5 ps or less. The number of iterations never exceeds 160 iterations even for the largest problems. The projected gradient is reduced substantially from the start point. While it is possible to try to bring down the projected gradient further, our stopping criteria kicked in and terminated the optimization in the interests of eficiency. We have indications that any further progress would be small relative to the cost of carrying it out. Finally, the CPU time is shown in the right-most column. The adder design was the most time-consuming. It ran for o-ver three days! Several methods of reducing the long run times of EinsTuner are enumerated in Section 7.
Profiling results on some smaller problems indicate that about 20% of the CPU time is spent in transient simulation, while the remaining 80% is consumed in the nonlinear optimization routines. On c2670, SPECS simulated 0.25 million CCCs and computed 15 million gradients, and LANCELOT executed almost half a million conjugate gradient iterations! In order to make additional comparisons, we have recently implemented a static tuning method similar to TILOS [4]. Instead of using a crude RC model, our version accurately simulates CCCs in the time-domain, but otherwise it has the same basic algorithm as TI-LOS, i.e., start with minimumarea, and then iteratively add area in small increments to the most sensitive and timing-critical portions of the circuit. Our results indicate that our formulation using nonlinear optimization can achieve significantly better solutions than currently available heuristic tuning methods.
Indeed, we first run both tuners, heuristic and optimal, on a small 22-gate macro iqia (not shown in Tables l and 2). The heuristic TILOS-like algorithm resulted in a 7% reduction of the delay along the critical path. The formal optimization algorithm resulted in a 11% reduction. We then run both algorithms on a much larger design: ioperdf (see Table 2 ), a 64-bit compara-Critical path delay Name #Tx Heur. Final tor with 559 gates. Here also, the optimization algorithm resulted in a significantly larger improvement (16% reduction in the critical-path delay) than the heuristic algorithm (12% reduction).
Circuit generalizations
The EinsTuner implementation described in this paper only accommodates combinational circuits consisting of parameterized library cells. Because the approach is simulation-based and includes a general-purpose nonlinear optimizer, it is readily extended to arbitrary transistor-level custom designs and sequential circuits. The software prototype will gain in generality by incorporation into a transistor-level timer and use of the timer's graph to generate the list of constraints. This section describes the additional considerations necessary.
AT infeas. Slew infeas. Proj. grad.
CPU -W
A W A . # Beg. End time
Arbitrary transistor-level circuits
Two existing techniques will allow the extension of EinsTuner to arbitrary custom circuits. First, patternmatching of the transistor topology in a CCC using graph isomorphism algorithms allows the recognition of a wide variety of gates [16] , including dynamic logic such as self-resetting CMOS or domino gates. Once the gatetype is recognized, a pre-stored set of timing constraints is added to the SIF file for each gate of a particular type. The constraints for dynamic logic can include special timing requirements relating the arrival of the precharge signal to the data signals or special relationships between the forward and reset paths in the case of selfresetting CMOS.
Second, for topologies that cannot be recognized by pattern-matching, a state-traversal algorithm can be used to set up the propagate segments for the CCC [17] . The end result of either pattern-matching or statetraversal is a list of propagate segments (based on which (ps) (ps) % (ps) (ps) a list of constraints can be generated), and the rules for simulating the CCC. Side-path loading and initialization of internal nodes of the CCC to actuate the worst-case pin-to-pin delay €or each propagate segment is an important part of this analysis.
Sequential circuits
Sequential elements such as latches must first be recognized either by attribution of the netlist, or by patternmatching. Depending on the type of sequential element, pre-compiled rules are followed to generate timing constraints and to simulate the element. For example, a latch will generate an additional set-up constraint. Special considerations are required for edge-triggered and transparent latches. For any type of latch, the required additional constraints must be added to the SIF file and the "back-end" configured to simulate the sequential element to evaluate each propagate segment.
Future work
In addition to the extensions described in the previous section, several avenues of future work suggest themselves. The long run times of EinsTuner can be ameliorated on many fronts. The direct method of sensitivity analysis may prove to be more efficient for the smaller (and most commonly encountered) CCCs. Employing a programming interface [ 181 to communicate with the optimizer will improve efficiency. The "adjoint Lagrangian" [3, 21 mode of gradient computation can be employed to compute all the gradients needed for a CCC by means of a single adjoint analysis, which would dramatically reduce the CPU time for gradient computation. Automatic criticality-and topology-based pruning can be used to reduce the number of timing and slew constraints without loss of accuracy. Failure recovery can be implemented more efficiently to avoid the overhead of possible repeated failures. Much can be done to make the nonlinear optimization more effective. Dealing with noisy data and determining good stopping criteria are topics of ongoing, but difficult, research. Two-step updating [15] can be applied to all arrival time variables, since they appear only linearly in the timing constraints.
In the future, one could envision taking noise constraints into account during optimization, using the mapping of semi-infinite noise constraints into equality constraints as in [19] . Simultaneous early and late-mode optimization could be employed to "fix" any fast-path problems. Tuning of wires along with transistors dovetails nicely into the formulation. Inclusion of timing constraints at several process corners and minimizing the worst negative slack across all the process corners can be used to improve parametric manufacturability. Where analytic delay rules exist as a function of transistor widths, they can be mixed with custom circuitry for the purposes of optimization.
Conclusions
This paper presented a unique formulation of the circuit optimization problem based on static timing analysis. By using large-scale, nonlinear optimization and fast transistor-level simulation and gradient computation, a wide range of circuits can be accurately optimized. Tuning of a number of datapath and control benchmark circuits has been demonstrated.
