This paper presents a modular optimization framework for custom digital circuits in the power performance space. The method uses a static timer and a nonlinear optimizer 10 maximize the per formance of digital circuils within a limited power budget by tuning various variables such as gate sizes, supply, and threshold voltages. It can employ different models to characterize the components. Analytical models usually lead to convex optimization problems where the optimality of the results is guaranteed. Tabulated models or an arbitrary timing signoff tool can be used if better accuracy is desired and although the optimality of the results cannot be guaranteed, it can be verified against a near-optimality boundary. The optimization examples are presented on 54-bit carry-Iookahead adders. By achieving the power optimality of the underlying circuit fabric, this framework can be used by logic designers and system architects to make optimal decisions at the microarchitecture level.
INTRODUCTION
programmmg.? While the convex delay models used by TILOS are rather inaccurate because of their simplicity, the Integrated circuit design has seamlessly entered (he power result is .,guaranteed to be globally optimal. Circuit delay limited scaling regime, where the traditional goal of optimization under constraints has been automated in the achieving the highest performance has been displaced by past as well. IBM's Eins'Iuner-' uses a static timing formu optimization for both performance and power. Achieving lation and tunes transistor sizes for minimal delay under the optimal performance under power limits is a challeng total transistor width constraints. The delay models are ing task and is commonly achieved through architecture obtained through' simulation for better accuracy; however and logic design. adjustments in the transistor/gate sizing, this guarantees only local optimality. supply voltages or selection of the transistor thresholds.
The conventional delay minimization techniques can be Solving this problem is challenging because it involves a extended to.account for energy as well. For example, a hierarchical optimization over a number of discrete and combination of both energy and delay, such as the energy continuous variables, with a combination of discrete and delay product (ED?) has been used as an objective func continuous constraints.
lion for minimization. A circuit designed to have the Various optimization techniques have been employed minimum EDP, however, may not be achieving the desired traditionally in digital circuit design, which range from performance or could be exceeding the given energy bud simple heuristics to fully automated CAD tools. At cir get. As a consequence:-anumber-of alternate optimization cuit level, custom integrated circuits can be manually sized metrics have been used that generally attempt to mini for minimum delay using the method of logical effort. ' mize an £'''0" product.' By choosing parameters nand Technology mapping step in logic synthesis commonly m a desired tradeoff between energy and delay can be employs delay minimization using gates with different achieved. but the result is difficult to propagate to higher sizes from a library of standard cells. TILOS 4 was the first layers of design abstraction. In the area of circuit design, tool that realized that the delay of logic gates expressed this approach has been traditionally restricted to the eval using Elmore's formula presents a convex optimization uation of several different block topologies, rather than problem that can be efficiently minimized using geometric using it to drive the optimization. In contrast, a systematic solution to this problem is to minimize the delay for a given energy constraint." Note subject to a delay constraint yields the same solution. Two solutions to this problem for sizing at circuit level are well known. The minimum energy of the fixed logic topology block corresponds to all devices being minimum sized. Similarly, the minimum delay point is well defined: At that point further upsizing of transistors yields no delay improvement.
Custom datapaths are an example of power-constrained designs where the designers traditionally iterate in sizing between schematics and layouts. The initial design is sized using wireload estimates -and-is iterated through the lay out phase until a set delay goal is achieved. The sizing is refined manually using the updated wireload estimates. Finally, after minimizing the delay of critical paths. the non-critical paths are balanced to attempt to save some power. or in the case of domino logic to adjust the timing of fast paths. This is a tedious and often lengthy process that relies on the designer's experience and has no proof of achieving optimality. Furthermore, the optimal sizing depends on [he chosen supply and transistor thresholds. An optimal design would be able to minimize the delay under power constraints by choosing supply and thresh old voltages. gate sizes or individual transistor sizes, logic style (static, domino, pass-gate), block topology, degree of parallelism, pipeline depth. layout style, wire widths. etc. This paper builds on the ideas of convex" or gradient based' delay optimization techniques under constraints. The average energy per computation is used as a constraint for the delay minimization method. The ideas presented here constitute a modular design optimization framework for custom digital circuits in the power -performance space that:
• Formulates the design as a mathematical optimization problem;
• Uses a static timer to perform all circuit-related compu tations;
• Uses a mathematical optimizer to solve the optimization problem numerically: , .
• Adjusts various design variables at different levels of abstraction;
• Can employ different models in the timer in order to balance accuracy and convergence speed; • Handles various logic families (static, dynamic, pass gate) due to [he flexibility of the modeling step; • Guarantees the global optimality of the solution for cer tain families of analytical models that result in the opti mization problem being convex; • Verifies a near-optimality condition if global optimality cannot be guaranteed. 
DESIGN OPTIMIZATION FRAMEWORK
The framework is built around a versatile optimization core consisting of a static timer in the loop of a mathe matical optimizer, as shown in Figure 1 .
The optimizer passes a set of specified design variables to the timer and gets the resulting cycle time (as a measure of performance) and power of the circuit. as well as other quantities of interest such as signal slopes, capacitive loads and, if needed. design variable gradients. The process is repeated until it converges to the optimal values of the design parameters that achieve the desired optimization goal. The circuit is defined using a SPICE-like netlist and the static timer employs user-specified models in order to compute delays. cycle times, power, signal slopes, etc. The choice of models depends on the tradeoffs between the desired accuracy and convergence speed and is discussed in Section 3.
Since the static timer is in the main speed-critical opti mization loop, it is implemented in C++ to accelerate computation. It is based on the conventional longest path algorithm. The custom-written timer does not account for false paths or simultaneous arrivals. but it can be easily substituted with a more sophisticated one because of the modularity of the optimization framework.
The optimization core can be configured to perform var ious tasks for different types of circuits. For instance, if the circuit to be optimized is combinational, the framework can be configured to solve the following optimization problem:
Adjust GATE SIZES in order to Minimize DELAY subject to: with the following additional constraints (in order to en sure manufacturability and correct circuit operation):
Maximum infernal slopes
Maximum output slopes
Maximum input capacitances

Minimum gute sizes
By solving this optimization problem for different val ues of the energy constraint, the optimal energy-delay tradeoff curve for that circuit is obtained, as shown in Figure 2 .
The optimal tradeoff eurve has two well defined end points: Point 1 represents the fastest circuit that can be designed; point 2 represents the circuit with the lowest energy per transition, primarily limited by minimum gate sizes and signal slope constraints. The points in-between the two extremes (marked "3" on the graph) correspond to minimizing various Em D" design goals (such as the EDP).
MODELS
Arbitrary optimization problems are very difficult to solve and the global optimality of the result cannot be usually guaranteed. If the functions involved in the optimiza tion have certain mathematical properties, the problem becomes easier and certain statements can be made about the optimality of the results. In particular, convex opti mization problems (where the objective and inequality constraint functions are convex") can be solved reliably by commercial optirnizers while guaranteeing tbe global optimality of the result.
For the circuit optimization framework from Figure l , the properties of the objective and constraint functions are given by the models used in the static timer. Therefore, the choice of models in the static timer greatly influences Power -Performance Optimization for Custom Digital Circuits + can exploit mathematical properties -can't guarantee convexity; Lo fonnulatc a convex optimization is "blind" optimization problem the convergence speed and robustness of the optimizer. Analytical or tabulated models can be used in the opti mization framework, depending on the desired accuracy and speed targets. Table I shows a comparison between the two main choices of models. Closed form analytical mod els can usually be forced into a convex form using various mathematical operations such as changes of variables and the introduction of additional (slack) variables. ' Tabulated models provide excellent accuracy at the points of characterization, but sacrifice the convexity property.
Analytical Models
In our initial optimizations we use a simple. yet fairly accurate analytical model. This model allows for a convex formulation of the resulting optimization problem. where the gate sizes are the optimization variables. The model has three components: A delay equation (I), a signal slope equation (2) , and an energy equation (1) and (2) are a straightforward first order extension to these models that accounts for signal slopes.
The capaeitance-ota-nodti-is computed using (4): (4) where Wi are the corresponding gate sizes. Each input of each gate is characterized for each tran sition by a set of seven parameters: p, g, TJ for the delay, A, /L, " for the slope and k for the capacitance. Each gate is also characterized by an average leakage power P 1eaX measured when its relative size is W = I. Each node of the circuit has an activity factor a which is computed through logic simulation for a set of representative input patterns.
All the above equations can be written as posynomials in the gate sizes, W j :
(5) (6) If (,lOpe_I,' is a posynornial, then tc and tshlpe_OU[ are also posynornials in ~. By specifying fixed signal slopes at the primary inputs of the circuit, the resulting slopes and arrival tirriesnaJ: -iilf-the--noaes-Wilf also--lSe--·posyilomials in ~. The maximum delay across all paths in the cir cuit will be the maximum of several posynomials, hence a generalized posynomial. A function f is a generalized posynomial if it can be formed using addition, multiplica tion, positive power, and maximum selection starting from posynornials.?
The energy equation is also a generalized posynornial: The first term is just a linear combination of the gate sizes while the second term is another linear combination of the gate sizes multiplied by the cycle time, that in turn is related to the delay through the critical path, hence also a generalized posynomial.
The optimization problem described in Section 2 using the above models has generalized posynomial objective and constraint functions: Such an optimization problem with generalized posynomi als is a generalized geometric program (GGP).7 It can be converted to a convex optimization problem using a simple change of variables:
With this change of variables the problem IS tractable and can be easily and reliably solved by generic commer cial optimizers. Moreover, since in convex optimization any local minimum is also global, the optimality of the result is guaranteed.
This delay model applies to any logic family where a gate can be represented through channel-connected components." as in the case of complementary CMOS or domino logic. The limitation of this approach is that it uses linear approximations for the delay. signal slopes, and capacitances. Figure 3 shows a comparison of the actual and predicted delay for the rising transition of a gate for a fixed input slope and variable fanout. Since the actual delay is slightly concave in the fanout, the linear model is pessimistic at low and high fanouts and optimistic in the mid-range. The accuracy of the models can be increased by fitting them to higher order posynornials (hence main taining the convexity of the optimization problem), but it results in exponentially increased time for characterization.
Tabulated Models
If the accuracy of linear. analytical models is not satisfac tory, tabulated models can be used instead. For instance, (1), (2) and their respective parameters can be replaced with the look-up table shown in Table II.  The table can have as many entries as needed for the desired accuracy and density of the characterization grid. Actual delays and slopes used in the optimization pro cedure are obtained through linear interpolation between the points in the table. The grid is non-uniform, with more points in the mid-range fanours and slopes, where most designs are likely to operate. Additional columns can be added to the tables for different logic families-for instance if a dynamic gate is characterized this way. the relative size of the keeper to the pull-down network needs to be included, too.
The resulting optimization prohlem, even when using the change of variables from (7), cannot be proven to be convex. However, although not absolutely accurate, the analytical models that describe the behavior of the circuits closely approximate the tabulated models. Thus, the result ing optimization problem is nearly-convex and can still be solved with very good accuracy and reliability by the same oprimizers as before." The result of the nearly-convex problem can be checked against a near-optimality bound ary. The example in Figure 4 shows a comparison of the analytical and tabulated models and the corresponding near-optimality boundary.
The figure shows the energy-delay tradeoff curves for an example 64-bit Kogge-Stone earry tree in static CMOS using a 130 nm process. The same circuit is optimized using each of the two model choices discussed in this sec tion. Both models show that the fastest static 64-bit carry tree can achieve the delay of approx. 560 ps, while the lowest achievable energy is 19 pJ per transition. The ana lytical models are slightly optimistic because the optimal designs exhibit mid-range gate fanouts where the analyti cal models tend to underestimate the delays (Fig. 3) .
The near optimality boundary is obtained by using tab ulated models to compute the delay and energy of the designs that resulted from the optimization with analytical. models. This curve represents a set of designs optimized using analytical models, but evaluated with tabulated mod els. Since those designs are guaranteed 1O be optimal for analytical models, the boundary is within those models' error of the actual global optimum. However, if an opti mization using the correct models (tabulated) converges to the correct solution, it will always yield a better result than a re-evaluation of the results of a different optimization using the same models. Therefore, if the optimization with tabulated models is to converge correctly the result must be within the near-optimality boundary i.e., will have a smaller delay for the same energy.
If a solution obtained using tabulated models is within the ncar-optimality boundary it will be deemed "near-opti mal" and hence acceptable.
In a more general interpretation, optimizing using tab ulated models is equivalent to optimizing using a trusted
• J. Low Power Electronics 2, 113-120, 2006
Power -Performance Optimization for Custom Digital Circuits timing signoff tool whose main feature is very good accu racy. The result of such an optimization is not guaran teed to be globally optimal. The near-optimality boundary is obtained by running the timing signoff tool on a design obtained from an optimization that can -guarantee the global optimality of the solution. The comparison is fair because the power and performance figures on both curves are evaluated using the same (trusted and accurate) timing signoff tool.
Model Generation and Accuracy
Tabulated models are generated through simulation. The gate to be modeled is placed in a simple test circuit and the fanout and input slope and relative keeper size (for dynamic gates) are adjusted using automated Perl scripts. The simulator is invoked iteratively for all the points in the table and the relevant output data (delay. output slope) is stored. This can be lengthy (although parallelizable) if the grid is very fine and the number of points large. This characterization is similar to the one performed for the standard-cell libraries, and yields satisfactory accuracy. Alternatively, or in addition, characterization points for static gates can be used from the tabulated entries in the standard cell library.
Analytical models are obtained through data fitting. Data points are obtained through simulation in the same manner as for tabulated models. Least squares fitting is used to obtain the parameters of the models. The num ber of points required for a good fit (Sa-IOO, depending on the model) is less than the number of points needed for tabulated models (of the order of 1000) and thus the characterization time for analytical models is one order of magnitude shorter.
The error of the analytical models depends on their complexity and Ion the desired data range. The models in (1) and (2) are accurate within 10% of the actual (simu lated) delays and slopes for the range specified in Table II . The, energy, equation (3) is accurate within 5% for fast slopes but its accuracy degrades to 12% underestimation at slow input slopes due to the crowbar current (which is not included in the equation). The maximum slope constraints for output and internal nodes ensure such worst cases do not occur in u~ual d_~signs_._ _ ~._.
~ __
RESULTS
We use the presented optimization framework to optimize a 64-bit adder, which is a very common component of custom datapeths. The critical path of the adder consists of the carry computation tree and the sum select." Trade offs between the performance and power can be performed through the selection of circuit style, logic design of carry equations, selection of a tree that calculates the carries, as well as through sizing and choices of supply voltages and transistor thresholds .
Carry-Iookahead adders are frequently used in high performance microprocessor datapaths, Although adder design is a well-documented research area.l:" fundamen tal understanding of their energy-delay performance at the circuit level is still largely invisible to the microarchitects. The optimization framework presented in this paper pro vides a means of finding the energy budget breakpoint where the architects should change the underlying circuit design.
Datapath adders are good example for the optimiza tion because their layout is often bit-sliced. Therefore, the critical wire lengths emt be estirnatedpre-desrgnandure a weak function of gate sizing. The optimization is per formed on two examples: (I) A 64-bit carry tree of a carry-lookahead adder imple mented in standard static CMOS, using analytical models to tune gate sizes, supply, and threshold voltages;
(2) 64-bit carry lookahead adders implemented in domino and static CMOS, using tabulated models.
Tuning Sizes, Supply, and Threshold Using
Analytical Models
In order to tune, supply, and threshold voltages, the models must include their dependencies. A gate equivalent resis tance can be computed from analytical saturation current models (a reduced form of the BSIM3v3
Using (8), supply and threshold dependencies can be included in the delay model. For instance (I) becomes (9), with (2) having a very similar expression:
The model is accurate within 8% of the actual (simu lated) delays and slopes around nominal supply and thresh old, over a reasonable yet limited range of fanouts (2.5-<i). For a ±30% range in supply and threshold voltages the accuracy is 15%. Figure 5 shows the optimal energy-delay tradeoff curves of a 64-bit Kogge-Stone carry tree implemented in static CMOS in three eases: ..=:. 1.8 i5; 1.6 Q; 1.4 Jj 1,2 Opt,mal VDD·Nom,nal VTH (~ase 2) / " ' " A few interesting conclusions can be drawn from the above figures:
• The nominal supply voltage is optimal in exactly one point, where the VDO = 1.2 V curve is tangent to the opti mal V[)D curve. In that point, the sensitivities of the design to both supply and sizing are equal:" • Power can be reduced by increasing V D O and downsiz ing if the V oo sensitivity is less than the sizing sensitivity; • Achieving the last few picoseconds of the delay reduc tion is very expensive in energy because of the large sizing sensitivity (curves are very steep at low delays); • The optimal threshold is well below the nominal thresh old. For such a high activity circuit, the power lost through increased leakage is recuperated by the downsiz ing afforded by the faster transistors with lower threshold. Markovic et at} eame to a similar conclusion using an analytical approach. the logic structure of the adders can be found in (Ref. [17] ). Figure 6 shows the energy-delay tradeoff curves for a few representative adder configurations in a general purpose 130 nm process. Radix-J (R2) adders merge 2 carries at each node of the carry tree. For 64 bits, the tree has 6 stages of relatively simple gates. Radix-a (R4) adders merge 4 carries at each stage, and therefore a 64 hit tree has only 3 stages but the gates are more com plex. In the notation used in Figure 8 classical domino adders use only (skewed) inverters after a dynamic gate, whereas compound domino use more complex static gates, performing actual radix-2 carry-merge operations. 18 Based on these tradeoff curves, microarchitects can clearly determine that under these loading conditions radix-4 domino adders are always preferred to radix-2 domino adders. For delays longer than 12.5 F04 inverter delays. a static adder is the preferred choice because of its lower energy.
The fastest adder implements Ling's pseudo-carry equa tions in a domino radix-4 tree with a sparseness factor of 2. 17 An implementation of the fastest adder in a general purpose 90 nm process is described in (Ref. [19] ) and. measured results are in good agreement with the optimizer.
Runtime Analysis
The complexity and runtime of the framework depend on the size of the circuit. Small circuits are optimized almost instantaneously. A 64-bit domino adder with 1344 gates (a fairly large combinational block) is optimized on a 900 MHz P3 notebook computer with 256 MB of RAM in 30 seconds to 1 minute if the constraints are rather lax. When the constraints are particularly tight and the opti mizer struggles to keep the optimization problem feasible, the time increases to about 3 minutes. A full power ~ per formance tradeoff curve with 100 points can be obtained in about 90 minutes on such a machine. For grossly infeasible problems the optimizer provides a "certificate of infeasi bility" in a matter of seconds. Electronics 2, 113-120, 2006 Power -Performance Optimization for Custom Digital Circuits For large designs the framework allows gate grouping. By keeping the same relative aspect ratio for certain groups of gates, the number of variables can be reduced and the runtime kept reasonable. Gate grouping is a natural solu tion for circuits with regular structure. For instance. in an adder, gates can be grouped at various levels of the carry tree, which simplifies the layout. All the adders optimized in Section 4.1 and 4.2 use gate grouping for identical gates in the same stage.
J. Low power
CONCLUSIONS
This paper presents a design optimization framework that tunes custom digital circuits based on a static timing for mulation. The framework can use a wide variety of models and tune different design variables. The problem solved is generally an energy-constrained delay minimization. Due
[Q the flexibility in choosing models, the framework can easily handle various logic families.
If analytical models are used the optimization is con vex, can be easily and reliably solved, and its results are guaranteed to be optimal. The accuracy of the modelling can be improved by using look-up tables, at the cost of the optimality guarantee as well as increased characteriza tion time and complexity. More generally, the optimization can be run on any trusted and accurate timing signoff tool, with the same tradeoffs and limitations as for tabulated models. Results obtained using tabulated models (or with the said "trusted and accurate timing signoff tool") can be verified against a near-optimality boundary computed from results guaranteed optimal in their class. If the results fall within that boundary they are considered near-optimal and therefore acceptable.
The framework was demonstrated on 64-bit carry lookahead adders in 130 nm CMOS. A static Kogge-Stone tree was tuned using analytical models by adjusting gate sizes. supply voltage, and threshold voltage. Complete domino and static 64-bit adders were also tuned in a typ ical' high performance microprocessor environment using tabulated models by adjusting gate sizes.
The framework can be extended to optimize sequen tial blocks as well. One aspect of this optimization could involve the placement of the latch positions in a pipelined datapath. By -building-on the 'combinational circuit opti mization, this tool would allow microarchitecrs a larger freedom in trading off cycle time for latency. Another interesting extension of this framework is to optimize the energy-delay of a block under the presence of uncertainty. The convex delay models can be extended to include the parameter uncertainty due to process or environment vari ations. By using these models, the GGP translates into a robust GP."
