Abstract
Introduction
Leakage power has become one of the most critical design concerns. While lowered supplies (and consequently lowered V th ) and aggressive clock gating can achieve dynamic power reduction, these techniques increase leakage power and therefore cause its share of total power to increase. Leakage has become a significant contributor to total power and its contribution is projected to increase from 18% at 130nm to 54% at the 65nm node [12] . Leakage is composed of three major components: (1) subthreshold leakage, (2) gate leakage, and (3) reverse biased drain substrate and sourcesubstrate junction band-to-band-tunneling leakage [1] . Subthreshold leakage is the dominant contributor to total leakage at 130nm and is forecast to remain so in the future [1] . In this work we target subthreshold leakage reduction.
Leakage reduction methodologies can be divided into two classes depending on whether they reduce standby leakage or runtime leakage. Standby techniques reduce leakage of devices that are known not to be in operation, while runtime techniques reduce leakage of active devices. Several techniques have been proposed for standby leakage reduction. Body biasing or VTMOS based approaches [6] dynamically adjust the device V th by biasing the body terminal 1 . [10, 7, 11, 15] use high-V th CMOS (or NMOS or PMOS) to disconnect Vdd or Vss or both to logic circuit implemented using low V th devices in standby mode. Source biasing, where a positive bias is applied in standby state to source terminals of off devices, was proposed in [5] . Other techniques such as use of transistor stacks [23] and input-vector control [4] have also been proposed.
Muti-threshold CMOS (MTCMOS) techniques
Fewer techniques have been proposed for runtime leakage reduction. Recently, [3] proposed a runtime leakage reduction approach that increases the gate lengths of transistors not on critical paths. However, the only mainstream approach to runtime leakage reduction is the multi-V th manufacturing process. In this approach, cells on non-critical timing paths are manufactured to have higher threshold voltages, while cells on critical timing paths have lower threshold voltages. [20] presented a heuristic algorithm for selection and assignment of optimal high V th to cells on non-critical paths. The multi-V th approach has also been combined with several other power reduction techniques [9, 22, 16] . The primary drawback of this technique has been the rise of process costs due to additional manufacturing steps and masks required for each extra V th . However, the substantial leakage savings provided by this approach have outweighed cost increases, and dual-V th processes are now standard and used together with other power reduction techniques.
Today's standard cell library based flows use cell-level V th assignment (CLVA) techniques in which all PMOS and NMOS transistors in a cell are assigned the same threshold voltage. When two threshold voltages are available, the library required for CLVA is only twice the size of a single-V th library since there are two variants of each cell. In this paper, we investigate the benefits and costs associated with transistor-level V th assignment (TLVA). Since different transistors control different timing arcs, TLVA can modify the delays of individual timing arcs unlike CLVA. Asymmetry in timing criticality of different timing arcs of a cell instance in a circuit, as well as in rise and fall transitions, can be utilized by TLVA to yield significant leakage savings. Unfortunately, TLVA requires a larger number of cell variants which translates to a larger library and higher library characterization effort.
Approaches for TLVA have previously been proposed in [17, 21, 8] . [17] proposed a sensitivity-based upsizing (i.e., begins with nominal V th assigned to all transistors and assigns low V th iteratively to timing-critical transistors) algorithm which combined transistor sizing and V th assignment. [8] later proposed an enumeration based technique with better quality of results and reduced runtime. However, the enumeration based approach proposed in [8] quickly grows in space and runtime requirements as the input size increases. The approach neglects the effect of V th assignment on capacitance and is inclined to assign low V th more to the transistors near primary inputs. A sensitivity-based downsizing (i.e., begins with low V th assigned to all transistors and assigns nominal V th to non-critical transistors) approach was proposed in [21] . However, the technique of [21] has the following shortcomings.
• Impractical. The presented technique does not use precharacterized cell delay values but relies on analytical expressions to estimate delays of different timing arcs. Despite the high library characterization costs, analytical delay models are never used in practice due to their unacceptably large inaccuracy. The equations used in [21] assume that each input is controlled by one NMOS and one PMOS transistor, and that there are no transistors that do not control an input. These assumptions fail for all but the simplest cells. The impact of V th on capacitance is ignored and the used transistor delay models and timing analysis ignore delay dependence on slew (transition delay). Also, the impact of switching time of NMOS (PMOS) transistors on rise (fall) transition is ignored.
• Large library size. A standard cell library is required for timing closure in successive design steps. The proposed approach allows freedom to assign V th to individual transistors in a cell separately and hence may require up to 2 T variants of any given cell, where T is the number of transistors in the cell. A cell library with such a large number of cell variants may not be practical.
• Large runtime. The proposed algorithm (called PS 2 in [21] ) is extremely time-consuming since all sensitivities are recomputed and full static timing analysis run after each downsizing move.
We present an effective, accurate and scalable transistor-level V th assignment technique which is sensitivity-based and performs downsizing. In our studies, we have found downsizing to be significantly more effective for leakage reduction than upsizing irrespective of the delay constraints. An intuitive rationale is that upsizing approaches have dual objectives of delay and leakage while performing the upsizing moves. Downsizing approaches, on the other hand, are bound to meet timing constraints since they only perform downsizing moves if no timing violations are caused, and they hence have the sole objective of leakage minimization 3 . We apply our dual-Vth leakage reduction approach first at cell level and then at transistor level to reduce the runtime of TLVA. The following subsections introduce our ideas in detail.
Delay Asymmetries in Timing Arcs within Cell
We use the term timing arc to indicate an intra-cell path from an input transition to a resulting rise (or fall) output 4 transition. For an n-input gate there are 2n timing arcs 5 . Due to different parasitics as well as PMOS/NMOS asymmetries, these timing arcs can have different delay values associated with them. For instance, Table 1 shows the delay values for the same input slew, load capacitance pair for different timing arcs of a NAND2X2 cell from the Artisan TSMC 130nm library. 2 [21] proposed two other algorithms, BT and PB, which are inferior in solution quality but significantly faster. We compare with PS due to the similarity of our algorithm with PS and demonstrate the runtime savings due to our engineering optimizations such as the use of incremental timing analysis. 3 An upsizing approach, however, may be faster when loose delay constraints are to be met since very few transistors have to be upsized. However, delay is almost always the primary design goal and loose delay constraints are rare. 4 We assume all cells have one output. 5 There may be four timing arcs corresponding to non-unate inputs (e.g., select input of MUX). 
Use of Asymmetry
Pin swapping is a common post-synthesis timing optimization step to make use of the asymmetry in delays of different input pins. To make use of asymmetry in rise-fall delays, techniques such as P/N ratio perturbations have been previously proposed to decrease circuit delay [2] . We propose to exploit these asymmetries using TLVA to "recover" leakage from non-critical timing arcs within a cell. The other technique for runtime leakage reduction, gatelength biasing [3] , may also be developed for leakage reduction using the available asymmetry. We use TLVA since it is more mature and appears to yields a more favorable tradeoff between delay penalty and leakage reduction.
The remainder of this paper is organized as follows. In Section 2, we describe the proposed TLVA methodology. Section 3 describes our experiments and presents the results. Finally, Section 4 concludes and gives a brief description of ongoing and future research.
Methodology
In this section we describe our methodology for transistor-level V th assignment. Our flow to exploit non-criticality (or presence of slack) of timing arcs involves the following steps:
1. Cell-variant creation 2. Library generation 3. Optimization for leakage
Cell-Variant Creation
For each cell, our library contains variants corresponding to all subsets of the set of timing arcs. A gate with n inputs has 2n timing arcs and therefore 2 2n variants (including the original cell). Given a set of critical timing arcs, our goal is to assign nominal V th to some transistors in the cell and low V th to the remaining transistors to meet two criteria: (1) critical timing arcs have a delay penalty of under 1% with respect to the cell in which all transistors are assigned low V th , and (2) cell leakage power is minimized. TLVA in a cell, given a set of critical timing arcs, can be done for simple cells by analyzing the cell topology. However, we automate the process of V th assignment to transistors in a cell in the following manner. For each cell, we enumerate all configurations in which low V th is assigned to some transistors and nominal V th to the others. For each configuration we find the delay and leakage under a canonical load of an inverter (INVX1) using SPICE simulations. Then, for each possible subset of timing arcs that can be simultaneously critical, one V th assignment configuration is chosen based on the two criteria above. We note that approaches proposed in [8, 21] allow freedom to assign V th to individual transistors in a cell separately and hence may require up to 2 T cell variants of each cell, where T is the number of transistors in the cell. Since even one-or two-input cells can have many transistors, the number of variants required by previous approaches can be much larger than the number of variants created with our approach. Figure 1 shows V th assignment to transistors of the simplest NAND cell (NAND2X1) when only the rise and fall timing arcs from input A to the output are critical. 
Library Generation
We generate a restricted library comprising of variants of the 25 most frequently used cells in our test cases. To identify the most frequently used cells, we synthesize our test cases with the complete TSMC 130nm library and pick the 25 most frequently used cells. Variants for each cell are created as described in the previous subsection by threshold voltage assignment to transistors in the cell. We use Cadence SignalStorm and Synopsys HSPICE for library characterization (delays and power). We use TSMC 130nm SPICE netlists and IBM 130nm SPICE device models for library characterization. We do not assume or optimize the nominal and low threshold voltages but use the voltages specified with the IBM 130nm SPICE device models. Our implementation does not require all 2 2n variants to be present in the library; any variants that cannot be manufactured due to process restrictions may be omitted from the library.
Optimization for Leakage
Our sensitivity-based downsizing approach begins with a circuit in which all transistors are assigned low V th . [8] concludes that transistor sizing should be performed separately and before V th assignment. Also, several works [19, 21] perform transistor sizing and V th assignment separately. We use Synopsys Design Compiler v2003.06-SP1 for transistor sizing prior to V th assignment. Since delay is almost always the primary design goal we use the smallestdelay sizing solution for V th assignment.
Timing Analyzer Implementation
A timing analyzer is an essential component of any delay-aware power optimization approach; it is used to compute delay sensitivity to V th assignment. For an accurate yet scalable implementation, we use three types of timers that vary in speed and accuracy.
• Standard static timing analysis (SSTA). Slews and actual arrival times (AATs) are propagated forward after a topological ordering of the circuit. Required arrival times (RATs) are back-propagated and slacks are then computed. Slew, delay and slack values of our timer matches exactly with Synopsys PrimeTime vU-2003.03-SP2, and our timer can handle unate and non-unate cells 6 .
• Exact incremental STA (EISTA). We begin with the fan-in nodes of the node that has been modified. From all these nodes, slews and AATs are propagated in the forward direction until the values stop changing. RATs are propagated in the backward direction from only those nodes for which the slew, AAT or RAT has changed. Slews, delays and slacks match exactly with SSTA.
• Constrained incremental STA (CISTA). Sensitivity computation involves temporary modifications to a cell to find changes in its slack and leakage. To make this step faster, we restrict the incremental timing calculation to only one stage before and after the gate being modified. The next stage is affected by drive strength change and the previous stage is affected by pin capacitance change of the modified gate. The ripple effect on other stages farther away from the gate (primarily due to slew changes 7 ) is neglected since high accuracy is not critical for sensitivity computation.
We use the phrase downsizing a timing arc to mean substitution of a cell instance with a variant that has V th assignment such that the timing arc is slowed down. In our terminology, s i p represents the slack on i th timing arc of cell instance p and s i p represents the slack on the arc after it is downsized. p and p indicate the initial and final leakages of cell instance p before and after downsizing respectively. P i p represents the sensitivity associated with downsizing timing arc i on cell instance p and is defined as:
The pseudocode of our leakage optimization implementation is given in Figure 2 . The algorithm begins with SSTA and initializes slack values s i p for all timing arcs of all cells in Line 1. Sensitivities P i p are computed for all cell instances p and all of their timing arcs i, and put into the set L in Lines 3-5. In Lines 7-8, we select the highest sensitivity P i p * corresponding to the i th timing arc of cell instance p * and remove it from the set L. If the highest sensitivity is negative, we exit the loop in Lines 9-10. In Line 11, the function SaveState saves the V th assignment of all transistors in the circuit and delay, slew and slack values for all timing arcs. The timing arc i of cell instance p * is downsized and EISTA is run from p * to update the delay, slew and slack values in Lines 12-13. In Line 14, we check for a timing violation (negative slack on any timing arc) due to downsizing and if timing is violated we restore the state in Line 15. If, however, there is no timing violation then this move is accepted and sensitivities of node p * , its fan-in nodes, and its fan-out nodes are updated in Lines 17-21. The loop continues until sensitivities become negative or L becomes empty.
Function ComputeSensitivity(q, i) temporarily downsizes the i th timing arc of cell instance q and finds its slack using CISTA. Since high accuracy is not critical for sensitivity computation we choose to use CISTA which is faster but less accurate than EISTA. Table 2 shows a comparison of leakage and runtime when EISTA and CISTA are used for sensitivity computation. forall q ∈ N, and timing arcs of q, j 
Experiments and Results
In this section we describe the experimental flows for validation of our TLVA approach, and then present experimental results. We use the following ISCAS'85 combinational and ISCAS'89 sequential 8 CLKINVX1, INVX12, INVX1, INVX3, INVX4, INV-X8, INVXL, MXI2X1, MXI2X4, NAND2BX4, NAND2X1, NAN-D2X2, NAND2X4, NAND2X6, NAND2X8, NAND2XL, NOR2X-1, NOR2X2, NOR2X4, NOR2X6, NOR2X8, OAI21X4, XNOR2-X1, XNOR2X4 , and XOR2X4. The delay constraint is kept very tight so that the post-synthesis delay is very close to minimum achievable delay. STMicroelectronics 130nm device models are time checks, false paths, multiple clocks, 3-pin SDFs, etc. 7 There may be some impact due to coupling induced delay also, as the arrival time windows can change; we ignore this effect. 8 To handle sequential test cases, we convert them to combinational circuits by treating all flip-flops as primary inputs and primary outputs. .06-SP1 is used to measure circuit delay, dynamic power and leakage power. We assume an activity factor of 0.02 for dynamic power calculation in all our experiments. Table 3 shows the reduction in leakage, dynamic and total power achieved by our sensitivity-based downsizing (SBD) algorithm for TLVA. 62% − 89% reduction in leakage and 23% − 63% reduction in total power is achieved in comparison to when all transistors in the circuit are assigned low V th . Dynamic power decreases because of less gate capacitance on nominal V th assignment. We allow no delay penalty, i.e., circuit delay (as reported by Synopsys PrimeTime) does not increase after SBD. We observe larger leakage reductions in sequential circuits; this is because circuit delay is determined by the slowest pipeline stage and the percentage of non-critical paths is typically higher in a sequential circuit.
Power Reduction

Comparison with Previous Work
We compare the quality of results and runtime of SBD with the latest previous TLVA approach [8] 9 . Our implementation of the SATVA algorithm proposed in [8] sets the pruning factor to allow up to 50 tuples at gate outputs 10 . Reduced pruning improves solution quality but increases runtime and memory requirements. We run experiments on a dual Xeon 1.4GHz computer with 2GB RAM. We use the same library (as described in Subsection 2.2) for both approaches. Table 4 presents the power, runtime and actual delay (as reported by Synopsys PrimeTime) after performing SATVA and SBD. Delay penalty is set to 0%, 5% and 10%.
SBD consistently performs better than SATVA and is significantly faster. Since SATVA ignores the impact of V th assignment on capacitance, it generates circuits which do not satisfy tight delay constraints. Runtime of SBD increases with circuit size and when there is less slack on nodes. This is because CLVA assigns low V th to a smaller number of cells, and TLVA must then optimize a larger number of cells. Since SATVA is an enumeration based technique, its runtime and memory requirements increase with the presence of gates having three or more inputs.
CLVA vs. TLVA
Power reduction achieved using TLVA with respect to CLVA is presented in Table 5 . The algorithm is used for both CLVA and TLVA; however, for CLVA the cell library is constrained to have only the cells in which all transistors have the same V th .
For sequential circuits and when delay penalties are large, we find insignificant additional leakage savings for TLVA. The primary Table 5 : Leakage, dynamic and total power reduction from transistor-level V th assignment (TLVA) with respect to cell-level V th assignment (CLVA). Delay penalty is set to 0%, 5% and 10% for each of the test cases.
reason is the availability of slack on larger number of paths; CLVA assigns nominal V th to most cells leaving only a few low-V th cells on which TLVA can do optimization. High-performance circuits are desired to operate at the highest possible frequency, and techniques like retiming and clock skew scheduling are used for balancing pipelines. For such circuits, large leakage savings (15%-30%) can be expected from TLVA with reference to CLVA.
Conclusions
We have presented a sensitivity-based downsizing approach for TLVA. The approach is effective, accurate, scalable, and is easily usable with the today's design flows. We compare our approach with a recently proposed TLVA approach and find our approach superior in terms of quality of results and runtime.
A comparison between CLVA and TLVA shows a 5%-27% reduction in leakage for tight delay constraints. Even though our approach uses fewer cell variants than previous approaches, the number of cell variants is significantly larger than those required for CLVA. This leads to large cell libraries and increased characterization effort. Therefore, TLVA should be performed for only the most frequently used cells, such as inverters, buffers, NAND and NOR gates. Fortunately, the most frequently used cells typically have one or two inputs, and hence only a small number of variants need to be characterized for any cell. To further reduce library size, only one of the cell variants in which different logically equivalent inputs are fast may be retained, and pin-swapping techniques can be used during leakage optimization.
Our ongoing work includes development of better approaches to reduce cell library size. To increase scalability, we plan to investigate "batched" moves (along the lines of the PB algorithm in [21] ) in which several independent transistors are assigned nominal V th in every iteration. Recent works conclude that sizing and V th assignment should be performed together [13, 18] . This is in contrast to the previously preferred approach of keeping sizing and V th assignment separate; we plan to combine our approach with other power-reduction techniques such as sizing and dual-V dd .
