Dual threshold technique has been proposed to reduce leakage power in low v oltage and low p o w er circuits by applying a high threshold voltage to some transistors in non-critical paths, while a low-threshold is used in critical paths to maintain the performance. Mixed-Vth MVT static CMOS design technique allows di erent thresholds within a logic gate, thereby increasing the number of high threshold transistors compared to the gate-level dual threshold technique. In this paper, a methodology for MVT CMOS circuit design is presented. Di erent MVT CMOS circuit schemes are considered and three algorithms are proposed for the transistorlevel threshold assignment under performance constraints. Results indicate that MVT CMOS design technique can provide about 20 more leakage reduction compared to the corresponding gate-level dual threshold technique.
Introduction
The increasing need for low p o w er in portable computing and wireless communication systems is making design communities accept low v oltage CMOS processes 1, 2 . With the lowering of supply voltage, the transistor threshold voltage Vth has to be scaled down to meet the performance requirements. Unfortunately, such scaling increases the subthreshold leakage current, thereby increasing leakage power.
Multiple-Vth design technique can be used to deal with the leakage problem in low p o w er and high performance applications. Multi-Threshold-Voltage CMOS MTCMOS circuit technology was proposed by inserting high threshold devices in series to normal circuitry 3 . This technique is very e ective for the standby leakage power reduction. But the large inserted MOSFETs increases the area and delay. F or dual threshold design technique, a high threshold voltage can be assigned to some transistors in non-critical paths so as to reduce leakage current, while the performance is maintained due to the low threshold transistors in the critical paths. Therefore, both high performance and low power can be achieved simultaneously. This technique has been demonstrated that leakage power can be reduced during both active and standby modes without any delay and area overheads 5 . Recently, a dual-Vth MOSFET process was developed 4 , making the implementation of dual-Vth logic circuits more feasible.
However, due to the complexity of the circuits, not all the transistors in non-critical paths can be assigned a higher threshold voltage. Otherwise, some non-critical paths may become critical. In order to achieve the best leakage savings under performance constraints, algorithms for dual threshold assignment w ere presented in 5, 7 . But these algorithms only dealt with the circuits at the gate-level the transistors within a gate were assumed to have the same threshold voltage.
For mix-Vth MVT CMOS circuits, the transistors within a gate can have di erent threshold voltages with certain process constraints. Therefore, more transistors can be assigned high-Vth, and hence, larger leakage current reduction can be achieved. In this paper, di erent MVT CMOS circuit schemes are introduced and several algorithms for MVT CMOS circuit design are presented. The e ciency of each algorithm is demonstrated by experiments on a 32-bit adder and some ISCAS benchmark circuits.
The paper is organized as follows. In Section 2, necessary de nitions are introduced. Di erent MVT CMOS circuit schemes are proposed in Section 3. Section 4 describes three algorithms for MVT CMOS circuit design. Section 5 presents the implementation details and experimental results. Finally, conclusions are given in Section 6.
Preliminaries
Let us consider Figure 1 . The logic gates are clearly marked in circles. Suppose gate G is the one being analyzed. GIi and GOj are the fanin and fanout gates of G, where i varies from 1 to the number of fanins FI and j varies from 1 to the number of fanouts FO. Each fanin gate GIi connects to a pair of transistors pi; n i in gate G for a standard CMOS implementation. Similarly, for each fanout gate, there are a pair of transistors pj; n j driven by gate G.
Transistor-level static timing analysis
Transistor-level static timing analysis is used in our algorithms. Each transistor has a propagation delay, which can _ ___________________________ Permission to make digital/hardcopy of all or part of this work for personal or classroom use is granted without fee provided that copies are not made or distributed for profit or commercial advantage, the copyright notice, the title of the publication and its date appear, and notice is given that copying is by permission of ACM, Inc. To copy otherwise, to republish, to post on servers or to redistribute to lists, requires prior specific permission and/or a fee. 
Transistor delay slack
The slack is the amount b y which a gate or a transistor can be slowed down without a ecting the circuit performance. For a logic gate, the slacks of the p pull-up tree and the n pull-down tree are represented by Sp and Sn, respectively.
For a primary output PO , the slack is determined by the di erence between the required time and the departure time where Cox is the gate oxide capacitance per unit area. Weff and Leff are the e ective c hannel width and the e ective channel length, respectively. 0 is the zero bias mobility. n 0 is the subthreshold swing coe cient of the transistor.
Fo r a l o w-Vth transistor, if its threshold voltage is increased to a high-Vth value, the leakage reduction is proportional to the e ective c hannel width and the mobility. Therefore, we de ne the leakage reduction measure for transistor i as follows, leaki = Weff i i 9 where is the normalized mobility, which is equal to where the summation is taken over all the high-Vth transistors in the dual-Vth circuit. The larger the value of Mleak, the more leakage reduction can be achieved for a dual-Vth circuit, compared to the corresponding single low threshold circuit.
For each transistor, larger leak is preferable for larger leakage reduction. Consider the high-Vth delay and lowVth delay di erence td. If it is small, a large number of transistors can be assigned the high threshold under performance constraints, thereby leading to more savings in leakage power. In our analysis, we de ne the priority o f transistor i as follows, priorityi = leaki
Clearly, a transistor with a larger priority will result in more leakage reduction. For mixed-Vth CMOS circuit, the transistors within a gate can have di erent threshold voltages with certain process constraints. There are two t ypes of mixed-Vth CMOS circuit schemes that we consider. For type I scheme MVT1, there is no mixed Vth in p pull-up or n pull-down trees. Figure 4 shows the example circuit in MVT1 scheme. For type II scheme MVT2, mixed Vth is allowed anywhere except for the series connected transistors. The example circuit in MVT2 scheme is illustrated in Figure 5 . The reason that transistors in a stack h a v e the same threshold voltage is because of the process consideration. Suppose the transistor thresholds are controlled by c hannel doping. For the transistors in a stack, their channels are too close to each other, making it di cult to achieve distinct channel doping. Therefore, it is hard to get di erent thresholds for the transistors in a stack.
Obviously, MVT CMOS shows more opportunities for the high Vth assignment than the gate-level dual threshold circuit. be evaluated using equation 1, and the corresponding delay di erence td can be easily calculated. The next step is to assign dual threshold voltages to the transistors under performance constraints. All the transistors in the circuit are initially assumed to have the low threshold voltage. We forward-trace the circuit level by level from primary inputs to calculate the departure time of each gate using equations 2 and 3. Next, back-trace the circuit level by level from primary outputs to explore every gate G. The pull-up tree slackSpG, pull-down tree slack SnG, and the slack o f e a c h transistor within G can be calculated by using the equations 4-7. For the gate-level dual-Vth DVT scheme, if td of all the transistors within G are no larger than their slack v alues, G is a high-Vth gate.
For the mixed-Vth type I scheme MVT1, if all the transistors in the pull-up pull-down tree of gate G satisfy the requirement that td are no larger than their slack v alues, the pull-up pull-down tree can be assigned VtH. Let 
Priority Selection PS Algorithm
Priority selection algorithm is an exhaustive priority-based algorithm. The transistors are visited according to the priority v alues. After each visit, the transistor slacks are recalculated. The pseudo code of the priority selection algorithm for mixed-Vth type II scheme MVT2 is outlined below. Mark all the transistors in this series as visited transistors g g g The rst step is to levelize the circuit and calculate the delay and priority of each transistor. All the transistors are assumed to have VtL and marked unvisited. The second step is to explore all the transistors in the circuit according to the transistor priority v alues. For each visit, the departure time of each gate can be evaluated by forward tracing the circuit level by level from primary inputs, and the slack o f each transistor can be calculated by backtracing the circuit level by level from primary outputs. The transistor with the maximal priority from the unvisited transistors is then selected. By comparing the td of the transistor being visited with its slack v alue and considering the series connected transistors, the threshold voltage of this transistor can be determined. In order to avoid repeating assignment, this transistor is marked as a visited transistor.
For the priority selection PS algorithm, the circuit needs to be updated to re-calculate the transistor slack v alues after each transistor is visited. Therefore, the worst case run-time Hence, the worst case run-time is Omn.
Implementation and Results
The three algorithms described in Section 4 have been implemented in C under the Berkeley SIS environment. In this section, the results for a numb e r o f c o m binational circuits are presented. In our analysis, the threshold voltage and supply voltage of the original single low-Vth circuits are assumed to be around 0:2V and 1V , respectively. The primary inputs are assumed to arrive simultaneously and the timing constraints for primary outputs are determined by the critical path delay of the single low-Vth circuit.
Results for a 32-bit Adder
A w ell designed 32-bit static CMOS Kogg-Stone adder was investigated based on PathMill static timing analysis. The normalized active leakage power and standby leakage power at di erent VtH are given in Figure 6 and Figure 7 , respectively. The circuit temperature is assumed to be 110 o C and 25 o C for active mode and standby mode, respectively. Results show that there is an optimal VtH, at which mixedVth design technique can provide nearly 20 more leakage power savings than the corresponding gate-level dual threshold technique. Suppose the group number m is 10 for the priority-based backtracing PB algorithm. Fo r a H P w orkstation, the run-time of backtracing BT algorithm, priority-based backtracing PB algorithm, and priority selection algorithm are 3.8s, 4s, and 18s, respectively. Results indicate that the PB algorithm gives almost the same leakage savings as the PS algorithm, but the run-time is close to that of the BT algorithm. Vth 32-bit adder at di erent VtH and di erent primary input activities. The total power can be reduced by about 9 and 22 at max activity and 0.1max activity, respectively. Figure 9 shows the path distributions of the 32-bit adder at single high-Vth, single low-Vth, and mixed dual-Vth conditions. Certainly, single high-Vth circuit has less leakage power, but the critical delay of single high-Vth circuit is 30 larger than that of single low-Vth circuit. Dual-Vth circuit has the same critical delay as the single low-Vth circuit. However, the delay v alues of the non-critical paths are increased by assigning the high threshold voltage to some transistors in non-critical paths. By using SIS command map", circuits are mapped to a library targeting the minimal area, where the gates with the minimal width are preferred. Technology mapping can also be achieved using SIS command map -n 1 -AFG" to achieve minimal delay. In the critical path, the gates with larger width are chosen, while the gates in the non-critical paths may h a v e smaller width. Obviously, the circuit mapped for delay is more balanced than the circuit mapped for area. Table 1 and Table 2 report the leakage power savings for ISCAS benchmark circuits which are mapped for area and delay, respectively. The backtracing algorithm is used and di erent circuit schemes, such a s D VT, MVT1, and MVT2, are compared. More leakage reduction can be achieved for the circuits mapped for area because of the larger imbalance in slack. The leakage savings of MVT2 scheme are larger than those of MVT1 scheme. The mixed-Vth schemes provide more leakage savings than the corresponding gate-level dual threshold technique. For some benchmark circuits, the additional leakage savings can be more than 20. Table 3 shows the leakage power savings for di erent algorithms, such a s b a c ktracing BT algorithm, priority selection PS algorithm, and priority-based backtracing PB algorithm. MVT2 scheme is used and the circuits are mapped targeting the minimal delay. The CPU time is for a SUN UltraSPARC-II. Results indicate that PS algorithm shows more leakage savings, but also takes more CPU time. BT algorithm is the fastest one, but it gives less leakage saving than the other two algorithms. For PB algorithm, the group number m is set to be 10. The leakage savings are close to those of PS algorithm and the run-time is similar to that of BT algorithm.
Summary & Conclusion
In this paper, a mixed-Vth CMOS circuit design technique is presented and di erent mixed-Vth circuit techniques are introduced. Several algorithms for transistor level threshold assignment for mixed-Vth static CMOS circuit design style are proposed. A 32-bit adder was simulated based 
