h ti paper we address tbeproblem of delay constrained~a-tion of Ieakagepower of CMOS digiti circuits for durd VT technology. A novel and efficient hefitic rdogrithm based on circuit graph enumeration is proposed. me experimentirestits on the MCNC91 benchmark circuits show that Up to an order of magnitude powerreduction can be achieved without any increase in delay.
Introduction
CMOS has long been mnsidered the tihnology of choi= for low power applications. The continuous shrirddng of feature sizes has made it possible to achieve ever greater integration of complex functions on a single chip. This capabtity has rdso fueled an explosive growth in the market for high performance portable computing and communications systems. However, the higher chip densities have restited in a one to two orders of magnitude increase in tie power consumption of many high+nd processors. The point is rapidy being reached where reduction of power consumption becomes the single most important hurde that designers and manufacturers must face.
Power consumption in CMOS circuits can be expressed as tie sum of the (average) witchingpower(P~W),the shn+ircuitpower (P,.) md the ktigepower(Pf=a~).
P.v is due to the chmging and discharging of load capacitances as logic gates tisition between O and 1. It is typicdy expressed as CLVjdE(t), where CL is the load CaPaCi~CC, Vdd~tie SUPPIY VO1tige and~(t) k tie expecti number of times that tie gate switches. P,. is due the existence of a conducting path between the vdd and ground during the brief period when agate switches, and P1==ĩs due to tie Ieakagecmrent caused by the stored charge in the drain junctions leaking away and due to devices hat conduct w~e in tbe off-state (subtieshold conduction). Wlti relatively larger devices, i.e., signiEcanfly larger tian lpm, P.m is tie dominant componenL me quadratic dependence of P.W on vdd indicates &at reducing the supply voltage W have the greatest impact on reducing Pa.. Addition&y, CL is reduced by the reduced dimensions of the devices and by circuit design and layout techniques. Fiutiy, a signiEcant number of restits have been reported on techniques to reduce the switching activi~(E(t)).
Petission to tie di@M or M copim of aUor part of b~vorkfor personrd or &%sroom we k -ted \tifiout fm protided tit copi= are not wde or d~tib uted for profit or comer~advantage and tit copies bear this notice and tfreM dtation on tfre fit page To copy ofimtie, to repubhh, to~t on semem or to rti~~%ute to kts, rqties prior sp~c pe~lon arrd/or a fee. ICCAU98,Sm Jo?%C& USA O 1998ACA11-5S113@S-29S/~11.S5.~S ctig down the supply voltage has tie most sigticant impact on the power dissipation.~*O avoids hot+arrier effects in short channel devices. However, the threshold voltage VT has *O to be scdeddownb~ause otherwkeit has a much greater detienti impact on tie delay when smd geome@ devices are used [4] . Thus schg VT by tie same factor as vdd is needed so as not to adversely impact delay. However, reducing VT in sm~geome~MOSFBTS restits in a exponential increase in the s~d-by current [1] . Simtition restits given in [5] show that the power dissipation due to tie standby cmrent dominates tie switching power at low threshold voltages. The standby current increases as the subthreshold swing increases, and the subthreshold swing increases witi increased doping density and rduced gate length, both of which happen for smg eometry devices.
It is now clear that opdrnd design of CMOS circuits that employ subrnimndevicesrequiring operation at low voltages involves a number of complex tiadeoffs, involving device dimensions, tie supply voltage, and the threshold voltage. One relatively recent development is the use of multiple tieshold voltage CMOS~C-MOS) [10] , which is relatively easy to implement H o~y two threshold voltages are considered (dud threshold CMOS -DTC-MOS), then tie threshold voltage of an appropriate subset of tie devices can be assigned tie higher threshold voltage by including an extra implant step [10] . k [9] , an approach to simultaneously optimize the supply voltage, threshold voltage and transistor sties assuming MTCMOS is presented.~eirapproachis basedon Erst assigning delays to flthe gates without violating the cycle time constrain Then the optimal supply and threshold voltage and transistor width of each gate are determined so as to minimize power consumption. Mthough their algorithm has the flexibfity of assigning different threshold voltages, tie restits repoti were based on a single tieshold voltage. Theirresdts indicate signiEcantreduction in totrd power dissipation and energy. It must be pointi out that tieir estimation of the leakage power does not take in account the signal probabfities, which mtid resdt in signiEcat errors in tie estimates.
ti [10] , m~CMOS circuit sticture is~posd and anrdyzed. me circuit consists of a network of transistors that have low VT. me network is connected to ground through a high VTgding Wsistor that is off dtig tie inactive period and on dtig the active period. h this way the standby current is redud.
This approach intiduces some complications for circuit design. For example, reverse conduction patis may exist which tend to rduce tie noise margins or in the worst case, result in complete ftiure of tie gate. Additiondy, the leakage power dissipated when the system is in the active mode WU not be rduti.
Findy, extra chip area is required for the high VT gating transistors and the associated routing of wires.
The avtiabfity of two or more threshold voltages on the same chip~vides a new oppotity for circuit designem to make tradeoffs between power and delay. k this papr the problem of opdmfl assignment of threshold voltages to transistors in a CMOS logic circuit is defind, and an efficient dgonti for its solution is given.
~problem WU be referred to as the Dud VT Sekctwn prubles ince ody two threshold volhga, a IOWVT md a Mgh VT, Ue Considered. -The rest of the paper is orgtied as fo~ows. SWtion 2 contains a summary of the power md delay modek r~enfly investigated by other researchers. k Section 3, a formal statement of tie Dud VT Sekction problem is given. Anew dgoriti to solve this problem is destibed in S&tion 4. Some implementation issues are discussd h SWtion 5. me effusiveness of the proposal dgoriti is examined by carrying out extensive experiments on MCNC91 benchmark circuits. me results of these experiments are given in S=tion 6. Fintiy, conclusions and dir=tiom for future work are discussed in Section 7.
Preliminaries
k this smtion background material on modek used for estimadng power dwsipation for short channel MOS~Ts and the modek used to compute delay are presented.
LeaWge Power Model of MOSFE~s
For a single NMOS@MOS) device, the Berkeley Short-Channel IGFBTmodel @S~) [1] is used to estimate the leakage power dissipation. k the BS~modeL the threshold voltage is expressd as:
(1)
where VFDis the flatband voltage, @Sis two times the Fermi potentia vBSis the substrate reverie~1~, kl~tie body+ffect fac~r> and k2 and q model the threshold lowering eff-ts of short channel MOSFBT'S. me leakage current for NMOS transistor working in tie week inversion region, i.e. V9S= o,~given by
where Vt is the therrnd voltage (X 25mV at room temperature), n is the subthreshold slope mefficien~and 10 = pocc.
(W7/L)v~el.8.
The Ieakagecurrentfonmda for a PMOS devi-is stiar. uation (3) gives a simple formtia for the leakage current for a single NMOS devim. h CMOS logic gates mnsisdng of seriespdel networks of PMOS and NMOS devims, the leakage current through devims in partiel can be taken to be tie sum of tie individurd leakage currents. However, the leakage current tiugh a series of MOS~Ts requires careful analysis of different mmbinations of the on and off devices. k [5] , simple andyticd forrmdas for tie leakage current tiugh a stack of one, two and three MOS~Ts are given. h addition, tie leakage current for stacked NMOS devices is related to the single NMOS leakage current as foUows. where I,i (i=l,2,3) is tie leakage current for i stacked MOS~Ts. 1,1 is given by (3). @uation (4) shows that the leakage power of a CMOS gate depends on the state of inputs and tie threshold vdueof the mrresponding transistor. With this in mind, mnsider a 2-input CMOS NAND gate shown in Figure 1 . For simphcity the PMOS and NMOS transistors driven by the the ssme input = assumed to have the identid tishold voltages, although Werent NMOS transistors can have different threshold voltages. The Ieakagepower of the gate under dflerent input combinations is snnuntied in Table 1.-IA,. and Ia,n~tie le~age c~en~for tie s~gle WOS devi-of input A and B respectively. These maybe Werent due to their different threshold voltages. l~:p k tie Ie*age c~entof he single PMOS device of input A.~vrdue can *O be d~erent from l~,n since tbe two &ansistors may have dtierent sties.~~ĩ s the leakage current when boti NMOS devi=s A andB are off and the output C is high. Mthough A and B may have different tieshold voltages, for simplicity, lA~n is taken to be the smtier of 1$,ã d~~,n. w is a conservative approximation but W not lead to significant errors since tim (4) the leakage power of~o series connected MOS~Ts is much less ban that of a single MOSFBT. M these quantities can be obtained dirwtiy tim (4).
The overfl average leakage power dissipation can be expressed as fouows:
where p(o) are the signrd probabtities for the different input combinations. To awurately estimate tbe leakage power, the exact probabfities for each mmbmation have to be found. This maybe achieved using BDDs. However, in most practicrd cases, the signal prubabtities at the gate inputs and outputs are obtained by eitier locrd probabtity propagation or by logic simulation.
Delay Model

Gate Delay Model
An amurate and computation~y efficient model for a short channel MOS~T is destibed in [4] . The modeL cded the ntb power law, is an extension of tie alpha-power law model [3] , but is much more awurate. The nth power law model has been shown to a=u-rately represent the 1 -V characteristics of short channel MOSFBTs down to 0.25-pm channel length. The CMOS inverter prop agation delay and output transition delay fomtia derived tim the MOS~Ts model predicts the circuit behavior for modem subrnimmeter designs very weU. For CMOS gate delays, it was found that N series-onnected MOS~Ts (SCMS) would show less than N times the delay compared to a single MOS~T for submimmeter designs [3] . That is,
where~is a technology dependent parameter and O < ( < 1 for most current subrnimmeter technologies. To compute the ctiuit delay, standard static timing audysis is used. For each gate n of a *tit we define three values: AT(n),~(n) and S(n), which are the tivrd time, required time and slack for the gate n. The arrival time AT(n) is the worst delay tim the primary inputs to gate n. Given the arrival tie at pdrnary inputs, the arrival time of gate n is obtained by
i e fanins of n where di(n) is the pin-to-pin delay from the tiput i of gate n to the output of gate n. This quantity is mmputi using the nth pwer law model. Note that tbe intemnnut delay is not considered in the delay computation. The rquired time is the latest time tie signal has to arrive at the output of gate n. Given therequirti time at each primary outpuL the rwuired time at the output of gate n is obtained where dn (j) is the pin-to-pin delay from the inut of gate j that is fd tim gate n to tie output of gatej. This delay is *O mmputi usirrg the ntb power law model. The slack is defined as S(n) = (n) -AT(n). The set of gates that has tie minimal slackvdue mnstitute the critical pati of tbe circuit E no gate in the circuit has a negative slack hen timing constraints are sadstied [11] .
Problem Definition
A combinational circuit is represented by a duected acycfic graph @AG) G = (~E). Each node and edge in G corresponds to a gate and a connection in the circuit respectively. The gened form of tie dud-vT optimization problem is to assign one of two tieshold voltages, VT,hi~hand VT,ZW, to each-Sister such that some cost function is optimized subject to constraints. Since the PMOS and NMOS transistors hat are connected to the same signal have the same threshold voltage, tie~erent threshold voltages of the tisistors are represented by Iabehg each connection eij c E by Xaj, where Xil = O (xiJ = 1) means hat the PMOS and NMOS tisktors that Sre fiven by edge eij have a VT = vT,high (VT = vT,rm).
The dud-l+ selection problem can be viewed in one of two ways -eitierdelay can be optied subject to constraints on power or visa versa. The rdgonthm presented here attempts to reduced the standby power subject to the constraint of not increasing tie delay. Thus, the procedures~with a combinationrdcircuit where d tie devices me assumed to have tieirthresholdvoltage set to VT,lOW and selects a subset of devices whose threshold voltage WMbe changed to VT,high,without increasing the delay. This is formdy expressed as fo~ows: Given a combinatiorud circuit, represented as a DAG G = (1{ E), and with d the devices having their threshoti voltage set to VT,lOW, rquired time for each primary output is defined to be the worst case delay of the original circuit where d tie devices have tieir tieshold voltage set to VT,~~~. Thus, the initial circuit is the fastest implementation with W tie other parametem being fixed. W is a constied O-1programming problem with non-hear cons-t functions. k the foUowirrgs~tion an efficient heuristic prucedure to solve this problem is described. The effectiveness of the dgonthm W demonstrated through experiments on the MCNC91 benchmark circuits. No& The~ofs of the bmrnas are simple and are orrtmitted herein tie irt~rest of brevity.
The Algorithm
De fiition 1 tit ei, be an edge OfG = (V, E). eij issaidto befeasibk l~changing the threshokfvoltage Ofeijfiotn VT,t~~to vT,hig, does not result in making the skrck of gates i and j negative.
Given the delay information of the gates, the feasibfity of an edge is detemdned x foUows. An dge eij is feasible if its threshold voltage is VT,fOW and~f < s(j) and~b < S(i) whereef = AT(i)
is the pin-to-pin delay when &e tieshold VOI~geof edge eij is vT,high. Shce non-feasible connections are gnsranteed not be included in any selection, ody feasible connections need to be considered.1 lNo~for tie ofs~fiti~, dthou~the outputrise andfd delayshave not beenMerentiti so far in the praentetion, they are accoontedfor in tie implementation. A solution to the dud-VT problem is to identify tie largest subset SH~E (SL = E -SH), such thatchangingthe threshold voltage Oftie edges~SH @ VT,highWfi nOtViO1aktie delay COnS~&. The heuzistic procedure to be described consists of two steps. Firs$ instead of finding tie largest feasible subset of edges, a maimd feasible subsetis determined. Thatis, a subset hat has tie property that if another edge is added to i~it is no longer feasible, i.e., it violates the delay constrain A maxirnd feasible subset is a bcdly opdmfl solution. To escape from this with tie intent of finding a possibly better one, a second step, ctiedswapping,is carried out M swaps elements from SL and SH to increase the weight of the maximal seT he weight of a set of edges is the toti power reduction when tie threshold voltage of W edges in the set is changd tim VT,[U to vT,high. 
Construction of an Initial Solution
Definition 3 A cut C of a directed acyclic graph G = (V, E) h a partitwn of the nodes of V into two disjoint sets, i.e. C = (S,~).
forward cut edges are those edges eij c E such that i c S and j~~. Simibrly the backward cut edg= are dl those edges eij E Esuchthati c~andj c S. Aforwardcut Cf of Gis a cut where dl cut edges areforward cut edges. hrnrna 1 titCf be afotward cut of a DAG G = (V, E). tit el and e2 be any two edges in the set offoward cut edges. men el is neither in the transitivefmin cone nor the transitive fmout cone of e2.
hmma 2 Given aforwardcut of the circuitgraph G, changing the threshotivoltage of dl thefowardcut edges that arefemibkfiom VT,lW to VT,high will nOt increase the CirCUh &@. k the rest of the paper, a forward cut WNbe referred to simply as a CULA simple Ngoriti to find a good intifl solution can obtained by iteratively finding tie maximum weighted cutof the circuit graph unti the weight of the cut becomes zeru. Note that after changing the threshold voltages of W edges in a cut to VT,hi9h, the timing information of the circuit has to be updati and the edge weights have to be re~vduated since their feasibtity may have been changd. The problem of finding a maximum weighted cut of a graph is NPComplete. A heuristic employed here is to define a special type of a cut which can be easfiy identified and where the toti number of such cuts is sufficiently smd thattheycan beenurnerated. One such class of cuts is based on tie topologicrd level of tie gates.
Given a combinationdcircti~thelevelof a primary tiputis zero, and the level of a gate is the one more than maximum of the Ievek of d its fanin gates.
Defiition 4 Given a circuit graph G = (V, E) corresponding to a kvelized combin~"oti circuit, the kvel k partition of G is a partwn of V into (S,~) such that (1) V i E S, kveqi)~k;[2)Vj E~, kve~j) > k; and (3) O < k < ma (level(n)), V u G V.
Clearly, the level k partition is a forward CULTheprocednre outhed in Figure 2 finds an initird solution by iteratively Ending the maximum level-k cut of the current weighted circuit graph unti no other cut with anon-zero weight can be found. Figure 3 shows an exmple of how the procedure works. Assume hat the circuit graph shown in Figure 3(a) is the current state.
The circuit has three level cuts and steps 6 through 9 of the procedure inirsolutwn WW resdt in a level i -1 cut with the maximum weight of 21. Recfl that an edge of weight of zero means that tie edge is either non-feasible or ik tishold voltage is aheady 1$ high.~erefo~d tie dges h he level i -1 cut wi~~SitiVe weight (edges {e=., ebd, ebe}), m be includ~ti tie~ti~SOIU-tion md their threshold voltages wfi be chmg~~vT,high (s~Ps12 and 13). At this point a static dining analysis is performed and the weights of tie edges WMbe updated (step 14, 15). For example, afterchangingthe weights of the dgesin the level i-l cutb vT,high! the weight of the edge eg~chmges~m 14~0.~us~ge egk k now unfeasible. The new weighted circuit graph is shown in Figure 3@) . The loop containing steps 3 tiugh 16 WU be repeated on tie new circuit graph unti tie weights of d edges become zero. 
Impmving the initial Maximal Set by Swapping
The restit of procedure initsolution @igure 2) is a partition of the edges of the circuit graph into two disjoint sets, Sh and St, where Sk is the set of edges whose threshold VOlbgeSw~be VT,hi9hand S1k the set of edges whose threshold voltages W be VT,[W. bmma 3 SbkS hat he Set Of dgeS witi vT,high's k m~~ti tie sense that no more edges can be adddto Sh (tire S~)wi~out~~e=~g tie delay. Since the objective function is non-negative, ti set is a Iocdy optimal solution. To escape for a Iocfiy optimal solution, a swapping procedure is carried out An outie of this procedure is shown in Figure 4 . The basic idea here is to move edges from sh with a toti weight that is as smM as possible into the set SL, md~move~ges~rn SL~s h whose toti weight is as large as possible, witiout increasing the delay. Note tiat CoroUary 1 states hat none of tie edges are feasible in the initial solution. r some edge is moved tim tie set Sh into St, tiere may be some previously unfeasible nets that becomefessible. These are potentifl candidates for being moved fim S1 into Sh. The SWaPPing is Prformcd one dge at a tie, with the edge having the sm~est weight being movedtim Sh to St @e 2). Thisis done to as to minimize the cost of tie swap out operation. After setting the threshold voltage of the edge to be swapped out to vT,tti, m timemen-M tig analysis (refer to Section 5) is performd to identify M edges that have become feasible @e 5). Note tie the situation with a CULit may not be possible to swap in d tie feasible nets because they may not be simdtaneously feasible. A conservative approach is to define a gain associated with tie edges that might be swapped in. This gain is the maximum weight among W the fessible edges @e 6). U the gain is greakr ban the cosL the swap in performed. Note that to guarantee that tie delay is not increased, the feasible nets wfi have to be swapped in one by one fo~owed by an increment dining an~ysis. A good heuristic is to swap in the order of decreastig weight since the dge with a larger weight (gain) @be swap@ in drs~This is ctied out by tie pracdure applySwap~nNet9, tie de~of which are omitted. The value returned by the procedure applySwapInNets is the set of edges that ti be moved tits Sh. FinWy, if the gain is less tian tie COSL tie swap is not performed.
The oved durd VT powerminiza tion dgorithrn is shownti Figure 
Implementation
The boffleneck in the proposed method is tie swapping operation. For each candidate edge to be swapped 0U4 an increment timing analysis is performed to find the new edges that bmme feasible. k the worst case, even the increment timing mdysis may have to traverse backward to the primq inputs and forward to tie primary outputs in order to compute the changes in tie slack v~ues of d the affected gates. However, in practice, the effects of changing tie tieshold voltage of a single edge on tie slack vrdues of otier gates in tie circuit diminishes geometrictiy witi the depth of tie fanin and fanout cones [7] . Consquentiy, in tie current implementation, timing anrdysis is performed witi a tidow (number of levek round the node in consideration) of a specided size. Experiment resdts conti that this simple heuristic sigticantiy improves the running time for large circuits with fitie degradation on tie reduction in the leakage power.
Experimental Results
The dud VT power opdmization dgorithrn shown in Figure 5 was implemented ti SmWW.
W experiments were run on a Sun Sparc4 machine with 64MB memory witi the circuits tim tie MCNC91 bencbmmk suites. me miCd vT,high and VT,IOW for current dud VT digiti CMOS process are 0.7 and 0.25 volts respectively [12, 13] . W the technology psrametem for the power and delay model come from [1, Z 3,~. The signdprobabfities are obtained by logic simulations with randotiy generated input patterns. The leakage power model (4,5) is used to compute tie leakage power for each gate. Since the signal probabfities and gate sizes WU not be changed during the optimization, tie leakage power reduction of each edge @y changing it hm VT,[N @ vT,high) neõ~y be computed once. The~g constraints for a circuit is tie worst case delay of tie orighd circuit implementation, i.e. d transistors are low tieshold voltages. Since accurate delay computation and drning andysisis used in the dgona the dgoriti guarantees to produce a new implementation with a worst case delay that is the same as the original one. The experimented resdts are shown in Table 2 .
The circuits shown in the Table 2 Table 2 , it can be seen that sigticant power reduction can be achieved using the proposed dusd VT power~tion dgonti
The power rduction can be up to an order of magnitude. By setting a window on tie static timing anrdysis, the amputation time for large circuits is reduced sigticantiy with fitie penrdty on the power rduction. For example, by setting window size to be 5, tie CPU time for the circuit C6288 reduti tiost 50% with ody 3% decrease in tie power reduction. Considering the tited computation power we have, the algorithm tihed in reasonable dme . ..
for d circuis and shodd be suitable for red large designs. FinWy, if tie circuits are irnplementi witi d high tieshold voltage devices, tie reduction in leakage power w~beat least 2 to 3 orders of magnitude. But tie last column shows bat the average delay increase WU be about 3090. The proposed rdgonthrn can achieve sigticnnt power reduction without any delay penalties.
Conclusion and Future Work
Modem advanced digiti CMOS dud VT technology dews transistors with two different threshold voltage on the same chip. This provides anotier dimension for circuit optiation. h h paper we addressed the problem of leakage power rduction under delay constraints given a circuit implemented by W low VT devices for dud VT technology. A simple and efficient algorithm was presented and experiment results show that sigficant leakage power reduction can be achieved without any delay penalty.
The increasing use of dud and mdtiple VT CMOS technology provides otier oppofinites for circuit optiation.
Cnrrenfly we are looking at tie fo~owing problew (1) delay opwtion with minimum leakage power penalty given dud VT technology (2) simultaneously gate sfig and VT selwtion for power delay &ade-offs; (3) including tie effeck of dud VT on the shofi circuit power which is not considered in tie power model in this paper.
Acknowledgement
This work was carried out at the Center for hw Power Electinics which is supported by the Nationrd Science Foundation, the Department of Commeu of the State of Won% and various companies in tbe microelectinics indus~, including, Analog Devices, &d-ogy, Burr Brown, Hughes_ hte~Mimchip, Moturol~Na-tionrd Semiconductor, RocheL Sicou S~Texas hstrumenk, md Western Design. [11 B.Sheu, d . L Scharfetter, PK Ko and M.C.Jens "BSM Berkelev [10] J. Kao, A. Chandrakasan,andD. Antoniandis"TransistorStig ksues andToolForMulti-~eshold CMOSTechnologyfl Proc.ofDA&97, k Vegas,NV,June 1997.
References
[11] S. Devsdas, A Ghosh, and K Keutrer,~gti Synth~is, McGrawm, 1994.
[12] H.Y.Xie, Motorola hc.,pemonalconununicatiom, 1997.
[13] T.DWger, RockweUhe., persod co-nicatiom, 199S.
Figure 1: A 2-input CMOS NAND gate. . compute tie weights for each edge of3 . static timing analysis; 4. set the timing constraints as tie worst cast delay of5
B
. I= initMolution(G); 6. S =wap (G,I); 7. re<ompute tie totrd le&age poweq 8.~Sti(~, } Table 2 Expetientirestik (cent'd).
