There is a growing need to analyze and optimize the stand-by component of power in digital circuits designed for portable and battery-powered applications. Since these circuits remain in stand-by (or sleep) mode significantly longer than in active mode, their stand-by current, and not their active switching current, determines their battery life. Hence, stringent specifications are being placed on the stand-by (or leakage) current drawn by such devices. As the power supply voltage is reduced, the threshold voltage of transistors is scaled down to maintain a constant switching speed. Since reducing the threshold voltage increases the leakage of a device exponentially, leakage current has become a dominant factor in the design of VLSI circuits. In this paper, we describe a method that uses simultaneous dynamic voltage scaling (DVS) and adaptive body biasing (ABB) to reduce the total power consumption of a processor under dynamic computational workloads. Analytical models of the leakage current, dynamic power, and frequency as a function of supply voltage and body bias are derived and verified with SPICE simulation. Given these models, we show how to derive an analytical expression for the optimal trade-off between supply voltage and body bias, given a required clock frequency and duration of operation. The proposed method is then applied to a processor and is compared with DVS alone for workloads obtained using real-time monitoring of processor utilization for four typical applications.
Introduction
Power consumption has become an overriding constraint for microprocessor designs, not only in mobile environments, but in desktop and server applications as well. Traditionally, the supply voltage of microprocessors has been set at the maximum allowable voltage based on device breakdown potentials, and the processor is run at maximum clock frequency. During typical use, however, the dynamic load of applications running on a processor results in substantial periods of operation where the maximum performance of the processor is not required. A number of methods have been proposed that take advantage of these periods of low utilization by scaling the supply voltage and clock frequency, resulting in a reduction in dynamic power consumption [1] [2] [3] .
While these dynamic voltage scaling (DVS) methods are effective in addressing the dynamic power consumption, they are not as effective in reducing the leakage or static power. As minimum feature size continues to shrink, the scaling of the maximum supply voltage requires a reduction in the threshold voltage which results in an exponential increase of leakage current with each new technology generation. Even in today's technologies, it is not uncommon for leakage power to comprise as much as 20% of the total power consumption [4] . As technologies continue to scale, it is expected that leakage power consumption will become comparable to dynamic power consumption [5] . Furthermore, during periods of low utilization, the lower clock frequency and lower corresponding voltage level result in reduced dynamic power consumption, thereby causing leakage power to dominant.
Leakage Current Reduction in VLSI Systems
David Blaauw, Steve Martin, *Krisztian Flautner, Trevor Mudge University of Michigan, Ann Arbor, MI *ARM Ltd, Cambridge, UK For processors operating under dynamic computational loads, it is therefore essential that both leakage and dynamic power are addressed effectively. Previously, adaptive reverse body biasing (ABB) has been proposed to control the leakage current during standby mode [6] [7] [8] . Recently, methods using forward body biasing have also been proposed [9, 10] . Adaptive body biasing has the advantage that it reduces the leakage current exponentially, whereas dynamic voltage scaling reduces leakage current linearly. In this paper, we propose the use of simultaneous dynamic voltage scaling and adaptive reverse body biasing to control both dynamic and leakage power.
The difficulty in employing simultaneous DVS and ABB is in determining the optimal trade-off between supply voltage and reverse body bias voltage, such that the total power consumption at a particular frequency of operation is minimized. The optimal trade-off between supply voltage and body voltage depends on the frequency of operation, since this determines the relative magnitude of the dynamic and static power consumption.
The energy required for switching the supply voltage and the body voltage is amortized over the period of operation at a particular frequency. Since the switching energy for the body voltage and supply voltage differ, the optimal trade-off therefore also depends on the length of the operation at a particular frequency. Finally, the possible combinations of supply voltage and body-bias are constrained by the requirement that the circuit delay meets the specified clock frequency.
In this paper, we derive an analytical expression for the optimal supply voltage and body voltage for a given processor frequency and duration of operation. We present analytical models that express the power consumption and processor performance as a function of the body voltage and supply voltage and show how to fit these functions to SPICE simulation results with good accuracy. By using the performance of the processor as a constraint, the resulting two dimensional optimization task is reduced to a one dimensional task and is solved through differentiation. The analytical expression for the optimal supply voltage and body bias was verified through SPICE simulations.
The proposed simultaneous DVS and ABB method was then applied to a processor and was compared with using DVS alone. The dynamic processor loads were obtained through measurements on a 600MHz Crusoe processor for four different applications. Expected gains from using simultaneous DVS and ABB were evaluated at a current 0.18µm technology as well as for a projected 0.07µm technology. The simulations show that the proposed method improves the total power consumption by an average of 23% for the 0.18µm process and by an average of 49% for the 0.07µm process. These substantial savings in total power result from the fact that for many applications, the processor spends substantial periods of time at very low performance levels where leakage current dominates the total power consumption. The results also indicate that for future technologies with higher leakage power, the effectiveness of the proposed method will increase.
The remainder of this paper is organized as follows. In Section 2, the models for dynamic power, leakage power, and performance as a function of supply voltage and body bias are presented. In Section 3, we derive the optimal trade-off point for the supply voltage and body bias. In Section 4 we apply these optimization techniques to the Crusoe processor and in Section 5 we draw our conclusions.
Power and Performance Models
We first derive the threshold voltage as function of the supply and bias voltages and then express the total power consumption as function of these voltages. We then derive the performance as a function of the supply and body bias voltage. In all cases, we compare the analytical model to SPICE simulation results.
Threshold Voltage
The threshold voltage of a short-channel MOSFET transistor in the BSIM model [11, 12] is given by, (1) where V th0 is the zero-bias threshold voltage, Φ s and γ are constants for a given technology, V bs is the voltage applied between the body and source of the transistor, ∆V sce is the dependence of the threshold voltage on short channel effects and drain induced barrier lowering, and ∆V NW is a constant for a given transistor size that models narrow width effects. The change in threshold voltage due to short channel effects is given by, (2) where θ(DIBL) is a function dependent only on process dependent fitting parameters and effective length of the transistor, θ(SCE) is approximately a constant, and V dd is the supply voltage. Combining (1) and (2) shows that V th has a linear dependence with V dd . If then can be linearized as which yields,
where K 1 , K 2 and V th1 are constant fitting parameters. an R 2 value of 1 is a perfect linear fit. Figure 1 . V th vs. both V dd and V bs as generated by SPICE simulation for a deep-submicron process.
Power Consumption
The power consumed in a processor consists of three components as given by,
where P AC is the dynamic power, P DC is the static power due to leakage, and P SC is the negligible power due to short circuits when both PMOS and NMOS devices are on during signal transitions [14] . The dynamic power is given by, (5) where C eff is the average switched capacitance per cycle, and f is the clock frequency. Figure 2 shows the major components of static current in a standard inverter. Although [15, 16] consider only the leakage due to I subn and I subp , as [7, 17] points out, the contributions of I j and I b can be significant. Thus, the static power consumption is
given by, (6) where I subn is the subthreshold leakage current, I jn is the drain to body junction leakage current, and I bn is the source to body junction leakage current through the NMOS device. Equation (6) is given for the inverter when its output state is high (i.e. the NMOS device is leaking). A similar equation can be derived for the inverter when the output state is low and the PMOS device is leaking.
The subthreshold leakage current through a transistor with V ds =V dd and V gs =0 is modeled by,
where W and L are the device geometries, I s , n, and V off are empirically determined constants for a given process, and V T is the thermal voltage [11] . Since V off is typically small and is nearly 1 for all V dd , (7) can be approximated as, (8) where K 3 and K 4 are constant fitting parameters. Substitution of (3) into (8) (9) where K 5 and K 6 are constant fitting parameters. As |V bs | is increased, the current due to junction leakage, I j , increases and counteracts the savings achieved by lowering I subn . The maximum value of |V bs | before junction leakage overrides subthreshold current reduction is dependent on process and has been shown to vary from as high as -0.6V to -2.5V [8, 18] . This crossover point is also highly dependent on temperature, where at higher operating temperatures, transistors exhibit a larger reduction in I subn (and thus tolerate a larger |V bs |) before I j increases [8] . SPICE simulations for the 0.07µm process show that the crossover point is about -1.2V. Therefore, to be safe, V bs was constrained between 0 and -1V, although in a different process, a lower cutoff point might be needed. I j can be approximated as a constant, (10) and the total static current, I stat , becomes, (11) Figure 3 is a SPICE generated plot of I stat vs. simultaneous changes in V dd and V bs . A comparison between I stat as generated by SPICE and I stat as generated by (11) yields an average percent error of 2.09% and a maximum percent error of only 5.63% for 0.3 < V dd < -1V and -1 < V bs < 0V.
Substitution of (9) and (10) into (6) yields, (12) and finally, (13) Leakage Current (A)
Delay
The delay of a gate is a function of both the power supply and the threshold voltage of the internal transistors. Since the delay of complex gates remains proportional to the delay of a standard inverter, the circuit delay can be modeled similarly to the alpha-power model of an inverter [19, 20] as, (14) where K 7 is a constant for a given process technology, and α is an indication of the amount of velocity saturation occurring in the device (α is typically 1.3-1.5 for short channel devices). The critical path delay in a circuit can be modeled as, (15) where t crit is the delay of the critical path and L d is the so-called logic depth of the path [15] . Substitution of (3) and (14) into (15) yields, (16) Figure 4 shows the plot of delay vs. V dd and V bs as determined using SPICE. A comparison between the SPICE data and the operating frequencies calculated using (16) yields an average percent error of 9.8% and a maximum percent error of 33.2% for 0.5 < V dd < 1V, -1 < V bs < 0V, and α=1. While this maximum percent error is large, (16) produces worst-case frequencies which guarantee that the circuit will meet timing. The optimal power consumption, however, is not fully realized. Additionally, V dd was limited to greater than 0.5V to ensure proper circuit operation and noise margins since the V th of a transistor approaches 0.38 V when scaling.
Optimization
Now that the necessary models have been developed, the technique for finding optimal settings for implementing both DVS and ABB is presented. With three possible variables to control, V dd , V bs , and f, the optimization first begins with limiting the number of free variables. 
Variable Reduction
The processor's algorithm for determining utilization based on workload generates a value for the required frequency eliminating one free variable. In order to eliminate a second variable, this frequency is treated as a constant for a given optimization point and (16) can be solved to find V dd as a function of V bs . If α =1 then (16) yields, (17) which may be rewritten as, (18) where, for a given frequency, (19) This leaves V bs as the only free variable.
Energy Minimization
The energy consumed per cycle is defined as, (20) By substituting in (13), the total energy consumed per cycle for an entire circuit is given by, (21) where L g is the number of logic gates in the circuit. Unfortunately, there is also energy required in switching the circuit between varying power modes. This switching energy, E s , is given by, (22) where ∆V dd is the change in V dd , ∆V bs is the change in V bs , C r is the capacitance of the power rail, and C s is the total capacitance of the substrate and wells of the device. Let t be the duration of time in a given power mode then the total energy consumed in a particular mode is given by, (23) Differentiating (23) with respect to V bs yields, (24) where by substituting in (18) ,
and (26) In (26), k 1 -k 6 are constants derived from the other process variables, K 1 -K 8 . Their values for a 10 inverter chain are presented in Table 1 . Figure 5 shows the derivative of total energy vs. V bs . The zero crossing indicates the V bs for minimum energy consumption. Figure 6 shows the optimal V bs and V dd for minimum energy consumption for different required frequency, given a 50µs duration of operation. For any duration t > 50µs, the V dd and V bs values are independent of t while for a duration t < 50µs, V dd and V bs scale with t. The shorter duration cycles do not lend themselves to large voltage changes because the energy required to switch V dd and V bs can not be amortized over as many cycles as during the longer duration cycles. Figure 7 shows the energy savings achievable by using both DVS and ABB. The average energy reduction over all frequencies by simultaneous DVS and ABB as opposed to just DVS is 54% while the savings over a circuit with no scaling is 74%. SPICE simulated values for total energy and the expected values based on (23) agree with an average percent error of 12.7% and a maximum percent error of 28.8%. 
Microprocessor Results
The proposed method of simultaneous DVS and ABB was applied to a mobile processor using the derived optimal trade-off between supply voltage and body-bias. The dynamic processor load was obtained through hardware monitoring as explained in the following section. The application of the simultaneous DVS and ABB method and the resulting energy savings are discussed in Sections 4.2 and 4.3.
Workload
Performance-setting algorithms dynamically adjust the processor's performance level while ensuring that the software running on the processor meets its deadlines. For some applications, there is a very clear notion of what these deadlines are. During video playback, for example, the performance-setting algorithm must ensure that the desired framerate (usually 30 frames/second) is achieved. Setting the performance too low would cause the application to not be able to decode and not display frames at the proper time, causing jerkier playback.
Decoding a frame too soon, on the other hand, unnecessarily increases power consumption since finishing a task before its deadline implies that the performance level was set too high. The goal of the performance-setting strategy is to stretch the execution of each task exactly to its deadline and scale the supply voltage and body bias voltage to their optimal values for the required performance. The only difficulty is in knowing exactly what the deadlines are. Our algorithm is implemented in the Linux kernel and relies on monitoring system calls and inter- task communication to derive deadlines automatically and without modification of user programs [2] . Unlike many similar algorithms, ours is equally effective for interactive and real-time (periodic) workloads.
The traces for this paper were collected on a Sony Picturebook PCG-C1VN which uses the Transmeta Crusoe 5600 processor whose performance level can be varied between 300 -600MHz in 100MHz steps (or frequency scaled between 50 -100% in 16% steps). While this processor has its own algorithm for controlling the processor performance levels, we have overridden it with our own performance-setting algorithm. During the benchmark runs, the processor's frequency was varied between 300MHz and 600MHz and the measured performance levels were used to compute the expected energy using either DVS alone or using simultaneous DVS and ABB. Moreover, we noticed that for many of our benchmarks, even the minimum speed of this processor was unnecessarily fast. Some of our applications would meet their deadlines with a performance level of 10% of peak.
Therefore, we also estimated the effects on energy consumption for a conceptual processor that could run over a wider range of frequency values. The frequency values ranged from 10 -100% in 5% steps. We compare these energy results with those from the more restricted range where the minimum frequency was restricted to 50% of maximum performance. The four benchmarks in this paper are:
• xmms-mp3: mp3 audio playback using the xmms player, which includes visual effects (See Figure 8 ).
• mpeg: video playback of Red's Nightmare.
• emacs: record of an editing session in emacs.
• os: record of user doing miscellaneous UNIX operations (e.g. grep, ls, vi, find, awk, perl, etc.).
Optimization
The constant values for the Crusoe processor were calculated using published data on the processor [21].
Since, the processor is fabricated in a 0.18µm process, the fitting parameters were adapted from the Berkeley predicted models for a 0.18µm process [13] . Table 2 shows the constants for the Crusoe processor in 0.18µm technology. It is recognized that as technology continues to scale, processes incur higher leakage [22] . In fact, the static power in current 0.18µm high-performance processors comprises 20% of total power [4] . Conservatively, our 0.18µm simulations have only 10% leakage power. To forecast for future generations, constant values were also calculated for the higher-leakage 0.07µm predicted process. The 0.07µm process's leakage power is 30% of by up to 0.5V and V bs by up to -1V. The minimum duration at any utilization was set at 200µs which is a conservative estimate of V dd and V bs switching times based on previous published data [6] . During these switching periods, the higher-power state was used as an estimate of total energy. Figure 8 shows a sub-section of the trace for the xmms-mp3 player and the required V dd and V bs values for energy optimization in the 0.07µm technology. Table 3 shows the energy reduction achieved by employing both DVS and ABB in the 0.18µm process using scaling between 50 -100% in 16% steps. The values are shown for the four different workloads. In the 0.18µm process the average energy savings over DVS only schemes is 23%. The more aggressive performance scaling (10 -100% scaling) does not yield any benefits in the 0.18µm process because the longer run times during active cycles override the benefits achieved during the idle states when the clock is halted and only static power is consumed. This is due to the relatively low-leakage nature of the 0.18µm process (10% of total power). Table 4 shows the energy reduction achieved by simultaneous DVS and ABB scaling in the 0.07µm process.
Energy Savings
The performance scaling between 50 -100% of peak shows an average energy reduction of 39% over DVS alone while the scaling between 10 -100% of peak has an average energy reduction of 48%. The most benefit is achieved in applications like emacs where the processor spends a lot of time idling and consumes mostly static power. This static power is reduced by the body biasing.
Conclusion
We examined an energy reduction technique through simultaneous implementation of DVS and ABB and presented an analytical expression for power consumption and processor performance as functions of three control parameters (frequency, supply voltage, and body bias voltage). A closed-form method for finding the proper V dd and V bs for optimal power consumption was also presented. Furthermore, this optimal solution was easily obtained using process parameters from SPICE simulation and design specifications. The optimal parameters were applied to both actual and simulated workloads for a 600MHz, 0.18µm mobile processor. The results show that the simultaneous implementation of DVS and ABB power scaling techniques produce an average energy reduction of 23% in a 0.18µm process and 39% in a predicted 0.07µm process over DVS alone when scaling performance between 50 -100%. Energy reductions of nearly 50% were achieved through more aggressive performance scaling (10 -100%) in the 0.07µm process. The results also suggest that as technology scales and leakage power increases, simultaneous DVS and ABB scaling will become more effective. 
