Abstract
Introduction
Energy dissipation has become a critical design constraint in high performance microprocessors. Until recently, the focus has been on the dynamic energy dissipated in CMOS circuits. In older technologies, the majority of the energy is dissipated when transistors switch (transient power dissipation). When the circuits are not active the current is extremely low relative to switching, and thus, the static energy consumed is negligible. This exaggerated relationship between dynamic and static energy will experience a marked shift in the near future.
Static energy dissipation is a result of leakage current due to the Ýnite-resistance of the off transistors between power and ground that exist whenever power is applied to a CMOS circuit. The magnitude of the leakage current is highly dependent on the threshold voltage Î Ø characteristics. As integrated circuit technology scales to ever smaller dimensions, supply voltage levels are likewise scaled. To improve circuit speed, the threshold voltages are also decreased. This decrease in threshold voltage results in an exponential increase in the subthreshold leakage current [3] . The International Technology Roadmap for Semiconductors [2] projects dimensions of 70 nm to be in production by the year 2006. At these dimensions, the leakage energy is estimated to be on par with the dynamic switching energy if novel circuit techniques are not developed [3] .
Since most of the transistors in a microprocessor reside in the storage structures (the caches and buffers), the RAMs are responsible for a large portion of the leakage power [7, 9, 11, 14, 20] . The functional units, alternatively, consist of a much smaller fraction of the transistors. However, the model developed by Butts and Sohi [6] for estimating leakage current in various logic structures reveals an order of magnitude larger leakage current for combinational logic relative to cache RAM transistors. While precise estimates for static power require detailed circuit knowledge of the processor, which is not readily available, this model indicates the integer and Ðoating point functional units contribute a noticeable fraction of the overall static power despite the smaller transistor count relative to the caches.
In this paper, we present the beneÝts of employing a dual threshold voltage domino logic circuit technique [16] (dual-Î Ø ) to reduce subthreshold leakage current in the integer functional units (FUs) of a general-purpose processor. We focus on domino logic dual-Î Ø circuits because domino logic has both superior speed and area characteristics as compared to static CMOS logic circuits [1, 10, 13, 16] . We restrict the analysis to the integer FUs because it is these units that are most heavily utilized. Some domino logic designs have a sleep mode in which the circuit expends very little static energy. However, due to the energy cost of entering this mode, it has thus far been proven useful only to reduce leakage during long-term standby mode. Idle times in the functional units can often be relatively short, from one to a few hundred cycles. We develop an energy model appropriate for the architecturelevel analysis of logic circuits and explore strategies to employ the sleep mode in the dual-Î Ø circuits so as to minimize the overall energy when idle times are short. We use this energy model to develop insight into the dependencies among the application behavior, activation of the idle mode, and the underlying technology characteristics of the circuit.
We study both analytically and empirically (by determining the effects on the performance and energy of a set of integer benchmarks) the beneÝts and costs of aggressively enabling the sleep mode at every opportunity (MaxSleep) relative to never enabling the sleep mode (AlwaysActive). These two extreme sleep mode management policies, MaxSleep and AlwaysActive, are the two simplest policies possible and provide bounds on the energy savings to which other sleep management methods should be compared. Our results show that with idle intervals as short as ten cycles, the MaxSleep policy performs well across a broad range of parameters. We also propose a circuit-based scheme we call GradualSleep that blends the best behaviors of MaxSleep and AlwaysActive and reduces the energy impact of using the sleep mode for even smaller idle periods. We show that GradualSleep performs well across a wide range of conditions. The simple GradualSleep design achieves most of the potential energy savings, indicating that more complex control strategies may not be warranted.
The rest of the paper is organized as follows. The lowleakage domino circuit and its behavior is described in Section 2. A static energy model appropriate for architectural energy studies of functional units is developed in Section 3. Our experimental methodology is described in Section 4. The use of the sleep mode to reduce overall energy in integer functional units is evaluated in Section 5. Related work is discussed in Section 6. Finally, concluding remarks are made in Section 7.
Low-leakage logic-based circuit design
Dynamic domino logic gates are frequently used in critical paths within the functional units of high speed processors. The structures of a static CMOS AND-gate with its counterpart implemented as dynamic domino logic are contrasted in Figures 1a and 1b . In static CMOS, the inputs are loaded by both the PMOS and NMOS transistors. In domino logic, the inputs have only the NMOS device as a load and thus are inherently faster. The operation of the domino AND-gate is shown in Figure 1c . The internal node Dynamic is precharged during the low phase of the clock. Note that the path to ground is cut-off by an NMOS transistor during this time. When the clock transitions high, the path to ground is enabled and the inputs are evaluated. When both inputs are high, the dynamic node is discharged and the output goes high. When either input is low the dynamic node remains charged and the output is low. The state of the dynamic node is preserved against coupling noise, charge sharing, and charge leakage by the keeper transistor. In contrast to static CMOS, every clock cycle the dynamic nodes are precharged and the inputs evaluated regardless of whether the inputs change state. When the circuit is not required, useless re-evaluation (and energy cost) can be avoided by gating the clock such that the clock input is forced high.
As described in [1, 13, 16] , domino circuits permit the use of dual threshold voltage techniques to reduce subthreshold leakage current without sacriÝcing active mode circuit performance. The key to achieving this balance is to place low-Î Ø transistors only along the critical evaluation path as shown in Figure 2a , in which the shaded transistors are the slower high-Î Ø devices. The leakage current of a dual-Î Ø domino circuit is asymmetric and depends on the voltage level at the internal dynamic node. If either In 1 or In 2 are low, the dynamic node will remain high. In this state, the voltage across the high leakage transistors N1, N2, N3, as well as N4 results in a large subthreshold leakage current. Alternatively, when both inputs and the clock are high, the dynamic node is discharged and the low leakage transistors P1, P2, and N5 are strongly cutoff. When the dynamic node is discharged, the voltage drop is across the high-Î Ø devices, which act as high resistance switches, and not across the low-Î Ø transistors. In this state, the static energy of the circuit is dramatically reduced.
Dual-Î Ø domino circuits that incorporate a low leakage sleep mode do so by adding the ability to force the internal nodes into the low leakage state. Many circuits incorporating a sleep mode have been proposed [1, 12, 13, 16] . For the purpose of this paper, the essential behavior of all these circuits is similar. The differences are in the complexity and energy overhead of the sleep mode function. For the ensuing discussion, we select a circuit from [16] that is simple and incurs minimal energy overhead.
The proposed method for incorporating the low leakage sleep (idle) mode into a dual-Î Ø domino circuit is shown in Figure 2b . A high-Î Ø transistor is added to discharge the dynamic node when the Sleep signal is asserted, regardless of the input vector. Only the Ýrst stage in a sequence of domino circuits requires this additional transistor. Asserting the Sleep signal drives the Out signal high which turns off the keeper and forces any subsequent domino gates to evaluate to the low leakage state in a domino fashion. Not shown is the standard gating of the clock when Sleep is asserted to disable the precharge phase. An important aspect of this design is that the activation energy overhead of the sleep transistor is negligible relative to the switching energy of the gate, 0.14 fJ versus 22.2 fJ.
The delay and energy parameters of an 8-input OR (OR8) Table 1 for low-Î Ø , dual-Î Ø without the sleep mode, and dual-Î Ø with the sleep transistor. The parameters are Î ½ ¼Î and Î ØÒ ¼ ½¼Î . Since leakage energy in dual-Î Ø domino circuits depends on the state of the circuit, Vector LO Lkg is the input 10000000 which discharges the dynamic node to the low leakage state, and Vector HI Lkg is the input 00000000 which does not discharge the dynamic node. The keeper maintains the dynamic node at the high voltage level, which is also the high leakage state.
The lower gate overdrive of the high-Î Ø keeper transistor in dual-Î Ø domino circuits reduces the contention current when switching the output and improves the propagation delay and dynamic power characteristics as compared to the low-Î Ø domino circuit. In the dual-Î Ø domino circuit, the difference in leakage energy between the LO Lkg and HI Lkg vectors is a factor of 2,000. Our method of incorporating a sleep mode is not in the evaluation path of the gate so there is no impact on the propagation speed of the circuit. The sleep transistor is minimally sized and introduces negligible additional loading on the dynamic node of a domino gate. With the sleep mode capability, we can force the internal state of all of the gates to the low leakage state, drastically reducing the leakage energy regardless of the input vector. Enabling the sleep transistor, however, requires some additional energy (0.14 fJ) which must be accounted for. The delay in discharging the gate via the sleep transistor, 16 ps, is comparable to the delay of the evaluation phase, 15 ps, so the circuit can transition to the sleep state in one cycle. The measurements assume a 4 GHz clock.
The overhead of enabling the sleep mode depends upon the state of the circuit from the previous evaluation phase. The contributors to the dynamic energy dissipated during an evaluation are the circuits whose input vectors cause the dynamic nodes to discharge. In a complex circuit such as an ALU, on average not all dynamic nodes will discharge during an evaluation cycle. An activity factor is the probability that a domino logic gate will evaluate and place the dynamic node into a low voltage state at any given clock cycle. The average activity factor («), therefore, determines the fraction of the dynamic nodes that are discharged during each evaluation period, ¼ « ½. Activating the sleep switch leaves the circuit in the same state as if the activity factor were 1.0 in the last evaluation; thus, activating the sleep mode discharges the dynamic nodes of the rest of the gates in the circuit. This portion is the´½ «µ fraction of the gates that were not discharged during the previous evaluation period before the sleep mode. To return to the active mode, the clock is again enabled and one precharge phase readies the circuit for evaluation; thus, activation also occurs within a single clock cycle.
Tradeoffs between active versus sleep modes
Enabling the sleep mode reduces the static energy dissipated, however, this mode is entered by discharging all of the dynamic nodes within the circuit that did not discharge in the evaluation phase. Thus, there is a tradeoff between the energy saved due to lower leakage current and the additional energy expended in the next active cycle from precharging these dynamic nodes that would have remained charged had the sleep mode not been entered. The activity factor « affects both the leakage energy and the energy overhead in transitioning to the sleep mode. As previously mentioned, activating the sleep mode is equivalent to an evaluation with a maximum activity factor of « ½ ¼.
We approximate a generic functional unit (FU) by a circuit consisting of 500 OR8-gates arranged as 100 rows of Ýve cascaded domino circuits. The circuit contains the drivers that distribute the Sleep signal throughout the FU and this energy is accounted for. The energy expenditure for this circuit relative to the idle interval is shown in Figure 3 The graph shows that for a low activity factor there is a considerable expenditure of energy to transition to the sleep mode after which the additional energy is minimal. If the circuit is not idle for at least 17 cycles then more energy is used than is saved by shifting to the low leakage sleep state. This extra energy decreases as the activity factor increases since more nodes enter the low leakage state during the previous evaluation phase before the idle mode. Interestingly, the time to break even is relatively insensitive across this range of activity factor. The reason is that as the activity factor increases, both the sleep transition overhead and the uncontrolled idle circuit leakage energy decrease at a similar rate, roughly proportional to (1 -«).
Static energy model
A precise energy model depends heavily on the details of the logic design and the circuit design. General circuit methods to reduce static power include combining high-Î Ø devices (slow, low leakage) with low-Î Ø devices (fast, high leakage) and placing the high-Î Ø transistors along the noncritical paths throughout a functional unit. We develop a simple energy model that is parameterized and can represent the energy characteristics across a wide range of logic and circuit designs at a level useful for architectural studies. The model parameterizes the contribution of the low leakage and high leakage transistors in the overall energy dissipation of the circuit. This parameterization of the fraction of high leakage transistors abstracts the circuit speciÝcs into a single primary parameter that can be varied.
The total energy of a circuit is shown in equation (1). The total energy consists of the dynamic and leakage energy during active cycles plus the leakage energy when the circuit is idle. We divide the total run-time into three categories of operation. The cycles of actual computation are called active cycles and their number is denoted as Ò .
The cycles when the circuit is clock-gated (no computation) but the sleep mode is not enabled are called uncontrolled idle cycles and denoted by Ò ÍÁ . Cycles when the circuit is forced into the low-leakage state of the sleep mode are called sleep cycles and denoted by Ò .
The dynamic energy is the number of active cycles Ò times the maximum dynamic energy per cycle, , prorated by an activity factor «, which is the fraction of the internal dynamic nodes that are actually discharged during the evaluation phase. Recall that the dynamic nodes are precharged prior to evaluation. Since we are using circuits based on dual-Î Ø domino logic, we can simplify (1). In dual-Î Ø circuits, the static energy Ë¼ is much less than Ë½ [16] . We deÝne a relationship between the two as Ë¼ × ¢ Ë½ where × is typically in the range of ¼ ¼¼¼½ × ¼ ¼½. Furthermore, for a given technology, we can deÝne the leakage energy as a fraction of the dynamic energy for a device, Ë½ Ô where ¼ Ô. To elaborate, for a single gate the factor Ô is the ratio of the maximum leakage energy expended to the maximum energy for evaluation per time unit (1 cycle). For circuits in a 130 nm technology, the value will be small, Ô ¼ ¼½. This leakage factor Ô is a versatile parameter.
Functional units may be designed using all domino logic or a mix of dynamic domino logic and static logic. In the latter case where there is a mix of low-Î Ø devices along the critical paths and high-Î Ø devices along the non-critical paths, we can consider the circuit as a whole and use the ratio of its leakage energy to its evaluation energy as the factor Ô. This value of Ô is lower for a single low-Î Ø gate but greater for a high-Î Ø gate. Thus, the factor Ô abstracts the details of the circuit into a single value that models the worst-case leakage behavior relative to the dynamic energy . The factor Ô becomes a key parameter that we vary to explore the technology design space. Applying the above relationships results in equation (2) .
In this architectural study we focus on the relative energy between control policies. We can further simplify (2) by normalizing to the active energy as in (3) .
Equation (3) represents the total energy of a circuit in terms of three factors: the technology, the control policy, and the application. The technology deÝned parameters are Ô, ×, ËÐ Ô , and . Together, the control policy and application determine the active, uncontrolled idle, and idle times Ò , Ò ÍÁ , and Ò , respectively. The application determines the activity factor «.
To give perspective to the magnitude of the technology variables, we calculate the values for the circuit characterization described in Section 2 from the data listed in Ta means the leakage factor Ô has the greatest impact. We note here that from a similar circuit characterization by Heo and Asanovic [10] we estimate from the data in the paper that their implementation of a Hans-Carlson adder circuit in 70 nm technology has a leakage factor that is comparable to our result, between ¼ ¼¿ Ô ¼ ¼ .
Analysis
The analytical model permits quick exploration of the parameter space to Ýnd interesting regions that might not be evident from simulating individual data points. We choose values for × and ËÐ Ô that are in agreement with the circuit measurements but somewhat pessimistic (higher). SpeciÝ-cally, we set × ¼ ¼¼½ and ËÐ Ô ¼ ¼½ . We vary the leakage factor Ô from ¼ Ô ½ to cover a broad range of technology points that include relatively extreme points in terms of the energy contribution caused by subthreshold leakage current. In some of the results we select speciÝc values for Ô. In these cases, we restrict Ô to be either 0.05 (motivated by the values calculated from the circuit characterization) or 0.50 (a convenient number to demonstrate contrasting behavior). These two technology points act as representatives for two distinct behavior regions that, as we shall see, require very different methods for reducing leakage energy. In the rest of this paper, we assume a Ýxed clock duty cycle of 50% (
¼ ¼).
Breakeven idle interval. The break even idle interval is the length of time that a circuit must be idle in order that the energy saved in the sleep mode offsets the additional energy required for the transition. Let us parameterize ØÓØ Ð from (3) as ØÓØ Ð´Ò Ò ÍÁ Ò Å « Ô µ and calculate the break even point for a single idle interval. Thus, the break even interval Ò Ð is the interval that satisÝes the following relationship: Modeling control strategies of the sleep mode. An advantage of a mathematical model is that a model permits exploration of the parameter space before any simulations are run. For this section, we explore three basic methods for controlling the Sleep signal. These methods are distinguished by being easily modeled and deÝning the boundary cases of managing the sleep mode. The Ýrst method, AlwaysActive, represents the case of doing nothing other than clock gating. We never enable the sleep mode so all idle cycles are uncontrolled idle cycles and the circuit expends greater leakage energy. The second method, MaxSleep, aggressively enables the sleep mode whenever the circuit has no useful calculation to perform in the following cycle. The MaxSleep method incurs the maximum energy transition overhead. The third method, NoOverhead, provides an upper bound on energy savings. This method is the same as MaxSleep but we omit the energy overhead for transitioning into the sleep mode. Thus, the No Overhead strategy represents an unachievable lower bound on total energy and, therefore, is an upper bound on energy savings for any control method. Formally, the energies for each of the strategies are deÝned in (6)- (8) . Ñ Ü of (9) is the maximum dynamic energy that the circuit can expend by performing a calculation on every cycle assuming an activity factor «, and AE is the total number of cycles for the simulation. We normalize the graphs to Ñ Ü as a useful baseline for the magnitude of the energy differences. Here, we are exploring how the relative energy costs vary across the parameter space.
To limit the degrees of freedom, we link the four param- extremes of the usage factor and idle intervals (100 cycles happens to be a long idle interval). In Figure 4b , the lower grouping of three lines is for a 10% usage factor. The lowest energy line is the NoOverhead policy. The slope is relatively Ðat since 90% of the time the circuit is in the low leakage sleep mode. The AlwaysActive line shows a sharp rise as the leakage factor increases. The line for the MaxSleep policy runs parallel to that of the NoOverhead policy. The difference between the Figure 4a) .
The relative behavior of the policies at the 90% usage factor is similar but the differences are compressed. Since all three policies have identical energy in the active phase, which accounts for 90% of the time, differentiation between the policies can occur only in the remaining 10% of the cycles. Figure 4c is a similar plot with Ò Ð ½¼¼ cycles.
With the larger idle interval the MaxSleep policy is nearly identical to the No Overhead policy at the 10% usage level. The difference between Figures 4b and 4c is that in the latter Ýgure the transition energy is amortized over 100 cycles as compared to only 10 cycles. The worst case at the 50% usage level is shown in Figure 4d where Ò Ð ½ cycle, meaning the circuit alternates between one active and one sleep cycle to incur the maximum transition overhead.
The GradualSleep design
The brief exploration of the energy model space in Section 3.1 showed that the preferred policy for managing the sleep mode depends on parameters for the technology (Ô) and the control policy/application behavior (embodied by and Ò Ð in the discussion). The MaxSleep policy works well if the average idle interval is longer than the break even interval, Ò Ò Ð , but the AlwaysActive policy performs better when the idle interval is shorter, Ò Ð Ò .
A policy that selects the minimum energy between the two options, Ñ Ò´ Å ÜËÐ Ô ÐÛ Ý× Ø Ú µ, is the best combination of the two policies.
Here we propose a method that is a hybrid of the MaxSleep and AlwaysActive control schemes. By dividing the circuit into slices and staggering the Sleep enable signal, we can incrementally place the circuit into the sleep mode and avoid the initial energy dissipation in the Ýrst idle cycle as in the MaxSleep policy. This method also protects against excessive static energy consumption that the AlwaysActive policy would incur in the event of a long idle interval. A block diagram of a circuit divided into four slices is shown in Figure 5a . The timing of the Sleep signal is shown in Figure 5b . The Sleep signal feeds one end of a shift register whose bits supply the Sleep signal to the different slices of the circuit. The AND gates ensure simultaneous re-activation of the circuit. All of the register bits are simultaneously cleared upon de-assertion of the Sleep signal. While any level of granularity can be used, we assume the number of slices matches the number of cycles in the break even interval for the technology, Ò , so that every cycle ½ Ò of the circuit enters the sleep mode. Using fewer slices changes the curve for GradualSleep to be more similar to the MaxSleep behavior. Adding more slices results in a shift towards the Always Active behavior. We hide assertion/deassertion of the Sleep signal behind the register read stage of the pipeline. The basic pipeline of the Alpha 21264 [15] is shown in Figure 6 , as is a single, generic Sleep signal to one of the FUs. At the end of the issue stage the number of integer instructions to be executed is known and the appropriate FUs are activated before the instructions reach the execute stage. Since the Sleep signal is staged and is not along a critical path, the shift register and AND gates can be constructed from slower, high Î Ø transistors with very low subthreshold leakage current. We do not include the small additional dynamic energy in the analysis.
The energy costs of transitioning to the sleep mode for the three policies is compared in Figure 5c . 
Experimental methodology
We use the Simplescalar simulator [5] to verify the preceding analytical analysis in Section 3. The processor is modeled after the Alpha 21264 and the conÝguration parameters are given in Table 2 . We have modiÝed the simulator to have individual structures for the reorder buffer, integer queue, Ðoating point queue, and load store queue as in the Alpha 21264. We restrict the study to the integer units since integer operations are generally the dominant type of instructions executed, thus, these functional units are heavily utilized. The integer benchmarks are listed in Table 3 .
The goal of this study is to explore the potential for improving energy efÝciency with Ýne-grained control of static energy in large logic circuits. To ensure the results are not inÐated by excess resources that can be trivially put to sleep, we limit the number of functional units. Our processor conÝguration supports a maximum of up to four integer functional units. For each application, we determine the minimum number of functional units required to achieve at least 95% of the peak performance from using four functional units. Restricting the number of functional units makes it more difÝcult for a control method to successfully exploit the sleep mode and, thus, makes differences between control methods more meaningful. Implicit in this methodol- ogy is the assumption that some technique of proÝling [19] or compiler analysis [18] can be used to identify when functional units are not needed a priori. Such an analysis could be used to signal the run-time system that some functional units are unnecessary and can be disabled at the start of an application. The second to last column in Table 3 shows the number of integer units used for each benchmark in all of the simulations. The fourth column lists the maximum IPC with four functional units, while the Ýfth column lists the achieved IPC for a given number of functional units. In the simulations, we allocate operations to the set of functional units in round robin fashion and record precise statistics on the idle times for each functional unit. From this data, we calculate the total energy used by each functional unit by summing the energies for active cycles, uncontrolled idle cycles, and sleep mode cycles as given in equation (3) . The total energy of the integer unit is the sum of the energies of the individual functional units. Values of the equation parameters are listed in Table 4 .
We present results for three values of the activity factor, 
Results
The distribution of idle intervals across the benchmarks is plotted in Figure 7 . The x-axis is a ÐÓ ¾ scale in cycles of the length of the idle interval and the y-axis is the fraction of the total time that the ALUs are idle. The data for each of the functional units from the different applications are combined as fractions to give the data equal weight regardless of the instruction window size of the application. To improve readability, idle intervals longer than 8192 cycles have the total idle time accumulated at the 8192 cycle marker, hence, the short but sharp step at the right of the graph. The graph shows that across the suite of benchmarks, any given integer ALU is idle 46.8% of the time when the L2 access latency is 12 cycles. Furthermore, nearly all of the idle intervals are shorter than 128 cycles and a large fraction, 75%, occur within the L2 access latency time. To highlight the inÐu-ence that the L2 access latency has on the distribution, also plotted is the idle interval distribution using a 32 cycle L2 access latency. The increased overall idle time reÐects the additional time to access the L2 cache. As demonstrated in Figure 7 , extremely large idle intervals are rare and relatively short intervals are common.
The relative energies of the three policies presented in Section 3 are compared in Figure 8 Figures 8a and 8b are marked on the graph. As described before, when the leakage energy of the circuit is small the AlwaysActive policy outperforms the MaxSleep policy, but the reverse is true when the leakage energy becomes large. The GradualSleep design, however, exhibits well behavior across the complete technology range, and performs better near the breakeven point for the distribution of idle intervals of the benchmarks. Thus, the ability to blend both policies has little negative impact and can actually improve the overall energy efÝciency when the distribution of idle times centers around the breakeven point. The fact that the GradualSleep design avoids the extreme behaviors of the other two policies means that the The problem of leakage energy is often reported as the fraction of the total energy due to leakage. This view of the data is plotted in Figure 9b . At Ô ¼ ¼ , the leakage energy is 13% of the total energy for the AlwaysActive policy, but increases to 60% at Ô ¼ ¼.
The results shown in Figure 9b are best appreciated in the context of the processor as a whole. Borkar [3] indicates that at 70 nm dimensions and beyond (Ô ¼ ¼ , approximately), leakage will comprise 30% or more of the total power. Our results showing only 13% at Ô ¼ ¼ do not conÐict with this conclusion for the following reasons. The primary factor producing the lower than projected fraction of leakage energy is our methodology of eliminating unnecessary functional units that would contribute signiÝ-cantly to leakage but not to dynamic energy. For example, in our simulations mcf utilizes only 31% of the two functional units and the fraction of leakage energy is 15%. The fraction increases to 25% for a microarchitecture with four functional units. Second, we do not include the non-interger functional units in our analysis because they are mostly idle in this benchmark suite (and, thus, trivially controlled). In the integer benchmarks, these non-integer functional units add disproportionately to the leakage portion of the total energy. This effect would further increase the overall fraction of leakage energy relative to the total energy. Depicted in Figure 9b is the plot for the No Overhead policy. This policy represents a lower bound on the fraction of static energy since all the idle cycles are at the lowest leakage state and there is no additional energy cost to transition to that state. Thus, for this policy, the static energy is almost entirely due to leakage during computation cycles. The active mode leakage energy is a signiÝcant fraction of the overall leakage energy, and becomes the dominant fraction as Ô becomes larger. Circuit techniques are required to reduce this portion of the leakage energy.
Related work
Dual-Î Ø domino logic circuits with a sleep mode have been proposed in [1, 10, 13, 16] . While all of these circuits limit leakage energy by forcing the dynamic nodes into the low leakage state, the overhead of this sleep mechanism varies. We selected the circuit from [16] because the technique has no delay penalty and a low energy overhead.
However, the energy model parameters can be adjusted to reÐect many other circuit techniques.
Heo and Asanovic [10] introduce the technique of controlling the sleep mode of dual-Î Ø circuits for Ýne-grained reduction of leakage energy. The focus is on the circuit itself and ends with an analysis of the breakeven interval for an adder. We extend this work by introducing an analytical energy model for a logic functional unit and perform a detailed study on how to implement Ýne-grained control of the sleep mode in heavily used functional units of a microprocessor. Our results reveal the interdependencies among the circuit technology, the application, and the control strategy.
Butts and Sohi [6] introduce a static energy model for estimating static power consumption early in the design process at the architectural level. This static energy model can be parameterized to provide steady-state estimates of various types of circuits, e.g., RAM cells, CAM cells, and logic gates. To relate this work to our own, the Butts and Sohi model is appropriate for estimating the parameter and the leakage factor Ô. In contrast, our model is specialized for logic but estimates total energy of the functional units, both dynamic and static, based on the behavior of the application. The ability to consider the dynamic behavior of a circuit is essential in analyzing the tradeoffs between schemes that manage the sleep mode of the circuit.
Rele et al. [18] use the compiler to identify when functional units will be idle for long periods of time and can be power gated, thus reducing the static power. The basis of our study presumes a technique such as [18] has already been applied. By limiting the number of functional units, our study explores how to manage resources that are critical to performance and, consequently, have short idle times.
Both Brooks and Martonosi [4] and Ghose et al. [8] demonstrate that many operands do not require the full width of the datapath. To save dynamic energy, datapath hardware detects these bytes and gates the logic from performing unnecessary work. In the context of this paper, this phenomenon might be able to be exploited in the GradualSleep policy by placing the high order bytes to sleep initially and upon re-activation only activate these bytes that are also enabled by the datapath hardware.
Pyreddy and Tyson [17] use dual speed pipelines to save dynamic energy by scheduling non-critical instructions on the slow pipeline. A slow pipeline could have a higher threshold voltage and lower leakage current. Off-loading the non-critical instructions from the fast pipeline will increase the average idle duration in the fast pipeline. This strategy may offer additional opportunities to enable the sleep mode of the fast pipeline.
At the architectural level, the study of leakage reduction has centered on the storage structures in the microprocessor. Yang et al. [20] gate the power supply voltage to the L1 instruction cache RAM cells to turn off power to the storage cells and essentially eliminate the leakage energy. The state of the cell is lost. Kaxiras et al. [14] present a control scheme that dynamically adjusts when to place the cache lines into the sleep mode to minimize leakage energy. Flautner et al. [7] propose a drowsy cache design for the L1 data cache that maintains the cell state in the sleep mode but at the cost of higher leakage energy than if power to the cell were completely turned off. Their study concluded that a simple control scheme sufÝced to achieve most of the energy savings. Hanson et al. [9] compare these two techniques and a third method in an extensive study that includes the L1 instruction cache, the L1 data cache, and the L2 uniÝed cache. Heoet al. [11] take a novel approach to reduce the static energy associated with the bitlines in a RAM by simply tristating the drivers to the lines. The Ðoat-ing bitlines settle naturally at the voltage level that minimizes the leakage energy.
Conclusion
In this study, we evaluate the circuit technology of dual-Î Ø domino logic along with the sleep mode as a promising technique for reducing subthreshold leakage energy at a Ýne-grained time scale, from one to a few hundred cycles. Taking the energy cost of entering the low-leakage sleep state into account, we introduce an analytical energy model to characterize the energy behavior of functional logic units at the architectural level. We use this model to characterize the interaction of the application with the technology as well as evaluate the effects on performance and energy of a set of integer benchmarks as technology parameters are varied. We show that the simple GradualSleep design works well across a range of technology and application parameters by amortizing the energy cost of entering the sleep mode across several cycles. Our results indicate that a more complex control strategy to determine when to enter the sleep state may not be warranted and that the leakage energy lost during the active cycles of the functional units may eventually become the dominant component of the overall leakage energy.
