This paper explores the effectiveness of pipelining as a power saving tool, where the reduction in logic depth per stage is used to reduce supply voltage at a fixed clock frequency. We examine poweroptimal pipelining in deep submicron technology, both analytically and by simulation. Simulation uses a 70 nm predictive process with a fanout-of-four inverter chain model including input/output flipflops, and results are shown to match theory well. The simulation results show that power-optimal logic depth is 6 to 8 FO4 and optimal power saving varies from 55 to 80% compared to a 24 FO4 logic depth, depending on threshold voltage, activity factor, and presence of clock-gating.
INTRODUCTION
Pipelining reduces the number of logic levels between registers and is usually employed by digital systems designers to increase achievable clock frequency. But the time slack obtained from pipelining can also be used to reduce power consumption by low-ering supply voltage at a fixed clock frequency. This technique can be very effective for digital systems with fixed throughput requirements and highly parallel computations. Supply voltage scaling is by far one of the most effective techniques for trading time slack for power. Supply voltage reduction leads to a quadratic reduction in active power and also a super-linear reduction in leakage power, as leakage current has a strong dependency on drain voltage in deep submicron processes. A parallel architecture could also be used to provide excess performance to trade for lower power, but pipelining has the advantage of a lower area penalty. Power reductions from pipelining are eventually limited by the power overhead of the additional pipeline latches or flip-flops required for each additional pipe stage, leading to a power-optimal level of pipelining.
In this paper, we show how power-optimal pipelining varies for different operating regimes in deep submicron technology. We examine the tradeoffs between pipeline depth, supply voltage, threshold voltage, and total power using circuit-level simulations and analytical models. We also explore the effect of activity factor and clock gating.
RELATED WORK
The trend towards deeper pipelines in microprocessors is clearly seen in the evolution of Intel x86 family, with a factor of 7 reduction in logic depth per stage over the last decade [9] . This reduction in logic depth has combined with improvements in transistor speed from technology scaling to yield an even larger increase in processor clock frequency. Increasing the number of pipeline stages for an operation increases its latency in clock cycles, which in turn increases the number of pipeline stalls experienced by dependent operations. The resulting reduction in instructions completed per cycle (IPC) reduces the performance advantage from greater clock frequency, with greater impact on codes with lower instructionlevel parallelism (ILP).
Processor architects have explored this tradeoff between increased clock frequency and reduced IPC to determine performance-optimal pipelining depth. Early work by Kunkel and Smith [10] considered pipelining in vector supercomputers and found that 8-10 ECL gate levels was performance-optimal for scalar code, and as little as 4 gate levels for more parallel vector code. Recently, several authors have investigated the performanceoptimal pipeline depth for superscalar microprocessors [5, 9, 11] , with a consensus in the range of 8-11 FO4 delays for SPEC integer codes and around 6 FO4 delays for SPEC floating-point codes, which generally have higher ILP. These performance-optimal numbers ignore power as well as the design and verification complexity that would accompany such high-frequency designs (roughly twice the clock rate of existing systems [11] ).
Several authors have extended superscalar performance mod-els with power models that include the power overhead of additional pipeline latches [12, 6] . Srinivasan et al. [12] found that power-performance optimal logic depth increases to about 18 FO4 for SPEC benchmarks and around 24-28 FO4 for TPC-C, a commercial application. Hartstein and Puzak [6] found 22.5 FO4 is the power-performance optimum according to their powerperformance metric. They also found that clock gating pushes the optimum back to deeper pipelines [6] which agrees with our results. This previous work focuses on processor performance, where limited instruction parallelism reduces the benefits of deep pipelines, and these studies limit power optimization to the selection of the correct number of additional pipeline stages. Other types of digital system, including digital signal processors, network processors, and graphics engines, have much greater levels of parallelism and often have fixed throughput requirements. For these systems, pipelining can be used together with voltage and threshold scaling to reduce total energy consumption while maintaining a fixed clock rate. The use of pipelining for power reduction was proposed by Chandrakasan et al. [2] but without an attempt to determine the power-optimal pipelining strategy.
METHODOLOGY
In this paper, our main target is a logic-dominant pipeline stage. We make several simplifying assumptions in our analysis. We are interested in fixed-throughput designs for highly parallel computations and so do not include any performance loss from an increased frequency of pipeline stalls as pipeline depths increase. Global wire delay does not scale as fast as gate delay as feature size is reduced, and some modern microprocessors have so-called drive stages which include only wires and repeaters [8] . We leave wire-dominant pipeline stages for future work but note that wire RC delay become relatively less important as supply voltage is scaled down in a fixed technology, because wire resistance remains constant while effective transistor resistance increases. We do not include local wire capacitance due to the absence of detailed circuit layouts, but note that wire cap can be an important component of total load in deep submicron technology even for a logic-dominant stage. Figure 1 shows the baseline pipeline stage model assumed in our study. To model a well-designed path in a circuit, we use a simple static inverter chain with each inverter driving four copies of itself to yield a FO4 load. We use 24 FO4 delays as a baseline clock period, representing a current high-performance processor circuit (the high-frequency Pentium-4 has a 20 FO4 cycle time [11] , and most other designs have somewhat shallower pipelines).
Even though different circuit styles and logic gates might lead to different power-optimal pipelining results, we assume that our FO4 inverter chain model is fairly representative and insights gathered from our simulation results can be applied to other cases. Flipflops were chosen as the timing elements rather than latches due to their simplicity of usage, and the PowerPC transmission-gate flipflop was chosen because it is a popular choice due to its robustness and energy-efficiency [7] . While the transistor sizes of inverters and flip-flops were fixed, the sizes of clock buffers were varied to ensure the appropriate clock rise and fall times when varying the depth of pipelining.
We used the BPTM 70 nm transistor models with different threshold voltages [4] and HSPICE for circuit simulation. Throughout the paper, clock frequency was fixed at 2 GHz and temperature was constant at 100 Ó C. We only considered subthreshold leakage; although gate leakage might become significant at some point in these technology generations, it is also likely that new gate dielectrics will make gate leakage insignificant again.
PIPELINING AND SUPPLY VOLTAGE
We begin by showing the effect of pipeline depth on supply voltage. With delay fixed, supply voltage scales down as pipelining deepens because the logic amount per pipeline stage decreases. Synchronous circuit delay is approximately given by
where AE is the logic depth per pipeline stage in term of FO4 delay (or the number of FO4 inverters per pipeline stage), is the timing element delay normalized by FO4 delay, « is a velocity saturation effect factor, Î and Î Ø are supply and threshold voltages respectively. Assuming « is 2 (actual value of « in deep submicron technology is close to 1.5 due to the short-channel effect),
Now assuming
Î is close to zero, we get a simple linear equation between Î and AE , where ¼ is a constant: Figure 2 shows the simulated supply voltages when varying the number of FO4 inverters per stage for different threshold voltages.
ÄÎ Ì , Å Î Ì , and À Î Ìrepresent low, medium, and high threshold voltages respectively and their values are shown in Table 1 . Low threshold voltage results in low supply voltage for the same delay. ¼ and ½ were calculated using least square method and shown in Table 1 . We can see that ¼ is proportional to Î Ø as well as ½ (our simplified equations fail to explain the effect). 
PIPELINING POWER COMPONENTS
In this section, we explore the impact of pipelining on the components of total power consumption when delay is fixed. We use the supply voltage scaling results shown above in Section 4 and investigate switching, leakage, and idle components of power consumption assuming no clock-gating mechanism.
Pipelining and Switching Power
Switching power remains the dominant component of total power consumption when the activity factor is high, even in leaky deep submicron technology. Switching power is the power consumed while charging and discharging load capacitances. The load capacitances include transistor parasitic and wire capacitances. Because we assume our pipeline stage is logic-dominant, wire capacitances are not included in our simulation.
The switching power of a pipelined logic stage can be divided between that due to logic gates and that due to timing elements, and can be modeled as:
The overhead includes clock and switching power of timing elements and it is inversely proportional to, AE , the number of logic gates per stage. We assume that the number of latches increases linearly with the number of pipeline stages. All the switching power Table 1 , a higher Î Ø process gets less switching power saving from pipelining.
The optimal logic depth AE £ is given by:
The equation indicates that the capacitance ratio of a timing element and an FO4 inverter 
Pipelining and Leakage Power
The rapid reduction in gate length and accompanying downscaling of threshold voltages over the last few process generations has led to an exponential growth in leakage power. Within a few process generations, it is predicted power dissipation from static leakage current could be comparable to dynamic switching energy [1, 3] .
The leakage power of our pipelined circuit can be given by the following equations:
where ÒÚÌ is a constant representing leakage current slope, is a Drain-Induced-Barrier-Lowering (DIBL) coefficient, and ½ ¼ is the ratio of leakage power of one FO4 inverter versus one timing element. As in the switching power model, the leakage power in a stage can be divided into logic gate leakage and timing element leakage, with timing element leakage inversely proportional to AE .
When AE is much greater than ½ 
The exponential term,
µ represents the dependence of leakage current on the drain voltage (from DIBL). In modern deep submicron technology, for an appropriate supply voltage range, this term is larger than Ç´½µ but smaller than Ç´AE µ, therefore leakage power is reduced in a super-linear fashion as AE decreases, though less than the quadratic reduction for switching power. Also, it is noted that the exponential term scales down faster as AE decreases when ¼ is larger. A higher Î Ø process has higher ¼ as shown in Table 1 , and so it is expected that higher Î Ø process will see greater leakage power saving from pipelining, which is the opposite to the switching power case, but higher Î Ø processes have less absolute leakage to begin with. (13) Figure 4 shows the simulated leakage power while varying the number of logic gates per stage. Optimal logic depth AE £ was around six and optimal power saving was around 70-75%. The graphs show that lower threshold voltages gives less leakage power saving and slightly greater optimal logic depth. 
Idle Power without Clock-Gating
Clock-gating is a popular switching power reduction technique which inactivates the clock signal to timing elements within an inactive block when a circuit block is idle. But clock gating is not always possible due to the increase control complexity or the insufficient setup time of the clock enable signal. This section focuses on the impact of pipelining on an idle pipeline stage without clockgating. The following section discusses the effects of clock-gating.
The following equations model idle power with no clock-gating mechanism as simply the sum of the switching power of the timing elements and the total leakage power. Because of the exponential dependency of leakage current on Î Ø as represented in the Î Ø ÒÚ Ì term, È Ð approximately follows the switching power of the timing elements when Î Ø is high and follows the total leakage power when Î Ø is low. 
(20) Figure 4 shows the simulated idle power without clock-gating, varying the number of FO4 inverters per pipeline stage. Optimal logic depth AE £ was 8, which is greater than the optimal logic depths for switching and leakage power. Also, optimal power saving was smaller (50 to 70%) compared to the switching and leakage power cases. For idle stages, the overhead of timing elements is more significant compared to active stages. The graphs show that lower threshold voltages gives more idle power saving and slightly lower optimal logic depth. 
RESULTS
In this section, we combine the results for the individual power components to calculate optimal logic depths and optimal power savings for different operating regimes including threshold voltage, activity factor, and presence of clock-gating. Power-optimal pipelining varies depending on activity factor and Î Ø because these change the proportion of switching power and leakage power (or idle power with no clock-gating mechanism), and each impacts pipelining power differently as seen in Section 5. This section is divided into two parts: the first part details the case when there is a clock-gating mechanism for pipeline stages and the second part considers the case without clock-gating. Figure 6 shows the simulated total power when a clock-gating mechanism is present for different activity factors. With a low activity factor, total power curves follow leakage power curves and high Î Ø leads to more power saving by pipelining. As the activity factor increases, total power curves follow switching power curves and high Î Ø leads to less power saving by pipelining. Figure 7 shows the simulated optimal total power saving when a clock-gating mechanism is present. With zero activity factor, optimal power savings compared to a 24 FO4 design vary from 70 to 75% depending on Î Ø . Since switching power savings from pipelining are less dependent upon Î Ø , optimal power savings reach around 80% regardless of Î Ø as activity factor increases.
Case 1: Clock-Gating Present
Because both switching power and leakage power are minimized when AE is 6 as seen in Section 5.1 and Section 5.2, optimal logic depth was found to be 6 regardless of activity factor or threshold voltage when a clock-gating mechanism is present. However, as seen in Figure 3 and Figure 4 , both switching and leakage power curves are quite flat around the optimum and power saving by pipelining is quite insensitive to modest deviations from the optimum. Therefore, 8 FO4 delays per stage might be a better choice since it simplifies design complexity with a small loss of power saving. Figure 8 shows the simulated total power without clock-gating for different activity factors. With a low activity factor, total power curves follow idle power curves and low Î Ø leads to more power saving (Section 5.3). As the activity factor increases, total power curves follow switching power curves. Figure 9 shows the simulated optimal total power saving when there is no clock-gating mechanism. With zero activity factor, optimal power savings are around 5 to 15% less than the clock-gating present case because of the timing element switching power overhead which is not present when there is a clock-gating scheme. Optimal power savings reach 80% slowly as activity factor increases compared to the clock-gated case. It is noted that low Î Ø gets the most power saving regardless of activity factor. Figure 10 shows the optimal logic depths when clock is not gated for different threshold voltages. Because the idle power is minimized when AE is 8 (Section 5.3), optimal logic depths remain at 8 until activity factor reaches around 0.2 (0.3 at high Î Ø ) and after 0.2 (0.3 at high Î Ø ), it falls to 6. 
Case 2: No Clock-Gating Present

DISCUSSION
Our study has a number of limitations. The number of latches was assumed to grow linearly with the number of pipeline stages, whereas previous authors have used a superlinear latch count scaling formula of the form AE , with an exponent ½ ½ [12, 6] . It is not clear how latch counts scale in highly parallel architectures, but larger values of would increase the optimal logic depth.
Depending on the computation being parallelized, additional state in the form of larger memory arrays might be required to track the increased number of operations in flight. A growth in the size of these memory structures would tend to increase energy per operation and hence increase optimal logic depth per stage, though we expect this effect to be minor as memories are generally lower power than processing units.
Our study did not include the effects of glitching on power. Others have noted that glitching activity reduces linearly with pipeline depth as it becomes less likely that inputs to a gate would have very different path lengths [12] . This effect would tend to push the optimum towards shallower pipeline stages.
We did not include parasitic wire capacitance. Adding wire load capacitance to our model will increase total switching power, and so will again push the optimum towards shallower pipeline stages.
For deeply pipelined circuits, fast path problems are more likely, as there will be an increase in the number of short logic paths between timing elements and an decrease in the relative wire delay. Because clock frequency is not increased, clock skew and jitter problems are not as apparent as in a frequency-scaled design, but clock jitter might increase as power supply to the clock drivers is reduced.
One benefit of supply scale-down is that wire delay becomes relatively less significant as gates slow down. This helps reduce some of the design effort of building a highly pipelined circuit compared with pipelining for increased clock frequency.
CONCLUSIONS
Pipelining can be an effective power-reduction tool when used to support voltage scaling in digital systems implementing highly parallel computations. Simulation results show that power-optimal logic depth is 6 to 8 FO4 and optimal power saving varies from 55 to 80% compared with a 24 FO4 design depending on threshold voltage, activity factor, and the presence of clock-gating.
Even though the exact power-optimal pipelining is technologydependent, we can gain some important insights from the simulation results. First, higher activity factors decrease the poweroptimal logic depth and increase the optimal power saving because pipelining is most effective at saving the additional switching power. Second, pipelining is more effective with lower threshold voltages, resulting in lower logic depths and lowest power, except for low activity factors when leakage power is dominant. Third, clock-gating enables deeper pipelining and more power saving because it reduces timing element overhead when activity factor is low.
Therefore, power-optimal pipelining with clock gating should be an efficient low-power technique for high throughput blocks in systems implementing highly parallel computations.
