Abstract
Introduction
Decreasing CMOS transistor feature sizes have enabled higher processing speeds and more components on chip. However, this is at the expense of increased static power dissipation in the form of transistor leakage current. For next-generation technology, the static power accounts for up to 40% of the total power consumed [11] and this percentage is projected to increase exponentially as processor technology continues to decrease [12] . The static power consumption of functional units (FUs) at different technology parameters is shown in Table 1 , taken from [6] .
Detailed microarchitecture simulation shows that average functional unit utilization rates are typically low with some periods of execution during which certain FUs are idle [10] . Our work takes advantage of these idle periods by turning the FUs OFF (power gating), thereby re- Table 1 . FU Static Power Dissipation ducing the total energy consumed via a decrease in static leakage energy. To maximize the energy savings, if the FU idle period is short, such that the overhead energy required to power cycle the unit is greater than the energy consumed by leaving it ON during this period, then the unit is left ON. We use an annotated Control Flow Graph (CFG) extracted from an application to detect idle periods. These idle periods are then translated into compiler instructions that turn FUs OFF/ON while hardware enables the actual OFF/ON operation. Note that although Energy = Power * Time, we use the terms power and energy interchangeably to suit the context.
Related Work
Approaches for reducing the dynamic power dissipated by functional units during idle periods using clock gating are described in [4, 7, 13] . Techniques that reduce FU static power using power gating include [3, 9, 10, 5] . The work that is most similar to ours is presented in [10] , where FU static power dissipation is optimized by power gating these units during long idle periods. Their method detects these periods and generates compiler instructions to turn FUs OFF/ON. The long FU idle periods are detected based on utilization rates within a basic block and the overhead energy required for FU power cycling is not considered. In our approach, we quantify the energy consumed by the FUs and the overhead energy for power cy-cling and use this information to detect FU idle periods that drive FU shutdown.
Implementation
Our method generates a Control Flow Graph (CFG) from the application execution that is subsequently annotated with initial FU requirements based on the instructions within each program basic block (BB). Using this annotated CFG, we analyze the tradeoff between leaving FUs ON and turning FUs OFF when not utilized by consecutive basic blocks. Through this analysis, idle periods are detected and FU requirements are optimized for minimum energy consumption. The optimized functional unit requirements are then translated into compiler-generated instructions that are used during program execution to physically switch the FUs OFF/ON. We describe the generation and annotation of the CFG, the energy estimation, and the FU requirement optimization algorithm in the following sections.
Control Flow Graph Generation
The control flow graph (CFG) is generated by dynamically profiling an application during which each branch instruction and its target address(es) is identified and represented as a node in the CFG. Although we generate transition counts between nodes (shown on arcs), we use this information only to guide CFG optimization. Next, we annotate each node of the CFG with initial functional unit requirements based on the number and type of instructions in each basic block of the static code.
Due to dependences between basic block instructions, an accurate determination of FU requirements may only be done through dynamic analysis. Therefore, we estimate basic block FU requirements based on a static read-after-write (RAW) dependence analysis of instructions. The instruction dependences are represented in a tree structure, where nodes represent BB instructions and edges between nodes indicate a dependence. Nodes that are at the same level in the tree contain independent instructions. The level or depth of the tree that contains the maximum number of instructions is used to estimate the final FU requirement. Consider the code sequence for a basic block and its dependence tree shown in Figure 1 . Here, instructions 3 and 6 are dependent on instruction 1; instructions 3, 4, and 5 are independent. Therefore, the INT Add unit requirement for this block is 3, which corresponds to the number of instructions in Level 2. If the maximum number of independent instructions of a particular type exceeds the number of FUs of that type, the FU requirement is set to the defined number of FUs. Figure 2 shows a CFG where each node is annotated with its functional unit requirements. A FU requirement 
Energy Estimation
The energy consumed by basic block instructions comprises dynamic and static (leakage) energy dissipated every execution cycle by the individual functional units and the overhead energy associated with power cycling a FU if necessary. Note that the energy is an estimate since the final FU requirements are based on a static dependence analysis; the number of cycles that FUs are OFF/ON is based on an average IPC (instructions committed per cycle) obtained by application execution profiling. The values for INT and FP FU dynamic power and cycle time are taken directly from Wattch [1] and are shown in Table 3 . Wattch is a simulator based on SimpleScalar [2] that implements microarchitecture component power models used to generate energy dissipation data for an applications' execution. Although the Wattch power models are based on previous-generation technology (130nm), we assume that static power accounts for up to 40% of the total power consumed which is valid for 45nm technology according to [11] . Since our energy savings is computed and reported as a relative percentage, the magnitude of the actual power saved will be smaller in future technologies than that for 130nm but the percentage of energy saved is a valid projection for future transistor feature sizes. Therefore, the leakage power or Leak Factor is expressed as:
Wattch computes the power consumed by a functional unit, P FU , as:
where D FU is the dynamic or instantaneous power consumed by a FU ( Power cycling the functional units incurs an energy overhead, which is discussed in more detail in Section 3.2.1. The compiler-generated instructions (see Section 3.4) that work in conjunction with the hardware to physically turn FUs OFF/ON also incur an energy overhead for their execution. Therefore, these instructions must be counted and their energy included in the determination of total overhead energy, E OH , which is computed as,
where I ON and I OFF are the number of compilergenerated FU ON and FU OFF instructions, respectively; E OH OFF and E OH ON are the energy overhead to turn a FU OFF and ON, respectively. The total energy consumed is expressed as:
where time = clock cycles * clock cycle time.
Overhead Energy and BreakEven Cycles.
Power cycling functional units incurs an energy overhead since power gating requires a circuit (a header device [5] ) to perform the physical switching. [5] . This means that a FU must be powered OFF for more than 20 cycles for the aggregate leakage energy savings to be greater than the total energy overhead cost. Conversely, if a FU is powered OFF for less than 20 cycles, the overhead energy cost for power cycling is greater than the aggregate leakage energy saved. Therefore, the total energy consumed is minimized when the FU is left ON during this period.
To show that 20 cycles is a reasonable initial assumption, we perform an analysis that quantifies the sensitivity of the aggregate energy saved to the value of BreakEven cycles in Section 5.2. The energy overhead attributed to FU power cycling can be expressed in terms of leakage energy and BreakEven cycles, as:
where BE Cycles is the BreakEven cycles; Cycle time is the clock cycle time ( Table 3 ). The other variables are defined in Equation (3). We assume that E OH ON = E OH OFF , and therefore,
Functional Unit Requirement Optimization
To maximize the energy savings, our algorithm optimizes basic block functional unit requirements. The optimization depends on accurately detecting short FU idle periods where the energy overhead for power cycling a FU is greater than the aggregate energy saved while the FU is OFF. Short idle periods occur as shown in Figure 3 . The FU requirement of basic block 1 (BB1) is 2 INT Add, 1 INT Mult, 1 FP Add, and 1 FP Mult units. BB2 only requires 2 INT Adders for its execution, while BB3 has the same FU requirements as that of BB1. Assume that BB2 executes for 6 cycles (i.e., < BE cycles). In this example, the overhead energy required to power cycle the FUs (INT Mult, FP Add, and FP Mult) OFF for the execution of BB2 and back ON for BB3 is greater than the static leakage energy saved by turning these FUs OFF during BB2's execution. Our algorithm detects these cases in The complexity of this method is O(N B ), where N is the number of FUs and B is the number of basic blocks in the CFG. Because exhaustive analysis of the CFG is computationally infeasible, we implement sub-optimal but computationally feasible solutions called the Local and Global Optimizers.
Local
Optimizer. This optimization is performed one node at a time. For example, if the INT Add unit requirement in sequential nodes is (4-3-2-4), the optimizer sets the requirement of the second node to 4 and calculates the energy consumed. Setting the FU requirement to 4 removes the overhead of switching a FU OFF from node 1 to node 2 but consumes energy to leave it ON. If the total energy consumed is less when the requirement is set to 4 rather than 3, then the requirement for node 2 is set to 4. Since we assume 4 INT Add units, the optimization of node 2 is complete. We optimize nodes 3 and 4 similarly. Table 2 shows the total energy and each of its components for all INT FU configurations that are examined by the local optimizer. The FU configuration selected by the optimizer is shown in bold type. Figure 4 FU Config E ON Used E ON Unused (4-3-2-4) 2.33E-6 2.79E-7 (4-4-2-4)
2.33E-6 3.72E-7 (4-3-3-4) 2.33E-6 3.72E-7 (4-3-4-4)
2.33E-6 4.65E-7
2.85E-7 2.85E-7 3.17E-6 (4-4-2-4) 2.85E-7 2.85E-7 3.27E-6 (4-3-3-4)
1.43E-7 1.43E-7 2.98E-6 (4-3-4-4)
1.43E-7 1.43E-7 3.08E-6 mary advantage of the local optimizer is its relatively low complexity, which leads to a linear increase in optimization/computation time with an increase in the number of CFG basic blocks. The main disadvantage is that since it only optimizes using one node at a time, higher order combinations that may result in increased energy savings are not analyzed. To take advantage of higher order optimizations in a computationally feasible method, we divide the CFG into smaller sub-CFGs of a specified depth and perform an exhaustive search for optimal FU requirements on each sub-CFG. Sub-CFGs of depth one for the CFG in Figure 2 are shown in dashed boxes in Figure 5 . The depth chosen for sub-CFGs exhibits a trade off between energy reduction and optimization time.
The optimizer works as follows: if the INT Add unit requirement in sequential nodes is (4-3-2-4) and we assume a depth of 2, the optimization is performed over 2 sub-CFGs-(4-3-2) and (3-2-4). For the (4-3-2) sub-CFG, the FU requirements are changed from (4-3-2) through (4-4-4) and the energy is computed for each combination in a manner similar to that shown in Table 2 . The FU configuration with the least energy consumption is chosen. Comparing global with local optimization for this example, the FU configurations evaluated by the global algorithm are (4-3-3), (4-3-4), (4-4-2), (4-4-3), and (4-4-4). Those examined by the local algorithm include (4-4-2), (4-3-3), and (4-3-4), which are a subset of the combinations analyzed by the global optimizer. 
Processor Support
In our implementation we assume that the compiler inserts additional instructions to support the FU switching operations. At the start of a basic block, the current FU configuration and the new block requirements are compared to determine if any units are to be turned OFF/ON. The compiler inserts additional instructions accordingly. FUs are turned ON at the first instruction of each basic block. If control flows into a block after the first instruction, the FU requirement of the block is unknown. In such a case, based on the FU requirement of the current instruction, if no unit of the required type is ON, all of the available units for that instruction type are turned ON with a Hardware ON processor signal. For example, if control flows to a block at its third instruction, the FU requirement is unknown. If the third instruction of the block is an Integer Add, then all INT Add units are turned ON.
Performance Penalty
Performance penalties are incurred in cases where a functional unit cannot be assigned either because (1) it is not fully ON (i.e. it is in the process of being turned ON), (2) the required FU is currently OFF, or (3) all ON FUs of the required type are busy. In (2) and (3), a performance penalty is incurred if a ready-to-issue instruction waits until a FU is available or if an additional FU is switched ON. An additional performance penalty is incurred by the required FU OFF/ON instructions. We have accounted for the performance penalty as an increase in execution time (clock cycles) when our technique is implemented.
Experimental Platform
For implementing and validating the FU shutdown technique, we use Wattch [1] which supports the simulation of a MIPS-like superscalar, out-of-order, speculative pipeline. Our functional unit configuration consists of 4 We modified Wattch to include the overhead energy computations described in Section 3.2.1 and shown in the table. We assume a FU Turn-On latency of 3 cycles based on [5] . We use five FP and five INT benchmarks from the SPEC CPU2000 suite that we chose based on overall program characteristics given in [8] to validate our technique.
Experimental Results
The effectiveness of our methodology is shown in Figure 9 , where the percentage of power savings realized by each benchmark is shown for various optimizations. no-opt computes the total energy dissipation prior to any CFG optimization; local uses the Local CFG Optimizer; global2 uses the Global CFG Optimizer with depth 2; FP ON global2 leaves all FP FUs ON (only powering OFF INT FUs when appropriate) and uses the Global CFG Optimizer with depth 2. We chose a depth of 2 for global optimization since it demonstrated an acceptable tradeoff between optimization/computation time and energy reduction. At this depth, optimization time ranged from a few minutes to a few days for the benchmarks with the largest number of basic blocks (eon and fma3d). Table 4 gives the average and maximum percentage of total energy saved due to each of these four FU shutdown optimizations. The power savings is computed with re-
Figure 9. Total Energy Savings (%)
spect to the base case where functional unit shutdown is not implemented. With no optimization, FU shutdown saves a maximum of approximately 18% and an average of 0.60% of the total energy. The low average percentage energy savings is caused by the energy increase for FP benchmarks (e.g., the average energy increase for FP applications is 3.68%). Using global2 CFG optimization, an average of around 4% of the total energy is saved across all benchmarks, with an average of around 9.5% savings for integer applications. For all cases, integer benchmarks realize the largest reduction in total energy consumed using FU shutdown. The negative energy savings noted in some of the benchmarks for the various levels of CFG optimization may be primarily attributed to two characteristics-the energy overhead to power cycle FP FUs and the inaccuracy of FU requirements set in the CFG. The benchmarks that exhibit an increase in total energy are those that execute a relatively high percentage of FP operations. Table 5 shows the instruction mix for a subset of the benchmarks used. Eon and swim have between 18 and 20% FP ops, a very small amount of INT computation, and exhibit an increase in total energy when implementing FU shutdown. Gzip, mcf, and vortex show the largest decrease in total energy using FU shutdown and are all characterized by a large percentage of integer operations and few to no FP operations. FP functional units have a relatively large energy overhead for power cycling due to their size/capacitance. Additionally, CFG FU requirements are determined based on the instruction types within basic blocks and instruction dependences which are determined statically. If a basic block FU configuration is set incorrectly based on the static analysis and dynamically this basic block accounts for a large percentage of the total execution time (as is often the case for benchmarks classified as FP), then a large energy overhead, including energy due to performance degradation, will be incurred due to power cycling the FU unnecessarily. We are currently investigating this issue further.
Performance Degradation
The performance degradation is presented in Figure  10 . The execution time in cycles with no FU shutdown is compared to that with various CFG optimizations for FU shutdown. For most of the benchmarks the degradation in performance of the optimized execution compared to the actual execution time is about 1%. The execution 
Sensitivity of BreakEven Cycles
BreakEven (BE) Cycles are used to compute the energy overhead to power cycle FUs. A smaller number of BE Cycles results in a lower energy for power cycling FUs which implies that FUs can be turned OFF/ON more frequently. From Figure 11 , we see that the total energy savings decreases as BreakEven cycles increase. This is primarily due to the increase in overhead energy as the number of BreakEven cycles increases, although the change in the number of idle FU periods of length greater than or equal to BreakEven cycles also affects this. The data shown is for art, but the total energy versus BreakEven cycle trends are similar for the other benchmarks.
Conclusions and Future Work
In this paper we propose a technique to reduce microarchitecture energy consumption by turning func- Figure 11 . BreakEven Cycles Sensitivity, art tional units OFF during periods of execution when they are idle. This technique attempts to maximize energy savings by detecting short idle periods during which more energy is consumed by power cycling FUs than by leaving FUs ON. We show that for certain applications, this method saves up to 18% of the total energy with a performance degradation of 1%. However, for some applications, particularly those classified as FP benchmarks, functional unit shutdown (of FP units) results in increased energy dissipation. For these benchmarks, leaving the FP functional units ON (while power cycling INT FUs as appropriate) results in an average of 3.66% energy savings.
In the future we intend to investigate alternate methods of CFG optimization, methods of compressing the CFG to decrease optimization time and make increased depth optimization computationally feasible, and using a more accurate number of BreakEven cycles for FP functional units to improve the energy reduction of functional unit shutdown. We also plan to examine implementing FP FU shutdown of the back-end (i.e., later stages) of the FP pipeline rather than power cycling the entire unit. Finally, we will examine methods for improving the accuracy of CFG FU requirements for FP applications by including branch frequency information in setting FU configurations.
Acknowledgments
This work is funded in part by an IBM Faculty Award and is supported by NSF-MRI award CNS0421456 and NSF-funded ADVANCE Institutional Transformation Program at NMSU, award NSF0123690.
