Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies due to the continuing size reductions and increasing speeds of transistors. Recent studies have attempted to reduce leakage power using integrated architecture and compiler power-gating mechanisms. This approach involves compilers inserting instructions into programs to shut down and wake up components, as appropriate. While early studies showed this approach to be effective, there are concerns about the large amount of power-control instructions being added to programs due to the increasing amount of components equipped with power-gating controls in SoC design platforms. In this article we present a sink-n-hoist framework for a compiler to generate balanced scheduling of power-gating instructions. Our solution attempts to merge several power-gating instructions into a single compound instruction, thereby reducing the amount of power-gating instructions issued. We performed experiments by incorporating our compiler analysis and scheduling policies into SUIF compiler tools and by simulating the energy consumption using Wattch toolkits. The experimental results demonstrate that our mechanisms are effective in reducing the amount of power-gating instructions while further reducing leakage power compared to previous methods.
INTRODUCTION
Minimizing power dissipation can be considered at algorithmic, architectural, logic, and circuit levels [Chandrakasan et al. 1992] . Numerous studies in the literature on low-power design have proposed various techniques for synthesizing designs with reduced transitional activities. Recently, the prospect of combining architecture design and software arrangement at the instruction level has been addressed to help reduce power consumption [Bellas et al. 2000; Chang and Pedram 1995; Horowitz et al. 1994; Lee et al. 2003; Su and Despain 1995; Tiwari et al. 1998 Tiwari et al. , 1997 For example, several types of software rearrangement have been used to reduce the dynamic power, such as utilizing the value locality of registers [Chang and Pedram 1995] , swapping operands for Booth multipliers [Lee et al. 1997] , scheduling VLIW instructions to reduce the power consumption on the instruction bus [Lee et al. 2003 ], gating the clock to reduce workloads [Horowitz et al. 1994; Tiwari et al. 1998 Tiwari et al. , 1997 , utilizing cache subbanking mechanisms [Su and Despain 1995] , and an instruction cache for loops [Bellas et al. 2000] .
Leakage power is coming to represent a greater proportion of total power dissipation as the feature size of semiconductor technology continues to reduce as shown in Figure 1 . It is predicted that leakage power will become comparable to dynamic power within only a few generations [Doyle et al. 2002; Karnik et al. 2002; Kim et al. 2003; Semiconductor Industry 2004; Jones 2004] . Therefore, power gating to reduce leakage power should be used in addition to clock gating, which is only able to reduce the dynamic power [Kao and Chandrakasan 2000; Butts and Sohi 2000; Hu et al. 2004] . Recent studies have attempted to reduce leakage power using integrated architecture and compiler power-gating mechanisms [Dropsho et al. 2002; Yang et al. 2002; You et al. 2002 You et al. , 2006 Rele et al. 2002; Zhang et al. 2003 ]. This approach involves compilers inserting instructions into programs to shut down and wake up components whenever appropriate, based on a data-flow analysis or profiling analysis. While early studies showed this approach to be effective, there are concerns about the amount of power-control instructions being added to programs with increasing numbers of components being equipped with power-gating controls in system-on-a-chip (SoC) design platforms for embedded systems. Note that architecture designers can customize the processor with unique operation functions [Ip et al. 2002; Gonzalez 2000; Tsutsui et al. 2002] . For example, one may have extensible instructions for modules of cryptography, 3D graphics, and motion estimation, as well as variety of wireless communication modules, etc.
In this article we present a sink-n-hoist framework for a compiler to generate balanced scheduling of power-gating instructions. Our framework attempts to merge several power-gating instructions into a single compound instruction, thereby reducing the amount of power-gating instructions issued. Note that whilst power-gating instructions can significantly reduce leakage power, they produce recovery penalties and increase the execution time and code size of programs. Figure 2 illustrates an example of power-gating control. The lefthand panel of the figure shows two different components in use, the center panel illustrates the current practice of attempting to issue power-on and power-off instructions for these two hardware components separately, and the righthand panel shows our scheme that attempts to merge these instructions. In this article we provide a cost model and software foundation to guide this process. Our solution includes a set of data-flow equations for code motion of powergating instructions. Our work combines a theoretical foundation and step-bystep framework for moving, grouping, and merging power-gating instructions. We have performed experiments that incorporate our compiler analysis and scheduling policies into SUIF compiler tools, and simulate the energy consumption using Wattch toolkits [Brooks et al. 2000] . Experimental results obtained using the DSPstone benchmark suite demonstrate that our mechanisms are effective in reducing both the amount of power-gating instructions and the power consumption relative to previous methods. Our sink-n-hoist framework for merging power-gating instructions reduces the code size by an average of 47.8%, and also further reduces the energy consumption due to the block version of power-gating instructions, giving better power and performance than the pointwise power-gating instructions.
The remainder of this article is organized as follows. Section 2 describes a machine architecture for the target platform, Section 3 overviews the leakage-power reduction-framework, Section 4 presents our analysis and merging techniques for reducing the amount of power-gating instructions, Section 5 gives the experimental results of our study, Section 6 describes related work, and Section 7 concludes.
MACHINE ARCHITECTURE
The architecture model in our design has an instruction set that supports powergating control at the component level. We focus on reducing the power consumption of certain components by invoking power-gating technology. Power gating is analogous to clock gating, except that devices are powered off by switching off their supply voltage, rather than the clock. This can be implemented by forcing transistors to be off or using MTCMOS (multithreshold voltage CMOS technology) to increase the threshold voltage [Butts and Sohi 2000; Kao and Chandrakasan 2000; Roy and Prasad 1992; Hu et al. 2004] . Figure 3 illustrates an example of our target machine architecture based on a DEC Alpha 21264 processor with an instruction fetch, issue, and retire unit (Ibox), a block of integer-function units (Ebox), a block of floating-point-function units (Fbox), a memory reference unit (Mbox), and an external cache and system interface unit (Cbox) [Compaq 1999 ]. In the adapted DEC Alpha 21264 architecture model, Ebox and Fbox were equipped with power-gated functions. The power state of each unit is controlled by the 64-bit integer power-gating control register (PGCR). In this case, 1 bit is used for the integer multiplier unit and 3 for the floating-point function units. Setting the power-gating bit to true powers on the corresponding module, and clearing the bit to 0 powers off the corresponding module immediately in the following clock cycle. A new instruction was implemented to control units with the power-gated function by moving the appropriate value from a general-purpose register to the PGCR. The integer ALU unit is always powered on, since it takes the responsibility for moving data to the PGCR.
LEAKAGE-POWER-REDUCTION FRAMEWORK
This section presents the compiler framework for implementing power-gating mechanisms to reduce leakage-power dissipation. We have previously presented a data-flow analysis framework, called component-activity data-flow analysis (CADFA), to estimate the component activities on a microprocessor within a given program [You et al. 2002 [You et al. , 2006 . The analysis collects the information of the utilization of components at each point in the program. Powergating-instruction scheduling is then performed to determine whether, where, and when power-gating controls should be employed so as to produce power reduction. Finally, power-gating instructions are inserted into the program accordingly. In the current study, we present a sink-n-hoist framework, applied in the phase immediately before power-gating instructions are inserted, to generate balanced scheduling of power-gating instructions. Our solution attempts to merge several power-gating instructions into a single compound instruction. Figure 4 presents the compiler flow of the leakage-power-reduction framework. In the figure, steps I, II, and III are conventional [You et al. 2006 [You et al. , 2002 , and steps IV and V are proposed in this article to merge power-gating instructions. Steps I and II involve performing a component-activity data-flow analysis, step III decides if and where power-gating instructions should be inserted, step IV attempts to merge the power-gating instructions with our proposed sink-n-hoist framework, and step V produces the power-gating instructions. A motivating example of power-gating control in three floating-point units (ALU, multiplier, and divider) with this framework is illustrated in Figure 5 , where each item shows the status of a component on a timeline, and a shaded item represents one that it is in use. Three scenarios are considered: leftmost items show the case without power-gating controls; middle items show the case when steps I, II, III, and V in the framework are applied; and the rightmost items show the case when all phases in the framework are applied. The number of power-gating instructions inserted can be decreased from six to two when the sink-n-hoist Analysis is applied. In Sections 3.1 and 3.2, we describe the methods in steps II and III, and then steps IV and V with sink-n-hoist analysis for the code motion of power-gating instructions in Section 4.
Component-Activity Data-Flow Analysis
The goal of CADFA is to determine the utilization of components at each point in a program using a set of data-flow equations. We say a component activity c is generated at a block b if a component is required for execution, represented by COMPONENT loc (b) , and that it is killed if the component is released by the request, represented by COMPONENT blk (b) . The predicates of the data-flow equations for collecting component-activity information are given as follows:
-COMPONENT loc (b) is a set of components that are required for the first cycle of execution.
-COMPONENT blk (b) is a set of components that are released by the execution at block b.
-COMPONENT in (b) is a set of components that are required for execution at the beginning of block b.
where Pred(b) is the set of predecessor program blocks of block b.
-COMPONENT out (b) is a set of components that are required for execution at the end of block b.
COMPONENT out (b) can be interpreted as the information at the end of a statement, being either generated within the statement or entering at the beginning and not being killed as control flows through the statement.
- INACTIVITY(b) is a set of components that are not active at block b. In fact, INACTIVITY(b) is the complementary set to COMPONENT out (b) , that is,
where is the universal set.
Power-Gating-Instruction Scheduling
Once the utilization information of components has been obtained, we can insert power-gating instructions into programs at the appropriate points (i.e., beginning and end of an inactive block) to power off and on unused components so as to reduce the leakage power. However, both shut-down and wake-up procedures are associated with an additional penalty, especially the latter due to peak voltage requirements. The following equation represents a cost model for deciding whether the insertion of power-gating instructions will provide energy-consumption benefits.
where functions E and P return the value of energy and power consumption, respectively; E off (C) and E on (C) represent the energy consumption of issuing a power-off and a power-on instruction for component C, respectively; P leak (C) represents the leakage power consumption of component C in a cycle; P rleak (C) represents the leakage power consumption of component C in a reduced level in a cycle; 1 and ITVL idle is the length of the idle interval. Accordingly, we have a break-even length of idle intervals for each component C, called BE-ITVL idle C , that sustains the aforementioned inequality
Hence, the compiler must be aware that power-gating control of a certain component C is employed only when the component exhibits a continuous idle interval longer than BE−ITVL idle C . Moreover, the latency associated with powering a component on should also be considered.
The obtained component-activity information and cost model for deciding whether power-gating instructions should be employed allow us to consider scheduling mechanisms when inserting the power-gating instructions into given programs. Since the time required to instigate power-gating controls on components is influenced by conditional branches in programs, we propose the following set of scheduling policies with power-gating instructions: Basic Blk Sched, MIN Path Sched, and AVG Path Sched. A naive mechanism to control the power-gating instructions will set the on and off instructions at each basic block according to the component activities gathered by the dataflow equation. We call this scheme Basic Blk Sched. Another case to consider is that of an inactive block containing conditional branches, since the lengths of
the, say, two inactive blocks that follow the branch targets may be different. For example, only one of the branchings may benefit from power gating, in which case instigating power-gating control in one branch when the other is instead taken may not reduce the power requirements. In other words, the path lengths of the taken and not-taken paths of a branch may not be equal, and therefore one may satisfy the cost model and the other may not. Hence, we propose a MIN Path Sched policy to ensure that power-gating control is activated only when the inactive lengths of both branching paths exceed the power-gating threshold; that is, the minimum length of those paths reaches the criterion for power gating. Finally, since the behavior of program branches depends on both the structure of and the input data to programs, some branches may be followed rarely, or even never. To accommodate this, we propose an eclectic policy, called AVG Path Sched, to schedule power-gating instructions. AVG Path Sched returns the average length of two branchings, rather than the minimum length. These three scheduling policies have been described in detail previously [You et al. 2002] .
SINK-N-HOIST ANALYSIS
The main idea of sink-n-hoist analysis is to reduce the problem of excessive addition of instructions with code-motion techniques. The approach attempts to merge several power-gating instructions into one compound instruction by "sinking" power-off instructions and "hoisting" power-on instructions; that is, postponing the issuing of power-off instructions and bringing forward the issuing of power-on. This will result mainly in improvements to code size, but also in performance and energy via grouping effects. For instance, a power-off instruction can be postponed for several cycles to be merged with adjacent power-off instructions. Nevertheless, a maximum number of cycles to be sunk or hoisted should be set, since sinking or hoisting a power-gating instruction will increase leakage dissipation. A cost model is given next to determine the feasibility. For a component C, we have
where SINK−SLK is the number of cycles for which a power-off statement (or instruction) 2 is sunk, (i.e., the power-off statement is delayed for SINK−SLK cycles), E fet−dec−off (C) returns that of fetching and decoding a power-off instruction, E exe−off (C) returns that of executing a power-off instruction, and N is the number of power-gated components. Note that the sum of E fet−dec−off (C) and E exe−off (C) is equal to E off (C). The righthand side of the inequality represents energy consumed when the power-off statement is delayed for SINK−SLK cycles and merged with other (N − 1) power-off statements, while the lefthand side represents the energy consumed when the power-off statement is called immediately after the end of an active interval. In consequence, we have a maximum sinkable slack for each component C, called MAX−SINK−SLK C , that sustains the previous inequality.
Similarly, we have a maximum hoistable slack for each component.
With such cost constraints as the basis, we now present a set of data-flow equations to collect information for the code motion of power-gating instructions. Figure 6 shows the algorithm for sink-n-hoist analysis. The complete set of equations used is presented in Figure 7 . Sink-n-hoist analysis consists of three main phases: (1) sinkable analysis and hoistable analysis, which compute the information of possible positions for each power-gating instruction; (2) grouping-off analysis, grouping-on analysis, and grouping-switch analysis, which group together the power-gating instructions that can be merged; and (3) power-gating-instruction placement, which determines appropriate positions for power-gating instructions.
Sinkable Analysis and Grouping-Off Analysis
The predicates for collecting SINKABLE and GROUP−OFF information are given as follows. The SINKABLE predicate gives that to collect the information required to determine how far the power-off instructions of component activities can be sunk, and the GROUP−OFF predicate gives that to partition power-off instructions into groups. We can then use this information to group them by selecting the produced instructions:
-SINKABLE loc (b) is a set of power-off statements that occur within block b and which can be safely moved to the end of the block. Each statement is associated with an integer number SINK−SLK -SINKABLE in (b) is a set of power-off statements that can be safely moved to the beginning of block b.
The value of SINK−SLK b C would be the minimum one among the predecessors of block b if the values of SINK−SLK p C for each p are inconsistent with each other, where p is a predecessor of block b. This means that the sinkable slack from one predecessor would be reduced if other predecessors have a smaller sinkable slack. This implements the consideration that a power-off statement should not be sunk to a position that may cause a reverse effect. Moreover, the value of each SINK−SLK b C is decreased by one in accordance with the following definition.
is a set of power-off statements that can be safely moved to the end of block b.
The value of SINK−SLK otherwise, it is given from the one in SINKABLE in (b). In fact, SINKABLE out (b) presents the set of power-off statements (whether sunk or not) that can be issued at block b.
We now give the data-flow equations for GROUP−OFF, whose main concept is to partition power-off instructions into groups in which the possible positions of each such instruction (information that can be derived from SINKABLE out ) overlaps with at least one of those of the other instructions. In other words, it clusters together power-off instructions that might be merged. The predicates for computing GROUP−OFF are as follows:
-GROUP−OFF loc (b) is a set with at most one element (i.e., a singleton or empty set) in which the element (if it exists) is an integer representing a group number that never appears in other sets of GROUP−OFF loc . Block b belongs to the group it enumerates and is the beginning block of a set of successive blocks if GROUP−OFF loc (b) is not empty. The GROUP−OFF loc (b) set is not empty only when
A simple way to ensure that all numbers in the sets of GROUP−OFF loc of all blocks are unique is to assign each element to the value of an integer counter, and increment the counter once an element is assigned. -GROUP−OFF blk (b) is a universal set of integers, namely , or an empty set.
The set is not empty (i.e., flagged to be a set with an value) only when
In all other cases, it will be an empty set. -GROUP−OFF in (b) is an integer singleton (a group number) that can be assigned to the start of block b or an empty set.
where returns the value of the element of its parameter and returns infinity if the parameter is an empty set. In addition, all GROUP−OFF out sets of its predecessors in the same group can be replaced by GROUP−OFF in (b) if the GROUP−OFF out set of the predecessor of b is not empty. This provides opportunity for further grouping. -GROUP−OFF out (b) is an integer singleton (a group number) that can be assigned to the end of block b or an empty set.
In fact, the element in GROUP−OFF out (b) gives the group number to which block b belongs.
We now give a running example to illustrate how the analysis works. Suppose that two components, A and B, are considered for analyses. Given a control-flow graph as shown in Figure 8(a) , where each block in the graph contains only a statement, we can determine where power-gating statements should be located by performing steps I, II, III, and V in Figure 4 . This includes CADFA and power-gating-instruction scheduling.
In this example, it is found that components A and B should be powered off at blocks B m+2 and B n+2 , and at blocks B m+5 , B n+3 , and B n+5 , respectively. To reduce the amount of power-gating instructions issued, we apply sinkable analysis. By the definition of SINKABLE loc (b), a set of power-off statements that occur within block b, we have SINKABLE loc (B m+2 ) = {PowerOff A(4)}, SINKABLE loc (B m+5 ) = {PowerOff B(2)}, SINKABLE loc (B n+2 ) = {PowerOff A(4)}, SINKABLE loc (B n+3 ) = {PowerOff B(2)}, and SINKABLE loc (B n+5 ) = {PowerOff B(2)}, where the numbers in parentheses indicate the value of the associated SINK−SLK C (in fact, the values come from MAX−SINK−SLK A and MAX−SINK−SLK B ), and SINKABLE loc for the other blocks is an empty set. To simplify representation, the word "PowerOff " is removed and the value of the associated SINK−SLK C is superscripted (e.g., SINKABLE loc (B m+2 ) = {A 4 }). 
Compilation for Compact Power-Gating Controls
• 51:13 Table I . SINKABLE Predicates for the Example in Figure 8 Block
{A, B} { A 0 , B 0 } ‡ The superscript represents the value of the associated SINK−SLK b C . Table II . GROUP−OFF Predicates for the Example in Figure 8 Block 
Hoistable and Grouping-On Analysis
Hoistable and grouping-on analyses are similar to sinkable and grouping-off analyses, except that hoistable analysis is a backward data-flow analysis. Similarly, we can define a set of predicates for collecting HOISTABLE and GROUP−ON information as follows:
-HOISTABLE loc (b) is a set of power-on statements that occur within block b and which can be safely moved to the start of the block. Each statement is associated with an integer number HOIST−SLK b C , which is the slack time for component C that indicates how many cycles the power-on statement can be hoisted at block b. The initial value of HOIST−SLK b C is set as MAX−HOIST−SLK C . -HOISTABLE blk (b) is a set of power-on statements that cannot be safely moved from the end to start of bock b; that is, the set of power-on statements whose value of the associated HOIST−SLK b C is zero. -HOISTABLE out (b) is a set of power-on statements that can be safely moved to the end of block b.
The value of HOIST−SLK 
HOIST-SLK
is a set of power-on statements that can be safely moved to the start of block b.
A simple way to ensure that all numbers (in the sets of GROUP−ON loc of all blocks) are unique is to assign each element to the value of an integer counter, and increment the counter once an element is assigned. -GROUP−ON blk (b) is a universal set of integers, namely , or an empty set.
Block b is one (or the only) of the end blocks of a set of successive blocks if GROUP−ON blk (b) is not empty, which is the case when
-GROUP−ON in (b) is an integer singleton (a group number) that can be assigned to the start of block b or to an empty set.
In addition, we can replace all of the GROUP−ON out set of its predecessors by GROUP−ON in (b) if the GROUP−ON out set of the predecessor of b is not empty.
Note that this provides opportunity for further grouping. -GROUP−ON out (b) is an integer singleton (a group number) that can be assigned to the end of block b or to an empty set.
In fact, the element in GROUP−ON out (b) gives the group number to which block b belongs.
Grouping-Switch Analysis
In order to collect more grouping information for later analysis, we introduce grouping-switch analysis, which groups together all power-on and power-off instructions that might be merged. The analysis is similar to grouping-off and grouping-on analyses. The predicates for computing GROUP−SWH are as follows:
-GROUP−SWH loc (b) is a set with at most one element (i.e., a singleton or empty set) in which the element (if it exists) is an integer representing a group number and never appears in other sets of GROUP−SWH loc . Block b belongs to the group it enumerates and is the beginning block of a set of successive blocks if GROUP−SWH loc (b) is not empty. The GROUP−SWH loc (b) set is not empty only when
A simple way to ensure that all numbers in the sets of GROUP−SWH loc of all blocks are unique is to assign each element to the value of an integer counter, and increment the counter once an element is assigned. HOISTABLE in ( p) = ∅.
-GROUP−SWH in (b) is an integer singleton (a group number) that can be assigned to the start of block b or to an empty set.
In addition, we can also replace all of the GROUP−SWH out set of its predecessors by GROUP−ON in (b) if the GROUP−SWH out set of the predecessor of b is not empty. Note that this provides opportunity for further grouping. -GROUP−SWH out (b) is an integer singleton (a group number) that can be assigned to the end of block b or to an empty set.
In fact, the element in GROUP−SWH out (b) gives the group number to which block b belongs.
Power-Gating-Instruction Placement
We use information from the SINKABLE out , HOISTABLE in , GROUP−OFF out , GROUP−ON out , and GROUP−SWH out predicates described in Sections 4.1, 4.2, and 4.3 to determine how to place power-gating instructions, that is, whether power-gating instructions should be combined or issued separately. Figure 9 outlines an algorithm for placing power-gating instructions in a group-by-group manner. It first determines all possible policies for issuing power-gating instructions; a legitimate policy is one in which all power-gating instructions are issued at block b in which SINKABLE out (b) or HOISTABLE in (b) is not empty, and where each type of power-gating instruction appearing within a group must be issued exactly once only. It then uses an energy-cost model (including leakage energy, the energy associated with issuing power-off instructions, etc.) to determine which policy results in the lowest energy consumption. The algorithm for power-gating-instruction placement is basically a method of exhaustion, yet can be regarded as a simple and valid method. Towards the actual time spent in our experiments the process only contributes a very small fraction: less than 0.6% of our proposed framework.
In the following, we elaborate the idea by continuing the example presented in Section 4.1. An energy-cost model is established with the information of SINKABLE out and GROUP−OFF out , and evaluated for each case of issuing poweroff-instruction policies under the guideline that power-off instructions must be issued at the block in which SINKABLE out is not empty, and each type of powergating instruction appearing within a group must be issued exactly once only. For example, the policy could be "powering off A at B m+2 and powering off B at B m+5 " or "powering off A and B at B m+2 ' in group 1". The policy with minimum energy cost as evaluated by the model is chosen, since this should give the lowest power consumption. Finally, power-off instructions are inserted at appropriated points, as shown in Figure 8 (b): The power-off statements within each group are merged.
EXPERIMENTAL RESULTS

Platform
We used a DEC-Alpha-compatible architecture with the power-gating controls and instruction sets as described in Figure 3 as the target architecture for our experiments. The proposed leakage-power-reduction framework was incorporated into the compiler tool with SUIF [Stanford Compiler Group 1995] and Machine-SUIF [Smith 1998 ], and evaluated by the Wattch simulator with a 0.10-μm process parameter and a 1.9-V supply voltage [Brooks et al. 2000] . Table III summaries the baseline configuration of the simulator in our experiment. By default, the simulator performed out-of-order executions. We used the "-issue:inorder" option in the configuration so that instructions would be executed in order for ensuring the correctness of power-gating controls. Nevertheless, our approach can also be applied to out-of-order issue machines if the additional hardware supports proposed in You et al. [2006] are employed. The benchmarks used in our experiments were from the floating-point version of the DSPstone benchmark suite [Zivojnovic et al. 1994] . The average IPC (instructions per cycle) of the benchmarks is 0.36 with the configuration in Table III . Figure 10 illustrates the phases in the compilation and simulation framework. We incorporated the low-power optimization phase just before code generation; that is, after all traditional performance optimizations are performed. Hence, the additional phase has little or no influence on performance; it only inserts power-gating instructions and thus barely affects execution behavior. The implementation was based on SUIF2 and the Control Flow Graph (CFG) and Machine libraries from Machine-SUIF. Programs were first transformed from high-SUIF to low-SUIF format with SUIF, and then translated to the machinelevel or instruction-level CFG form with Machine-SUIF. The proposed four components of the low-power optimization phase (implemented as a Machine-SUIF pass) were then performed, and finally, the compiler generated DEC Alpha assembly codes with power-gating controls. We also examined the breakdown of the overall compile time, as shown in Figure 11 . It is observed that the proposed approach, CADFA with sink-n-hoist, contributes an average of 19.2% of overall compile time.
In addition, the power-gating mechanism is absent in the original DEC Alpha processor, and thus there are no power-gating instructions in its instruction set. We therefore treated power-gating instructions as a set of special instructions so that they are recognized by the DEC Alpha assembler and linker: "stl $24, negative offset($31)", where negative offset is a negative integer that is used for indicating the functional unit to be powered on or off. The instruction stores the value of register $24 into the memory address below zero, which is an invalid memory address ($31 is a constant zero register) and should never be generated by standard compilers. To prevent processors from accessing the invalid memory addresses, we made a small modification in Wattch: When the instruction decoder deciphers such instructions, it extracts the power-gating information and converts it to an NOP (no-operation) instruction. Furthermore, since Wattch does not model leakage at the component level per se, we assumed that leakage power contributes 10% of the total power consumption. Furthermore, we assumed that wake-up operations of power-gating controls have a 3-cycle latency and that it took 4 and 10 times the leakage energy per cycle to power a component off and on, respectively. The energy consumption of fetching and decoding a power-gating instruction was assumed to be 2 times the leakage power. Also, the baseline data was provided by the power estimation of Wattch cc3 with a clock-gating mechanism, which gates the clocks of those unused resources in multiported hardware to reduce the dynamic power; however, leakage power is still exuded.
Results and Discussion
The results from three types of experiment are compared: (1) no power-gating mechanism (baseline); (2) CADFA as from a previous work [You et al. 2006 [You et al. , 2002 in which only steps I, II, and III of Figure 4 were performed; and (3) sinkn-hoist analysis involving all phases in Figure 4 . In addition, three policies for power-gating-instruction scheduling were proposed in step III of Figure 4 to deal with conditional branches in programs. Without loss of generality, we used the Min Path Sched policy to schedule power-gating instructions in this experiment.
Figures 12-14 give the compilation and simulation results of two approaches: CADFA and CADFA with sink-n-hoist when the integer multiplier, floatingpoint adder, and floating-point multiplier are considered for power gating, and the comparison baseline in these figures is the one without power-gating controls. Figure 12 presents the code-size growth due to power-gating instructions, which shows that sink-n-hoist reduces the code size by about 47.8% on average (from 60.3% to 25.4%) compared with the method without the sink-n-hoist framework, namely, CADFA. Moreover, our scheme also further reduces total energy consumption compared to that without the sink-n-hoist framework, which is due to the block version of the power-gating instructions giving better power and performance characteristics than the pointwise version. illustrates the normalized energy breakdown with conventional, CADFA, and CADFA with sink-n-hoist compilation strategies. The energy consumption was measured by 5 categories: the dynamic energy dissipated by clock circuits and that by the whole processor except for clock circuits, the leakage energy dissipated by power-gatable units and that by the whole processor except for power-gatable units, and the overhead energy consumption due to extra powergating instructions. The overhead includes not only the energy dissipated by power-gating instructions themselves, but also the negative impact of memory, buses, etc. The average impact of using CADFA and CADFA with sink-n-hoist are 0.98% and 0.20%, respectively, in which about 20% of the energy is contributed to power-gating operations and the other 80% to dissipation in the caches, fetch and decode units, buses, etc. Figure 13 shows that our scheme reduces average power by 11.9% compared with the conventional method. Note that the average reduction in total energy does not seem high, but this is attributable to the fact that only 3 types of functional units (the integer multiplier, floating-point adder, and multiplier) are under power-gating control in this experiment. In fact, the CADFA method has already achieved average energy reductions in combined dynamic and leakage power of 70.4% and 72.6% for the adder and multiplier, respectively [You et al. 2006 [You et al. , 2002 . Figure 13 also shows that our scheme is superior to CADFA in terms of energy reduction, which is also due to the block version of power-gating instructions improving power consumption more than the pointwise. In addition, we also compile the breakdown of the execution cycle in terms of function unit activities. It is observed that for the integer multiplier, floating-point adder, and floatingpoint multiplier, 76.4%, 76.2%, and 77.0% of idle cycles, respectively, were controlled with the power-gating mechanism by CADFA with the sink-n-hoist approach. Figure 14 shows that the performance impact of power-gating mechanisms is less than 5% for most of the benchmarks for both CADFA and CADFA with sink-n-hoist. The only exceptions are fir2dim and matrix, which are due to the fact that the number of power-gating instructions placed within loops are much greater than for the other benchmarks. Therefore, fir2dim and matrix execute more power-gating operations, and thus consume more execution cycles. The performance degradation is reduced by an average of about 64.81% over the CADFA method. Our method exhibits an advantage over the one without the sink-n-hoist framework due to reduction in number of power-gating instructions. Note that the performance penalty is less than the increase in number of instructions, since most instructions are added outside the loop kernel. Nevertheless, the reduction in number of power-gating instructions still yields a performance advantage.
In addition, Figure 15 gives the normalized energy breakdown in four categories (dynamic energy by the whole processor, leakage energy dissipated by power-gatable units, leakage energy dissipated by the whole processor except the power-gatable units, and for overhead energy consumption due to extra power-gating instructions) with different configurations of the leakage contribution (from 10% to 90%). It shows that our technique is effective in helping leakage control at/beyond new technology generations. Generally, the effectiveness of our approach becomes greater as leakage contribution rises. However, this trend stabilizes when the leakage contribution becomes greater than 70% due to the dominating leakage and growing impact on overheads in memory, bus, and other uncontrolled units with respect to power gating. Recall that we only attempt to do power-gating control on units of the integer multiplier, floating-point adder, and floating-point multiplier in this experiment setup.
RELATED WORK
Recent studies have attempted to reduce leakage power using integrated architecture and compiler power-gating mechanisms [Dropsho et al. 2002; Rele et al. 2002; You et al. 2006 You et al. , 2002 Zhang et al. 2004 Zhang et al. , 2003 . Dropsho et al. [2002] proposed an analytical energy model for architecture-level analysis, and described the benefits of employing a dual-threshold-voltage technique to reduce subthreshold leakage current in the integer functional units of a processor. They also proposed a simple architecture design, called gradual sleep, to reduce the overhead of activating the sleep mode for smaller idle periods. The work of Rele et al. [2002] is based on a profiling approach to identify those blocks in which functional units are expected to be idle (based on the execution frequencies of each basic block), and then inserting off and on instructions at entry and exit points of such blocks, respectively. You et al. [2002] proposed a more formal compiler methodology that uses a data-flow analysis approach to collect the information of activities of each functional unit at each point of a program, inserting power-gating instructions by using a scheduling algorithm to deal with the uncertainty of idle periods due to conditional branches. They also proposed an architecture to make power-gating controls applicable to out-of-order issue processors [You et al. 2006] . Aside from controlling leakage energy of functional units, Zhang et al. [2004] presented a compiler-directed approach that inserts power mode instructions for cache lines to control leakage energy consumed in the instruction cache.
The previously described approaches have shown that leakage power can be effectively suppressed with help from compilers. However, there are concerns about the amount of power-control instructions being added to programs as increasing numbers of components are equipped with power-gating controls in SoC design platforms. Whilst power-gating instructions can significantly reduce leakage power, they produce recovery penalties and increase the execution time and code size of programs. Our sink-n-hoist framework for a compiler solution attempts to merge several power-gating instructions into a single compound instruction so as to reduce the amount of power-gating instructions.
CONCLUSION
In summary, our experiments have demonstrated that the sink-n-hoist analysis framework proposed in this article improves code size, energy consumption, and performance. It reduces the overall energy consumption and code size growth by an average of about 0.9% and 47.8% , respectively, compared with the CADFA scheme without our sink-n-hoist approach, and impacts performance by an average of less than 1%. As the compiler phase is done one phase after another, our framework provides a sound theoretical foundation capable of working with other improvements, such as adding more slackness for low power. We are currently in the process of incorporating more components (such as cryptography modules) into our architecture and simulator. We expect that our scheme will be even more beneficial as more extensible modules are equipped with powergating controls in SoC design platforms.
