Power leakage constitutes an increasing fraction of the total power consumption in modern semiconductor technologies. Recent research efforts also indicate architecture, compiler, and software participations can help reduce the switching activities (also known as dynamic power) on microprocessors. This raises interests on the issues to employ architecture and compiler efforts to reduce leakage power (also known as static power) on microprocessors. In this paper, we investigate the compiler analysis techniques related to reducing leakage power. The architecture model in our design is a system with an instruction set to support the control of power gating in the component levels. Our compiler gives an analysis framework to utilize the instruction to reduce the leakage power. We present a data flow analysis framework to estimate the component activities at fixed points of programs with the consideration of pipelines of architectures. We also give the equation for the compiler to decide if the employment of the power gating instructions on given program blocks will benefit the total energy reductions. As the duration of power gating on components on given program routines is related to program branches, we propose a set of scheduling policy include Basic Blk Sched, MIN Path Sched, and AVG Path Sched mechanisms and evaluate the effectiveness of those schemes. Our experiment is done by incorporating our compiler analysis and scheduling policy into SUIF compiler tools [32] and by simulating the energy consumptions on Wattch toolkits [6] . Experimental results show our mechanisms are effective in reducing leakage powers on microprocessors.
Introduction
The demands of power-constrained mobile and embedded computing applications increase rapidly. Reducing power consumption hence becomes a crucial challenge for today's software and hardware developers. While maximization of
The work was supported in part by NSC-90-2218-E-007-042, NSC-90-2213-E-007-074, NSC-90-2213-E-007-075, MOE research excellent project under grant no. 89-E-FA04-1-4, and MOEA research project under grant no. 91-EC-17-A-03-S1-0002 of Taiwan.
battery life is an obvious goal, the reduction of heat dissipations is important as well. The reduction of power consumptions is also with the similar objective as the reduction of heat dissipations. Minimization of power dissipation can be considered at algorithmic, architectural, logic and circuit levels [11] . Studies on low power design are abundant in the literature [3, 4, 15, 19, 27, 29, 37] in which various techniques were proposed to synthesize designs with low transitional activities.
Recently, new research directions in reducing power consumptions have begun to address the issues on the aspect of architecture designs and on software arrangements at instruction-level to help reduce power consumptions [5, 12, 20, 24, 25, 33, 34, 35] . The architecture and software efforts to reduce energy consumptions in the recent attempt have been primarily on the dynamic component of power dissipation (also known as dynamic power). The energy, E, consumed by a program, is given by E = P × T , where T is the number of execution cycles of the program [25] and P the average power. The average power P is given by P =
where C is the load capacitance, V dd the supply voltage, f the clock frequency, and E the transition count. In order to reduce the dynamic power, several research works have been proposed to reduce the dissipations. For example, software re-arrangements to utilize the value locality of registers [12] , the swapping of operands for booth multiplier [25] , the scheduling of VLIW instructions to reduce the power consumption on the instruction bus [24] , gating clock to reduce workloads [20, 34, 35] , cache sub-banking mechanism [33] , the utilization of instruction cache [5] , etc.
As transistors become smaller and faster, another mode of power dissipation has become important. This is static power dissipation or the leakage current in the absence of any switching activities. For example, consider two Intel's Pentium III processors manufactured on 0.18µm process, the Pentium III 1.0 GHz and the Pentium III 1.13 GHz [21] . The datasheet lists the 1.0 GHz processor has a total power dissipation of 33.0 Watts and a deep sleep (i.e., static) power of 3.74 Watts while the maximum power dissipation at 41.4 Watts and the static power at 5.40 Watts for the 1.13 GHz one. The static power is up by 44% and comprises 13% of the total power dissipation while the total power is increased by only 25%. Figure 1 and Figure 2 show the growing ratio of static power among the total power [14, 36] . Figure 1 gives the growing trend of static power. It's growing in a fast paste. Figure 2 again shows the the growing trend of static power in terms of temperatures in the hardware devices. This raises the importance of reducing static power dissipations. Recently, academic results have tried to characterize the engineering equation and cost model for analyzing static powers [14, 36] . This is important, as the architecture designers and system developers can then deploy the architecture and software designs to reduce the static power according to the cost model. Previously, the availability of the cost equation for dynamic powers have prompted fruitful research results in the efforts to reduce dynamic power. The recent result in characterizing static power has the following equation. an average device,Î leak is the technology parameter describing the per device subthreshold leakage [8] .
In this paper, we investigate the compiler analysis techniques to reduce the number of devices, N , in the static power equation above to try to ease the problem of leakage powers. The architecture model in our design is a system with an instruction set to support the control of power gating in the component levels. From the viewpoints of engineering equation for static power, we attempt to reduce the number of devices by turning them off when they are unused. Our work provides compiler solutions in giving analysis and scheduling for the power gating control at component levels. Our compiler gives an analysis framework to utilize the instruction to reduce the leakage power. A data-flow analysis framework is given to estimate the component activities at fixed points of programs with the consideration of pipelines of architectures. We also give the equation for the compiler to decide if the employment of the power gating instructions on given program blocks will benefit the total energy reductions. As the duration of power gating on components on given program routines is related to program branches, we propose a set of scheduling policy including Basic Blk sched, MIN Path Sched, AVG Path Sched mechanisms and evaluate the effectiveness of those schemes. Basic Blk Sched mechanism schedules the power gating instructions according to the component activities in a given basic block. MIN Path Sched mechanism schedules the power gating instructions by assuming the minimum length among plausible program paths. AVG Path Sched schedules the power gating instructions by assuming the average length among plausible program paths. MIN Path Sched and AVG Path Sched mechanisms proposed in our work are based-on a depth-first traversal scheme to look up the interval in inserting power gating intsructions for components to reduce static power. Our experiment is done by incorporating our compiler analysis and scheduling policy into SUIF compiler tools [31, 32] and by simulating the energy consumptions on Wattch [6] toolkits. Experimental results show our mechanisms are very effective in reducing leakage powers on microprocessors. This work is also a part of our efforts in DTC (design technology center) of our university to develop compiler toolkits [24, 17, 39, 18, 38, 10, 9] for high-performance and low-power micro-processors and SoC designs. 
Machine Architecture
The architecture model in our design is a system with an instruction set to support the control of power gating in the component levels. Figure 3 shows an example of our target machine architecture on which our optimization is based. We focus on the reduction of the power consumption of the certain function units by invoking the "Power Gating" technology. Power gating is analogous to clock gating; power gating turns off devices by switching off their supply voltage rather than the clock. It can be done by forcing transistors to be off or using multithreshold voltage CMOS technology (MTCMOS) to increase threshold voltage [8, 22, 30] . We build the experimental architecture within the Wattch simulation environment [6] . In the simulation environment, we can measure every CPU components's power consumption of the whole experimental program. Basically, this architecture is compatible with the DEC Alpha 21264 processor [13] . The length of an instruction in our experimental architecture is 32 bits. Memory addressing is byte address. As Alpha is a 64-bit processor and uses 64-bit data bus, our experimental architecture has 32 integer registers (R0 through R31), and each is 64 bits wide. The major difference of these two architectures is the additional "power gating" design in our experimental architecture.
Those power-gated function units in our experimental architecture are Integer Multiplier, Floating Point Adder, Floating Point Multiplier, and Floating Point Divider. The power gating of each function unit can be controlled by the "Power Gating Control Register" ("PGCR" for short). The PGCR is a 64-bit integer register. In this case, the only lowest 4 bits of this register can affect the power gating status. The 0th bit of the lowest 4 bits of the PGCR controls the power gating of the Integer Multiplier. Setting of the bit will cause the Integer Multiplier to be turned on. Clearing of the bit will turn off the corresponding function unit in the immediately following clock. The 1st bit of the 4 bits is for Floating Point Adder, the 2nd bit is for Floating Point Multiplier, and the 3rd bit of the 4 bits is for the Floating Point Divider. Worth to mention, the Integer ALU unit within architecture also takes response to execute general operation. And, it performs the data movement to the PGCR, too. As a result of the Integer ALU is always required, this function unit is always turned on. In addition, we invoke a new instruction in the simulation environment to specify the access direction of PGCR. This instruction can operate those 4 power gated function units at once by move a proper value from a general purpose register to the PGCR. Figure 3 is also the architecture model on which we carry out our experiments later in Section 5.
Component-Activity Data-Flow Analysis
In this section, we investigate the compiler analysis techniques to ease the problem of leakage powers. We present a data-flow analysis framework [2] for a compiler to analyze the inactive states of components on a microprocessor. The process collects the information of the utilization of components at various points in a program. We first construct basic blocks and control flow graphs of given programs. We then try to develop a data flow equation for the summary of component usages at given program points. To gather the data-flow information, we define comp gen [B] , comp kill [B] , comp in [B] , and comp out [B] for each block B.
We say a component-activity c is generated at a block B if a component is required for this execution, symbolized as comp gen [B] , and it is killed if the component is released by the last request, symbolized as comp kill [B] . We then create two groups of equations shown below. The first group of equations follows from the observation that comp in [B] is the union of activities arriving from all predecessors of B. The second group is the activities at the end of a block that are either generated within the block, or those entering at the beginning but not killed as control flows through the block. We have the data flow equation for these two groups below,
We use an iterative approach to compute the desired results of comp in and comp out after comp gen have been computed for each block. The algorithm is Figure 4 . This is an iterative algorithm for data flow equations [2] with the additions of resource management strcutures. A two-dimension array, called RemainingCycle, is used to maintain the number of required cycles to achieve requests for each component and block. In addtion, a resource utilization table is adopted to give the resource requirement for each instruction of given processors. The resource utilization table can be used to give the initial values of RemainingCycle. The remaining cycles of a component will be decreased by one for each propagation. Initially, both comp in and com kill are set to be empty. The iteration goes until the comp in (and hence the comp out) converges. As comp out [B] never decreases in size for any B, the algorithm will eventually halt while all comp out are steady. Intuitively, the algorithm propagates activities of components as far as they will go by simulating all possible executing paths of the program. This algorithm gives the state of utilization of components for each point of programs.
Leakage Power Reduction
In this section, we present a cost model for decisions if power gating control should be applied and a set of scheduling policies to place power-gating instruction sets for given programs.
Cost Model
With the utilization of components obtained from last section, we can insert power gating instructions into programs at proper points, the head/tail of an inactive block, to turn off/on useless components. This can reduce the leakage power. However, both shutdown and wakeup procedures take additional penalty, especially for wakeup process due to peak voltage. The following gives our cost model for deciding if the insertions of energy instructions will profit in energy consumptions.
where E turn of f (C) is the penalty of energy for shutting down component C, E turn on is the penalty of energy for waking up component C, BreakEven C is the break-even cycle for component C, and P leak saving (C) is the leakage power saving of component C per cycle by employing power gating controls. The lefthand side of the equation shows the energy consumed by shutdown and wakeup procedures, and the right-hand side shows the leakage energy consumed for a certain cycles. It will save power for power gating control only if the amount of power of shutdown and wakeup is less than the one at RHS.
While employing power gating, there is another thing we should note. It's the latency to turn a component on. Due to the high capacitance on the circuits, several clock cycles will be needed to bring the component back to the normal operating state. Butts et al. also illustrated that it takes about 7.5 cycles at 1 GHz to charge 5 nF to 1.5 V with 1A [8] . With this consideration, we enforce power gating on a component only when the size of its inactive block, i.e. the idle region, is larger than its break-even cycle and its latency to recover. Our cost model after incorporating latency issue is now as follows.
where Latency C is the power gating latency of component C. In addition, we will try to insert the on operations of the power-gating control to be ahead of the time for the use of the corresponding components to avoid the delay in the programs due to wakeup latency.
Scheduling Policies for Power Gating
With the component activity information gathered and the cost model for deciding if the power-gating instructions should be employed, we now give the scheduling mechanisms to place the power gating instructions for given programs. As the duration of power gating on components is related to conditional branches in programs, we propose a set of scheduling policy including Basic Blk Sched, MIN Path Sched, and AVG Path Sched to schedule power gating instructions. The details are given below.
A naive mechanism to control the power-gating instruction set will set the on and off instructions at each basic block according to the component activities gathered by the data flow equation in the previous section. We call this scheme as Basic Blk Sched.
Next, an inactive block of a component may cross over more than two adjacent basic blocks. We use a depth-first-traveling algorithm to traverse all possible executing paths. In general, an inactive block will be turned off while the criteria reached. In the case of conditional branches occurred in the inactive block, there should be an another consideration to take action of power gating. This is because the size of the two inactive blocks, which are targets that the branch instruction points to, may be different. There may be a situation that one of the branchings benefits for power gating while the other doesn't. It will be against power reductions if we take control of power gating considering only one branch but the other branch is taken. Hence, we propose a MIN Path Sched policy to ensure that power gating control would be activated only if the inactive lengths of both branching paths exceed the power gating threshold, that is, the minimum length of those paths reaches the criteria for power gating. Figure 5 presents the details for MIN Path Sched algorithm. Given a control flow graph annotated with component utilizations, we define a integer variable, Count, to maintain the inactive length so far. It is passed around from parent blocks to successors and increased for each passing. The algorithm is recursive to guarantee the accuracy of Count and to ensure all paths being traversed. The algorithm is divided into four parts to handle conditions when encountering/non-encountering a conditional branch while the analyzing component is active/inactive.
1)
A conditional branch reaches and the component is inactive: under this condition, current traveling will halt until both two branches having done their travelings. And then, it makes a judgement on power gating when no branch encountered before and return the minimum inactive length of two branchings. 2) A conditional branch reaches and the component is active: under this condition, it takes control of power gating if necessary, starts two new travelings for both branchings, and finally returns the current inactive length. 3) Any statement except conditional branches reaches and the component is inactive: under this condition, it only continues the current traveling, that is, it only increases Count for passing and returns. 4) Any statement except conditional branches reaches and the component is active: like condition 2, it takes control of power gating if necessary and starts a new traveling for its successor. And finally, it returns Count.
Note that cares have to be taken for recursive boundaries to reach the backward edges for a loop. As a depth-first search algorithm can find out the loop, the cycle situation can be known in our algorithm. In a cycle situation, if the the whole instructions used in the cycle of a program fragment does not use the component in the search, we will assume the loop cycle is executed for once with the minimum path scheduling policy. If some instructions in the backward edge of a program fragment does use the component in the search, the the backward edge extend to that instruction will be counted for the program path. In addition, since our proposed algorithm is based on the depth-first-traveling, the complexity of our approach is O(N ) where N is the size of nodes in a control flow graph. Next, as the behavior of program branches depends on the structure and the input data of programs, some branches may be taken rarely or even not taken. To accomodate this issue, we propose an eclectic policy, called AV G P ath Sched, to schedule power gating instructions. The only difference between AV G P ath Sched and M IN P ath Sched is the judgements made in condition 1 above. AV G P ath Sched returns the average length of two branchings instead of the minimums. With this scheme, it will take advantage of power reduction if a unusual-taken branch returns a small value of Count which causes power gating mechanism inactivated. AV G P ath Sched mechanism can be approximately done by assuming the probabilities of all branches are 50%, by assinging branch probabilities at compiler time by programmers or compilers or by incorporating path profiling schemes to examine the probabilities of all branches.
Experimental Results

Platform
We use an Alpha-compatible architecture with power gating control and instruction sets described in Figure 3 of Section 2 as the target architecture for our experiments. The proposed data-flow analysis and scheduling policies are incorporated into the compiler tool with SUIF [32] and MachSUIF Library [31] and evaluated by Wattch simulator [6] . Figure 6 shows the structure of the framework including compilation and simulation parts. In the compilation part, we use the SUIF library to perform compiler optimization for performances and the MachSUIF Library to perform machine-dependent optimizations for Alpha processors. After those optimizations having been done, we then analyze component activities with our proposed data-flow equation and schedule power gating instructions to reduce leakage dissipation. Finally, the compiler generates the Alpha assembly code with power gating instructions. To be recognized by Alpha assembler and linker, power gating instructions are replaced by an instruction sequence within the Alpha instruction set with annotation information to simulators. We then use the Wattch power estimator, which is based on the SimpleScalar [7] architectural simulator, to simulate power dissipation and to evaluate our approach. The SimpleScalar is a simulator that provides execution-driven and cycle-accurate simulations for various instruction set architectures (include our target architecture, Alpha ISA). Both SimpleScalar and Wattch are now widely used for simulations to evaluate performance and power dissipation [6] . We also do refinements on the Wattch estimator to catch the instruction sequences for power gating control.
Results
The test suits in our experiment are the common benchmarks listed in FAQ of comp.benchmarks [1] . Figure 7 and Figure 8 Multiplier for nsieve application, respectively. In these figures, the X-axis represents the break-even cycle for our scheduling criteria and the Y-axis represents the power consumption. The leftest bar shows the power dissipated by function units while no power gating control being employed. This is the results of traditional clock gating mechanism provided by the Wattch power estimator. This is the version we use as the base version for comparison. The clock gating mechanism gates the clocks of those unused resources in multi-ported hardware to reduce the dynamic power. However, there is still static power leaked. Wattch assumes that clock-gated units dissipate 10% of their maximum power, rather than drawing zero power. For example, the clock gating mechanism reduces about 30% of total power consumption against the one without clock gating for several SPECint95 and SPECfp95 benchmarks [6] . The rest bars of the figures give the power gating results for the proposed scheduling policies with different breakeven cycle. The results show that the power gating mechanism reduces a large amount of leakage power even if the penalty of power gating control is high (i.e., large break-even cycle). Note that we have incorporated the penalty of inserting power gating instructions into our power simulator, Wattch. Figure 9 and Figure 10 give the power consumption of the Floating Point Adder and the Floating Point Multiplier for various benchmarks while employing the power gating mechanism with break-even cycle 32. Once again, it is observed that the AV G P ath Sched benefited the most power reduction while 
Related Work
Several research groups have proposed and developed hardware techniques to reduce dynamic and static power dissipation in recent. Recent work by Powell et al. architectural and circuit-level techniques to reduce the power consumption in instruction caches [26] . The cache miss rate is used to determine the working set size of the application relative to that of the cache. Leakage power is then removed from the unused SRAM cells using gated-V dd transistors. Kaxiras et al. also attacks leakage power in cache memories. Policies and implementations for reducing cache leakage by invalidating and turning off cache lines when they enter a dead period are4 discussed [23] . This leads to power savings in the cache. Recently, we found the research work done by Zhang et al. giving a compiler approach which exploits schedule slacks in VLIW architectures to optimize leakage and dynamic energy consumption [40] . Gupta et. al gave experimental results in using software for power gating [28] . Those two work are concurrent work to our research work in compiler solutions for leakage power reduction. Note that a key part of our solution in analyzing the component activities filed a patent in Taiwan by us dated back to the year of 2000. Comparing with those two concurrent work, we have speciality in providing data flow analysis for component activities of programs. Our analysis crosses the boundary of basic blocks. In addition, we provide a family of scheduling policies in inserting energy instructions.
Conclusions
In this paper, we investigated the compiler analysis techniques related to reducing leakage power. The architecture model in our design is a system with an instruction set to support the control of power gating in the component levels. We presented a data flow analysis framework to estimate the component activities at fixed points of programs with the consideration of pipelines of architectures. A set of scheduling policy including Basic Blk Sched, MIN Path Sched, and AVG Path Sched mechanisms were proposed and evaluated. Experimental results show our mechanisms are effective in reducing leakage powers on microprocessors.
