Abstract
Introduction
Embedded microprocessor domain is a rapidly growing segment in the microprocessor industry. Although the range of embedded applications is diverse, there is one common design factor in all the microcontrollers and microprocessors for these applications -the aspect of power efficiency is treated with equal importance as is performance. Due to this factor, the design considerations for embedded processors are focused on both performance and power instead of only on performance, as has been in traditional systems. Although techniques to reduce dynamic power in circuits have been worked on for long, with shrinking technologies, the leakage power dissipation in circuits has become substantially large. It has been established that a chip's leakage power increases 5 times each generation whereas the active power remains constant [1] . For 70 nm process tech- nology, the leakage power component is about 40% of the total power dissipated by circuits. Since the total energy consumed by a circuit is indicative of its battery life, various subthreshold leakage current reduction techniques have been employed to reduce the total leakage energy consumption of circuits in embedded microprocessors. A taxonomy of various works in architectural level leakage reduction for microprocessors is given in Figure 1 . Several earlier works on architectural-level leakage reduction have concentrated primarily on the memory subsystems [3, 4] . However, subsequently, there have been attempts at investigating leakage reduction in functional units of a datapath. These attempts include detection and activation of power-gating [2] periods for functional units at the microarchitectural level as well as at the compiler-level. Amongst the microarhitectural techniques, reverse body-bias was investigated for leakage reduction in the Intel Xscale microprocessor by Clark et al. [5] . An analytical energy model was developed for dual-threshold domino circuits to reduce leakage in functional units by Dropsho et al. in [6] . Microarchitectural techniques for power-gating of execution units are investigated by Hu et. al [7] in which activation and deactivation of the functional units are guided by branch prediction techniques. Amongst the compiler-based techniques, Rele et al. [8] , Talli et al [9] , Zhang et al. [10] , and You et al. [11] employ dynamic code profiling and component usage analysis techniques to identify portions of the program with low functional unit requirement and insert sleep instructions to apply power-gating. In [10] , when power-gating was not possible, input vector control was suggested for further leakage reduction.
According to our understanding, the techniques proposed in the works cited above, although reported only for superscalar [8, 9, 11] and VLIW architectures [10] , in principle, should also work for embedded processors. However, in these works, the various energy components associated with application of the power-gating technique have been assumed based on some architectural level power models as opposed to using their precise values. To the best of our knowledge there is no architectural leakage reduction work that exist for embedded processors.
In this paper, we propose a compiler-based technique for reducing leakage energy consumed by the functional units in an embedded microprocessors core. We investigate the program behavior of a set of benchmark applications for embedded systems. It is well known that the applications for embedded processors are characterized by relatively small code sizes when compared to those for generalpurpose processors which are executed as part of iterative loops. Based on these observations, we focus on the iterative code structures in the program for detection of long idle regions in the program. Switches for power-gating the functional units are provided at the circuit level, which are controlled with special instructions inserted within the code during compile time based on idle time behavior analysis. Extensive simulations were performed based on the ARM processor as the target architecture model, and synthesizing a library of functional units with power-gating capability using 70nm MOSIS technology files. Experimental results are presented for two benchmarks for each category in MiBench suite [18] which indicate a leakage energy reduction of 34% on the average.
The remainder of the paper is organized as follows. Section 2 describes the framework for power-gating adopted in this work. The target architectural model with powergating capability, and the modifications required to the compiler are discussed in detail. Section 3 presents the design of functional units with power-gating capability, the exper- 
Framework for Power-Gating
The framework for power-gating used in this work is shown in Figure 2 . The application source is first statically analyzed for functional unit requirements. The application program is also dynamically profiled using inputs supplied with the benchmarks. At the idle functional unit subgraph identification stage, for each functional unit, subgraphs are identified within the source program in which it is not used. At the same stage, the power-gating details of the functional units are used to decide the insertion of the sleep instructions. Finally, a cycle-accurate simulator for the powergated architecture is used to calculate the leakage energy savings. In this section, the power-gated architecture and the components of the compiler extensions are discussed. Figure 3 shows the block diagram of a generic ARM architecture [12] which is modified to enable power-gating functionality for the functional units. A floating point unit has also been added for the floating point instructions supported by the ARM Instruction Set. The functional units are designed such that the latency in their activation 1 takes one clock cycle 2 . The details of the functional units are discussed in Section 3. The instruction pipeline is considered to be a generic five-stage pipeline with a diversified execution pipe. By a diversified execution pipe, we mean that there are parallel sub-pipelines employing different functional units in the execute stage(s).
Power-gated Architecture Model
Deactivating the functional units is carried out using the sleep instruction. A Sleep Control Register (SCR) is added to the instruction decode logic. Each functional unit that needs to be deactivated is supplied as an operand to the sleep instruction. Since each functional unit can be required to be either activated or deactivated, one bit is required to specify each operand. A '1' indicates that the functional unit has to be deactivated. A '0' indicates the functional unit should not be deactivated. When a sleep instruction is decoded, a '1' is written into the corresponding bit location in the sleep control register and a '0' is ignored.
The activation of the functional units happens at the decode stage itself. The functional unit gets activated during the cycle following the one in which the instruction enters the operand fetch stage or the dispatch stage in the pipeline. Therefore, by the time the the instruction enters the execute stage, the functional unit is active to perform useful computation. This avoids the need for separate wakeup instructions.
Compiler Extensions
The main task of the compiler is to analyze the program behavior and predict regions where certain functional units are not expected to be required so that power-gating can be applied. Thus, given a control flow graph (CFG) for the program, the objective is to find out maximal subgraphs of the CFG in which the program is expected to spend a large amount of time and insert sleep instructions for those functional units that are not required in those subgraphs.
Subgraphs enclosed within loops
In this work, we use the loops in the source program to identify maximal subgraphs for applying power-gating effectively. We introduce the notion of loop hierarchy trees (LHTs) to capture the nesting structure of the program segments. During the static code analysis phase, for each function in the source program, we create a forest of LHTs. Each vertex of a LHT denotes a loop in the source program and its descendant denote the loops that are nested within that loop. Each vertex is annotated with the functional unit requirement of the loop corresponding to that vertex. Figure  4 (a) shows the control flow graph of the basic blocks 3 in a sample function with the loops as indicated on the back edges. Figure 4(b) shows the corresponding loop hierarchy trees for the CFG shown in Figure 4 (a). Loop l 1 has 3 nested child loops, l 2 , l 3 , and l 4 . Among these, l 4 further has l 5 as a nested loop. Similarly, l 7 is a loop nested within loop l 6 . For the sake of simplicity, the functional unit requirements for the basic blocks in the CFG and the loops in the LHTs are not shown in the figure. An essential property of a LHT is that it gives a partial ordering 4 of the subgraphs to be considered in the CFG for the program. In Figure 4 (b), loop l 4 encloses a bigger subgraph than l 5 does. However, a definite ordering of the sizes of the subgraphs enclosed by l 3 and l 4 cannot be established.
During the dynamic profiling of the source program, the following information is gathered: (1) the vertices in the CFG of the program are annotated with the corresponding basic block execution frequencies. (2) the vertices in the loop hierarchy tree are annotated with the corresponding loop execution frequencies.
We define the average length of a loop in terms of the average number of instructions that are executed from within the loop during the dynamic profiling stage. More formally, let G i = (V i , E i ) denote the control flow graph of the i th function in the source program such that each vertex in G i a basic block. Let l i be any loop in G i such S(l i ) ∈ V i denote the set of vertices in l i and C i denote the set of children loops of l i . Then the average length of the loop l i is defined average number of instructions executed as part of that loop during the dynamic profiling state and is given by the recursive relation, Thus, the L avg values for all the loops in the CFG can be calculated by running a breadth first search from the root of each tree in the LHTs and applying memoization 5 . It can be noted from the definition of a nested loop that f (l i ) ≤ f (c i ), for all c i ∈ C i , i.e. the execution frequency of a loop is at least as high as that of its nested loops. For the functions that are called from within a loop, the entire function is considered as a basic block in equation (1) . Also, each function in the source program is separately analyzed for insertion of sleep instructions.
Insertion of the sleep instructions
We define the threshold 6 number of instructions, L th , as the number of instructions whose period of execution is sufficient for keeping a functional unit r deactivated to compensate for the energy overhead in switching it on and switching it off.
where, ∆ r = dynamic energy overhead in activation and deactivation of r, δ r = leakage energy saved per unit time by keeping the functional unit r deactivated, and t clk = clock period. Calculation of ∆ r for each functional unit is described in Section III B.
To find the locations to insert sleep instructions, we perform a depth first traversal in each LHT starting at its root. This traversal is done once for each functional unit and is terminated as soon as it is found that the entire loop corresponding to the vertex does not use the functional unit. Let the normalized average length of loop x, L norm (x) be defined as,
where, f (x) = execution frequency of loop x, h(x) = the number of times loop x is entered from a basic block outside of x. We perform the normalization to distinguish higher iterative loops from the lower iterative ones. If L norm (x), is greater than L th , the basic block leading to the loop, e(x), along with the functional unit, r, is added to the set of sleep instructions, S. After all the LHTs are inspected, the set S is uniqified such that the sleep instructions to be inserted at the end of a particular basic block pertaining to various functional units are combined into one sleep instruction for multiple functional units. Algorithm 1 and 2 present the pseudocodes for the routines INSERT SLEEP and INSPECT.
Experimental Results
In this section, we describe the experimental setup and the results. First, we present the specifications of the functional units created in this work. Then we describe the cal- 5 Memoization is a top-down dynamic programming strategy. 6 This is also termed as the breakeven period in literature [6] .
Algorithm 1 INSERT SLEEP(F )
1: for all tree T ∈ F do 2: for all functional unit r ∈ R do 3: INSPECT(root(T )) 
S ← S ∪ {e(x), r} 4:
for all y ∈ C x do 6: INSPECT(y) 7: end for 8: end if 9: end if culations of various components of energy which are required by the compiler for insertion of sleep instructions.
Design Details of Power-Gated Functional Units
We created functional units in compliance with the functional specifications of the functional units for the ARM processor. For the purpose of estimating the overhead and the steady state energy components, these functional units are described structurally using lower level components like 4-bit carry lookahead (CLA) module, 8-bit registers, 8-bit shift registers, 8-bit multiplexers, etc., which are constructed in circuit level and are characterized for power using 70nm MOSIS technology model files. The power-gated version of all these components employ an appropriately sized footer sleep transistor. For example, the 32-bit adder in the integer ALU is constructed out of 8 stages of the 4-bit CLA module and the Booth multiplier is comprises primarily of 32-bit adder, 32-bit multiplexers and 32-bit shift registers. The power-gated functional units consist of the power-gated components while the regular functional units are comprised of components without any sleep transistors. The integer functional units comprise of the ALU, Barrel Shifter, and a Booth Multiplier. The floating point functional units have been modeled for supporting the IEEE standard 754 single-precision scalar operations supported by the VFP9-S Vector Floating-point Coprocessor [14] . The floating point functional units have been separately implemented as an arithmetic unit, a multiply unit, and a divide and square root unit. We construct single-precision functional units using the parameterized floating point unit designs presented in [15, 16] . Table 1 describes the latency specifications of these components for a clock period of 10 ns. These latency values are used during cycle-accurate simulation with the embedded benchmarks.
Energy Component Calculations
Figure 5(a) shows the footer sleep transistor configuration and Figure 5 (b) illustrates the significant time intervals for the calculations of the various components of energy for the power-gated structural components of the functional units. V vrgnd refers to the voltage at the virtual ground. P inst refers to the instantaneous power dissipated by the circuit. At time, t 0 , when the sleep transistor is switched OFF, V vrgnd rises to V dd by time t 1 . From time t 1 to t 2 , all the capacitances in the circuit reach their final charge. During this interval the circuit still dissipates instantaneous power which approaches the steady state leakage power in its OFF state, P of f . Thus, the overhead energy in deactivating the circuit is given by the area under the curve for P inst during the interval t 0 to t 2 :
Similarly, the overhead energy in activating the circuit is the total energy dissipated during the interval t 3 to t 5 ,
At time t 5 , the circuit starts to dissipate P on . The calculations of the energy components of the components is performed using HSPICE simulations on the 70nm model files available from MOSIS. Since the leakage power is proportional to the number of transistors in a circuit, the energy components of each functional unit is estimated as the sum of the energy components 
Cycle-Accurate Simulation Results
We used the Simplescalar-ARM distribution [17] for our experimentation. The object code for the program is disassembled using the gcc tools for ARM available with the distribution. Static code analysis is performed on the CFG generated for the functions in the source program. We do not inspect standard library functions for insertion of sleep instructions. We used sim-profile to perform dynamic profiling of the program and, finally, used sim-outorder to perform cycle-accurate simulation after inserting the sleep instructions. The configuration used for sim-outorder is for Intel SA-1 microarchitecture [13] which has a fetch and decode width of 1. Two benchmarks from each category in the MiBench testsuite were selected for the experiments.
We compare the leakage energy savings in a processor with power-gated functional units to one in which the functional units are not power-gated. The results are tabulated in Table III . While reporting the energy savings for benchmarks that do not have floating point computations, we report savings for a processor core with and without the floating point units. Since only some of the benchmarks require floating point computations and embedded system processor architectures are often designed with floating point units only if the targeted applications require floating point computations, we performed simulations for both cases. The deactivation of functional units using power-gating results in leakage energy savings without incurring any performance degradation. Since there was much less than 0.1% performance degradation, we did not include those numbers in the table. As can be seen in Table III , the proposed methodology yields average energy savings of 34% for the benchmark programs.
Conclusions
In this paper, we presented a detailed description of proposed compiler-based technique for applying power-gating in embedded processors. In this work, the calculations of the various energy components involved in the application of power-gating are modeled to achieve high degree of accuracy for each functional unit in the processor core. Thus, the final experimental results reported are more accurate (calculated by HSPICE) than those reported earlier. We also discuss how power-gating can be accomplished without any significant overhead or loss of performance by providing a robust mathematical appratus. An important aspect of our approach is that the design of functional units are based on assumption that the activation of a functional unit can be accomplished in a single clock cycle that is prior to the cycle when the unit is needed. The advantage is the reduction in the overhead since the above assumption precludes the need for a separate instruction to activate the functional units. Since embedded processors have much less hardware complexity than superscalar processors, this is a reasonable assumption. If the processor is an extremely fine-grained pipeline, then more than a single cycle may be needed for activation in such processors.
