Power is becoming a critical constraint for designing embedded applications. Current power analysis techniques based on circuit-level or architectural-level simulation are either impractical or inaccurate to estimate the power cost for a given piece of application software. In this paper, an instruction-level power analysis model is developed for an embedded DSP processor based on physical current measurements. Signi cant points of di erence have been observed between the software power model for this custom DSP processor and the power models that have been developed earlier for some general-purpose commercial microprocessors 1, 2]. In particular, the e ect of circuit state on the power cost of an instruction stream is more marked in the case of this DSP processor. In addition, the processor has special architectural features that allow dual-memory accesses and packing of instructions into pairs. The energy reduction possible through the use of these features is studied. The on-chip Booth multiplier on the processor is a major source of energy consumption for DSP programs. A microarchitectural power model for the multiplier is developed and analyzed for further power minimization. In order to exploit all of the above e ects, a scheduling technique based on the new instruction-level power model is proposed. Several example programs are provided to illustrate the e ectiveness of this approach. Energy reductions varying from 26% to 73% have been observed. These energy savings are real and have been veri ed through physical measurement. It should be noted that the energy reduction essentially comes for free. It is obtained through software modi cation, and thus, entails no hardware overhead. In addition, there is no loss of performance since the running times of the modi ed programs either improve or remain unchanged.
Introduction
Embedded computing systems are characterized by the presence of application speci c software running on specialized processors. These processors may be o the shelf digital signal processors or application speci c instruction-set processors (ASIPs) that have been specially designed for a certain class of algorithms. In most of the embedded applications, such as cellular phone or portable electronic devices, power becomes an important constraint in the design speci cation. However, there is very little available in the form of design tools to help embedded system designers evaluate their designs in terms of the power metric. At present, accurate power estimation tools are available only for the lower levels of the design -at the circuit level and to a limited extent at the gate level. For an embedded processor, circuit-level or gate-level simulation is slow and impractical to evaluate the power consumption of software. These techniques often cannot even be applied due to lack of circuit-level and gate-level information of the embedded processor. In 3, 4, 5] recourse is taken to architectural-level power simulation. This takes the software instructions as stimulus and sums up current of the active modules in the processor for each simulation cycle. Here, the current assigned to each module is the cost of being active, and is usually constant { regardless of the operand value, circuit state, and the correlation with the activities of other modules. In 6], architectural power analysis based on stochastic data modeling is proposed for greater accuracy for ASIC DSP applications. However, due to the higher levels of abstraction in these works, the power estimates are not very accurate. Further, lower level internal details of the processors are still needed, in order to assign power costs to modules.
All of the above problems can be overcome if the current drawn by the CPU during the execution of programs is physically measured. An instruction level power analysis technique based on physical measurements has recently been developed by Tiwari et al. 1] . This paper discusses the application of this technique for the power analysis of a Fujitsu embedded DSP processor. This processor, referred to as the target DSP processor from here on, is used in several Fujitsu embedded applications, and is representative of a large class of DSP processors. The analysis results are used to derive an instruction-level power model that makes it possible to evaluate the power cost of programs that run on the target DSP processor. The results of the analysis also motivate several software power minimization techniques that exploit the power consumption characteristics of the processor. These are also described in the paper.
The instruction-level analysis technique had earlier been applied to two large general-purpose commercial microprocessors 1, 2] . Some points of signi cant di erence between the power models of these two general-purpose processors and the target DSP processor have been identi ed. In particular, the e ect of circuit state change is found to be more marked in terms of power consumption for the target DSP processor than for the previous two processors. This e ect had limited impact for the other processors, and it was felt that this may be characteristic of large general-purpose processors, as opposed to smaller, specialized processors like DSPs. The signi cant impact of circuit state in the case of the target DSP processor suggests that an appropriate scheduling of instructions can lead to a reduction in the power cost of programs 7] . Furthermore, it is seen that for certain instructions that involve the ALU datapath, even non-adjacent instructions can cause changes in the circuit state that contributes to signi cant power consumption. This e ect was not observed in the previous work. If this e ect is not considered for the target DSP processor, power for certain programs can be underestimated -e.g., by about 13.6% for an example described later.
The issue of scheduling for power minimization has been explored to a limited extent in earlier works 8, 9] . In 8], the study shows that faster programs consume less energy. So optimizing software performance through instruction scheduling can minimize the energy consumption. In 9] , only the power consumed by the controller of a processor is targeted for minimization by instruction scheduling. The power cost of di erent instruction schedules is estimated by counting transitions on an RTL level model of the control-path. This is a rough measure of the actual power cost. Furthermore, the increase in energy cost due to longer schedules is not considered. The instruction scheduling technique that we propose overcomes the previous limitations since it is based on actual energy costs obtained through physical measurements, and is therefore more e ective,
The DSP processor has some special architectural features that were originally provided to reduce the number of cycles for programs that can utilize the features. However, our analysis shows that these feature are also very e ective for reducing the energy cost of programs. The rst feature allows double data transfers from di erent memory banks to registers in one cycle 10], and the other packs two instructions into a single code-word. Automated techniques for e ectively exploiting these features for energy reduction are presented. Another avenue for power reduction is based on the observation that the on-chip Booth multiplier is a major source of energy consumption for DSP programs, This motivates the development and analysis of a microarchitectural power model for the multiplier. Based on this model, an e ective technique for local code modi cation by operand swapping is proposed to further reduce power consumption.
Finally, a energy minimization methodology is presented to collectively apply the above techniques to any given piece of code. Experimental results on several example DSP programs show energy reductions ranging from 26% to 73%. The experimental set-up used allows for immediate veri cation of results. Thus, the energy savings reported in the paper have been physically validated. It should also be noted that the energy reduction essentially comes for free. It is obtained through software modi cation, and thus, entails no hardware overhead. In addition, there is no loss of performance since the running times of the modi ed programs either improve or remains unchanged.
This paper is organized as follows. Section 2 highlights the architectural features of the target processor relevant to our study on power analysis and optimization. Section 3 explains the experiment setup for current measurement. Section 4 develops an instruction-level power model for the target DSP processor based on physical current measurement, and highlights its major di erences from previous models. Section 5 discusses the proposed energy optimization techniques based on memory bank assignment, instruction packing, instruction scheduling, and operand swapping for the Booth multiplier. Section 6 presents the experimental results, and Section 7 summarizes the contributions and future directions of this work.
Architecture of the Target DSP Processor
The target DSP processor used for our study is a Fujitsu 3.3V, 0.5um, 40MHz CMOS processor. It implements several special architectural features to support embedded DSP applications. Some of these features are usually not seen in general-purpose microprocessors. The basic architectural features of the processor that are relevant to the rest of the paper are listed below. The impact on power consumption of some of these will be studied in detail when the features are discussed further in the following sections.
Reduced number of pipeline stages: 2. In the rst stage, the instruction is fetched and decoded; in the second stage, the necessary operands are accessed, the result is computed and nally written back.
Limited number of data registers: 4. These are 24-bit registers A, B, C, and D.
8 index registers to support sequential access to data arrays in the memory banks, a data structure commonly found in DSP applications.
A fast MAC (Multiply-and-ACcumulate) unit. A MAC operation is completed in one cycle by using a Booth multiplier. Power analysis and optimization for the MAC unit is discussed in Section 5.4.
Simple instruction set restricted by the special DSP architecture. For example, the Booth multiplier always takes the operands from registers A and B. Explanation and classi cation of the instruction set will be given in Section 4.2.
Power management for the Booth multiplier. The two operands of the Booth multiplier are latched to retain their old values to reduce unnecessary switching in the multiplier. Its e ect on the proposed instruction-level power model is discussed in Section 4.1.
Two on-chip memory banks -RAMA and RAMB. If two data operands are in di erent banks, they can be fetched simultaneously into registers by a double-transfer instruction. A low-energy memory bank assignment technique is proposed in Section 5.1 to exploit this feature.
Packed instruction. In general, an ALU-type instruction and a data transfer instruction can be packed into a single instruction codeword for simultaneous execution. Energy optimization through the use of this feature is discussed in Section 5.2. 
Current Measurement
The average power P consumed by a processor while running a certain program is given by P = I V dd , where I is the average current and V dd is the supply voltage. The energy consumed by a program, E, is given by E = P T, where T is the the execution time of the program. This in turn is given by T = N , where N is the number of clock cycles and is the clock period.
Since the applications of the target DSP processor run on the limited energy available in a battery, energy consumption is the focus of attention. While we will attempt to retain the distinction between the terms \power" and \energy", the \power" is often used to refer to \energy", in adherence to common usage. Now, V dd and are known and xed. Therefore, E is proportional to the product of I and N. Given the number of execution cycles, N, for a program, we just need to measure the average current, I, to calculate E. The product I n is the measure used to compare the energy cost of programs in this paper.
The current measurement setup is illustrated in Figure 1 . This processor is part of a personal computer evaluation board with several 3.3V/5V level converters and instruction memory operating at 5V. The DSP chip can be programmed through a monitor program running on a personal computer. Using the monitor, the DSP instructions can be down-loaded to the o -chip instruction memory, while the input data can be stored in the two on-chip memory banks of the DSP processor. The current drawn by the DSP processor is measured through a standard o -the-shell, dual-slope integrating digital ammeter (indicated by the circle A in Figure 1 ), which is connected between a 3.3V power supply and the Vdd pins of the target DSP chip. The layout of the particular evaluation board makes it di cult to isolate the power connections to the external memory chips, and thus the measurement results do not include this current.
If a program completes execution in a short time, a current reading cannot be visually obtained from the ammeter. To overcome this, the programs being considered are put in in nite loops and current readings are taken. The current consumption in the CPU varies in time depending on what instructions are being executed. But since the chosen ammeter averages current over a window of time (100ms), if the execution time of the program is much less than the width of this window, a stable reading will be obtained. This is illustrated in Figure 2 .
The main limitation of this approach is that it will not work for longer programs, since the ammeter may not show a stable reading. However, this is not a limitation for the development of an instruction-level power model, since short instruction sequences are all that is needed. To determine the base cost of a given instruction, a loop consisting of a su cient number of instances of the instruction, for example, 200, is needed. Similarly, to assign energy costs to speci c interinstruction e ects like circuit-state, pipeline stalls, etc., appropriate short instruction sequences are needed. This is discussed in greater detail in 1]. For these purposes, the inexpensive approach of using a digital ammeter works very well. It should be stressed that the main concepts described in this paper are independent of the actual method used to measure average current. The results of the above approach have been validated by comparisons with other current measurement setups. But if sophisticated data acquisition based measurement instruments are available, the measurement method can be based on them, if so desired. 4 Power Analysis for the Target DSP Processor A comprehensive instruction-level power analysis of the target DSP processor has been performed using the current measurement technique discussed in the previous section. The salient results of the analysis are described in this section. We begin by describing an inter-instruction e ect that was not evident in the case of previous power models 1, 2] . This has to do with the e ect of circuit state in the special design of the on-chip multiplier. An enhanced power model is proposed to account for this e ect. Subsequently, the other parameters of the complete instruction-level power model of the processor are described.
E ect of Circuit State Overhead
Instruction-level power analysis models for the Intel 486DX2 and the Fujitsu SPARClite have been developed earlier as described in 1, 2]. The primary components of these models are base costs of instructions and overhead costs between adjacent instructions. The base current of an instruction is measured by putting several instances of the target instruction in an in nite loop. If a pair of di erent instructions, say i and j, is put into an in nite loop for measurement, the current is always larger than the average of the base costs of i and j. The di erence is called the overhead cost of i and j, and is considered as a measure of the change in circuit state from instruction i to j, and vice-versa. So the total energy consumed by a program is the sum of the total base costs and the total overhead costs, over all the instructions executed.
However, the overhead cost for the above processors only considers the circuit state change caused by adjacent instructions. The results for our target DSP processor show that this model can underestimate current, especially for multiply instructions. Table 1 gives an example program consisting of a sequence of packed instructions (MUL:LAB) followed by NOPs. The packed instruction MUL:LAB multiplies registers A and B while in the same cycle transferring two new data from memory to registers A and B respectively. The associated base cost, the overhead cost of instruction pairs, and the number of execution cycles are listed in Table 1 The di erence, 27.8, in the two numbers actually comes from the circuit state overhead between non-adjacent instructions 1&3. This is due to a special design feature at the inputs of the multiplier as illustrated in Figure 3 . A latch for each input operand is put between the multiplier and the operand bus to retain the old values until the next multiply instruction is executed. Therefore, the state change at such input latches cannot be accounted for by the overhead of adjacent instructions 1&2 or 2&3. It is given by the overhead of instructions 1&3. So 2 times the overhead of 1&3, (2*13.9mA), can compensate for the above di erence, leading to a more accurate estimate 1 .
As a result, the new power model needs to include the overhead caused by non-adjacent multiply instructions. Now, this overhead is dependent on the previous and current values of the input latches for each multiply operation. But these values are typically unknown until runtime. So for the purpose of program energy evaluation, the state of the input latches is considered unknown, and an average overhead current penalty is added to the base cost of each multiply instruction. This average value was determined to be 12.5mA for MUL instructions. So in a way, the above e ect is handled by using an enhanced form of base cost for multiply instructions. The enhanced base 1 Because these four instructions are put in an in nite loop for measurement, the overhead will occur twice, between 1 and 3 as well as 3 and 1. cost is the base cost as de ned earlier, plus an average overhead penalty. While most instructions in the instruction set did not show the same e ect, some other instructions involving the ALU data-path did. It can also be expected that other processors that utilize similar designs may also exhibit a similar e ect.
Instruction-Level Power Model
The instruction-level power modeling technique described in Section 4.1 suggests that accurate current estimates for a program can be obtained if a table that gives the base cost for each instruction, and a table that gives the overhead cost for each instruction pair can be derived. Such tables can be empirically constructed through appropriate experiments using the measurement based power analysis technique. However, there are some practical issues to be considered in this regard. First, the power cost of some instructions can vary depending on the operand value. Extensive experimentation can lead to the development of accurate models for this variation. A practical approximation in this case is to use average costs for these instructions. The average costs are then tabulated. The other issue is one of table size. For processors with rich instruction sets, assigning power costs to all instructions and instruction pairs can lead to large tables. Creation of these tables may require a lot of work. However, it has been observed that instructions can be arranged in classes such that the instructions in a given class have very similar power costs. Instructions with similar For the target DSP processor, the instructions most commonly used in DSP applications were categorized into classes. Six classes were used for unpacked instructions. The principal packed instructions are similarly classi ed. The instructions in the same class have similar functionality and activate similar parts of the CPU. Hence they have similar characteristics with regards to the current drawn. These six instruction classes are listed in Table 2 , where the addressing mode and the corresponding functionality is also provided. Extensive current measuring experiments were then conducted to verify for each class the characteristics of current consumption. Furthermore, the e ect of di erent operand values on the variation of current consumption was studied for each class. The average base and overhead costs were also assigned. All these analysis results are discussed in detail in the remainder of this section. Packed and unpacked instructions are discussed separately. A scheduling algorithm that has been developed to use this information for energy reduction will be described in Section 5.3. Table 3 gives for each unpacked instruction class, the range of base costs for di erent operand values. The exact operand values are often unknown until runtime. Thus, average values are used during program energy evaluation. These are also shown in Table 3 . Since the range of variation in the base costs of each class is reasonably small (less than 10%) for most classes (LDI being the exception), any inaccuracy resulting from the use of averages is limited. The overhead costs between instructions belonging to di erent classes are shown in Table 4 .
Base/Overhead Cost of Unpacked Instruction
The entry in row i and column j gives the overhead cost when an instruction belonging to class j occurs after an instruction belonging to class i, or an instruction belonging to class i occurs after an instruction belonging to class j. This table is symmetric, since the method used for calculating overhead costs assumes that the costs in these two cases are the same. There is a variation in the value of each entry for di erent operands and for the choice of instructions in each class. This variation is again limited, and it is reasonable to use average values. The entries in Table 4 represent the determined average values. The value in the MAC, MAC entry represents the overhead that can occur even if the two instructions are non-adjacent, as described in Section 4.1. An alternative way to look at this case is to use the enhanced base costs of Section 4.1. The base cost for MAC in Table 3 can be increased by 12.5 and the MAC, MAC entry in Table 4 can be changed to 0.
An important observation from Table 4 is that there is signi cant variation across the various entries in the table. This is in contrast to previous power models. In the power models for the the large general-purpose processors, the variation in the circuit state overhead for di erent instruction Table 5 shows for each packed instruction class, the range of base cost variation caused by all possible operand values. Again, the variation is reasonably small (less than 10%) for most classes. An average value is assigned as the base cost, which is also shown in Table 5 . An interesting observation is that the base cost of a packed instruction that has a data transfer instruction as a component, is very close to the base cost of the unpacked data transfer instruction alone. (cf. Tables 3 and 5 .) For example, the base cost of ASL:LAB is 36.6mA, very close to the base cost of LAB alone.
Base/Overhead Cost of Packed Instruction
For the overhead cost, experiments showed that except for instructions that have a packed MAC, most packed instructions have small ranges of variation. So an average value can be assigned as the overhead cost for these packed instructions. The overhead costs of a few packed instructions commonly found in our DSP software are listed in Table 6 .
As to the overhead cost of MAC instructions, when MAC is packed with a data transfer instruction, especially LAB, which changes data values in registers A and B used by MAC as inputs, signi cantly wide current variation is observed. Such wide variation is mainly due to the complex Booth multiplier implemented in the MAC unit. Table 7 shows what happens when For a typical DSP application, MAC:LAB instructions are usually applied to a sequence of data for lter operations, such as P c i X i . Ideally, the pairwise overhead cost given in Table 7 can be used to arrange the data ordering such that the total overhead cost, or the sum of individual pairwise overhead costs, is minimized. But the problem is that X i is usually not available until execution time. Hence, for our estimation purpose, the average value, 17.2mA, is used as the overhead cost for MAC:LAB instructions, due to the unavailability of execution-time operands.
However, for the purpose of minimization, this single overhead cost value cannot guide the search procedure to a better schedule for a sequence of MAC:LAB instructions. In any case, for lter applications such as P c i X i , instruction scheduling of existing code may not be the best alternative. The reason is that the arrival order of operands X i is determined by the environment of the embedded processor, and is not under the control of a scheduler. Thus, the overall design of the system or algorithm may have to be changed to produce more favorable signal statistics. This may not always be possible. Therefore, under such environmental constraints, in order to still reduce the energy consumption due to MAC's, a more e ective technique of local code modi cation is proposed in the next section, based on exploiting the architecture of the Booth multiplier. 
Energy Minimization for the Target DSP Processor
Based on the power analysis of the target processor, this section proposes several e ective energy minimization techniques for embedded DSP software. The rst two techniques, memory bank assignment and instruction packing, exploit the architectural features of dual-memory transfers and packed instructions. These minimize energy by reducing the program execution cycles. Then an instruction scheduling algorithm is introduced to reduce the circuit state overhead cost. In addition, since the on-chip Booth multiplier is a major source of energy consumption, a microarchitectural power model is developed and an energy minimization technique based on swapping the operands is proposed. The above two techniques achieve energy reduction by reducing the average current, without a ecting the number of execution cycles. All reported current values are obtained by the current measurement technique described in Section 3. 
Memory Bank Assignment for Low Energy
Section 2 mentioned that the target DSP processor has two on-chip data memory banks RAMA and RAMB, as depicted in Figure 4 . Each of these can supply data to the register le for an ALU operation, in the same cycle, by a double-transfer instruction { LAB. If these two operands are stored in the same memory bank, two single-transfer instructions MOV's are needed instead. This takes two cycles, one for each transfer. Table 8 shows the average current for these alternatives in Columns 3 and 2, respectively. It is possible to save software energy through the use of LAB, which takes half of the execution time of two MOV's. As seen from Table 8 , the average current for a LAB is the same or marginally higher than that for a sequence of two MOV's. However, since it takes only half the cycles, the double-transfer instruction takes about half the energy, as shown in Column 4. Figure 5 depicts this graphically for the third entry in Table 8 . The area under the solid and dotted curves is proportional to the energy cost of the 2 MOV's and 1 LAB, respectively, since energy is proportional to I N.
Although both types of instructions perform the same function, the energy consumed by 2 MOV's is always larger than that by a single LAB. This may have to do with the fact that the large power cost associated with the clock, instruction fetch, control, etc., gets shared by the two data transfers when they execute together in one cycle. In addition, the power cost associated with the change in circuit state between the two MOV's is also eliminated. Therefore, in order to reduce energy consumption, variables in an embedded program should be assigned to memory to allow maximal use of double-transfer instructions. We formulate this memory allocation problem as a variable partitioning problem. A simulated-annealing algorithm is proposed next as an e cient solution for Given a 2-way partition of the access graph, we assign a cost value to the partition, which is the number of cycles to transfer data from memory banks to registers. That is, for each edge < v i ; v j >, if variables i and j are assigned in di erent memory banks, 1 cycle is needed by a double-transfer An algorithm SA is proposed in Figure 7 to nd the least cost partition. The algorithm is based on the standard formulation for simulated annealing 12]. Given an initial random partition of G, SA iteratively generates a new partition until the stopping criteria frozen and equilibrium are met. The cost of any given partition is determined by summing up the costs over all edges. The function next state generates a new partition by allowing movement of a single node as well as swapping of two nodes. The function accept determines at each stage of the algorithm, if the new partition should be accepted or not.
It should be noted that the dual-memory transfer feature is not unique to the Fujitsu DSP, but is also provided by several other popular DSP processors, e.g., the Motorola 56000 series. The above observations are likely to be valid for these other processors too. In addition, recent memory 
Instruction Packing for Low Energy
As discussed in Section 2, the target DSP processor provides the capability of packing an ALUtype instruction and a data transfer instruction into a single instruction codeword for simultaneous execution. This feature is called \instruction packing". Therefore, if data dependency and the packing rules allow, two instructions can be assigned to the same execution cycle and packed into a single instruction. The single packed instruction represents the same functionality as the sequence of two unpacked instructions. But, interestingly, we found that using the packed instruction always leads to a reduction in energy. The reason for this is that the average current for the packed instruction is only slightly more than the average current for the sequence of the two unpacked instructions. Thus, the reduction of one execution cycle, more than o sets the slight current increase, leading to a large overall energy reduction. This observation is illustrated by the following example, where MSPC, a multiply instruction, and LAB, a load instruction, can either execute as unpacked instructions for a total of two cycles, or as a single packed instruction that executes in just one cycle. The current drawn by the instructions in packed and unpacked format is compared for two sets of data operands in Table 9 . As can be seen from the results, the average current drawn for the packed instructions is only marginally higher than for the unpacked instructions. However, the unpacked instructions complete in twice the number of cycles as the packed instructions, so the total energy consumed by the unpacked instructions is much larger, about twice as much as the packed instructions. Figure 8 illustrates this comparison graphically for the rst set of operands.
The area under the graph, which is given by I N, is proportional to the total energy consumption.
The explanation for the above observations may lie in the fact that there is a certain underlying current cost associated with the execution of any instruction, which is independent of the functionality of the instruction and independent of whether the instruction is packed or not. This is the cost associated with fetching the instruction, pipeline control, clocks, etc. This cost gets shared by two instructions when they are packed. In addition, the circuit-state overhead current between the two adjacent unpacked instructions (LAB and MSPC) is eliminated. Since minimizing the total energy consumption is our objective, instructions should be packed under the packing rules, as much as possible. A greedy as-soon-as-possible (ASAP) packing algorithm has been implemented, which selects the next available instruction in a program as a candidate for packing, if data dependency and packing rules allow.
Instruction Scheduling for Low Power
As can be seen from the results in Section 4, the circuit state overhead cost has signi cant variation across di erent instruction pairs. Thus, di erent instruction schedules for the same program can consume di erent power. This suggests that it is possible to reduce the power cost of a program by an appropriate scheduling of instructions. An automated instruction scheduler has been designed that can minimize the total circuit state overhead cost for a program. It looks up the overhead cost tables and chooses a good instruction schedule without violating data dependencies. The implementation for the scheduler is based on the popular list scheduling algorithm 14], with the overhead cost as the objective function to be minimized. The list scheduling algorithm for overhead cost minimization is showed in Figure 9 . At each cycle, a ready list is maintained for the instructions whose operands become available. Then an instruction in the ready list with the lowest overhead cost is selected to be scheduled at the current cycle. Pipeline stall conditions due to resource or data hazards can be veri ed as well during such instruction selection, to avoid penalties due to extra cycles. The scheduled instruction is then deleted from the ready list, and the new ready instructions arising from the new schedule are added to the ready list. 
Operand Swapping for the Booth Multiplier
We have found that in typical DSP applications the multiplier in the MAC unit is usually a major source of power consumption. This is because of its complex design, which leads to a large current cost. This is compounded by the frequent usage of lter operations such as P c i X i in DSP applications. This section focuses on power analysis of the on-chip Booth multiplier, and proposes a power reduction technique based on swapping the multiplication operands.
Power Model of the Booth Multiplier
The Booth multiplier implemented in the MAC unit takes the data in registers A and B as operands for fast multiplication. Without going into the details of the Booth multiplication algorithm, the fundamental idea behind it is to recode B by a so-called \skipping over 1s" technique 15]. Table 10 shows the basic recoding scheme employed by the Booth algorithm. The motivation for recoding B is that in cases where B has its 1s grouped into a few contiguous blocks, only a few versions of A need to be added/subtracted to generate the product. For instance, for a 7-digit B value 0011110 that would need four additions of shifted A, it can be recoded to 01000 10 by Table 10 ( 1 Table 10: Booth multiplier recoding table for denotes -1, for simplicity), which now requires only one addition and one subtraction. However, in the worst case, B may have alternating 1s and 0s, and each bit in B selects a shifted version of A to add or subtract. In order to determine how many additions and subtractions are needed by the Booth multiplier, we can de ne the weight of B value as the number of non-zero digits in its representation. For instance, the weight of 0011110 is 4, while the weight of 01000 10 is 2. A simple model of the microarchitecture of the Booth multiplier is depicted in Figure 10 . The Booth multiplier does not treat A and B symmetrically. The weight of recoded B determines the number of times A is added or subtracted while generating the product. So if the weight of A is smaller than that of B, we can reduce the number of additions and subtractions by just swapping the operands in registers A and B, which can potentially result in current reduction. Table 11 gives three experiments where swapping the operands of the Booth multiplier reduces current signi cantly. This observation points out that an e ective way to reduce current for MAC instructions is to just swap the operands in A and B. A simple power consumption model based on the microarchitecture model of the Booth multiplier in Figure 10 was empirically derived and validated through extensive current measurement experiments. In this power model, the switching activity of the multiplier is characterized mainly by the contents of registers A and B. Since circuit state is a signi cant factor for the multiplier, pairs of consecutive values in the registers are considered. For register A, the bit switching between consecutive values is considered, which can determine the complete switching activity in register A and part of the activity in the shift/add array. For register B, two factors are considered. First, the bit switching between consecutive values, and second, the weight of the Booth recodings of the values, which determines the number of additions and subtractions in the shift/add array. Table   12 shows the average current drawn by MAC:LAB for di erent characteristics of the pair of consecutive values in A and B. An index (1 to 10, shown in the square parentheses) is assigned to each entry to identify the data characteristics of A and B that the entry represents. For example, entry 8 represents the case where there is high switching between the pair of consecutive values in A, and low switching between the values in B. In addition, both values in B have high Booth recoding weights. In Table 11 the rst pair of data is an example with such characteristics.
Power Reduction by Operand Swapping
It can be seen from Table 12 that average current for the entries where B has high recoding weights is consistently higher than that in other corresponding entries. Moreover, we can see that entry 9 incurs the highest average current. This is the case where both A and B switch signi cantly and B has high recoding weights. The second pair of data in Table 11 is an example of such a case. If we swap the two sets of operands in A and B, the characteristics of A and B are now changed. One of the new possibilities is that A still has high switching, but B, which takes the values originally stored in A, can have high switching but low Booth recodings. So it is possible that after swapping, the values of the operands now fall under the case represented by entry 7 in Table 12 . Thus, the current drawn may be sharply reduced.
For lter operations such as P c i X i , the value of the constants c i is usually known at the time of instruction scheduling. So the scheduler can calculate the weight of the Booth recoding of c i , and then decide to load c i into register A, if the recoding weight is high, and into register B, if the recoding weight is low. But the decision about the placement of operands is being made based on the knowledge of the value of just one of the operands. Thus, sometimes the wrong decision may be made. However, on the average, determining the placement of operands based on the knowledge of even one operand will lead to current reduction. A systematic investigation was conducted to determine the possible improvements, and the results are shown in Table 13 . The known operands are initially assumed to be in register B. If the recoding weight of the value in B is high, the operands are swapped. This means that in case the initial data characteristics fall under the entries in the last 3 columns of Table 12 , the operands will be swapped. Table  13 gives the average current reduction when swapping changes the operand characteristics from one entry to another. The columns under the heading \before" show the entries in Table 12 that will result in an operand swap. The column \after" shows the new cases that can arise when the operands in the A and B registers are swapped. The average percentage reduction in the current after operand swapping is shown under the column labeled \% saving". In a few cases there is either no current reduction, or a minor increase. But in a great majority of the cases, we can see that operand swapping can signi cantly reduce the current. Thus, on the average, the current drawn by MAC:LAB instructions can be reduced, even though only one operand, for instance, c i is known at schedule time. Operand swapping is easily achieved by locally modifying the given instruction, e.g., from MSPC:LAB (X0+1),(X1+1) to MSPC:LAB (X1+1),(X0+1). In addition, there is no performance or code size penalty associated with it.
Overall Software Energy Minimization Methodology
Based on the previous discussion, an overall energy optimization methodology is summarized in Figure 11 . It takes one basic block of a DSP program at a time and assumes that instruction selection and register allocation have already been performed. So, a data ow graph (DFG) can be constructed for each basic block. A node in this graph represents the operation performed by the associated instruction, and an arc from nodes k to l implies that the data produced by k is used by l. It then optimizes energy by sequentially performing four steps: simulated annealing based memory bank assignment, ASAP instruction packing, list scheduling, and nally, operand swapping if the basic block has lter operations. Tables 4 and 6 to reduce overhead cost, while Table 12 is used for checking if operand swapping is bene cial.
Experimental Results
Four DSP programs were tested to demonstrate the energy reductions possible by our minimization methodology. Table 14 shows the experimental results where all the comparisons are made in terms of the product of measured current and the number of cycles. These values are proportional to total energy. Multiplying them by 8:25 10 ?8 , gives the total energy in Joules. Column 1 lists the name of each benchmark program. The remaining columns show the energy comparisons by applying di erent minimization techniques: Column 2 (up p) for the original unpacked code, Column 3 (m) for memory bank assignment alone, Column 4 (m+p) for the combined application of memory bank assignment and packing, Column 4 (m+p+o) for the combined application of memory bank assignment, packing, and overhead cost reduction by list scheduling, and Column 6 (m+p+o+s) for the combined application of all the techniques including operand swapping.
The rst program ex is a real Fujitsu application for vector preprocessing. No MAC instructions are used in this program, so operand swapping is not applicable. The second program LP FIR60 is a length-60 linear phase FIR lter; the third program IIR4 is a fourth-order direct form IIR lter; and the fourth program FFT2 is a radix-2 decimal-in-time FFT butter y. The last three programs are taken from the TMS320 embedded DSP examples in 11] and translated into native code for our target processor. In the case of LP FIR60, because the same MAC:LAB instruction is repeatedly used, which is for lter operation on the data sequence, execution order does not change by di erent schedules. So list scheduling is not applicable to LP FIR60. For each benchmark program, the product I N (which is proportional to energy) is given, where I is the measured average current, and N is the number of execution cycles. The values in the parentheses are the relative values when the products in column un p are normalized to 1. Thus, these values represent the energy reductions possible by the corresponding software modi cation technique. Figure 12 The results show that about 9% to 39% energy reduction can be achieved by memory bank assignment alone. Then instruction packing can reduce energy by another 4% to 46%. The reason that FFT2 has only 4% reduction is due to a certain data dependency in the unpacked code that prevents e ective instruction packing.
After packing, the list scheduling algorithm for overhead cost reduction can further reduce energy by 4% to 14% for the packed codes. In the cases of LP FIR60, IIR4, and FFT2, operand swapping is applicable, and an additional 18%, 7%, and 4% energy can still be saved, respectively. So, the overall energy reduction is seen to be 26% to 73% if the source code is originally unpacked and the memory banks are not assigned. Energy reductions of 8% to 17% are observed even if the original code is already pre-optimized through the use of instruction packing and memory bank assignment.
Conclusions and Future Work
In recent years, power/energy consumption has become one of the primary constraints in the design of embedded applications. In order to study the problem of energy consumption in these systems, one has to examine the variation in energy consumption due to variations in software, since software constitutes a large part of the functionality of embedded systems. This work describes the results of a study of the energy consumption due to software for a DSP processor that is * used in several commercial applications. An empirical instruction-level power model has been developed for this processor. Due to architectural di erences between this processor and others studied earlier, the power model has some important points of di erence. These di erences were exploited through a low-power instruction scheduling algorithm. Special architectural features that could be used for generating low energy code were identi ed and quantitatively analyzed, e.g., dualmemory accesses, instruction packing, and operand swapping for the Booth multiplier. Techniques for e ectively utilizing these features were also described. Several examples were presented to illustrate the potential of all these techniques. Signi cant energy savings were observed. Energy savings obtained by these methods come for free -there is no hardware cost or software performance penalty. Future work will involve the re nement of the power model and optimization algorithms, and further exploration of code generation for low energy.
