We propose a new DSP programming scheme which optimizes the program in terms of the power consumption. The optimization is to perform accumulations in m interlaced manner, which exploits our finding that the power saving by keeping a constant value at one of the inputs of a multiplier is as much as the power consumption of an entire adder or a registerfile of a programmable DSP. We demonstrate the power saving capability of our method by showing power simulation results for a conventional and power optimized FIR filter routine. The We showed that a data operation unit dominates power consumption of a programmable DSP and that the power is highly data &pndent[5J.
-
Power saving by software is one of the questions that we should address [l] .
We have not known what characteristics we can exploit to save power efficiently by optimizing software. This paper proposes an interlaced programming scheme and demonstrates a significant power saving resulting from it.
Some scheduling methods that minimize input data transitions were reported for multi-processor [21[31. These methods are not purely software approach because they are hardwate synthesis techniques to minimize power. In contrast, it was reported that compilation techniques can minimize power for a programmable processor [4] . Though diffmnt compilation techniques resulted in different power consumption values, it is not yet clear if a technique that provides the smallest number of dynamic steps provides the lowest power for a single programmable processor. It is more doubtful if we can apply the same compilation techniques for low power DSP programs because most DSP pmgrams are already highly optimized in terms of dynamic steps and any modifications Seem to increase the dynamic steps which eventually results in increase of power. We showed that a data operation unit dominates power consumption of a programmable DSP and that the power is highly data &pndent[5J.
We investigated the data depdency of the data operation unit power, and came up with a new programming method to save power. 1) Significant chip power is saved by maintaining one of the inputs of a This paper describes two issues. multiplier at a constant value. 2) Programming scheme which exploits this feature can be realized by optimizing the program with an interlaced accumulation scheme. The optimization scheme was applied to an FIR filter program and the power consumption of the data operation was reduced by 46% without increasing the execution cycles. Fig. 1 shows the power consumption distribution among the modules of a programmable DSP[51. Each point on the plot represents power consumption with one of the test programs, and the variations of the contribution of modules are due to the difference of the test programs. Particularly, the variation of the data operation module power is large because the power consumed by this module is highly data dependent. Data operation takes up to 34% of the worst case chip power. This is the dominant factor in the chip power and it is difficult to reduce the power by improving circuits.
Power Analvsis Results
In contrast, peripheral modules consumed much power, which can be x d u d by using known techniques such as stopping clock or lowering the clock frequency when the module is not in use. We focused on the data operation power and observed the data dependency. Fig. 2 shows the data dependency of power consumption of an ALU and a register file, where they show good correlation to the number of input transitions. Here, number of transitions represents the Hamming distance between the two consecutive input dara. The power consumption itself is at most 2 Wcycle for each of the ALU and the register file. Fig. 3 shows the data depenhcy of a multiplier power when two sequences of random data were fed to the multiplier. There is no correlation between the number of input data transitions and the power consumption, and the power consumption is approximately 7 nJ for any input transitions. Fig. 4 also shows the data dependency of a multiplier power instead when one of the inputs is fed by a constant value. Here they show a good correlation to the number of input transitions and the power consumption is drastically different depending on which side is kept constant. The difference is about 3.5 nl, 2.1 nJ and 5.5 nJ when one of the input is kept at constant values of 0x800000, 0x555555, and OxFFFFFF, respectively.
The multiplier employs Booth algorithm, where the input 'B' is encoded to the Booth codes which geneme It sounds obvious that power consumption can be reduced by keeping one of the multiplier inputs unchanged especially the Booth e n d e r side. However, what we found surprising is that the amount of power saved by keeping the Booth encoder input fixed is as high as or higher than the power consumption of the ALU and the register file. Recall that data operation units dominate DSP chip power, hence the power saving can be significant even for the chip power.
We have also found that surplus power is consumed due to surplus transitions of the precharge bus between the # of trans. at input B (a) 0x800000 x (random) . . . . . . . 
Low-Power Proarammina Scheme
We propose a new programming scheme that performs accumulations in an interlaced manner, and that exploits the characteristics of Booth multipliers described in the previous section. The new programming scheme tries to keep same value at the Booth encodex side input of the multiplier as long as possible. The programming style is well explained by an example shown in Fig. 6 . In the conventional way shown in Fig. 6 (a) , we calculate the three multiplications consecutively for a single output, whereboth of the inputs of multiplier change in every cycle. The multiplier consumes about 7 nJ per single multiplication according to the characteristics shown in Fig. 3 . In our proposal shown in Fig. 6 (b) , we calculate the multiplications in an interlaced or&r amss the sampling perid, and try to minimize the change of the input data of the multiplier. After one of the input is kept unchanged, the other input is kept unchanged. We can estimate 2 to 3 nT of energy is saved per single multiplication, which is as' much as the energy consumed by the adder. This example assumes two accumulation registers, which is as the same number as in the HX24 [5] .
If we have more accumulation registers, which can be a destination of the multiplier and a source of the adder, C2X(k) can be calculated at the fourth place instead of returning to ClX(k-1). This implies that if we have more accumulation registers to keep the intermediate data for the accumulation, we can further reduce the multiplier power.
The inner most loop of the optimized program is a two-instruction repeat loop, while the inner most loop of the conventional program is a single instruction repeat loop. On the other hand, the outer loop of the optimized program has half the number of iteration of the conventional one. In addition to the power saving by keeping the same value for the multiplier, there are two advantages in the optimized program. Number of data load from memory is less by half in the optimized program than the conventional one. Since we use the same input twice we need to load only one data per cycle in the optimized program, while the conventional one requires two data loads f Y(3+2) = In programmable DSPs, repeat instruction has no overhead for branching to the start of the repeat loop, but the loop instruction spends a couple of cycles to branch back for every iteration. Because the optimized program executes half the number of loop iterations than the conventional one, the overhead by loop iteration is reduced in terms of power consumption as well as number of dynamic steps.
We can apply the interlaced accumulation programming scheme also to a matrix multiplication, An equation of a matrix multiplication is as follows.
This is an example of multiplying an MxN size maaix A and a NxP size matrix B to get an MxP size matrix C . Fig. 7 shows the interlaced accumulation programming scheme as it is applied to 3x3 matrix multiplication. The oder of the calculation is indicated by numbers with a circle. In the conventional way shown in Fig. 7 (a) , we calculate the three multiplications consecutively for a single output, and both of the inputs of multiplier change in every cycle. By calculating the multiplications in an interlaced order across the columns, the change of the input data of the multiplier is reduced as shown in Fig. 7 (b) .
As shown above, the power saving matrix multiplication keeps one of the inputs unchanged because two accumulations are done at a time using two accumulation registers , c and d as shown below. 
s imulation Resu Its
We simulated four different ways, the conventional and the optimized FIR programs with static and precharge bus in logic level as we reported previously [5] . The simulated FIR filter had 18 taps and the number of signal data was 24.
The results are shown in Fig. 8 .
The low power program saves 8% of the chip power with the precharge bus for the operator, and 13% of the chip power with the static bus for the operators. If we compare the two cases, the conventional program with precharge bus and low power program with static bus, the power savings in the multiplier and ALU are 51% and 58% respectively, and the power consumption of the data operation unit is reduced by as much as 46%.
Simulation 30
conv.
program: conv. low P. conv. low P. data transfer: static bus In addition, we can save half of the number of data read because we use the same data twice in the low power program instead loading twice in the conventional one, as we mentioned in the previous section. This results in reducing the memory access power by 34%.
Note that we can reduce power consumption very drastically without sacrificing any other features of the DSP such as chip size and execution cycles except for increasing latency by one sample, which does not usually cause any problems.
Conclusion.
We showed that the power savings obtained by keeping a constant value at one input of a multiplier is as much as the power consumption of an adder or a register file. Since the contribution of the data operation unit in terms of the chip power is so large, the optimization which tries to keep a constant value for the multiplier's input is promising.
We proposed an interlaced accumulation programming scheme which optimizes the order of the operations to minimize the data operation power. The scheme exploits the data dependent nature of the multiplier's power consumption. We demonstrated significant power saving by the low power programming with static data transfer from the register file to the multiplier and adder through power simulation of FIR filter routine. The proposed scheme saved as much as 46% of the data operation power, which is almost impossible to realize by any circuit technology without sacrificing any other features. This implies that power consumption can be saved much more by combining some techniques properly than by improving techniques individually.
