Most research to date on energy minimization in DSP processors has focuses on hardware solution. This paper examines the software-based factors affecting performance and energy consumption for architecture-aware compilation. In this paper, we focus on providing support for one architectural feature of DSPs that makes code generation difficult, namely the use of multiple data memory banks. This feature increases memory bandwidth by permitting multiple data memory accesses to occur in parallel when the referenced variables belong to different data memory banks and the registers involved conform to a strict set of conditions. We present novel instruction scheduling algorithms that attempt to maximize the performance, minim& the energy, and therefore, maximize the benefit of this architectural feature.
INTRODUCTION
Currently, there is a high demand for DSP processors with low powerlenergy in many areas such as telecommunications, information technology and automotive industries. This demand stems from the fact that low power consumption is important for reliability and low cost production as well as device portability and miniaturization.
In the last decade we have seen the proliferation of electronic equipment like never before. As these systems are becoming increasing portable, the minimization of power consumption has become an important criterion in system design. In order to design a system with low energy and high performance, it is important to analyze all the components of the system platform. Since a large portion of the functionality of today's system is in the form of software. it is important to estimate and minimize the software component of the energy cost and maximizes the software component of the performance cost [1] [2] .
Although dedicated hardware can provide significant speed and power consumption advantages for signal processing applications, extensive programmability is becoming an increasingly desirable feature of implementation platforms for VLSI signal processing. Increasingly shorter life cycles for consumer products have heled the trend toward tighter time-to-market windows, which in tum, caused intense competition among DSP product vendors and forced the rapid evolution of embedded technology. As a consequence of these effects, designers are often forced to begin architecture design and system implementation before the specification of a product is l l l y completed. For example, a portable communication product is often designed before the signal transmission standards under which it will operate are finalized, or before the 1 1 1 range of standards that will be supported by the product is agree upon. In such an environment, late changes in the design cycle are mandatory. The need to quickly make such late changes requires the use of sohare.
Although the flexibility offered by software is critical in DSP applications, the implementation of production quality DSP software is an extremely complex task. The complexity arises from the diversity of critical constraints that must be satisfied. Typically these constraints involve stringent requirements on metrics such as latency, throughput, power consumption, code size, and data storage requirements [3] .
DSPs are a special kind of processor that is primarily designed to implement signal-processing algorithms efficiently. Code generation for DSP is more involved than general-purpose processors. This is because DSP processon have nonhomogeneous register sets, a number of specialized functional units, restricted connectivity, limited addressing, and highly irregular datapaths. It is a well-known fact that the quality of compilers for embedded DSP systems are. generally unacceptable with respect to code density, performance, and power consumption. This is because the compilation techniques for general-purpose architectures being used do not adapt well to the irregularity of DSP architectures.
We address the problem of code generation for DSP systems on a chip. In such systems. the amount of silicon devoted to program ROM is limited, so the application software must be sufficiently dense. In addition, the software must be written to meet various high-performance and low energy constraints. Unfortunately, cunent compiler technologies are unable to generate highquality code for DSPs, whose architectures are highly irregular. Thus, designers often resort to programming application software in assembly -a labor-intensive task.
In this paper, we present a novel instruction scheduling for one particular architectural feature, namely multiple data memory banks. This feature, increases memory bandwidth by permitting multiple data memory accesses to occur in parallel. This happens when the referenced variables belong to different banks and the register involved conforms to a strict set of conditions. Furthermore, the instruction set architecttm (ISA) of these DSPs require the programmer to encode in a limited number of long instruction words, all the data memory accesses that are to be performed in parallel, thus assisting in the generation of dense
code.
Instruction scheduling techniques that use a listed-based method has been around since the mid-1980s [4] , and it is the most popular method of scheduling basic blocks. Trace scheduling is an optimization technique that selects a sequence of basic blocks as a trace, and schedules the operations from the trace together [5].
Percolation scheduling [6] looks at the whole pmgram and tries to improve the parallelism of the code. The idea that register allocation can be viewed as a graphwloring problem has been around since early 1970s, but Chaitin et. al. [7] were the first to actually implement it in a compiler. Briggs (81 came up with some modifications to Chaitin-style allocation, the most important idea being the optimization of variable selection for register spilling.
Most of the previous work on reducing power and energy consumption in DSP processors has focused on hardware solutions to the problem. However, embedded systems designers frequently have no control over the hardware aspects of the predesigned processors with which they work and so, software-based power andor energy minimization techniques play a useh1 role in meeting design constraints. Recently, new research directions in The rest of the paper is organized as follows. Section 2 describes the DSP architectural features with multiple data memory banks. Section 3 describes the novel algorithms of instruction scheduling to reduce the number of cycles and the register pressure. Section 4 describes the examples that illustrate our algorithms. Section 5 describes the benchmark results. Section 6 concludes the paper.
DSP A R~C T U R A L~A T U R E S WITH MULTIPLE DATA MEMORY BANKS
Our approach for increasing the packing efficiency has been tested on DSP architectural features with multiple data memory banks, which can be characterized as Dual-Load-Execute @LE)
architectures. Examples of DLE processors include Analog Devices' ADSp2lxx family, NEC's u7701x family, Motorola, 56xxx family, and Fujitsu's Elixir family. These processors support parallel execution of an ALU operation and two data move (data load or data store) operations in the same cycle.
DSP Architectures
The DSP architectural units of interest are the data arithmetic logic unit (Data ALU), addressing generation unit (AGU) and X/Y data memory banks [9] . 
INSTRUCTION SCHEDULING
Our instruction-scheduling algorithm based on list scheduling directly supports packing, because the above DSP supports simultaneous execution of multiple operations. Packing is efficient in terms of performance, because it always leads to a reduction in the cycle time of programs. Another important feature of packing is that it also tends to reduce the amount of energy consumed during program execution. In practice, packing has the potential to reduce energy consumption by more than half.
Instruction Level Energy Model
The average power P consumed by a processor while running a certain program is given by P = IWdd, where I is the average current and Vdd is the supply voltage. The energy consumed by a program, E, is given by E = P*T, where T is the execution time of the program. This, in turn, is given by T = N*delta, where N is the number of cycles and delta is the cycle period. Since a common application DSP embedded system is often in the portable space where power is stored in a battery, energy consumption is the focus of our attention. Now, Vdd and delta are known and fixed.
Therefore, E is proporti0~1 to the product of I and N. Given the number of execution cycles, N, for a program, we only need to measure the average current, I, in order to calculate E. The product of I and N is, therefore, the measure used to compare the energy cost of programs in this analysis. The energy model as taken from 
EXAMPLESANDALGORITHMS 4.1 Example One
Consider the C and corresponding uncompacted symbolic assembly code shown in Figure ] shows the DDG with two weighting factors on each node. The first weighting factor in the tuple is the value of depth, which is counted from the bottom node for each branch. The node with a higher value has higher priority. For instance, nodes VO and VI have the highest priority to be selected in the ready set shown in DDG witb tuple of (deptb,lifetime) on the node.
Note that the boldface nodes in the ready set are ALU operations such as ADD and MPY. One ALU can be executed with one or two MOVE operation in the same cycle. Two ALU operation nodes, though, cannot be allowed to execute in the same cycle. The second weighting factor in the tuple is the lifetime of the register variables. For instance, node VO is dependent upon register r0 that is alive during cycles 1 through 3 (see Figure 2) ; thus, the interval live time is 3 1 = 2. In Figure 2 , the initial code is executed in 11 cycles and the number of registers required is 2 using the lifetime analysis. Next, we construct the ready set based on the as-soon-aspossible (ASAP) scheduling scheme for each node. The total cycles are now 6 cycles for the unscheduled nodes in ready set. Figure 4 shows the nodes have been scheduled based on our algorithm to exploit the DSP architecture. The number of registers is 3. Note that the number of cycles and the number of registem have the tradeoff relationship. It is cost efficient to reduce the cycles instead of increasing the number of registers. However. for DSP processors, the number of registers are limited, so we need to develop the best scheduling algorithm to minimize the number of registers (i.e. reduce the register pressure) during the scheduling process. Figure 5 shows that for the random choice, the number of cycles is 7 and the number of registers is 4. This demonstrates that the algorithm developed is very important at this stage. We borrowed the energy model from [2] to count the energy consumption. In Figure 4 , total current, I, is 78OmA and total cycles, are 6. So, the energy cost is I*N = 780*6 = 4,680. In Figure 5 , the scheduling based on random choice has an energy cost of I*N=850*7=5,950.
Before scheduling, in Figure 1 , the total current is 108OmA and the number of cycles is 1 1. So, the energy cost is I*N= 1 1,880. After scheduling, Figure 4 , based on our algorithm, the number of cycles is duced from 11 to 6 (45% reduction in cycles) and the energy cost is reduced from 11,880 to 4,680 (60% reduction in energy consumption). This demonstrates that our scheduling algorithm has made a high performance, low energy code generation with minimum register pressure for the DSP processors.
ADD r 2~3 MOVE
r3,r4 as5
--16omA Need 6 cycles and 3 registers). 
Example Two -
Again, consider the C and corresponding uncompacted symbolic assembly code shown in Figure qa) and Figure 6@ ). Figure 6 Our algorithm focuses on educing the number of cycles and the register pressure at the same cycle. This helps to do register labeling in the next stage. Besides that, due to the code compaction, the number of cycles is minimized and further results in the reduction of the energy consumption.
Before scheduling, in Figure 6 , the total current is 104-and the number of cycles is 10, giving an energy cost of I*N= 10,400. After scheduling, in F i g m 7, based on our algorithm, the number of cycles is reduced from 10 to 5 (50% reduction in cycles) and the energy cost is reduced from 10.400 to 3,400 (67.3% reduction in energy consumption). The number of registers is only 4. If random choice, the number of cycles is increased from 5 to 6 and the number of registers is increased from 4 to 6 (see Figure 8) . This implies that it degrades the performance and increases the energy consumption. Furthermore. it needs more registers. In Figure 7, total current, I, is 68OmA and total cycles, N, are 5. So, the energy cost is I*N = 680*5 = 3,400. In Figure 8 , the scheduling based on random choice has an energy cost of I*N-780*6=4,680.
Instruction nodes Live t i m e Analysis
Cycle , v1, v2 10 rl 12 r3r4 15
Example 2: + a * , 
our Algorithms
Our algorithm is based on the list scheduling but the priority should be modified not only reduce the number of cycles (i.e. improve the performance) but also minimize register pressure. The following is our algorithm. 
2.

4. While -isemp@@) do
Construct the Data Dependence Graph (DDG).
Make the Tuple (P,L). where P is the depth value of the node, and L is the liftime of the node. Before running the algorithm, we need to construct the data dependence graph @DG) with nodes and calculate the depth value and lifetime value for each node. Next, the list scheduling starts to MOVE i r t cia-1 2~ MAC rO,rl,r2 e,r4 f,r5 -18OmA MAcr3.14~5 r2,v -17OmA come to play. Before that, we need to construct the initial ready set R.
Our algorithm considering the first priority is the value of depth, the first weighting factor in tuple. If there are nodes in the same cycle of ready set, the node with higher depth value has higher priority. This ensures that the number of cycles is minimized. Remind that the number of cycles is the function of the energy cost. This helps reduce energy cost a lot.
Besides that the concept of reducing the number of cycle, the other important issue is how to reduce the number of registers. This is because the number of registers in DSP processor is limited. Hence, the next priority is to consider the second weighting factor in tuple. If the nodes have the same depth values, the pair of nodes having the same lifetime has higher priority. Otherwise, the node with lower lifetime value has higher priority. This is to reduce the number of overlap lifetimes among the nodes and hence reduce the number of registers.
If the nodes have the same depth values and the same lifetime values, the d e s in the same tree have the higher priority. This is to increase the chance of locating the ALU nodes with the MOVE nodes in the same cycle. Note that the ALU nodes can be parallel processed with the other one or two MOVEs. This helps reduce the value of current. For example, ADD with 2 parallel MOVEs only consumes the current of 15OmA but the ADD with 2 serial MOVEs consumes the current of 280mA.
BENCHMARKRESULTS
In our benchmark results, the performance is improved in average 48.3% and the energy consumption is saved in average 66.6% (see Figure 9 ). Figure 9 shows the unscheduled assembly codes, scheduled assembly codes, and the total required current, the number of cycles, and the energy cost for both codes. 
CONCLUSION
In recent years, powerhergy consumption has become one of the primary constraints in the design of embedded applications. Current compiler technology is unable to take advantage of the potential increase in parallelism offered by multiple data memory banks. Consequently, compiler-generated code is far inferior to hand-written code. While optimizing compilers have proved effective for general purpose processors such as PowerPC CPU (RISC), Intel CPU. AMD CPU, and 680x0 CPU (CISC), and Intel CPU (EPIC), the irregular datapaths and small number of registers found in embedded processors, especially fixed-point DSPs with multiple data memory banks, remain a challenge to compilers.
In this paper, we have developed a high performance, low energy compiler design for DSPs. Such a compiler should not compromise on performance or code size when redwing energy consumption.
Our benchmark shows that the perfonance is improved in average 48.3% and the energy consumption is saved in average 66.6%.
We assume that instruction selection is performed by another compiler and a sequence of instructions (unpacked) is given to our procedure. In the future work, we will develop an algorithm consisting of instruction scheduling, register allocation and memocy assignment. This involves building a re-targetable compiler and simulator tool-kit which is extensible; developing program transformations for automatically exploiting all of the useh1 parallelisms in a given prog", developing "architectweaware'' and "memory aware" optimizations for compiling, and finally exploring the interaction between compilers and architectures.
