A systematic methodology for energy dissipation reduction of multimedia applications realized on architectures based on embedded cores and application specific data memory organization is proposed. Performance and area are explicitly taken into account. The proposed methodology includes two major steps: A high-level code transformation step that reorganizes the original description of the target application.
INTRODUCTION
Portability as well as packaging and cooling issues made power consumption an important design consideration [1] . Realizations of data dominated signal processing applications such as multi-media require a large amount of memory for the storage of the large multi-dimensional array-type data structures that are present in such applications.
Advances in memory technology keep decreasing the memory power cost [3] , however the ever-increasing storage requirements of data dominated applications offset these gains and retain memory related power consumption as the dominant contribution to the total system power cost [5] . This statement is true for different architecture platforms [6] [7] [8] [9] [10] [11] .
Rapid advances in the area of programmable embedded processor cores made them an attractive solution for the realization of real time multimedia applications, due to the flexibility they offer but also for time-to-market reasons. Especially the combination of processor cores with an application specific memory and bus organization is very promising since it allows combination of the processor cores advantages described above, with the optimization freedom offered by the application specific memory and bus organization.
In the area of data transfer and storage exploration for power and area costs, the ATOMIUM [4] and the PHIDEO [12] methodologies have been proposed. Both these methodologies mainly target uni-processor (single thread of control) custom hardware architectures and cannot be applied to a programmable processor context in a straightforward manner. This is because extra issues like power consumed due to instruction storage and transfers, performance (in terms of number of cycles) and code size must be taken into account in such systems. No systematic methodologies currently exist for high-level data transfer and storage energy optimization for systems realized on programmable processors.
Some initial results for programmable processors with predefined fixed memory hierarchy have been recently presented [13, 14] . For the case of applications realized on systems consisting of embedded cores and application specific data memory and bus organization the application of ATOMIUM in such a context is only described in [15] . However no systematic methodology for data transfer and storage optimization is proposed and only the powerperformance trade-off is evaluated. Furthermore issues such as the power consumption due to transfer and storage of instructions and the code size effect on total system's power have not been explored.
Our previous work [26] has shown that the power consumed on instruction storage and transfers can be heavily affected by the application of code transformations that targets the minimization of power consumed on data storage and transfers. In the same work it was also pointed that the instruction memory-related power consumption of such systems can be orders of magnitude greater than the power consumed due to data storage and transfers. These observations lead to the following conclusions: First the cost function that should drive a methodology for the application of high-level power-optimizing code transformations must take into account the power consumed on storage and transfers of both instructions and data.
Second in order for such a methodology to be power efficient, it must also include transformations that aim at optimizing the power consumed by instruction accesses and transfers. Clearly these are not considered as the main contribution of this work, but they are distinguishing points between our approach and the approaches of [13, 14, 15] , as far as the application of high-level code transformations is concerned.
<< Insert figure 1 >>
The aim of the proposed research, part of which is described in this paper, is the development of a systematic methodology for the reduction of the power consumption related to storage and transfers of both instructions and data in realizations of multimedia applications on systems including embedded cores and applications specific data memory and bus organization. This is illustrated in figure 1 . Inputs to the proposed methodology are an original high-level description of the target application and a set of area and performance constraints. The proposed methodology produces an optimized description of the application and a detailed processor, memory and bus organization. The optimized code produced can be either passed to the target processor compiler or can be used as input for the manual development of assembly code. This paper focuses on the high-level part of the proposed methodology and specifically on a systematic approach for the application of a number of high-level code transformations. There are two main categories of transformations included:
The first category aims at optimizing in a high-level the power consumed in data memory accesses and data transfers, while the second category aims at optimizing the power consumed due to storage and transfers of instructions.
The rest of the paper is organized as follows: In section 2 the target architecture is described. The energy model and the cost functions are discussed in section 3. The proposed methodology is presented in section 4. In section 5 experimental results from several demonstrators are presented while in section 6 conclusions are offered.
TARGET ARCHITECTURE
The general view of the target architecture is shown in figure 2 . The main points of the target architecture are discussed below:
Processing units: Processing is performed on a number of instruction-set processor cores. The number of cores allocated is determined based on the performance requirements of the target application. The registers present in the cores can be exploited by the proposed methodology. Each processor core present is programmed by its private program memory.
Program memory and related bus organization:
The application code is stored in onchip program memories. On-chip storage of the application code is the case in most embedded systems since this reduces the total system's power consumption and improves performance as well. The size of the program memory is fixed in most cases. However there are cases in which the size of the program memory is assumed to be application dependent i.e. determined by the application's code size. Each processor core present is directly coupled through a dedicated bus to a program memory that holds the code executed by the specific core.
Data memory organization:
Application data are stored in an application-specific data memory hierarchy. Most levels lie on-chip while off-chip levels are also possible. Storage of the major part of the data on-chip favors significantly power consumption reduction. Each level of the data memory hierarchy may be divided to a number of different blocks (for power and performance reasons). Data organization is fully compile-time determined.
Bus organization:
The processor cores communicate with the different blocks of the data memory hierarchy over a number of global data, address and control buses. For simplification the buses are merged to one in figure 2. The address space of each core is divided to parts assigned to the memory blocks used by the specific processor. The fact that each memory block has a single address range must be taken into consideration during the assignment of the address spaces to memory blocks when multiple cores use the same memory. All memory blocks are directly connected to the data bus. Transfers between memory blocks belonging to different levels of the hierarchy are performed through dedicated buses connecting the corresponding levels. Data transfers from memory blocks to the cores but also between blocks are performed in a fully deterministic way (no hardware control mechanism is present as in the case of instruction set processor caches) controlled by the cores.
<< Insert figure 2 >>

ENERGY MODELS -COST FUNCTIONS
The main cost function that is the target for optimization of the proposed methodology is the energy dissipation related to data and instructions storage and transfers. For multimedia applications, this energy dissipation component dominates upon the total energy budget of the system. This cost is heavily related to the (external) bus traffic, which is also a crucial performance measure of the global system. In the proposed context only the energy consumption in background memories and in interconnect buses is taken into account. The energy consumed in the functional units, in glue logic and in foreground storage (registersregister files) is much smaller [1] and thus it is neglected.
As far as the storage related energy dissipation is concerned the energy model used for the onchip memories depends upon the number of accesses, memory size, number of read/write ports and the number of bits that can be accessed in every access. The model is valid for any type of memory (SRAM, DRAM etc.). The model is linear with respect to the number of accesses while the dependence on the memory size is determined by a sub-linear polynomial function f. This function is completely dependent on technology and the specific vendor used. Thus the power consumed by an on-chip memory is given by the following equation:
The function f that has been used to produce the energy figures that will be presented in the rest of the paper is described in [2] . For the estimation of the energy consumption of the off-chip memories the low-power 1 Mb SRAM presented in [3] is assumed as in [4] since no other figures for power consumption of off-chip memories are currently available by the vendors. This assumption leads to an energy dissipation of 2.6 nJoules per off-chip memory accesses. In this way it is assumed that the off-chip energy dissipation depends on the number of accesses only and not on the memory size. For both on-chip and off-chip memories it is assumed that the memory is in power-down mode when not accessed since this is the case for the state-of-the-art memories currently available.
In terms of interconnect energy the ever increasing technology scaling makes the energy dissipated in the on-chip buses more important contributor to the total system's energy dissipation than the energy dissipated in the functional units. The energy dissipation per line of on-chip interconnect during a transfer is given by equation 2 and depends on the energy dissipated in the related driver and the energy dissipated in the wire capacitance driven.
The energy dissipated per line of off-chip interconnect during a data transfer to/from an off-chip memory is given by eqn. 3. It is approximated as the sum of the energy dissipated in the I/O pins involved (in both the memory and the processing chip), the energy dissipated in the corresponding I/O drivers (pads) and the energy dissipation in the wire capacitance driven.
( )
For the results presented in this paper the energy dissipated per line of on-chip interconnect has been estimated assuming a total load of 6 pF (including all the components described above). The energy dissipated in one line of an off-chip bus has been estimated assuming a total load of 30 pF (including all the related components). Both the load values have been obtained using the data presented in [4] .
As already mentioned the proposed methodology aims at reducing the energy dissipation due to data and instruction storage and transfers. Assuming a data memory hierarchy with M blocks on-chip, N blocks off-chip, K on-chip bus lines and L off-chip bus lines the cost function that is optimized by the proposed methodology is given by eqn. 4.
Energy dissipation due to accesses to the program memory and due to the corresponding transfers is also a major contributor (except from the power consumption due to data storage and transfers) to the global system's energy dissipation. The energy dissipation due to the accesses to the program memory can be evaluated in the same way as described above. The number of executed instructions (directly proportional to the number of execution cycles) is used to approximate the number of access to the program memory. Thus the energy dissipation in the program memory is given by the following equation:
The size of the program memory can be either fixed (by the processor's core used) or can be adapted to the target application offering more freedom (design parameter). In such a case the program memory should be large enough to store the complete application code. It is assumed that in all cases the application code can fit in on chip program memory. The program related energy dissipation is given by equation 6.
The (major part of) global energy dissipation of the system is given by the following equation:
PROPOSED METHODOLOGY
In this section the complete methodology for the reduction of the power consumption due to data and instructions storage and transfers under performance and area constraints is briefly described. The high level part of the methodology is described in detail.
Global Approach
The proposed methodology is presented in figure 3 . The inputs to the methodology are an original high-level description of the target application and a set of area and performance constraints. A flow in which the processor's compiler will be used in the lower level is assumed. The methodology consists of two major steps: a) A high level step that includes the application of code transformations and b) A lower step in which detailed decisions on processor, data memory and bus organization are made.
<< Insert figure 3 >>
In the first step (that is described in detail in the following sub-section) power optimizing code transformations are applied. A transformed description of the application is produced at the end of this step. The performance of this description is evaluated on the cores selected for the realization. Taking into consideration the performance constraints a number of cores that will be used for the realization is allocated. This ensures that performance constraints are met.
In the next step the area of the implementation is estimated using the procedure described in Transformations that introduce performance penalties (in comparison to the original description). Such transformations negatively affect area since more cores and related buses will be required to execute code within the performance constraints.
ii) Transformations that introduce large number of new data signals with increased size.
Such transformations increase the data memory area and the area of the related buses.
iii) Transformations that increase code size thus increasing the program storage requirements.
<< Insert figure 4 >>
This procedure is iterated until both performance and area constraints are met. If both types of constraints cannot be satisfied then priority is given to the performance constraints (overriding issue in the target domain) while area must be kept as small as possible. The output of the first step includes an optimized description of the application that will be compiled in the end as well as a set of constraints in relation to the number of levels of the data memory hierarchy and some initial signal-to-level of the hierarchy assignment decisions. As far as the processor organization is concerned the high level step determines the number of processor cores that will be used for the realization of the application.
The second step of the proposed methodology includes three main sub-steps: All these steps are power oriented and take performance and area constraints into account.
After the application of the second step performance and area constraints are evaluated and if not met local (over the second step) and global (over both steps) feedback loops can be performed. The output of the second step is a detailed processor, memory and bus organization of the system.
The Memory Management Graph
To better explain the energy minimization procedure and the effect of the power optimizing code transformations on power the concept of the Memory Management Graph (MMG) [27] is used. The memory management graph G MM {V, E}is a directed graph, whose
= is in one-to-one correspondence with the set of tasks. The directed edge set
is in correspondence with data arrays passed from the one task to another.
The edges of the graph are annotated with two costs:
a) The size of the data arrays S ij (i, j=1, 2, …N) passed from the source task ( i ) to the sink task ( j ) of the edge.
b) The number of accesses A ij (i, j=1, 2, …N) to the arrays implied by the edges.
It must be noted that the edges connecting primary inputs (PI's) or outputs (PO's) to tasks of the graph, correspond to input or output arrays of the algorithm. The corresponding edges are annotated with the same information as explained above. An edge connecting the k th primary input to the task i is denoted as (PI k , i). In the same way an edge connecting the task j to the l th primary output is denoted as (j, PO l ).
Each vertex v i is annotated with two costs:
a) The number of instructions 2) Loop/control flow transformations [18] : Increase the locality and regularity of the accesses enabling the reduction of the number of accesses to the larger background memories in the memory hierarchy.
3) Data reuse transformations [19] : Introduce array signals where copies from larger signals that exhibit data reuse are stored.
In-place mapping [20] 
S S
Additionally our approach includes a group of transformations that targets the reduction of the energy dissipated due to instruction accesses and transfers. These transformations do not affect the way data are accessed or stored.
d) Instructions reduction transformations:
Mainly target the reduction of the number of executed instructions but may also affect the code size. Number of instruction is given higher priority than code size, since energy is linearly proportional to instructions and sublinearly proportional to code size, according to the energy model described in section 3.
Common sub-expression elimination, constant expression elimination, loop invariant code motion and algebraic transformations belong to this category. The effect of the instruction reduction transformations can be described by the following equations using the memory graph notation:
The claim here is that these categories should be applied in the specific order above in order to better control the energy-area trade-off. Transformations that enable the application of all the above-described energy optimizing transformations may also be applied. Data flow transformations that remove data dependencies and loop transformations that improve locality and regularity are the most important enabling transformations.
Another important issue in our approach is that the array signals in an algorithm description are classified based on their functionality to the following categories [13, 27] The detailed order of application that is the main contribution of this paper is described in figure 6 . In the first step, data flow transformations are applied to reduce accesses to the array signals of the description (see [17] ). Then the size reduction step is applied. The long lifetime signals are mainly optimized by applying in-place mapping mainly intra-signal but also intersignal especially between input and output signals. Intermediate signals can be mainly optimized by applying loop and in-place mapping [20] . At the same time, also enabling transformations are applied. In the next step, access-moving transformations are applied 
EXPERIMENTAL RESULTS
In this section experimental results from the application of the high level part of the proposed methodology to a number of real life demonstrators are presented. The demonstrators include the full search motion estimation kernel [21] , the QSDPCM algorithm for video compression [22] , the SGLDM texture analysis algorithm [23] and a voice coder application [24] . For the realization of the demonstrators the ARM 7 processor core has been used. For the case of the QSDPCM application realization on 14 cores was assumed. The best in terms of power hybrid (task-data level) partitioning of the QSDPCM described in [16] was assumed. For the rest applications single core realizations were assumed.
As far as the data memory organization is concerned the number of levels is assumed to be the minimum required to make all the access moving transformations meaningful in terms of power i.e. to ensure that no data transfers beginning and ending in the same level of the data hierarchy exist in the transformed code (for example if the maximum number of new levels introduced by all the access moving transformations in one application is three then the data memory hierarchy should at least include three levels). It is also assumed that the levels of the data memory hierarchy are not divided into blocks (centralized architecture). These assumptions will be used for the high-level evaluation of the data storage and transfer related power consumption and data memory area.
<< Insert figure 7 >> << Insert figure 8 >>
For the program memory it is assumed that its size is not fixed i.e. it can be adapted to the size of a specific code and thus it is an important design parameter. No other implementation specific assumptions (number of memory ports, bus organization) are made. An important point is that the ARM debugger that has been used for the simulations does not evaluate cache related effects i.e. it assumes that all the accesses to the array signals present in the application code have the latency required to access the (off-chip) main memory. The proposed methodology moves certain signals in levels of the data memory hierarchy lying on-chip and closer to the cores where accesses have shorter latencies than those to the main memory. Thus the performance results presented in this paper are underestimates of the real performance that can be achieved if the proposed memory organization is adopted.
The effect of the proposed methodology on the data transfer and storage related power 
LIST OF FIGURES
