Abstract-Application studies in the areas of image and video processing indicate that between 50 and 80% of the power cost in these systems is due to data storage and transfers. This is especially true for multi-processor realizations, because conventional parallelization methods ignore the power cost and focus only on performance. However, also the power consumption depends heavily on the way a system is parallelized. To reduce this dominant cost, we propose to address the system-level storage organization for the multi-dimensional signals as a first step in mapping these applications, before the parallelization or partitioning decisions (in particular before the SW/HW partitioning which is traditionally done too early in the design trajectory). Our methodology is illustrated on a parallel QSDPCM video codec.
I. INTRODUCTION AND RELATED WORK
N multi-media applications and others that make use of large multi-dimensional array-type data structures, a very large amount of memory is required. Although the cost of memory is decreasing continuously, the ever increasing storage requirements in these data-dominant applications make the memory cost usually one of the dominant contributions to the total system cost [1] . This is especially true for embedded systems [2] , [3] .
Most research efforts in the parallel processing community address the problem of parallelization and load balancing [4] , [5] , [6] , [7] , [8] . However these approaches focus only on speed and not on power consumption. They do not take into account the background storage related cost when applied on data dominated applications like multi-media applications. A first approach for more global memory optimization in a parallel processor context was described by us in [9] .
The same gap is true for heterogeneous hardware/software systems. Most of the research effort in HW/SW co-design so far has targeted issues like modeling and simulation [10] , interfaces [11] , [12] , [13] , [14] and especially partitioning (manual [15] , [16] , [17] and automatic [18] , [19] , [20] ). All of these approaches ignore however the heavy impact of the data storage cost if they would be applied on data-dominated applications as in video or image processing. An approach for system level memory management in a HW/SW co-design context was first described by us in [21] .
These conventional parallelization and HW/SW co-design approaches tackle the partitioning and load balancing issues as the only key point so they perform these first in the overall method- ology. For the typical image processing system in figure 1 , this means that all the submodules are first assigned to the best matched processors, and afterwards they are treated fully separately where they will be compiled in an optimized way onto the corresponding HW or SW processor. This strategy leads to a good load balancing solution but unfortunately, it will typically give rise to a significant buffer overhead between the different submodules.
In section II, we will explain our methodology to remedy this situation. In section III a representative complex application from our target domain i.e. the QSDPCM video codec is described. In section IV, the design of the QSDPCM codec is described for the conventional methodology, and in section V, the same is done for our new methodology. Final conclusions are drawn in section VI.
II. METHODOLOGY
The conventional design methodology is illustrated in figure 2. An initial specification typically consists of a number of submodules, and the boundaries between these submodules are most often determined by issues like design reuse and top-down design (as in figure 1 ). In the design trajectory, these submodules are from then on treated fully seperately: during the parallelization and SW/HW partitioning, they are assigned to the best matched processors, and afterwards they are optimized only locally.
This will typically lead to large buffers between the submodules, because of the mismatch between the data produced and consumed by them. A simple example is when one submodule produces an image line by line, while the next submodule consumes it column by column.
Our methodololgy is illustrated in figure 3 . We propose to first apply storage and transfer oriented optimizations between the different systems. Initially, all the submodules containing multi-dimensional processing are combined into one global specification model, and then optimized as a whole in terms of data storage and transfers. In this way, global transformations over the intial submodule boundaries are performed. After these optimizations, a different set of subsystems with different boundaries is passed to the parallelization or SW/HW partitioning step. The boundaries between the submodules are now not determined anymore by design reuse or top-down design, but by storage and transfer oriented decisions. In this way, the final system will require much smaller buffers, which results in reduced area and power consumption.
This work on data storage and transfer optimization in a parallel processor context extends our previous research on memory management for customized single-processors architectures, for which we have proposed the ATOMIUM methodology [22] . A number of data storage and transfer optimization strategies are included in this methodology:
Memory oriented data-flow analysis and pruning of the initial system specification.
Global data-flow and loop transformations of the system's description for reduction of the required background memories and accesses but also to enable further optimization steps.
Data reuse transformations to exploit a distributed memory hierarchy more effectively in the algorithm.
Storage cycle budget distribution to determine the bandwidth requirements and the balancing of the available cycle budget over the different memory accesses in the algorithmic specification.
Memory allocation and signal to memory/port assignment, to determine the necessary number of memory units and their type (if freedom is left, e.g. in the main memory organisation). The goal of this step is to produce a netlist of memory units from a memory library as well as an assignment of signals to the memory units.
In-place mapping. The aim is to reduce the size of memory required by storing signals with non-overlapping lifetimes in the same physical memory location (using aggressive data transformations).
In this paper, we will mainly show that these system-level memory management steps are still valid in the context of parallel systems, and that they have to be applied before the parallelization or SW/HW partitioning decisions. Therefore, the next sections will show the mapping of the QSDPCM demonstrator application, without applying our methodology (section IV) and after applying our methodology (section V).
III. THE QSDPCM ALGORITHM

A. Introduction
The QSDPCM (Quadtree Structured Difference Pulse Code Modulation) technique is an interframe compression technique for video images [23] . It involves a motion estimation step, and a quadtree based encoding of the motion compensated frame-toframe difference signal.
A fundamental property of the QSDPCM technique is the joint optimization of both the displacement vector and the quadtree mean decomposition in a sense such that the total frame-to-frame update information can be coded with a minimum number of bits at a given distortion.
B. Detailed algorithm description
A global view of QSDPCM coding algorithm is given in figure 4. In a first step, the actual image is 4 ¡ 4 mean subsampled, and matched (using motion esimation) with the reconstructed previous image. The resulting displacement vector is used as an initial displacement in the second stage, where the same is done for 2 ¡ 2 subsampled versions of the images. The so-obtained refined (but not full resolution) displacement vector provides the initial guess for the quadtree decomposition.
The final displacement and the quadtree decomposition are determined in a joint optimization procedure. For each displacement in a +/-1 interval around the initial guess, the 16
difference signal between the current frame and the previous frame is computed and 2 The last step in the algorithm is the generation of the decoded frame (using motion compensation). Finally, the decoded frames are upsampled by a factor of 2.
Most of the submodules of the algorithm operate in a critical cycle with an iteration period bound (IPB) [24] , [25] of 1 (one frame). Indeed, to apply the first step of the algorithm to a new frame, the previous frame must already have been reconstructed and 4 ¡ 4 subsampled. However, the pipeline interleaving manipulation [26] can easily break this cycle, by processing each frame in parts instead of as a whole.
The main procedure (coding of one frame) is shown in figure 5.
C. Profiling of the initial algorithm
In all the experiments described in this paper it is assumed that the QSDPCM algorithm operates on frames in CIF format (288x528 pixels for the luminance and chrominance pixels together).
The QSDPCM algorithm was simulated using the "Coast Guard" test sequence. The total number of operations per frame is 12988 K. The most computation intensive tasks are the motion estimation with pixel accuracy and the adaptive upsam- Assuming an operation frequency of 50 MHz (taking into account some instruction level parallelism) and an arithmetic operation efficiency of 50% (the rest is lost in overhead for condition and address handling) for programmable (software) digital signal processors, as well as a frame rate of 25 frames/s, about 13 software processors would be required for the algorithm.
The total size of the array signals is 532 K words of 8 bits and 479 K words of 12 bits. The arrays for the previous, current and reconstructed frame (456 K 8-bit words) as well as the array for the difference blocks (342 K 12-bit words) dominate this figure. The total number of accesses to these signals per frame is 9800 K.
The array signals of the initial description of the QSDPCM algorithm can be divided in two categories:
Category A: those which are either inputs or outputs of the algorithm on a per frame basis (coding of one frame). The following signals belong to this category: previous frame, frame, rec frame, previous rec sub2 and rec sub2.
Category B: those which are intermediate results during the processing of one frame. All the other array signals belong to this category.
The memory requirements of the two categories (based on the initial description) are shown in table I.
D. Architecture model
The target architecture model assumed for the partitioning of the QSDPCM application is shown in figure 6 .
Communication between two processors is initially accomplished by means of double buffers which is the traditional solution to decrease the design complexity by decoupling the tasks on the processors.
IV. DESIGN BASED ON THE CONVENTIONAL
METHODOLOGY
Using the conventional methodology, the processor partitioning is done based on the initial description of the algorithm (i.e. the initial division in submodules). We will describe two alternatives: data level and task level partitioning. They are only Fig. 7 . Task level partitioning discussed briefly, as the focus of this paper is on the methodology of the next section and not on the partitionings themselves.
A. Data level partitioning
In data level partitioning (exploiting data-parallelism), each of the 13 processors will perform the whole algorithm on its own part of the current frame (so each processor will work on approximately 46 blocks). The results (the required storage and accesses and the corresponding area and power) are shown in table II. It is also indicated whether the signals will be stored on-chip or off-chip.
Estimated area and power figures were obtained using a model of Motorola for embedded SRAM 1 . The parameters of the model are the bit-width, word depth, access frequency and the number of ports. Except for section V-E, we have assumed that all signals (within one category) are stored in one singleport memory. This model has also been used for our off-chip area and power figures 2 , but we have indicated in the tables whether off-chip or on-chip memory is used.
B. Task level partitioning
In task level partitioning, the different functions (or parts of them) of the algorithm are assigned in parts of about 1 million operations over the 13 processors. This partitioning is combined with the pipeline interleaving manipulation [26] applied at the system level and each frame is processed in parts equal to its 1/13th. tioning these buffers are present in each processor (although not double-buffered) requiring 342 K words in total.
C. Conclusions
The description of task level partitioning and data level partitioning as well as their comparison proves that, when power is an issue, data and storage related costs must be taken into consideration during processor partitioning. Otherwise (i.e. when only performance is taken into account), partitionings that are very bad for power can be chosen. Indeed, for QSDPCM the data level partitioning would be chosen in that case because the algorithm is very regular and load balancing is clearly better for the data-parallel solution.
However, the next section will show that even much better results can be obtained when a system-level memory management step is done before the partitioning.
V. DESIGN BASED ON OUR METHODOLOGY
In this section the processor partitioning takes place after applying our system-level memory management methodology. First, in-place optimization will be applied to the category A signals. Next, a global loop reorganization will be performed, mainly affecting the category B signals. The task level partitioning is then performed to demonstrate the effect. We will also show the results of a crude data reuse exploration and memory allocation and assignment step. It must be noted that no memory hierarchy is introduced yet and also the cycle budget distribution is left out at the moment (see section II and [22] ).
A. In-place optimization
The category A signals occupy 532 K in the original algorithm. However using aggressive in-place mapping [27] this memory size can be heavily reduced, as explained below.
After performing a lifetime analysis, it is clear that the signals previous frame and rec frame have non-overlapping lifetimes so they can be mapped to the same memory space (152 K savings). Furthermore the previous frame and frame signals can be in-place mapped to 166 K (instead of 2 ¡ 152 K), because the motion estimation has a limited search area. So as we advance through the current frame, a part of the previous frame can already be overwritten [28] .
The total memory space required for the shared array signals is now 206 K (factor 2.6 improvement). This leads to important area and power savings (see the values for the category A signals in table IV). However this memory space is still large enough to impose off-chip memory. The power can be further reduced by introducing a memory hierarchy combined with loop transformations that increase locality as described in the full ATOM-IUM methodology [22] . However in this paper it is assumed that only one level of memory hierarchy is available. 3 
B. Global loop reorganization
To apply system level memory optimizations, all the procedures in the initial description of the QSDPCM application are combined in one big procedure. This enables application ¢ Note that we have both off-chip and on-chip memory, but these are not organized in a hierarchy (see figure 6 ). int main() { for (x=0; x<number_of_blocks_x; x++) for (y=0; y<number_of_blocks_y; y++) { SubSamp4(previous_frame, prev_sub4_region, x, y); SubSamp4(frame, sub4_block, x, y); SubSamp2(previous_frame, prev_sub2_region, x, y); SubSamp2(frame, sub2_block); V4(sub4_block, prev_sub4_region, x, y, v4x, v4y); V2(sub2_block, prev_sub2_region, x,y, v4x, v4y, v2x, v2y); V1diff(block, previous_frame, x, y, v2x, v2y, diff_blocks); QuadDecomp(diff_blocks, x, y, v1x, v1y); QuadConstruct(block, previous_frame, x, y, v1x, v1y, rec_block); Reconstruct(rec_block, previous_rec_sub2, x, y, v1x, v1y, rec_sub2_block); UpSamp2(rec_sub2_block, rec_frame, x, y); } } A global loop merging operation was applied to the initial description so that there are two outer loops that iterate over the block indices. Indeed, there are no dependences at all between two blocks of the same frame, so it is easy to see that this is a valid transformation. In this way only buffers for one block are required between submodules instead of frame buffers. The global loop reorganization is illustrated in figure 8 (compare this with figure 5 ). Note that to apply this global loop merging, a strip mining (loop tiling) transformation [29] had to be applied first in some procedures that were initially not block-based (such as SubSamp2 and SubSamp4).
The task level partitioning, described in subsection IV-B, was performed again. The memory size requirements for the category B signals are now only 4128 words. This memory size is extremely small in comparison to the 287 K words required initially. This is true because only buffers for one block are now present between modules. Another important point is that these buffers can be stored on-chip because of their small size.
C. Additional loop transformations between submodules
C.1 Loop merging between V1Diff and QuadDecomp
After the global loop reorganization the main part of the memory required for the category B signals is used to buffer the 9 8 ¡ 8 difference blocks produced by the V1Diff procedure. These difference blocks are quadtree decomposed by the QuadDecomp procedure. Both these procedures iterate over the nine possible displacements.
These buffers can be reduced if the two procedures are merged in a common loop that iterates over the nine possible motion vectors. In this way only one 8 ¡ 8 difference block needs to be stored between the V1Diff and QuadDecomp procedures. The required memory size is now reduced from 4128 to 1568 words.
C.2 Strip mining and merging in V1Diff and QuadDecomp
Further reduction of the size of the buffer required between procedures V1Diff and it QuadDecomp can be achieved, since it is clear that the minimum amount of data required for the beginning of the bottom-up quadtree decomposition is two 8-pixel lines of a difference block produced by the V1Diff procedure. So only two 8-pixel lines need to be stored between these 2 procedures. The size of the memory required for the category B signals is now further reduced from 1568 to 1268 words.
C.3 Loop interchange in motion estimation with 2 pixels accuracy
In the initial QSDPCM description, in all motion estimation procedures there is an outer loop iterating over all possible displacements and an inner loop iterating over the pixels of the block under coding. So first all pixels of the current block are compared with a candidate matching block, then with another candidate matching block, and so on. For each candidate, a mean absolute difference (MAD) can be computed and immediately compared with the best MAD so far. This means that the current block needs to be buffered as a whole, and that only one register is required for the MAD, as illustrated in the upper part of figure 9.
If these two loops are interchanged, the first pixel of the current block is compared with all candidate matching positions, and the resulting contributions to the MAD for all these positions are stored in an array with the MAD values. Then the same is done for the second pixel of the current block (of course the contributions to the MAD are now added to the array instead of simply stored), and so on. As shown in the lower part of figure 9, this means that the current block needs not to be buffered; only one register for the current pixel is required. On the other hand, buffer space for all MAD values is now required, because After the application of this obvious transformation the number of accesses to the shared array signals (category A) stored in the off-chip buffers is reduced by 155 K i.e. it becomes 4900 K. The number of accesses to the category B signals is increased to 8558 K (8520 K before the transformation). However in total this transformation is beneficial in terms of power consumption since it eliminates 155 K off-chip accesses and introduces 38 K on-chip accesses which are less costly in terms of power.
D. Results
The two versions of the task level partitioning (the initial and reorganized descriptions) are compared in table IV.
The impact of the global loop reorganization before partitioning is obvious. The category B signals require 287 K words in the partitioning based on the initial description. Almost all these buffers are eliminated when an extensive loop reorganization takes place first. Only 1190 words now need to be stored. An important point is that this small amount of words can be stored on-chip. This is crucial in terms of power consumption and leads to very large power savings.
E. Distributed memory organization
So far it is assumed that the category B signals are stored in one, single port on-chip memory. This is a high level assumption that guides initial decisions in the design flow. The envisioned situation is a distributed memory architecture. To generate such an architecture, memory allocation and assignment steps (see section I) are required. These steps have not been explored in depth here, but enough to give a fairly good estimate of the effect (w.r.t. the area and power costs) of the distribution of the array signals to different memories.
The main issue for the assignment of the array signals to different memories is the available bandwidth i.e. the frequency with which the memory can be accessed. For the exploration under consideration single port memories with 50 MHz bandwidth were assumed.
The generated distributed memory architecture is shown in figure 10 . The effect on the power consumption is given in the last line of table IV: a further factor of 20 is gained (for the used power model) !
VI. CONCLUSIONS
The aim of this research has been to demonstrate the impact of memory management for multi-media applications in a parallel processor context. Partitioning is conventionally performed without taking into consideration data storage implications. Since in data dominated systems storage and transfers heavily affect the area and power cost, this approach leads to very costly designs. The research described in this paper clearly demonstrates that the storage and transfer cost should be taken into account before any partitioning decision is made, in particular before the SW/HW partitioning decision which is traditionally taken too early in the design trajectory. It has been shown that a preliminary global reorganization step reduces both area and power costs of the system by orders of magnitude.
Although the main goal of our optimizations is area and power cost reduction, also the performance is positively affected. This is because for many signals, on-chip storage is made feasible, and on-chip accesses are much faster than offchip. Moreover, our optimizations do not impose a detailed instruction schedule, so a lot of freedom is still left for the subsequent design stages. 
