In this paper 
Introduction
In video applications large amounts of data have to be handled in real-time. This usually results in high power consumption in both data transfers over communication channels and in data storage in large background memories. Therefore it is important to optimize the power consumption and required memory storage as much as possible. Our power exploration methodology is based on the observation that in this type of data-dominated applications, the system power consumption is dominated by the power consumed in the transfers and storage related to the main memory organisation [20] . So, the fist stage in our power exploration methodology, is to come up with an optimized memory architecture.
The derivation of an optimal memory architecture is done in a number of steps. The first step is the optimization of the control-flow to increase the regularity and locality in the algorithm. The next step is to decide on the memory hierarchy, to allocate the memories and to assign every signal to one of the allocated memories. Finally, there is an in-place mapping step that minimizes the size of each memory by calculating a storage scheme that allows to overwrite as much as possible data that is no longer alive.
These basic steps have been described by us before for an area oriented system exploration (see [14] and its refs). The systematic approach for the combination of power and area is however new. This script and its effects will be discussed further in section 5 .
All of these steps are included in our High-level Memory Management methodology, which is partly supported in our ATOMIUM environment [ 141. In this paper we will illustrate this methodol'ogy on a 2D motion estimation kernel.
In the experiments it has been assumed that the application works with frames of W x H pixels, processed at F framesk For example, in the 2D motion estimation kernel for the QCIF standard, which we use as a test-vehicle to illustrate the general methodology, this means 176 x 144 pixel frames in a video sequence of 30 frames/s. This results in an incoming pixel frequency of about 0.76 MHz. This paper is organized as follows. Section 2 describes the related work. Section 3 introduces the test-vehicle on which the methodology is illustrated. Section 4 describes the power models we have used for the power estimation. Section 5 explains and illustrates the different steps in the power exploration experiments. Section 6 summarizes the conclusions of the paper.
Related Work
Most designs which have already been published on motion estimation in related MPEG video coders [5, 13, 181 are based on a systolic array type approach because of the relatively large frame sizes involved, leading to a large computational requirement on the DCT. However, in the video conferencing case, this is not needed. An example of this is discussed in 1.41. As a result, it will be shown that a power and area optimized architecture is not so parallel (even partly multiplexed). Hence, also the multi-dimensional (M-D) signals should be stored in a more centralized way and not fully distributed over a huge amount of local registers. This storage organisat ion then becomes the bottle-neck'.
As we haw shown earlier, in principle much power can be gained by reducing the number of accesses to large frames or buffers [20] . Also other groups have made similar observations [ 121 for video applications. Up to now however no systematic apprciach has been published to target this important field. Indeed, most effort up to now has been spent, either on data-path oriented work (e.g. [2]), on control-dominated logic or on programmable processors (see [17] for a good overview).
Note that the transfer between the required frame memories and the systolic array is also quite power hungry and usually not incorporated in the analysis in previous work ISLPED 1996 Monterey CA USA 0-7803-3571-8/96/$5 .W1996
Test-vehicle
The motion estimation algorithm [ll] is used in moving image compression algorithms. It allows to estimate the motion vector of small blocks of successive image frames. We will assume that the images are gray-scaled (in practice, for color images only the luminance is considered). The version we consider here is the kernel of what is commonly referred to as the "full-search full-pixel'' implementation [9] . The algorithm is typically executed in 6 nested loops, except for the implicit frame loop. The choice of the nesting for these loops is partially open, and there is quite a lot of room for parallelisation and (loop) reordering. The basic operation at the inner loop consists of an accumulation of pixel differences, while the basic operation two levels higher in the loop hierarchy consists of the calculation of the new minimum and its location.
Power Models
The libraries used in the power models have been uniformly adapted for a 0 . 7~ CMOS technology, operating at 5V. If some figures where not available in that specific technology, they were scaled with experimental weights. Note that if a lower supply voltage can be allowed by the process technology, the appropriate scaling has to be taken into account. It will however be (realistically) assumed that V d d is fixed in advance within the process constraints, and that it cannot be lowered any further by architectural considerations.
For the data-paths and address generation units (which were realized as custom data-paths), a standard cell technology was assumed where the cells were NOT adapted to low power operation. As a result, the power figures for these data-paths are very high compared to a macro-cell design with power-optimized custom cells. The power estimation itself however has been accurately done with the PowerMill tool of EPIC, based on gate-level circuits which have been obtained from behavioural specifications using IMEC's Clash!Dolphin custom data-path synthesis environment followed by the Synopsys RT-synthesis Design Compiler.
The resulting VHaL standard cell net-list was supplied with reasonable input stimuli to measure average power.
For the memories two power models were used: and not the maximum frequency Fcl at which the RAM can be accessed. The maximal rate is only needed to determine whether enough bandwidth is available for the investigated array signal access. This maximal frequency will be assumed to be 100 MHz '. It should be stressed however that results further on will show that in practice (after optimisation) a much lower access frequency is required so this maximum is never met and hence a single-port memory would suffice for all the allocated frame memories. If the background memory is not accessed, it will be in power-down mode5.
~~
'Currently, vendors do not supply much open information, so there are no better power consumption models available to us for off-chip memones.
3Values ranging from 1.6 to 2.3 were experimentally found for different types and parameters.
Most commercial RAMs have a maximal operating frequency between 50 and 100 MHz 5This statement is true for any modem low-power RAM [6] The real access Freal is the number of read or write accesses per frame multiplied by the maximal number of frames per s (which is 30 fr/s for many video conferencing applications). This is a very important consideration, because it means that the maximal clock frequency is not that crucial in memory related power optimizations.
A similar reasoning can apply however for the data-paths, if we carefully investigate the power formula. Also here the maximal clock frequency is not needed in most cases. Instead, the actual number of activations Freal should be applied, in contrast with common belief which is based on an oversimplification of the power model. During the cycles for which the data-path is idle, all power consumption can then be easily avoided by any power-down strategy. A simple way to achieve this is the cheap gated-clock approach for which several realizations exist (see e.g. [ 191) . In order to obtain a good power estimate, it is crucial however to obtain a good estimate of the average energy per activation by taking into account the accurately modeled weights between the occurrence of the different modes on the components. For instance, when a data-path can operate in two different modes, the relative occurrence and the order in which these modes are applied should be taken into account, especially to incorporate correlation effects. Once this is done, also here the maximal Fcl frequency is only needed afterwards to compute the minimal number of parallel data-paths of a certain type (given that Vdd is fixed initially).
Power Experiments
For the combined power and area exploration approach, we consider a target architecture as in Fig. 2 . Depending on the parameters, a number of parallel datapaths are needed. In particular, for the 2D motion estimation this is 2m x 2m x W x H x FIF,, processors for a given clock rate Fer. However, this number is not really important for us because we consider an architecture in which the parallel data-paths with their local buffers are combined into one large data-path which communicates with the distributed frame memory. This is only allowed if the parallelism is not too large (as is the case for the motion estimator for the QCIF format). Otherwise, more systolic organisations, with memoly architectures tuned to that approach, would lead to better results. In practice, we will assume that a maximal F,I of 48.66 MHz is feasible for the on-chip components, which means that we need 4 parallel data-path processors.
Single

Virtual Data path
We will now discuss a power optimized architecture exploration for the motion estimation, as illustration of the more general methodology.
Memoiry Organisation
For background memories, experiments have been performed to go from a non-optimized applicative description of the kernel in figure 1 to an optimized one for power, tuned to an optimized allocation and internal storage organisation. In the latter case, the accesses are heavily reduced. These accesses take up the majority of the power as we will see later.
Control-flow optimization. The first optimisation step in our methodology [14] , is related to data-flow and loop transformations. For the 2D motion estimation, we will focus on the effect of loop transformations. It is clear that reordering of the loops in1 the kernel will affect the order of accesses and hence the regularity and locality of the frame accesses. In order to improve this, it is vital to group related accesses in the same loop scope. This means that all important accesses have to be collected in one inner loop in the 2D motion estimation. The latter is usually done if one starts from a C specification for one mode of the motion estimation, but it is usually not the case if several modes are present. Indeed, most descriptions will then partition the quite distinct functionality over different functions which are not easily combined. Here is a first option to improve the access locality by reorganize the loop nest order and function hierarchy amongst the different modes.
Another important class of loop transformations is related to reversal and interchanging the loop iterators in one loop nest. For instance, the 4 loops corresponding to the window and bloclk traversal in figure 1 can be ordered either with the window blased ones as the outer or with the block based ones as outer. In this relatively simple case, a straightforward analysis of the required signal storage and the related number of transfers, shows that we have a trade-off. If the traversal over the block is put in the outer loops, the advantage is that for each pixel in the block, we can directly use it to compute all reIated contributions for all block locations in the window. 'This avoids a large amount of redundant frame accesses. Halwever, we then need to store the resulting intermediate accumulation for the motion error for the different locations. This buffer will be quite large (16 x 16 words) and hence, this is not a good option. The best altemative is to put the block traversal as inner loops, surrounded by the window loops. In that case, the motion error can be directly accumulated in foreground registers, eliminating the costly background (buffer) access.
Such experiments on loop transformations are supported by our inter,active loop transformation environment SynGuide in AT'OMIUM. It allows to remove the tedious and error-prone steps in the transformation, while the designer can still fully control the desired manipulations. 
Memory architecture decision.
In a second step, we have to decide on the memory hierarchy, allocation and signalto-memory assignment. Here, the search space for possible memory configurations meeting the cycle budget is quite large. Important considerations here are the frequency of access and the size of each resulting memory. Obviously, the most frequently accessed memories should be the smallest. This can however only be fully optimized if we introduce extra memory hierarchy. The steering for this is driven by estimates on bandwidth and high-level in-place cost. Based on this, the background transfers are partitioned over several hierarchical memory levels (within a range of 1 to a maximal memory depth MazDepth), to reduce the power andor area cost. A simple illustration of this is shown in Figure 3 , but also more than 3 layers may be present. An important task at this step is to perform transformations which introduce extra transfers between the different memory levels and which are mainly reducing the power cost. In particular, these involve adding temporary values -to be assigned to a "lower level" -wherever a signal in a "higher" level is read more than once. This involves clearly a trade-off between the power lost by adding these transfers, and the power gained by having less frequent access to the larger memories in the higher layer.
Based on these considerations, an optimized memory organization has been obtained for the frame memories and the different data-path processors for 2D motion estimation. In many applications, this memory organization can be assumed to be identical for each of the parallel processors (data-paths) because the parallelism is usually created by "unrolling" one or more of the loops and letting them operate at different parts of the image data. In order to obtain a good overall memory organization, the number of processors "ired should however also be relatively low. Otherwise inter-processor memory sharing and optimization has to take place which is not currently supported in ATOMIUM. For the QCIF standard, the number of processors is relatively low when a reasonable clock rate is assumed. For larger search neighbourhoods or image frames this is however not true. For larger parameters, it will probably be better to go to a systolic array type solution [3, 8, 9 , 151 even though much power is then spent on letting all the data "flow" through the array and on accessing the still required frame memories.
4 4 x 7 6~
8 b i t Nr-blocks Figure4: Data routing for the straightforward memory architecture. The formulas at the arrows indicate the amount of words that are read (Rd) from the memories or written (Wr) to the memories per frame.
The original architecture for the optimized loop order, if we assume only 1 layer of background memories is present, is shown in Fig. 4 . Note that the required access rate of about 396 blocks x 8 x 8 x 16 x 16 x 30fr/s = 195 MHz in this case, is too high for the available frame memories, so two of them should be accessed in parallel (for both frames so 4 in total). Each of these can then however be half the size. The memory access power budget related to this i s 4 x 0.26 =
1.04W.
After introduction of 1 extra layer of buffers, both for the block and the window accesses, we arrive at Fig. 5 . A direct implementation leads to a switched frame memory of 2 x H x W x 8 bit, a neighbourhood buffer of (2m + n -1) x (2m+ n -1) x 8 bit and a blockbuffer o f n x n x 8 bit.
The power budget then becomes 580 mW. In principle, the buffers need to have two ports because they have to supply data every cycle to the data-path and a second port is needed for writing the updates. As the latter are however performed at a much lower rate and as the two-port memory is very area and power hungry, it is better to increase the cycle budget per data-path a little bit to use 1-port memories instead. The best way to achieve this is by providing a slightly larger maximal clock frequency i.e. 48.86 MHz i.s.0. 48.66 MHz. A more costly alternative would be putting 1 more parallel data-path. This further optimization leads to a reduced power budget of 300 mW.
We can do even better however. if we realize that a large amount of the window pixels can be reused from the "previous" block processing. By exploiting this overlap, we can reduce the number of write transfers per block for the window buffer to (2m + n -1) x n x 8 bit with a corresponding reduction also in read accesses to the oldf Tame memory. The result is shown in Fig. 6 Because of the operation on a 8 x 8 block basis and assuming the maximal span of the motion vectors to be 8, the overhead in terms of extra rows in the combined frame buffer is then only 8 + 8 = 16 lines, by using a careful in-place compaction. This leads to a common frame memory of about ( H +-16) x W x 8 bit, in addition to the already minimal neighbourhood buffer of (2m + n -1) x n x 8 bit and a block buffer of n x n x 8 bit. For the parameters used in the example, the frame memory becomes about 0.225 Mbit. The result of the optimized memory architecture is shown in Fig. 7 .
In practice, however, the window around the block position in the "old active" frame is buffered already in the window buffer so the mostly unused line of blocks on the boundary between "new" and "active old" (indicated with hashed shading in Figure 7 ) can be removed also. This leads to an overhead of only 8 lines in thecombined newloldf Tame (the maximal span of the motion vectors), namely 1408 words, with a total of26752 i.s.0 2 x 25344 = 50688 words (47% storage reduction). A very small power penalty is paid for this large storage area saving because the access count is the same but the storage size accessed is a little higher. Because of the relatively small frame size in QCIF, we can now however consider putting this combined frame memory of 214 kbit on chip. If low-power embedded RAMS are used6, this will reduce the power budget further because the expensive off-chip comnnunication is totally avoided.
r-Old frame Taking into account all the memory transfers to the frame memory, from the combined frame memory to the two buffers, and from these two buffers to the data-paths, leads to a total power budget for the memory architecture of about 260 mW using the memory power models discussed in section4. A breakdown of this final power figure over the different buffers is :shown in Fig. 8 . This corresponds to a substantial saving compared to the initial situation of Fig. 4 (factor 4).
Address Generators
In addition to the memory architecture optimization, we have also explored the address generation for the original architecture of Figure 4 The results will be discussed in another paper. The main conclusion is that the best solution is neither the fully parallel (N ACUs:), nor the fully sequential one (1 ACU) when a combined Power-Area trade off is taken into account. Hence exploration supported by a design environment is definitely needed. Moreover, we have shown that the range of accumulated power is of the same magnitude as the data-path power, which is discussed next.
Data-path and Local Control
In order to compare the power consumed in the memory architecture and the address generators with the power consumed in the data-path, we have estimated the power of the data-paths as well. However, not very much optimization was done here.
In our experiments with the 2D motion estimation kernel we have used 4 parallel data-paths, corresponding to an F , I
of 48.66 MHz. A block diagram of one of the data-paths can be seen in Fig. 9 . The widths of the blocks in the data-paths have been assumed to be 8 bit in the compare and absolute values operators and 12 bit in the accumulator. This block has been defined in VHDL and synthesized using Synopsys' Design Analyzer. The power estimation itself is done using PowerMill with random input vectors.
This has lead to 22.4 mW per data-path, including same local control but excluding the connections and external buffering. The total power consumed by the unoptimized data-paths is therefore about 90 mW.
Conclusion
It can be concluded from these experiments that after optimization of the memories, data-paths and address generation units, the power which goes into the memory accesses (260 mW) dominates the other contributions, which are both comparable (less than 90 mW for all the data-paths and about 140 mW for the optimized address generators). This is true even when low-power circuits are used in the memories and when power hungry standard cells are used in the data-paths and address units. Moreover, the current figures do not yet include the power of the data transfers themselves which will also consume much power (especially the off-chip ones).
These transfers are also within the focus of ATOMIUM and they are equally optimized by reducing the background accesses.
