Abstract| We describe a power exploration methodology for data-dominated applications using a H.263 video decoding demonstrator application. The starting point for our exploration is a C speci cation of the video decoder, available in the public domain from Telenor Research. We have transformed the data transfer scheme in the speci cation and have optimised the distributed memory organisation. This results in a memory architecture with signi cantly reduced power consumption. For the worst-case mode using Predicted (P) frames, memory power consumption is reduced by a factor of 7 when compared to the reference design. For the worst-case mode using Predicted and Bidirectional (PB) frames, memory power consumption is reduced by a factor of 9. To achieve these results, we make use of our formalised high-level memory management methodology, partly supported in our ATOMIUM environment.
I. Introduction T HE video coding algorithm of Draft Recommendation H.263 is based on motion -compensated hybrid predictive and transform coding with improvements to t bit rates less than 64kbit/s. It is a complex and relevant example of a data-dominant application. A hardware realisation of such a decoder has to be power e cient in order to reduce the size of the chip packages where it is embedded, or the battery if it would be used in a mobile application. It is well-known by now that any future complex chip realisation has to take power reduction into account 1]. Our previous research has clearly shown that the dominant power contribution in data-dominated designs lies in the data transfer and storage of multi-dimensional array signals and other complex data types 2], 3]. In this paper we exploit this feature to achieve large savings in the system power without having to worry about the detailed data-path, foreground registers, and controller architecture.
The main contributions in this paper will be the evaluation of the applicability and e ectiveness of our power oriented methodology for data-dominated applications 4], 5], 3], (see Fig. 1 ), a study of the e ect of the possible optimisations, and the application of the most promising alternatives in the correct sequence on the H.263 decoder This research was partly sponsored by a co-operation with Texas Instruments Incorporated.
L. Nachtergaele and F. Catthoor are with IMEC, Kapeldreef 75, B3001 Heverlee, Belgium.
F. Catthoor is also Professor at the Katholieke Universiteit, Leuven, Belgium.
B. Kapoor was a resident at IMEC from the Corporate R&D labs of Texas Instruments Incorporated, Dallas, Texas.
S. Janssens was a student from Erasmus Hogeschool and is now with IMEC D. Moolenaar was a student from Delft Univ. of Technology and is now with IMEC.
algorithm. In addition, we have substantiated our earlier claims 2] that the cost of the background storage and related transfers is dominant during the system exploration. This will be shown in section VI by investigating the power in a representative data-path in H.263, including its corresponding local memories. In the rest of this paper, we have concentrated on the main storage (memory) and transfer related parts of the H.263 decoder architecture. This exploration has been done based on a power model described in section III. The nal results for the di erent steps are illustrated in Fig. 12 . A brief version of this paper has been published in 6] .
The numerous pointers and variables in the C code, which are used in the reference implementation, have been removed by rewriting the speci cation into a mixed applicative-procedural DFL description 7]. As a result, more indices and some extra signal copies and accesses are present in the code but the dependencies are much more transparent. This allows for systematic identi cation of the sources for potential optimisation. Moreover, this step is essential in applying a number of data storage and transfer related analysis and exploration/optimisation techniques which are collected in our high-level memory management methodology/script, partly supported by the prototype ATOMIUM environment 4].
Our strategy to obtain area and power gures is based on selecting the worst-case parameters and modes in the H.263 speci cation. This is valid for computing the maximal power budget and for nding the component size, which a ects mainly area, but not directly for the average power consumption. Still, we believe that the maximal power consumption gives a good view on the relative importance of the di erent components in the power budget and on the savings which can be obtained. In order to have a good view on the absolute average power consumption, we require accurate statistics on the occurrence of the different cases. In the sequel, we will only give some relative indication of this.
The following major algorithmic transformations and memory organisation optimisations have been performed on the DFL speci cation, incorporating mainly the power budget related to the access to/from the frame memories: 1. First the code was pruned to retain the operations relevant to the overall complexity of the description with respect to the number of cycles, area and power consumption. This boils down to keeping the relevant storage and accesses of the arrays storing the picture information explicitly and hiding details of arithmetic operations in function calls. As a result, the potential overhead of transfers and storage in the applicative writing style is removed when Fig. 1 . ATOMIUM script for storage/communicationoptimisationin the speci cation to be used for simulation and hardware/software synthesis.
interpreted e ectively. 2. Several data-ow transformations have been performed. The methodology for carrying out these transformations and their e ects are described in 8]. One of the major transformations results in the removal of all the border accesses used in the H.263 decoder, as discussed in subsection V-A. 3. Advanced transformations on the global function hierarchy and loop nests have been performed. These transformations have a signi cant e ect and will be partly discussed in subsection V-B. They are also essential to enable the application of the further exploration steps. 4. In order to further exploit locality of access and data reuse, extra memory hierarchy levels have been incorporated (see subsection V-C). For the P mode, this step has been especially e ective in the \overlapped-block motion compensation" (OBMC) mode which has the largest power consumption. We will only show the principle involved in this optimisation as depicted in Fig. 6 . For the B pictures and for the combination with IDCT computation, we will show more details. 5. Finally we have performed actual memory allocation and in-place mapping to determine the detailed memory organisation for the frame memories and some of the smaller intermediate memories. This step will be discussed in Section V-D. It has a large e ect on the required area, which is reduced by almost a factor 2, with only a very limited increase in the power budget.
II. H.263 video decoding
H.263 is a draft recommendation for video coding for narrow telecommunication channels at < 64kbit/s 9]. The coding/decoding is a block based algorithm that exploits spatial and temporal redundancy. Three standard video formats are used in conjunction with H.263, called QCIF, Sub-QCIF, and CIF. A QCIF picture has 144 176 pixels, represented by 9 11 macroblocks. Each macroblock has six blocks of 8 8 pixels. This is due to the (4:2:0) decimation of chrominance values. The picture that serves as the reference for prediction is called the P-picture. From the past P-picture, a future P-picture is predicted. This is called the forward P prediction. Interpolation between past and future P-pictures yield Bidirectional B-pictures (see Fig. 2 ).
A PB-frame consists of two pictures : a P-picture, which is predicted from last decoded P-picture, and a B-picture, which is predicted from last decoded P-picture and the Ppicture currently being decoded. Parts of the B-picture may be bidirectionally predicted from the past and future P-pictures. For PB-frames the coding mode intra (I) implies the P-blocks are intra coded, and the B-blocks are inter coded with prediction as for an inter block.
A decoder can be in one of the three modes; I, P, or PB mode. Two extensions are orthogonal to the P and the PB modes: the unrestricted motion vector extension allows motion vectors pointing outside the frame, whereas in overlapped block motion compensation (OBMC), 4 extra motion vectors are used to compensate motion. When we refer in this paper to the P or PB mode, we assume that both extensions are in use. Hence, the P and the PB mode refer to two modes that are most energy consuming. 
III. Power model
For data intensive applications, such as video decoding, data transfers dominate the power consumption. Therefore the primary design goal is to reduce memory transfers between large frame memories and datapaths. The cost of a data transfer is a function of the memory size, memory type, and the access frequency F real . F real is de ned as the real number of accesses per second and not the clock frequency. When there is a clock tick and the memory is not accessed, it is assumed that memory is in power-down mode. This assumption holds for most modern low-power RAMs 10] . The memory itself is characterised by the number of ports, words, bits, and the aspect ratio of the layout. We make use of an accurate but proprietary power model from Texas Instruments for the power exploration. In this paper, only the number of transfers per frame/picture (directly related to F real ) will be discussed.
IV. The reference design
To obtain an acceptable reference, we have counted the number of transfers to the arrays, that hold the past P, future P, and B pictures, in the Telenor C implementation 11]. These numbers depend on the mode of reconstruction. The ow of data using all extensions is depicted in Fig. 3 using thin lines. The order of computation of pictures P T-1 , Pext T-1 , Pnew T-1 , P T , B1 T , B2 T , and B T is shown in the gure. The dashed lines indicate that pictures Pnew T-1 and P T are stored in array signal newframe whereas the pictures B1 T , B2 T and B T are stored in Bframe. The rectangles with a bold border are the nal pictures after decoding. The thick line indicates that oldframe and newframe are interchanged after each decoded frame. In the C code, this is done by swapping the pointers to oldframe and newframe. This re ects that main memory is not being wasted in the C implementation, because the simulation speed is also a ected by this. The corresponding abstract organisation for the continuous P mode is shown in Fig. 4 . Table I lists worst-case and average number of transfers to the frame memories per picture. The worst-case numbers are obtained analytically and not by simulation. This means that whenever code is executed conditionally, the conditions are assumed to branch to the most energy consuming option. For example, it is assumed that every macroblock is motion compensated. This is clearly a worst-case assumption. Mode 1 uses prediction with overlapped motion compensation and unrestricted motion vectors. In Mode 2, bidirectional prediction is also included, introducing the extra transfers to the Bframe Table II together with the compression ratio. The C code, used as a reference, is optimised to run as fast as possible on a given workstation. It is indeed not optimised for e cient implementation. But it is a typical documentation that implementation groups start with. Mostly a direct mapping of the algorithm and the datastructures is made on a block diagram and each block is then optimised locally and implemented e ciently. This is why most video decoders have a large external memory (with high bandwidth) that holds 3 complete images. Also, the memory interface typically becomes a big component of the design.
When 19 ] which also include memory for three pictures, we believe that the access numbers to these memories will be comparable if the bi-directional mode is considered. If bi-directional is not used, the accesses will be comparable to the accesses corresponding to the P-picture. We will now give a summary for each of the main optimisations listed in section I. They have been applied starting from the initial mixed applicative and procedural DFL description of the video decoder. The high-level memory management methodology/script, partly supported by the prototype ATOMIUM environment 4], has been applied here.
A. Removal of the border
In order to accommodate for unrestricted motion vectors, a complete border consisting of 44 macroblocks is added to the oldframe. It is not just lled with zeroes but with real data copies in a non-trivial way 9]. To simplify the control ow in the original C, these data are duplicated in the frame signals (cfr. edgeframe in gure 3) prior to the actual image manipulations, resulting in storage and transfer overhead both for reading and writing. Actually, this requires an extra 16896 pixels to be stored. To reduce this overhead, the dependences on the border data can be checked by (manifest) conditions on the position of the pixels to be read. Now, instead of storing and accessing duplicate data, the original pixels are read at the boundary row/columns of the image frame. These guarding conditions have to be implemented in the controller and will steer the data-path. Usually, also some local bu ering is necessary then. Several stages of optimisation are possible here, starting from a simple context-independent caching of the border data (which is apparently selected in most industrial designs) up to a heavily optimised context-dependent checking and reduced local bu ering. All of these alternatives make the storage for the extra borders super uous but only the latter option allows to remove all redundant picture accesses. If we assume that on the average this reduction is about a factor 8 1 , we have an extra reduction in read accesses of about 16128. This is however datadependent. In terms of power consumption, our detailed models show that we obtain a saving of between 24% and 27% by the combined e ect of less transfers and a reduced frame size.
The gain in power comes at the price of an increased complexity of the code and the size of the controller though. Still, as the power consumed in the controller is quite small, the trade-o for power is clearly in favour of transforming all the border accesses. The resulting data ow without the border is depicted in Fig. 5. B. Loop and function restructuring to combine backward and forward P and B predictions, and IDCT
In the Telenor C code, decoding a PB-frame starts with decoding the incoming bit stream and results in a P and a B macroblock containing di erential errors in the frequency domain and motion information (Task 1 in Fig. 5) . Next, the forward P and B predictions are performed based on the motion information (Task 2 and 3). This yields a forward predicted B and P block. Both blocks are directly stored in a picture called B1 T and Pnew T respectively. Then, the decoded P macroblock is transformed to the spacial domain by means of an IDCT (Task 4). This P macroblock is added together with the macroblock read back from picture PnewT and stored in picture P T (Task 5). This picture, together with the macroblock stored in B1 T , is needed to do the backward B prediction (Task 6). The result is stored in picture B2 T . Also this picture is corrected with di erential errors similar as for the P-picture (Task 7 and 8). The gure also illustrates that instead of just producing a P and B picture once, the pictures are read and written several times in the original description. More precisely, since B1 T , B2 T and B T are stored in Bframe, every pixel in it is three times written and two times read. Pnew T and P T are stored in newframe, hence this picture memory is written twice and read once. Probably, the reason for this was to simplify the algorithmic description e ort for the system designers.
As an illustrative example, we will now explain how global loop transformations and complex restructuring of the hierarchy in the code allows to create more locality of access based on the pseudo code for task 2 and 3 in Fig. 5 The recon comp, recon comp new and the recon comp obmc functions perform di erent kinds of motion compensations depending on the motion vectors. Moreover, they are not embedded in the same loop scopes. However, with complex code restructuring it is possible to combine them.
This class of optimisations is crucial because they enable further optimisation on the memory hierarchy, which is discussed hereafter in subsection V-C.
C. Memory hierarchy related optimisations
This step involves data ow transformations which introduce extra transfers between the di erent memory levels and which are used mainly to reduce the power cost. In particular, temporary values { to be assigned to a \lower level" { are added wherever a signal in a \higher" level is read more than once. The duplicate read is then performed on the lower level temporary signal. The same can happen in the other direction for writes. If a signal assigned to a higher level is composed of several contributions, it does not make sense to update the nal result always in the higher level memory. Instead, it is usually better to perform the composition from the contributions consecutively (or in a close ordering) in a lower level (or several levels in more complex situations) and then directly transfer thenal result to the higher level. The principle of this bu ering process on the macro-block access is shown in Fig. 6 . In this code, a bu er, called buffer, is created. Now the task Forward B&P prediction will read several times from buffer instead of from picture P T-1 . This results in large power savings. Also, an extra block, called new Pblock is introduced. Therefore extra copies from this block to picture P T are necessary. Since these extra copies are situated at a lower memory hierarchy level, the global power consumption due to the memory transfers will still be reduced.
We now apply this principle in case of decoding an PB-frame, like depicted in Fig. 5 . Instead of storing the forward predicted macroblock in Pnew T , the result is stored in a bu er called reconf Pblock. This bu er is corrected with the di erential errors that result from the IDCT, called IDCT Pblock, to yield the nal forward P prediction new Pblock. This nal block together with the forward predicted B macroblock, called reconf Bblock, and motion information is required for the backward reconstruction. Instead of reading from B1 T and P T , the backward reconstruction is based on the bu ers reconf Bblock and new Pblock. The result in is stored in bu er reconb Bblock. instead of B2 T . Similar as for the P macroblock, it is corrected with the differential errors in IDCT Bblock to yield the nal B prediction block new Bblock. Extra transfers are introduced to transfer the nal block to the picture B T stored at the highest level. The resulting data ow, when decoding a PB-frame after introducing extra memory hierarchy, is shown in Fig. 7 . The pictures with a bold border are to be stored at the \highest" level of hierarchy. This level corresponds to the memory with the biggest transfer cost. Other smaller bu ers, such as buffer, Pblock, Bblock, reconf Pblock, reconf Bblock, new Pblock, reconb Bblock and new Bblock are stored at \lower" levels. In addition to this, many other similar optimisations have been performed for the di erent decoder modes (especially in the \overlapped -overlappedblock motion compensation" mode). D. In-place storage of past and future P-pictures
In Fig. 8 (Left) , the light gray area covers the portion of oldframe that is still needed for reconstruction. In Fig. 8 (Middle) , the gray area covers pixels that already are calculated. Array signals oldframe and newframe can be stored in-place if the shaded area in Fig. 8 (Right) is stored in a bu er. Decoding the macroblock in row y and column x uses data that is stored in blocks with coordinates (y 1; x 1) in oldframe. The worst-case dependence, the one that needs most memory, corresponds with the dependence in the following pseudo code :
for (y=1; y <= 11; y++) { for (x=1; x <= 9; x++) { Read from block (y-1,x-1) from P T The bu er mechanism can be implemented by calculating the block addresses modulo 13 20] . This results in a snake-like operation of the bu er, as illustrated in Fig. 9 . The resulting data ow is depicted in Fig. 10 where the 13 macroblocks are shown in the pipeline of the snake. Implementing this data ow, taking into account extra possibilities of memory hierarchy optimisations, leads to the detailed organisation depicted in Fig. 11 .
This in-place optimisation does not a ect the number of background transfers but signi cantly reduces the total size of the background memories. This will result in a smaller area cost. The combined picture is only 13 macroblocks larger than one of the two pictures required initially. is the power consumed by the picture memories when decoding bidirectional B frames with unrestricted motion vectors and overlapped motion compensation. The power consumption is normalised with respect to the power consumption for the reference design. The power gures are based on worst-case assumptions. The bar chart shows that when all optimisations discussed in this paper are applied, the power consumption is reduced by a factor of 9. Similar optimisations as reported in this article have been applied on the H.263 decoder running in the P mode. They reduced the worst-case power consumption by a factor of 7. The main di erence with optimisations for the PB mode is the absence of optimisations related to bidirectional coding.
The optimisations described in this paper has been partially applied on the public domain C code from Telenor Research 11] . Simulation of the resulting C code, while decoding stream suz14.263 for all the di erent decoding modes, shows that the average power consumption of the memories reduced to 57%. Remark that in these simulations, the extra transfers due to an extra layer of hierarchy are taken into account.
VI. Power consumption of IDCT
A DFL speci cation of IDCT algorithm 21] was simulated and veri ed using Mentor's DSP Station. This speci cation was synthesised using our datapath synthesis tool Dolphin 22] . Dolphin synthesis has resulted in a VHDL netlist which was mapped to the TI TGC2000 library using Synopsys' Design Analyzer and converted to the Verilog netlist. A net capacitance le for the design was generated using Synopsys' Design Analyzer tool. The Verilog netlist has been simulated for toggle counts using Cadence's Verilog-XL simulator. The average power consumption for the datapath was then computed using the net capacitance and the toggle count les. The computation of power consumption of the memory unit in the IDCT uses the power modelling described in section III. Table III lists the average power consumption of the IDCT for the 3 video formats used in conjunction with H.263. The computation uses a frame rate of 30 frames per second to derive the smallest possible frequency of operation for the datapath and memory units.
This IDCT module is the most arithmetic dominant in the entire H.263 speci cation. Still, it has been shown that the power for a direct IDCT realisation with commercial logic synthesis and gate array circuits is about 2 orders of magnitude smaller than the power in the combined unoptimised frame accesses. So, initially ignoring this arithmetic in the system exploration is motivated.
VII. Conclusion
We believe that the results described in this paper clearly substantiate the validity of the proposed high-level memory In the future, we will also explore the possibilities of these optimisations on a mixed software-hardware platform, as provided e.g. by the TI cDSP approach which supports a single-chip heterogeneous design consisting of embedded cores, sea-of-gate logic and embedded memories.
