Abstract-In this paper, we propose optimization techniques to achieve low power design for a transcoder's Motion Compensation module for an MPEG-4 compressed video stream, in the DCT domain on both the algorithmic and implementation level. At the algorithmic level, the low power design is achieved by employing the 3-2-1 partial information scheme coupled with the reduction of the bit precision for the constant transform matrix to two bits after the binary point. The simulation results show that the reduction of bit precision virtually induces no loss in PSNR measure except in the most significant digits of the DCT-CMs constants. At the implementation level optimization the proposed design outperforms the conventional scheme by having 12.26%, 9.9% and 12.13% reduction in power, time, and area respectively.
I. INTRODUCTION
Many existing networks such as POTS, ISDN, DSL over PSTN, ATM over SONET or SDH and 3G Wireless are interconnected resulting in a heterogeneous network environment. On the other hand, ATM over SONET via optical fiber technology allows data and/or voice signal transfers at 51.840 Mbit/s and higher. In the case where data is transmitted from the SONET network to PSTN network, dynamic bit-rate adaptation/reduction at the gateways is then required due to the transmission media has a lower capacity than the capacity required by the bitstream [1] .
One solution is to use the scalable coding feature coming with the MPEG-2 standard where the video is coded as two or more layers, a based layer and one or more enhancement layers. The Base layer stores the most critical information while the enhanced layers supplement the video quality and data. The enhancement layers maybe skipped (dropped) in the case of a changing network environment. This approach is nonetheless not flexible enough to handle finer scaling capability since the scalability in MPEG-2 provides only a limited number of possible transmission bit rates.
A more robust scheme is to implement a transcoder module to perform dynamic adjustments of bit rate of the coded video bitstream to the desired transmission rate. The simplest architecture of a transcoder is an open-loop structure in which the encoded bitstream is first inversely quantized by means of the finer quantizer Q1 and then heavily compressed by coarse quantization Q2 (Q2 ≥ Q1) to achieve the low target bit rate. While any transcoder scheme will result in more error due to the coarser quantization, this scheme in addition suffers from drift error. Drift error comes about because the encoded DCT coefficients are for a residue signal that was based on estimate of the current from previously sent frames. If those previously sent frames are requantized the residue should be recalculated. If it is not recalculated, drift error results [2] [3] . To recalculate the residue (and thus alleviate drift error) motion compensation must be done. This paper proposes to do that compensation in the DCT domain with a low power design. Various motion compensation techniques are described in [2] [3] [4] [5] [6] .
II. ALGORITHMIC OPTIMIZATION FOR MC-DCT MODULE
Chang et al [6] proposed the notion that Motion Compensation (MC) could be performed in the DCT domain by matrix multiplication with an appropriate 8x8, prematrix p 1 and postmatrix p 2 , to eliminate the conversion process back and forth between the DCT and the spatial domains. In this work, unlike the algorithm in [5] , we approach the rate-conversion problem based on the DCTTTM algorithm in [4] . The DCTTTM motion compensation puts an emphasis on the contribution of each coefficient in the four DCT-Requantization Error Blocks (REBs) of the reference frame to that of one estimated current DCT-REB via a linear operation of matrix multiplication. Consider a DCT-REB called E. It is an 8 x 8 matrix of DCT Requantization Errors (DCT-RE's). The elements of E are e ij , for 0≤ i,j ≤ 7. It can be written as:
where Q ij is an 8 x 8 matrix with only one non-zero coefficient '1' designated by the indices ij and others are all set to '0'. The IDCT of Equation 2.1 due to linearity property can be written as:
The DCT of the Q ij matrices can be pre-computed at design time. In transcoding, four 8*8 blocks of DCT-RE's in the reference frame will contribute to the DCT-REB in the current motion compensated block. Call these DCT-RE's e ij (n) where n indicates which of the four blocks the DCT-RE originated from and i and j indicate which DCT-RE. Given a specific k MV , the motion compensated DCT-REB, in the current frame, can be written as a linear combination of the 256 DCT-RE's in the four straddled blocks in the reference frame. That is:
where ˆ( ) ij Q n is the DCT-Correction Matrix (DCT-CM).
In derivingˆ( ) ij Q n , three steps are followed. It should be noted that the e i 's are formed by the requantization of already quantized DCT coefficients. Many of these already quantized DCT coefficients are in fact 0. Note that if a DCT coefficient was quantized to 0 in the first quantization its e i is also 0. For example if it is known that, all of the DCT coefficient in the 7 th row and all of the DCT coefficients in the 7 th column were always (or almost always) 0 there would be no need to calculate the corresponding terms in Equation 2.4. Thus we could replace the 256*64 composite matrix by a smaller 256*49 composite matrix. This will result in a reduction of computation.
Several schemes of DCT coefficients selection are investigated. The 3-2-1 scheme, referring to the assumption of the non-zero DCT coefficients values only located in the top-left corner of an 8 by 8 motion block with three DCTcoefficients from first row, two from second and one from third, starting from most left to right, render a fair compromise among the conflicting interests of computational saving and the degradation of video quality when comparing to other schemes.
By selecting the 3-2-1 scheme, it implies the computation of one motion-compensated REB requires six RE's in each of the four overlapped REBs in the reference frame. Thus, a total of twenty-four RE's are used. Applying this scheme to Equation 2.3, E MC (now call it E MC_321 ) becomes:
For each RE, there is a Q matrix associated for the transformation. Therefore, instead of having a compositematrix W of size 64 x 256 where all four REBs locations are taken into account, the matrix W is shrunk down to 64 x 24. Moreover, the elements of E MC will be added to existing DCT coefficients and then heavily quantized. Many of these elements will have no effect. Thus further computational reduction can be achieved by applying the partial information scheme to the motion-compensated REB; namely, among all the 64 REs coefficients, only six of them located in the top-left corner designated by the 3-2-1 scheme are assumed to be non-zero value. By doing so, the size of composite matrix W can be scaled down from 64 x 256 originally to 6 x 24. The computational complexity of one motion-compensated REB is then reduced to a simple matrix-vector multiplication of 6 x 24 times 24 x 1. Equation (2.5) becomes: For hardware implementation, the precision in bits is a very important parameter. Each additional bit of precision costs extra resources. To further reduce computations we consider heavily quantizing the constants in the DCTCMsQ . In the following if there are n bits to the right of the binary point we say that we are using n bit precision. Specifically n bit precision requires n+1bits. For example, if 2-bit precision after binary point is used (ie.
Step size of 0.25) to quantize the DCT-CM constants, 7 ≤ 2 (1+2) =8 distinct elements result; they are 1, 0, ± 0.25, 0.75, ± 0.5 (notice that by examining all DCT-CMs -0.75 and -1 never appear). In other word, these quantized values can be encoded using 3 bits only. The result of the simulations by examining different bit precisions against PSNR measurement for five video test sets, Table Tennis Surprisingly the reduction of bit precision for the DCTCMs does not severely affect the PSNR measure among the five test sets. In fact, the only noticeable gain occurs in the most significant digits of the DCT-CMs constants. Further increase in bit precision does not yield any significant improvement in PSNR measure. These observations construct a solid grounding to achieve considerable computational saving in hardware realization by heavily quantizing the DCT-CMs constant to only the most significant digits.
III. IMPLEMENTATION OPTIMIZATION FOR MC-DCT MODULE
The implementation can now be modeled mathematically by Equation (2.6). Alternatively, Equation (2.6) in its vector form can be expressed as Z is a MC-DCT vector of size 6 x 1 corresponding to the six motioncompensated DCT-RE elements from the top-left corner of a single block designated by the 3-2-1 scheme. Essentially, Equation (3.1) is a matrix-vector multiplication. Since the multiplication taking place from a given row does not depends on the result of any other neighboring rows, thus the multiplication for each row can be carried out simultaneously in parallel. The second step is to generate the resulting vector by summing all the intermediate products previously computed for each row. In our case, such multiplication yields a resulting vector of size 6 x 1. It should be noted that this vector is essentially a single block of DCT-RE with the top left corner of six coefficients and all others being zero.
The hardware design of MC-DCT module follows closely these two execution steps. Figure 3 .1 shows the block diagram of such a design. The MC-DCT hardware design can be divided into three major units; they are:
A. Encoded DCT-CM Storage Unit. B. Logic Multiplication Unit. C. Logic Addition Unit. These three units are discussed in more detail in [7] .
A . Encoded DCT-CM Storage Unit. The three primary units are designed to reduce power consumption. The width of the data bus for the sub-module interconnection is also carefully designed to attain resource saving. By exhaustively going through all possible input values within the range from 0 to 2048 and extracting the largest value for each of step size QP from 1 to 31 using H.263 quantization scheme, the size of DCT-RE is upperbounded by 76. 
B . Logic Multiplication Unit
One of the most basic yet power-demanding arithmetic operations inevitably associates to multiplication. In this paper, the multiplier unit is designed to tailor specifically the nature of the operands in which they are either in a finite range or contain substantial zero elements. This datadependent design concept is applied into multiplier unit by means of simple logical data routing as opposed to performing the conventional shift-and-add multiplication and thus attaining the goal of power saving.
With the anticipation of many zeros within a DCT-REB as well as in DCT-CM, we run several simulations to quantify the percentages of zero coefficients occurrence. The simulation result indicates for the DCT-CMs' constants, there are 67.16% of them are zero, while the percentages of zero coefficients occurrence for the intra frame and the inter frame of DCT-RE's coefficients from video Table Tennis are 92.013% and 42.057% respectively. Thus, the first optimization power saving design of the Logic Multiplier sub-module is to implement zero bypassing logic to detect occurrences of the zero coefficients for both operands. The second optimization for the Logic Multiplier sub-module, as its name suggests, is to implement a multiplication-free logic based multiplier. The incoming DCT-RE's coefficients, (multiplier), are to multiply with one of the seven DCT-CM's elements, (multiplicand), fed from the storage unit. Also notice that the seven elements are in fact the reconstruction levels of 0, 0.25, 0.5, 0.75, -0.25, -0.5 and 1 and the multiplication of such operands requires no more than three basic operations; shift, two-complement and add. No real operation is required except to "wire" the value properly to the output register. For power efficiency, an effective strategy is to include minimal hardware possible. To reduce power we decided to opt for the carry-save methodology in performing arithmetic addition using also Wallace tree. The method delivers a very short critical path delay. The adder design is described in detail in [7] .
D. Synthesis Result of MC-DCT Components
The VHDL code of the proposed MC-DCT module is synthesized using Synopsys in Xilinx 5.2 source environment with the target device Xilinx VirtexII 3000 series. In order to obtain some comparative measurement, the MC-DCT sub-modules are also implemented in a conventional standard approach. The synthesis result indicates that the proposed MC-DCT module consumes 545.34mW at 200MHz.
IV. CONCLUSION In this paper, a data-dependent low-power MC-DCT design is presented. Low power is achieved by performing optimization on both algorithmic and architectural levels.
The reduction of computation complexity for the MC-DCT operation is achieved by using only partial DCT-REs information via the DCTTTM based Motion Compensation algorithm. The employment of the 3-2-1 partial information scheme for both the input and the output appears to render a fair compromise among the conflicting interests between the sacrifice in quality and the saving in computation.
The reduction of the bit precision for the constant transform matrix DCT-CM does not severely degrade the PSNR measurement. Simulations among the five video test sets show there is virtually no loss in PSNR. Considerable computational saving is achieved by heavily quantizing the DCT-CMs constant to only the most two significant bit.
Low-power design of the MC-DCT module is achieved by having optimization at both design time and run time. The integration of the 3-2-1 partial information scheme along with the 2-bit precision for quantized DCT-CMs constant into the DCTTTM-based algorithm in processing the MC-DCT operation renders a fair enhancement toward power consumption strategy at design time. The run-time optimization is done by using data-dependent bypassing logic coupling with the customized logical module to reduce the number of operations.
By comparing the data-dependent design with the standard conventional approach in implementing the MC-DCT module, the synthesized results outperformed the conventional approach by having 12.26%, 9.9% and 12.13% reduction in power, time, and area respectively.
