This paper proposes a practical content-dependent lowpower DCT design with tolerable quality drop. Lowpower issue has become more and more important, especially for portable devices. Unfortunately, low-power design always brings in significant quality drop at the same time. This work not only achieves ultra low power dissipation but also remains tolerable quality drop (about 0.1dB) in most cases. The proposed architecture is based on distributed arithmetic architecture [1] and combined with more reliable PPA classification algorithm [2] . It can accurately control bit-level calculation and avoid unnecessary calculation to save power. This characteristic is powerful and very useful in video encoder systems, since coefficients after DCT and Q become zero with high probability. This part of power can be saved without causing undesired quality drop.
INTRODUCTION
Since the first appearance in [3] , discrete cosine transform (DCT) has become the most widely used transform coding technique for various image and video coding algorithms. It has also been adopted by most image and video coding standards, including JPEG , MPEG-1, MPEG-2, MPEG-4, H.261, H.263, and H.264/AVC.
DCT is a dominating computation-intensive task of video encoder systems, only second to motion estimation (ME). For encoder with full search ME case, DCT, quantization (Q), and their inverses together occupy 16% of overall complexity. As for encoder with fast search ME case, DCT/Q/IQ/IDCT occupies more significant 29% of overall complexity. Since video encoder systems generally adopt fast search ME for mobile video applications, the importance of DCT/Q/IQ/IDCT is noticeable. As a result, the minimization of power dissipation of DCT is indispensable in order to achieve a low-power video encoder system for mobile video applications.
In addition to power dissipation, the quality is also an essential factor for video encoder systems. Too much quality drop would make DCT implementation impractical, since the accumulative error propagation could degrade the encoded quality dramatically.
In the literature, many low-power DCT architectures have been proposed. The adopted techniques can be classified into two categories. One is the lossless approach, such as coefficient scaling, gated register, and architecture based on distributed arithmetic (DA) [1] . The other is the lossy approach based on content-dependent algorithm, such as PPA classification [2] . Although this approach can perform very well in lowering power dissipation, however, the caused quality drop is very significant such as to become impractical for video encoder systems.
The proposed low-power DCT design in this paper can reach tolerable quality drop based on content-dependent algorithm and all lossless techniques to satisfy the lowpower requirement, which makes it very practical and useful in establishing low-power video encoder systems for mobile video applications. DA is a bit-serial operation that computes the inner product of two vectors (one of which is a constant) in parallel without any multiplication. It uses the ROM and accumulator (RAC) structure to substitute the constantcoefficient multiplier, as shown in Fig. 2 . The advantage of DA is its low power dissipation, but it may encounter the mismatch condition. For 1-D 8-point DCT, if the bitwidth of input data is larger than 8-bit, then the mismatch condition of DA leads to precision loss with direct truncation. 
BACKGROUND

LOSSLESS APPROACH
There are some techniques that can simplify the DCT computation without any quality drop.
Coefficient Scaling
The adopted flow graph with scaled coefficient and DC/AC4 butterfly technique is to scale the DCT coefficient matrix with the constant a and further apply all-level even/odd decomposition specific for DC and AC4 frequencies. After the scaling of DCT coefficient matrix with the constant a, the matrix-vector multiplications for DC and AC4 frequencies only involve additions and subtractions. With the further application of all-level even/odd decomposition specific for DC and AC4 frequencies, which results in the DC/AC4 butterfly as shown in Fig. 3 , the DC and AC4 frequencies can be implemented by only four adders/subtractors instead of two RACs. This technique has two advantages. First, the mismatch condition can be alleviated. Since the DC frequency is implemented by bit-parallel adders, there is no mismatch condition for the DC frequency, which is originally the most critical frequency component. Second, the computational complexity can be reduced. Since the RAC of DC frequency usually has more bits to be computed, the implementation of DC and AC4 frequencies with only four adders/subtractors can thus reduce the required arithmetic operations. 
Selective Gated Registers
Rather than using the shift registers to shift the input data and the data after the butterfly structure in [2] , the proposed design adopts the selective gated registers (SGR) instead of shifter registers for registers D0 to D7 as well as the even registers E0 to E3 and odd registers O1 to O3 after the butterfly structure. Every clock cycle only one of the eight registers D0 to D7 is selected, and the others are powered down by clock gating. Compared to using the shift registers, using the selective gated registers for registers D0 to D7 can reduce the power dissipation to one eighth in ideal case. Besides, the four even registers E0 to E3 and four odd registers O1 to O3 are only activated for one cycle within every eight cycles, and all the eight registers are powered down by clock gating for other seven cycles. This can significantly lower the power dissipation of shift registers originally used for the data after the butterfly structure. Besides, the selective gated register array is also adopted to replace the transpose memory. Because of the parallel architecture, the transpose register array is only active one cycle per every eight cycles, which saves much unnecessary power dissipation.
CONTENT-DEPENDENT ALGORITHM
The content-dependent PPA (peak-to-peak pixel amplitude) algorithm for DCT was first proposed by [2] . This algorithm can approximate unnecessary computation of AC frequencies with good performance. Based on PPA, an advanced input classifier (AIC) is developed. Together with the dynamic effective bitwidth extraction (DEBE) technique, the proposed content-dependent algorithm further improves the performance of PPA and makes it more suitable for video encoder systems.
Advanced Input Classifier
The advanced input classifier can precisely approximate the zero or low-precision output data based on input data, MB mode, and QP. For a given MB mode, the advanced input classifier can reduce the bits to be computed according to the signal content variations of input data (with PPA criterion) as well as the chosen QP, and four threshold classes are defined to decide the maximum bitwidth to be computed of each RAC. Therefore, there are two sets of four threshold classes, one for intra MB mode and the other for inter MB mode. Rather than only being a function of input data, the criterion of advanced input classifier for thresholding is a function of both input data and QP. The threshold values are determined by exhaustive simulations, and the H.263 quantization method as defined in [4] is adopted. Besides, the threshold values of the second stage 1-D 8x1 DCT are two times larger than those of the first stage. In order to reduce the control overhead, the same threshold values are used for the RACs of odd AC frequencies (RAC1/3/5/7).
Dynamic Effective Bitwidth Extraction
Rather than using the direct truncation of bitwidth for mismatch condition in [2] , the dynamic effective bitwidth extraction technique, cooperated with above three techniques, is adopted to reject the bits of sign extension for the data after the butterfly structure while also carefully deal with the mismatch condition. The data after the butterfly structure are stored in the even registers E0 to E3 and odd registers O1 to O3 without truncation of bitwidth. These data are then processed by the dynamic effective bitwidth extraction with the information from the advanced input classifier, as shown in Fig. 3 . The dynamic effective bitwidth extraction can first identify the effective bitwidth by rejecting the bits of sign extension and reducing the bits after advanced input classifier. Since the most critical DC frequency is implemented by bitparallel adders, the effective bitwidth for other AC frequencies (AC1/2/3/5/6/7) are usually less than eight. The dynamic effective bitwidth extraction can then dynamically extract bits from the effective bitwidth range, one bit per cycle, to RAC1/3/5/7 and RAC2/6. Only when the effective bitwidth is larger than eight, the possible truncation of bitwidth could occur. This approach performs much better than direct truncation adopted by [2] in terms of the accuracy of output data, and thus the quality drop can be effectively reduced. After the bits of effective bitwidth range have all been extracted, clock gating is then applied to power down the dynamically idle circuits. Since the same threshold values are used for the RACs of odd frequencies (RAC1/3/5/7), there is only a single pair of control signal for these four RACs. By above four advanced techniques, the three drawbacks of [2] , such as undesired significant output quality drop caused by mismatch condition, the input classifier that can only handle the pre-defined QP case, and powerinefficient shift registers, can be effectively overcome, and thus a more practical and efficient content-dependent DCT design is achieved. Fig. 4 . shows the simulation result of quality drop. The test sequences are stefan, weather, mobile, and foreman. The targeted video encoder system is MPEG-4 simple profile (SP) with predictive four step search ME, GOP = 30, and IPPP format. When QP is larger than 8, the quality drop are all within 0.1 dB compared with floatingpoint DCT. shows the simulation result of computation cost. Because DC and AC4 are without RACs, this simulation only measures the number of bits needed to be performed in the RACs of other AC frequencies. As can be seen that the proposed algorithm can reduce 50% computation in average. From Fig. 4 and Fig. 5 , when QP becomes larger, the quality drop remains tolerable and the computation can be reduced. It means that this algorithm can effectively save unneeded power and does not cause sacrificed quality at the same time. The proposed content-dependent low-power DCT design has been implemented by front-end cell-based design flow and synthesized by Artisan standard cell library based on UMC 0.18 µm 1P6M CMOS process. Table I shows the gate-level implementation result. The area is estimated in terms of synthesized gate count, and the power dissipation is estimated by Synopsys PrimePower gate-level power estimation with the unit of mW @ 1.8 V, 33MHz. Since the proposed DCT design adopts content-dependent algorithm, different signal content variations of input data can result in different power dissipations. Therefore, two kinds of input data have been used for the power estimation. One belongs to the worse case in which the 8x8 blocks are of intra MB mode and at QP = 4, while the other belongs to the normal case in which the 8x8 blocks are of inter MB mode and at QP = 12. As can be seen from Table I , the content-dependent DCT design in normal case consumes lower power dissipation (63%) than in worse case. These power data are estimated under 1.8 V and 33 MHz. For CIF 30 fps, the proposed contentdependent low-power DCT design is only required to operate at 4.56 MHz. As a result, the static voltage/frequency scaling can be applied, and the power dissipation after scaling to 1.2 V and 4.56 MHz is estimated to be 910 µW in worse case or 473 µW in normal case. Fig. 6 is the power breakdown in worse case and in normal case. Fig. 7 is the further power analysis of 1-D DCT core. It clearly shows that the power of RACs can be dramatically reduced. And in normal case, only inevitable power like transpose register array and input/output registers dominates the DCT power dissipation. Table II lists the comparison with prior arts. It shows that this work has the better balance between power and quality. With only 0.1dB quality drop, this work can achieve the best power efficiency compared to prior arts. This makes it become a more practical and efficient content-dependent DCT design, which is very suitable for power-constrained devices for mobile video applications.
SIMULATION RESULT
CONCLUSION
