Abstract-A multiplierless discrete cosine transform (DCT) architecture is proposed to improve the power efficiency of image/video coders. Power reduction is achieved by minimizing both the number of arithmetic operations and their bit width. To minimize arithmetic-operation redundancy, our DCT design focuses on Chen's factorization approach and the constant matrix multiplication (CMM) problem. The 8×1 DCT is decomposed using six two-input butterfly networks. Each butterfly is for 2×2 matrix multiplication and requires a maximum of eight adders/subtractors with 13-bit cosine coefficients. Consequently, the proposed 8×1 DCT architecture is composed of 56 adders and subtractors, which represent a reduction of 61.9% and 46.1% in arithmetic operations compared to the conventional NEDA and CORDIC architectures, respectively. To further improve the power efficiency, an adaptive companding scheme is proposed. The proposed DCT architecture was implemented on a Xilinx FPGA. The results from power estimation show that our architecture can reduce the power dissipation by up to 90% compared to conventional multiplierless DCT architectures.
INTRODUCTION
The discrete cosine transform is one of the most computeintensive parts in various image/video coding standards, such as JPEG, MPEG and H.264. Its computational burden is due to large numbers of multiplications and additions. Multiplierless architectures, such as distributed arithmetic (DA), new DA (NEDA), coordinate rotation digital computer (CORDIC) and integer DCT (intDCT), are widely used for its VLSI realization because of improvements in speed, area overhead and power consumption [1] - [4] . These advantages of multiplierless architectures are due to multiple constant multiplications using optimized shift-and-add operations instead of generic arraymultipliers. The iterative or overlapped additions in DCT multiplications improve the power efficiency due to the reduced number of arithmetic operations. CORDIC executes shift-and-add iteratively to compute angle rotation and magnitude compensation. DA uses ROM memories where the partial sums of inputs are stored; the result is obtained by shifting-and-adding values stored in the ROMs. NEDA utilizes an adder array instead of ROMs to improve the area-efficiency.
Common subexpression elimination (CSE) is another efficient scheme to reduce the number of additions required to realize multiple multiplications [5] . It is mainly applied to single-input/multiple-output operations such as finite impulse response filters (FIRs). Constant matrix multiplication (CMM) is being researched for the efficient hardware design of multiple-input/multiple-output operations targeting area and delay optimization. The benchmarking of CMM-based DCT architectures has shown their area efficiency [6] [7] .
Schemes exist that reduce the computations based on visual irrelevance in relation to conventional multiplierless DCT architectures; they target low-power image/video coding. These architectures have high-precision controllability; e.g. changing accumulation cycles [2] [11] and skipping magnitude compensation [3] . However, common sub-expression shared architectures have low degree of freedom for precision control. This paper proposes a low-power DCT architecture for image/video coders. The proposed architecture is designed to reduce the required additions/subtractions involving CMM as well as the bit-width of arithmetic operations using an adaptive companding scheme while minimizing quality loss. The proposed and the conventional multiplierless DCT architectures are implemented on a Xilinx FPGA; their power consumption is estimated using the Xilinx XPower tool.
II. PROPOSED DCT ARCHITECTURE
The proposed DCT architecture targets power efficiency by minimizing the number of arithmetic operations as well as the bit-width for the arithmetic logic. For 2-D DCT, the rowcolumn decomposition technique can be used [1] . It reduces the complexity of DCT by a factor of four. Our 1-D DCT design that minimizes the number of arithmetic operations is presented in the following section; the companding scheme for bit-width reduction in each butterfly is described in section B.
A. 1-D DCT design to minimize arithmetic operations Our 1-D DCT for minimum arithmetic operations is based on Chen's factorization [8] that decomposes the 8×1 DCT to twelve 2-input butterflies, as shown in Fig. 1 . Each butterfly Figure 1 . Butterfly designs using constant matrix multiplication represents a 2×2 matrix multiplication; six butterflies, without all constants being unity, are generally computed with four multiplications, one addition and one subtraction. Although Wang's factorization [9] reduces these operations to three multiplications, two additions and one subtraction, Chen's algorithm is expected to eliminate more redundant additions among the four multiplications due to the symmetry of the constant pair.
In most of the 2-D DCT implementations for image/video coders, 12-bit precision has been used in order to conform to IEEE 1180-1990 accuracy specifications [1] [2] [10] . Unsigned 11-bit expressions of the cosine coefficients for the 8×1 DCT are shown in Table I . An extended canonic signed digit (CSD) format [5] is used for expressing the binary constant coefficients in Table I . This format can minimize the required arithmetic operations by improving the possibility for common additions and subtractions among the four constant multiplications in the 2-point butterflies. In standard multiplication, the number of additions directly depends on the number of '1' bits when the constant multiplicand is represented in 2's complement. The CSD number format, that represents a sequence of consecutive '1' bits with an appropriately placed '1' and '-1', is widely used with CSE and CMM [5] - [7] . However, since CMM has a better chance to involve common sub-expressions, the use of the extended digit set from {±1} of CSD to {±1, ±2} is expected to further reduce the number of additions /subtractions. While the conventional CSD form for a binary constant is unique, the extended CSD may map to multiple expressions. Therefore, the investigation of common arithmetic terms should be studied in all the forms of the extended CSD.
Appropriately arranging and positive/negative signing the common sub-expressions can reduce the required bit-width in each addition/subtraction. The CSE scheme produces savings in the number of operations by being free from the sequential accumulations needed in the multiplier and accumulator of DA architectures. However, the precision of a prior arithmetic operation should be kept until the whole computation is done; it implies the need for a large bit-width in arithmetic operations. Therefore, arranging a common pattern with a smaller shifting value is deemed effective to reduce the bit-width as well as the power consumption. When the common sub-expression is included in another one, the signs of the patterns, which can be exchanged, should also be considered for minimizing the effective bit-width for the arithmetic operations. To design a 1-D DCT with 12-bit coefficient precision, multiplierless butterfly modules are hand-optimized using the above schemes, as shown in Fig. 2 . Table II for the 8×1 DCT involves the required adders and the performance of the proposed architectures as compared to previous multiplierless DCT architectures. All architectures are pipelined; the number of NEDA accumulation cycles depends on the precision of the cosine coefficients (the large number of operations degrade the throughput). The proposed architecture reduces the arithmetic operations by 61.9% compared to the conventional NEDA. 
B. Adaptive companding scheme for bit-width reduction
An adaptive companding scheme is proposed to improve the power efficiency by reducing the bit-width of arithmetic operations in the CMM-based butterfly modules. The effective bit-width in the input of butterflies is lower than the designed bit-width to protect against arithmetic overflow/underflow. The practical distributions for two real images, Lena and Mandrill are shown in Fig. 3 . Minimizing the sign extension bits due to the difference between the effective bit-width and the designed bit-width reduces the power consumption by removing unnecessary arithmetic operations.
The proposed companding scheme is shown in Fig. 4 ; it reduces or restores the bit-size of input/output values by means of a compressor/expander pair. It provides the butterflies with reduced bit-widths while minimizing image/video quality degradation. Conventional operation reducing schemes usually sacrifice LSBs (least-significant bits) using the control of the accumulation cycles [2] [11] . When the maximum dynamic range (DR) of the input is larger than the reduced bit-width of the butterfly input, the compressor connects restricted MSBs (most-significant bits) to the butterfly module. If the DR of the input is smaller, then the compressor excludes sign-extension bits. The loss of output accuracy caused by the lower-bit truncation of a large input may be insignificant. For a practical design, DR control is based on a fixed value related to the application. Fine-grained DC control causes not only increases in the design complexity but also increases in the power consumption due to the additional circuit involvement.
III. IMPLEMENTATION AND RESULTS
We implemented the proposed design and the conventional multiplierless DCT architectures NEDA and CORDIC on the Xilinx XC2VP50 FPGA; we compared and measured the power consumption with the XPower tool. The required area of the proposed design without the companding scheme is shown in Table III ; the area was measured in number of slices. Two popular test images, Lena and Mandrill, were used to estimate the power consumption. All designs were operated at 50MHz and 1.5 Volt. Each of these images has 512×512 pixels, with each pixel being represented by 8 bits for a total of 256 gray levels. The energy consumption is summarized in Table IV . The results show that the proposed architecture reduces the power dissipation by up to 90.0% compared to the conventional NEDA. The power savings with the proposed companding scheme are shown in Table V . The reduced bits are determined by considering the bit distribution in Fig. 2 . Our scheme can reduce the power consumption by up to 15% while PSNR degrades by less than 12dB. More power savings are expected for ASIC implementations of our scheme. This is because our results come from an FPGA implementation using lookup tables without signal transition during an arithmetic operation. 
IV. CONCLUSIONS
This paper proposed a low-power DCT architecture for image/video coders. Power reduction was realized by minimizing the number of arithmetic operations and the bitwidth of the involved operands. In order to minimize the operations, the 1-D DCT was decomposed into 2-input butterflies using Chen's factorization; each butterfly was designed by involving CMM. The total required number of operations for the 8x1 DCT was 56; it represents a reduction of 39% compared to the conventional NEDA. An adaptive companding scheme was proposed to effectively reduce the bitwidth of the butterfly arithmetic. The power consumption of the proposed and the conventional multiplierless DCT were estimated using the Xilinx Xpower tool for a Xilinx FPGA. The result showed that the proposed DCT architecture, without companding, requires 10% of the energy consumed by the conventional NEDA. The proposed architecture is expected to be useful in mobile multimedia applications.
