Low power consumption in computing systems is a key requirement for devices such as cell phones and cameras. In this paper we present a low power DCT implementation using a highly scalable multiplier. This paper focuses on IDCT with playback applications such as digital photo displays. The proposed solution exploits the fact that the size of the multiplications varies per stage in a multistage IDCT implementation and configuring multipliers to match the needs of each stage saves power. Results are compared with Wallace and Array multipliers. We show that using a scalable multiplier and dynamically reconfiguring the width of the multiplier leads to significant power savings (over 72%) with negligible degradation in decoded image quality.
INTRODUCTION
Many applications today use digital signal processing in a wide variety of areas due to the increasing advances in technology. Multimedia applications, for example, use algorithms to code video and images to reduce the large amount of storage and satisfy transmission requirements. DCT is widely used in image and video compression to decorrelate the signal and increase compression efficiency. DCT is a complex operation and uses significant amount of computing resources. Thus fast transforms such as the fast discrete cosine transform (DCT) are often used to meet realtime constraints [1] , [2] , [5] .
Due to the use of multipliers, a significant amount of power and computations are required in image and video coding for direct and inverse transform operations (DCT/IDCT). A multiplication is computation intensive and utilizes large amounts of power. Over the years many fast algorithms have been proposed for the computation of the DCT focusing on different approaches (optimizing speed, throughput, latency, turnaround time), but the design for low power consumption was not an issue until the late eighties and it has become increasingly a hot topic as the demands for mobile computation power, portable devices and portable multimedia applications increase.
THE DISCRETE COSINE TRANSFORM

One Dimensional DCT-IDCT
A one dimensional N-point DCT of a given data X, is defined by: (1) The inverse DCT is given by: (2) Where
Transform Matrix
The 1-D DCT can also be expressed as matrix-vector product:
Where [T] is known as the transform matrix. The 1-D IDCT can be written as:
The 2-D DCT is obtained by row-column decomposition. A row wise 1-D DCT is applied followed by the column wise 1-D DCT:
The same operation is applied for the 2-D IDCT: Figure 1 shows the flow graph (butterfly diagram) for the DCT algorithm using 29 additions and 13 multiplications, with each stage corresponding to one single pass in the butterfly diagram and it is represented with one of the 4 different sparse matrices. Fast DCT implementations partition the 8x8 transform matrix into four different stages [3] , [4] , [6] . The transform matrix can then be expressed as the product of 4 sparse matrices:
MULTISTAGE REPRESENTATION OF DCT
Each single pass results in intermediate values of varying magnitude. The magnitude of these values depends on the input data size. For an n-bit wide input, the output after each stage is going to have a specific length. This bitwidth can be determined based on the transform matrix coefficients C i . The values in the transform matrix are not shown due to space constraints. 
Based on the previous analysis, the output bitwidth for the 2-D inverse DCT can be determined the same way as the forward DCT. Tables 3 and 4 show the bitwidth of values in each stage for row wise and column wise operations. 
F[T4][T3][T2][T1] n+3
From tables 1-4, it is clear that each stage results in numbers of different magnitude depending on the input. The magnitude of the coefficients in the input depends on the content being encoded and the quality at which it is encoded. Power consumed by multipliers depends on the width of the multipliers. Traditional multipliers such as Wallace and Array multipliers cannot be scaled dynamically and hence a 16 bit multiplier is used even with 8-bit operands. A low power multiplier based on operand truncation was proposed in [8] . Operand truncation reduces switching activity and hence reduces power consumption. The downside is that truncation leads to quality loss due to reduce precision. Power consumption can thus be reduced if different multipliers can be used for each stage. A Highly Scalable Multiplier (HSM) is well suited for these applications [8] . An HSM allows dynamic configuration of the multipliers for each block and each stage of IDCT.
TRANSFORM COEFFICIENT PRECISION
The DCT/IDCT implementations use floating point representation of the transform matrix at each stage. As the multiplier is scaled for each block/stage, the transform matrix coefficients should also be scaled down to match the multiplier size used. For example, if an 8-bit multiplier is used, an 8-bit floating point representation is used for the coefficients. This reduced precision of the transform coefficients affects the quality of the decoded images. The accuracy of a number in floating point representation depends on the bit length used and how many of those bits are used for the exponent and mantissa. A reduced precision floating point representation was determined by evaluating the configuration for mantissa that minimizes the loss in precision. The test was made for 8-bits, 10-bits, 12-bits and 14-bits of number length precision and 3, 4, 5 and 6 bits to represent the exponential part. For all different bit-length floating point precision of matrix coefficients, the best option for the minimum distance to the original value occurs when the exponent field is using 3-bits precision.
SIMULATION AND RESULTS
The performance of the proposed method was evaluated by simulating JPEG encoding/decoding on images: performing DCT, Quantization, Inverse Quantization, and then IDCT. The focus of the experiments was calculating the IDCT estimated power consumption, PSNR and SSIM on a set of images based on the use of scalable multipliers. A scalable multiplier adapts its size (bit length) according to the input resolution; therefore the transform matrix coefficients are truncated numbers that depend on the floating point resolution used by the multiplier. The IDCT algorithm use the product of 4 sparse matrices to represent the DCT transform matrix using 8x8 block size. The 2D-IDCT using the transform matrix is expressed as: Four different quantization modes were used in the simulation.
First value is Q=1 which means no quantization, second and third value are Q=8 and Q=16 respectively, which mean the constant quantize the DCT block and de-quantize and finally Q=M which means the Qua defined by the JPEG standard.
Three different cases are considered to perfo algorithm explained before:
• Fixed Size Multiplier (FSM) • Scalable Multiplier with Fixed Size for • Scalable Multiplier with Variable Size f
The power consumption for every case is an three different algorithms for multiplica Scalable Multiplier (HSM) [8] , Wallace M [8] , [9] , [10] and Array Multiplier (AM) [8] , consumption in CMOS digital circuits is switching activity in logic gates; the total that switch is used to calculate the approx power consumption. In Table 5 we can Toggle Count (ATC) used to calculate the with the three different multipliers. ATC fo was calculated using 12 and 16 bit multip LSB set to zero. HSM scales better than Increase in switching activity of an HSM i the square of the rate at which bit-width from 16 to 32 bits increases bits by 2 (power consumption) by 4).
Fixed Size Multiplier (FSM)
The FSM uses always the same size of operations. Even if the input bitwidth is sho all of its resources as if multiplying large power consumption in this case is a num considerable amount of energy because the not needed to perform the operations, but th is always better because the use of full p floating point representation. The power c be proportional to the number of multipli stage (26) and the ATC of the fixed multipli PC 26·ATC(X). Values for Power Consum in table 6. value used to the IDCT block antization Matrix orm the IDCT block (SFM) for block (SVM) nalyzed based on ation: A Highly Multiplier (WM) , [9] , [10] . Power s proportional to number of gates ximate switching see the Average switching power or 10 and 14 bits pliers with the 2 n WM and AM. is proportional to increases (going and hence ATC bits to perform ort, the FSM uses er numbers. The mber showing a e use of circuitry he output quality precision for the consumption will ications for each ier (X) ( Table 5) : mption are shown
Scalable Multiplier -Fixed Siz
The SFM uses a scalable multiplier maximum length needed in the mo that for each 2D-IDCT block the m for all stages (sparse matrices). For "n" as the maximum precision num the size of the multiplier will b consumption is proportional multiplications for each stage (26) by the scalable multiplier (X=n+5) (
Scalable Multiplier -Variable
The SVM adapts its size for eve matrix) based on the input bitwidth stage adapts the multiplier to the i next stage adapts its multiplier to th block. The power consumption is sum of the multiplications for eac different multiplier's resolution used A set of 5 Images with 512x512 re analyze the power consumption an SFM and SVM.
Baboon Barbara Goldhill
The power consumption is estimate of times a specific multiplier is used 14-bits and 16-bits) when perform approximate power consumption w sum of the products of every sing multiplied by its corresponding ATC average power for FSM using 8, 16 shows the average power for SVM a
The original input image is compar after the whole process (D Quantization-IDCT) to evaluate the and SSIM [7] algorithms. The perfo fixed size, fixed-size per block, and is summarized in Tables 7-9 . T significant amount of power has bee of a multiplier is adjusted per b overhead in reconfiguring the mult results show that the power consu further reduced by changing the wi each stage of the IDCT. The reduc point operations has minimal im decoded images. For video cod reduced precision causes a drift in distortion can accumulate for long ze for Block (SFM) that adapts its size to the st critical stage, it means multiplier has a fixed size r a specific input bitwidth mber in the input block, be "n+5". The power to the number of and the ATC determined (Table 5) . Analyzing the power consumption in ATC, the Scalable Multiplier with Variable Size (SVM) reduces power consumption by 64% when compared with fixed 16-bit HSM and 91% when compared with fixed 32-bit HSM. Table 10 shows the reduction in power consumption for SVM and SFM compared with FSM-16 and FSM-32 using WM and AR.
Size for Block (SVM)
CONCLUSION
Low power DCT using a scalable multiplier was presented. Performance analysis for an IDCT implementation related to playback applications was reported. The proposed solution, dependant on scalable multiplier support in hardware, exploits the fact that the sizes of the multiplications vary per stage in a multistage IDCT implementation and configuring multipliers to match the needs of each stage saves power. The proposed solution affects quality because of lower precision floating point representation necessary. We show that this impact is negligible. The power consumed by HSM can be reduced by more than 72% when multistage implementation is used.
