We propose a pipelined implementation of the eight-point Loeffler discrete cosine transform (DCT) for portable applications. The pipelined structure produces one DCT coefficient per clock cycle, which meets the limited memory bandwidth of many portable devices. Twodimensional algebraic integer (AI) encoding and the shift-and-add approach were used to make the implementation multiplication-free. A hardware cost reduction of approximately 40% was achieved by trading off the precision of the adders against a negligible amount of error in the reconstructed images.
Introduction
The discrete cosine transform (DCT) is one of the main components of many signal processing applications, including JPEG, MPEG, and H.26x (see [1, 2] and the references therein). For a given data sequence x i , i = 0, 1, · · · , N − 1, its one-dimensional (1-D) N-point DCT, y m , m = 0, 1, · · · , N − 1, is defined as follows [3] :
where α 0 = 1/ √ 2 and α m = 1 for m = 1, 2, · · · , N − 1. In this paper, K 0 is selected as √ 2, as in [3] . The efficiency of a DCT implementation can be evaluated based on its ability to produce DCT coefficients with a satisfactory precision as quickly as possible, using as little hardware as possible. Therefore, the methods used to implement a DCT with less hardware and simpler arithmetic operations have been studied for a long time. Traditionally, methods with less hardware are focused on minimizing the number of multiplication operations. The DCT algorithm introduced by Loeffler et al. [3] , called Loeffler algorithm below, is recognized as the most computationally efficient algorithm, because it requires the theoretically least number of multiplication operations. However, the 11 multiplications required by the Loeffler algorithm when calculating a 1-D 8-point DCT are still a burden for many applications.
The use of algebraic integers (AI) has been proposed for multiplicationfree and error-free DCT calculations [4] . Previous studies based on AI encoding have focused on the scaled DCT algorithms such as the Arai algorithm [5, 6, 7] . Scaled DCT algorithms can reduce the amount of computation significantly when the quantization step is applied after the transform. However, the Loeffler algorithm is the most efficient if quantization is not required. The Loeffler algorithm has also been adopted widely [2] .
A few implementations of the Loeffler DCT algorithm have been proposed, which are based on the approximation approaches. A previous study [8] presented a scaled 8-point Loeffler DCT architecture with 48 additions, which is based on a shift-and-add approach. Another study [1] proposed the relaxed cosine transform (RCT), which was very close to the DCT, and it reported an 8-point RCT fast algorithm using 36 additions. These previous studies are to produce eight DCT coefficients at a time and did not consider any pipelined implementation for portable applications with limited resources.
In this letter, we propose a pipelined DCT processor based on an implementation of the 8-point Loeffler algorithm. The processor implements the multiplication operations using AI encoding and shift-and-add schemes. The proposed processor uses 25 adders and provides a throughput which is sufficiently high for low-end applications, while satisfying the requirements for portable devices, such as a narrow memory bandwidth and low cost.
Two-dimensional (2-D) AI encoding
The 2-D AI encoding scheme uses the polynomials of two variables, which provide a flexible way of encoding that results in a cost-effective hardware implementation [4] .
In this letter, we use z 1 = 2 + √ 2 /2 and z 2 = √ 2. We use the 2-D polynomial expansion shown in Eq. (2) to represent the trigonometric values required by the Loeffler algorithm.
where f is cos or sin, and k = 1 for n = 1, 3 whereas k = √ 2 for n = 6. The remaining AI encoding process can be viewed as a process that maps a set of possibly irrational numbers to a set of arrays of integers, which are the coefficients (a ij s) in Eq. (2). The hardware cost of the final implementation is closely related to the a ij s that are neither zeros nor powers of two, as also noted in [4] . The AI encoding proposed in this letter and the error incurred are summarized in Table I , using a selection of K = L = 1.
We could make the encoding error-free, but we pursued a further reduction in the hardware cost by allowing nonzero errors. The amount of error was controlled such that there was no loss in the image data for display during the processing of DCT and IDCT transforms.
a 00 a 01 a 10 a 11 Error cos(π/16) 0
Loeffler DCT algorithm
In this letter, the algorithm is partitioned into three functional parts: decimation calculation (DEC), AI calculation (AIcal), and the final reconstruction Fig. 1 (a) , DEC performs the butterfly operations, which use common binary encoding. AIcal produces the DCT coefficients in the form of AI coefficients. Fig. 1 (b) shows the operations performed in the rectangular blocks in Fig. 1 (a) , which are labeled as kCn blocks where k = 1 or √ 2, and n = 1, 3, or 6 [3] . The complex multiplication operations included in these kCn blocks are simplified using AI encoding, as explained in Section 2. FRS converts the DCT results from the AI format into the common binary format, as described in Section 4.
Final reconstruction step
The AI-encoded DCT coefficients are converted to their real values in the FRS. Two output DCT coefficients, the 0th and 4th coefficients, bypass the FRS because AI-to-binary conversion is not required for these coefficients.
The remaining six output DCT coefficients are represented by twentyfour AI coefficients, where each output DCT coefficient, y m , is encoded by four AI coefficients, denoted below as C m,i , i = 0, 1, 2, 3. The real value of each DCT coefficient is calculated by multiplying each AI coefficient by the corresponding constant and accumulating the results. For the 1st, 2nd, 6th, and 7th DCT coefficients, the real values can be calculated using Eq. (3).
For the 3rd and 5th output DCT coefficients, the multiplication by √ 2 can be processed by changing the constants during FRS without increasing the calculation overheads, as shown in Eq. (4).
By approximating the constants, as shown in Table II , the multiplications in FRS can be implemented using shift-and-add operations [4] . Again, the amount of error was controlled such that there was no loss in the image data for display.
Table II. Encoding of the constants in the FRS

16-bit signed-digit encoding
Error
5 Proposed Loeffler DCT processor Fig. 2 (a) shows the architecture of the proposed DCT processor, which is an implementation of the Loeffler algorithm. The processor comprises three modules, which correspond to the three functional parts in Fig. 1 (a) . The processor is targeted at portable devices in which the memory bandwidth allocated for the DCT processor is narrow, e.g., as narrow as 60 bits: 24 bits for reading the R (red), G (green), and B (blue) data of a pixel; and 36 bits for writing the three corresponding 12-bit DCT coefficients.
Eight input data, In 0 · · · In 7 , are read into a buffer during eight clock cycles. Then, the DEC module receives those eight input data from the buffer in four successive clock cycles, as shown in Fig. 2 (b) . There are seven butterfly operations in DEC, so two adders are used in the DEC module to process these operations during seven cycles. In order to produce one DCT coefficient in every cycle, the 0th and 4th output DCT coefficients, Out 0 and Out 4 , are produced in the ninth and tenth clock cycles, respectively, assuming that the input data sequence begins arriving during the first clock cycle.
After the ninth clock cycle, all three modules-DEC, AIcal, and FRSoperate concurrently: DEC starts processing the next set of eight input data; AIcal continuously produces the AI coefficients for the remaining DCT coefficients and sends them to FRS, starting from tenth cycle using seven adders that operate in parallel. In this way, all the remaining DCT coefficients are produced continuously by FRS. The five-addition paths in the FRS module are partitioned into two pipeline stages, which bounds the critical path delay to three-adder delay. Fig. 2 (b) illustrates the parallel operation of the proposed processor and its I/O timing.
Each input datum is organized in an 8-bit integer representation, which is typical for video (image) processing applications. Since z 1 z 2 is defined with a precision of 2 −14 , as shown in Table II , the intermediate operands are organized in 24 bits that represent real values in the fixed-point format with 14-bit fractions. However, by aligning the operands correctly, we find that 12-bit precision (ten bits for the integral part and two bits for the fractional part) is sufficient for all 16 adders in the FRS part. For the remaining nine adders in the DEC and AIcal modules, we use nine 10-bit integer adders. This reduction in the precision is the only source of error in the proposed DCT processor, which can be eliminated with adders with full precision.
Experimental results and analysis
To verify the proposed DCT processor, we modelled the processor in Verilog and collected the output DCT coefficients produced by the processor for six test images: Lenna, Barbara, Boat, Cameraman, House, and Peppers. Then, the test images were reconstructed using a C-program version of the Loeffler IDCT, with full-precision floating-point computations and rounding-tointeger operations in the final step.
The proposed encodings in Table I and II approximate the constant multiplicands with errors, but the error does not affect the quality of the reconstructed images with the 8-bit image range. In the results, the only source of error is the use of adders with reduced precision, which causes finite Peak Signal-to-Noise Ratio (PSNR) values of at least 48.8 dB for all the test images. A PSNR of infinity can be achieved if we use adders with full precision.
The Verilog model of the proposed DCT processor was synthesized with Synplify, targeting for a Xilinx Virtex-5 xc5vsx95tff1136 FPGA device. The proposed processor for 8-bit inputs occupied 893 LUTs and 525 register bits and operated at 121.2 MHz in the worst case, which suggests that the frame rate can be approximately 58 frames per second for the image size of 1920 × 1080 1 . The operation speed could be increased by increasing the number of pipeline stages. With full-precision adders, the hardware cost increases significantly (67.6% more LUTs and 66.3% more register bits).
Conclusion
This letter proposed a pipelined processor for calculating the 8-point Loeffler DCT in portable applications. Multiplications are avoided by exploiting 2-D AI encoding and the shift-and-add approach.
Previous studies based on AI encoding have focused on scaled DCTs and did not consider the Loeffler DCT because the scaled DCT has lower hardware complexity at the algorithm level [7] if a quantization step is required after the transform. However, the hardware complexity of the final implementation does not reflect that of the original algorithm directly in our proposed pipelined design for portable devices with a narrow memory bandwidth. In particular, our proposed implementation may be a good alternative when the memory bandwidth of the system allows only one pixel datum to be read and written per cycle, which is the case in many portable devices.
