Abstract: This letter proposes an efficient hardware accelerator of onedimensional (1D) eight-point Loeffler discrete cosine transform (DCT) for small portable devices. For continuous 1D input data streams, the accelerator uses only 13 adders and can calculate one DCT coefficient per clock cycle, which is the optimal throughput for the considered applications. Implementation results show that the accelerator can support real-time encoding of practical video sequences and can be a good alternative for small portable applications.
Introduction
Discrete cosine transform (DCT) is a main component of many signal processing applications, including JPEG, MPEG and H.26x (see [1, 2, 3, 4] and the references therein). For a given data sequence x i , i ¼ 0; 1; Á Á Á ; N À 1, its one-dimensional (1D) N-point DCT coefficients, y m , m ¼ 0; 1; Á Á Á ; N À 1, are defined as follows [5] :
where 0 ¼ 1= ffiffi ffi 2 p and m ¼ 1 for m ¼ 1; 2; Á Á Á ; N À 1. In this paper, K 0 is chosen to be ffiffi ffi 2 p as in [5] .
DCT is computationally demanding. Thus, numerous works have implemented efficient DCT hardware accelerators that are fast enough for high-end applications. However, the hardware complexity of DCT is a significant concern in small portable devices such as internal disease data collectors.
Loeffler DCT algorithm [5] is recognized as the most computationally efficient algorithm, in the sense that it requires the theoretically least number of multiplications. The algorithm has been adopted in practical applications [2] . As shown in Fig. 1(a) , Loeffler DCT can be divided into two functional parts: decimation calculation (DEC) and trigonometric multiplication (TRIMUL). The DEC part consists of butterfly operations, whereas the TRIMUL part involves multiplication operations with trigonometric values. Given that the trigonometric values are irrational in Fig. 1(b) , the 11 multiplications required by Loeffler DCT can still be responsible for a large portion of the hardware complexity of the DCT accelerator.
To implement the multiplications using adders, a considerable number of studies have developed various design methods, including the use of distributed arithmetic [6] and shift-and-add algorithms [7] . In this letter, we are interested in a Loeffler DCT accelerator that uses as few adders as possible and achieves the optimal throughput for small portable devices. With the distributed arithmetic scheme, implementing the multiplications using one adder is possible, as in [8] ; however, the resultant design has long latency and low throughput.
In small portable devices, the allowed data memory bandwidth for the DCT accelerator is often narrow. The data bandwidth considered in this letter is 60 bits per clock cycle: 24 bits to read the red (R), green (G), and blue (B) data of a pixel; and 36 bits to write the three corresponding 12-bit DCT coefficients. Thus, the 1D eight-point DCT accelerator is allowed to have one 12-bit input port and one 12-bit output port for each color component (R, G, or B). The accelerator can then be used to calculate the two-dimensional (2D) DCT coefficients using the row-column decomposition method.
A few implementations have been proposed for the Loeffler DCT accelerator with no multiplier. A previous study [9] presented an implementation of a scaled eight-point Loeffler DCT based on the coordinate rotation digital computer (CORDIC) algorithm, which is basically a shift-and-add algorithm. The implementation uses 56 adders (for 12-bit precision), excluding the hardware for compensating for the gain accumulated during the transform. Another study [10] proposed a new shift-and-add algorithm and presented an improved implementation with 48 adders. Recently, authors in [11] used the canonical signed digit (CSD) encoding [7] to minimize the number of additions in the shift-and-add algorithm for implementing the multiplications. The authors reduced the number of additions further by applying the common subexpression elimination technique and showed that y 1 in Fig. 1 (a) can be calculated using 16 additions.
The aforementioned previous studies calculated eight 1D DCT coefficients per clock cycle. Although the authors in [11] showed that y 1 could be calculated with 16 additions, they did not show how many adders were needed and how to schedule the adders to calculate one 1D DCT coefficient at a time. For applications that allow the calculation of only one 1D DCT coefficient per cycle, a previous study [12] using algebraic integer encoding [13] proposed a Loeffler DCT accelerator with 25 adders. For integer inverse cosine transforms, which are not Loeffler DCTs, the authors in [14] presented hardware accelerators with at least 24 adders.
In this letter, we propose an efficient eight-point Loeffler DCT accelerator that uses significantly fewer adders, compared with previous works, with the throughput of one 1D DCT coefficient per cycle. The proposed accelerator is designed by directly applying the shift-and-add operations in an efficient pipeline optimized for the considered applications. Table I summarizes the proposed encoding for the trigonometric values required in Fig. 1(b) . The encoding is not minimal in the number of nonzero terms. We limit the number of negative terms to one and the number of terms to six, to efficiently exploit the proposed architecture in Section 3. Fig. 2 describes the architecture of the proposed accelerator. The accelerator consists of two modules that correspond to the two functional parts in Fig. 1(a) . The DEC module uses two common two-operand (2-op) adders (add0 and add1), whereas TRIMUL includes one six-operand (6-op) adder, one five-operand (5-op) adder, and two 2-op adders (add2 and add3). The 6-op adder consists of a six-input carry-save-adder (CSA) tree and a carry-propagation adder (CPA). The 6-op adder is used for the multiplications with three trigonometric values listed at the top of Table I . The encoding of each of the three trigonometric values includes only one negative term to allow the sharing of that one hardware module for negation. The 5-op adder is used for the multiplications with the remaining four values in Table I and consists of a five-input CSA tree and a CPA.
Trigonometric values

Proposed accelerator
The timing diagram in Fig. 3 illustrates the operation of the proposed accelerator. The accelerator has an input buffer of 96 bits, which is omitted from the figures for brevity. The first and second rows in Fig. 3 show which signals are read in from the buffer or produced out to the memory. The first and second rows denote the timing of the appearance of the signals after they are latched at the positive edges of the clock. The other rows show the timing for calculating the signals whose names are denoted in Fig. 1 .
The first data set, x 0 Á Á Á x 7 , are read into the buffer in eight clock cycles. Then, the DEC module reads those eight data from the buffer in four successive clock cycles, two at a time: x 0 and x 7 , x 3 and x 4 , and so forth. In the 9th cycles, while the 5-op adder is calculating
, DEC starts latching in x 8 and x 15 of the next data set, and the 6-op adder in TRIMUL starts processing the next data set by calculating ða 15 Þ C in the 10th cycle.
Evaluation
For each 1D input data stream, the proposed architecture uses 13 adders: six CPAs and seven CSAs. For continuous 1D input data streams, such as each color component (R, G, or B) stream of video images, the accelerator produces one DCT coefficient per clock cycle after an initial latency of 16 clock cycles. This throughput is optimal because the accelerator can output a maximum of one DCT coefficient per cycle in the considered applications. In a 2D image processing configuration that consists of the 1D column DCT, a transpose memory, and the 1D row DCT, as in [3] , the accelerator can process one pixel per clock cycle after an initial latency of 80 clock cycles.
The proposed accelerator has been modeled in Verilog. In this letter, the CPA in a multi-operand adder is a carry-lookahead adder (CLA) that is with groups of size up to four and a ripple-carry among groups [7] . The 2-op adders are ripple-carry adders (RCAs): two 13-bit adders (add0 and add1), one 17-bit adder (add2), and one 18-bit adder (add3). The highest precision in the accelerator is 18 bits. However, each result is truncated to a 12-bit precision number before being sent out from the accelerator.
The model has been verified using four benchmark images: Lenna, Pepper, House, and Cameraman. After the test images are transformed using the Verilog model, they are reconstructed with the inverse DCT software using full-precision floating-point computations with rounding-to-integer operations in the final step. The 12-bit limitation in the I/O precision affects the results; hence, peak signal-tonoise ratio (PSNR) values of at least 39.1 dB are obtained for all the test images. If the I/O precision is allowed to be 13 bits, then the PSNR value becomes at least 44.8 dB. The precision of the internal units can also be reduced further to decrease hardware usage at the cost of the PSNR values, as in [8] . The Verilog model of the proposed accelerator has been synthesized with Synplify, targeting for a Xilinx FPGA, XC7VX485T-2. The proposed accelerator for one color component, including its control part, occupies 795 LUTs with 674 register bits and operates at 200 MHz in the worst case, suggesting that the frame rate can be approximately 96 frames per second for the frame size of 1920 pixels Â 1080 pixels.
The cost and performance of the proposed accelerator have been compared with those of a previous design [12] , which is based on a well-scheduled pipeline and uses the 2D algebraic-integer (AI) encoding method. The DCT accelerator in [12] is the only published work that implements the Loeffler DCT for similar I/O limitations and achieves the same throughput. The design of the accelerator in [12] has been adjusted and evaluated under the same evaluation conditions considered in this letter. The evaluation results are summarized in Table II . All the adders in [12] are implemented as RCAs. Even when a CSA is regarded as a CPA, the proposed architecture uses adders that are significantly fewer than those used by previous works on Loeffler DCT [12] and on other integer cosine transforms with the same throughput (24 adders) [14] . However, the hardware cost is not reduced as much as the number of adders because more control circuitry is needed to exploit the reduced number of adders.
Conclusion
This letter proposes a hardware accelerator that calculates 1D eight-point Loeffler DCT using significantly fewer adders compared with previous works with comparable performance.
In terms of hardware usage, the proposed accelerator can be a good alternative for small portable applications. However, the number of adders is reduced at the cost of control overhead. Hence, the reduction in hardware usage is not markedly significant as the reduction in the number of adders. Thus, using a more carefully chosen AI encoding [15] can be considered to improve the pipeline of the accelerator in [12] .
