Discrete Cosine transform DCT and inverse DCT IDCT have been widely used in many image processing systems and real-time computation of nonlinear time series. In this paper, a novel lineararray of DCT and IDCT is derived from the data flow of subband decompositions representing the factorized coefficient matrices in the matrix formulation of the recursive algorithm. For increasing the throughput as well as decreasing the hardware cost, the input and output data are reordered. The proposed 8-point DCT/IDCT processor with four multipliers, simple adders, and less registers and ROM storing the immediate results and coefficients, respectively, has been implemented on FPGA field programmable gate array and SoC system on chip . The linear-array DCT/IDCT processor with the computation complexity O 5N/8 and hardware complexity O 5N/8 is fully pipelined and scalable for variable-length DCT/IDCT computations.
Introduction
With rapid growth of modern communication applications and computer technologies, image compression and real-time computation of nonlinear time series continues to be in great demand. Discrete Cosine transform DCT is one of the major operations in various image/video compression standards 1 and nonlinear time series applications 2-8 . Though fast Fourier transform FFT can be used to implement DCT, it requires complex-valued computations; and moreover, N-point DCT by FFT contains O log 2N 1 stages. The conventional DCT architectures using distributed arithmetic involve complex hardware with a great number of registers [9] [10] [11] [12] [13] [14] [15] [16] [17] [18] [19] . Other commonly used DCT architectures with matrix formulation and distributed memory 20-27 are however not suited for VLSI 2 Mathematical Problems in Engineering implementation because the hardware complex is proportional to the length of DCT, which leads to the scalability problem of variable-length DCT computations. In this paper, we propose the novel linear-array architecture for scalable DCT/IDCT implementation.
The remainder of this paper proceeds as follows. In Section 2, we propose the fast DCT/IDCT computation based on subband decomposition algorithm. In Section 3, the reconfigurable FPGA-based and programmable SoC implementations with low hardware cost are proposed for the fast DCT/IDCT computation. The performance comparison with conclusions can be found in Section 4.
Proposed Fast DCT/IDCT Computation
For an N-point signal, x n , the discrete cosine transform DCT 28 is defined as where n 0, 1, 2, . . . , N/2 −1. The original signal x n can be obtained from x L n and x H n as follows:
x 2n x L n x H n , x 2n 1 x L n − x H n .
2.3
As one can see, the DCT of x n can be rewritten as
Mathematical Problems in Engineering 3
where C L k and S H k are the subband DCT and DST discrete sine transform of x n , respectively.
Fast DCT Computation Based on Subband Decomposition Algorithm
Without loss of generality, the 8-point fast DCT based on subband decomposition algorithm is proposed for the widely used JPEG and MPEG-1/2 standards, which can be easily extended to variable-length DCT computations. The vector form of 8-point DCT can be written as
where
and T SB DCT,8
and T SB DST,8 denote the 8×4 matrices of subband DCT and subband DST, respectively, which can form orthonormal bases for the two orthogonal subspaces of R 8 . Notice that, due to the orthogonality between T SB DCT, 8 and T SB DST,8 , x L n and x H n can be obtained from C k as follows:
where n 0, 1, 2, . . . , N/2 − 1, and N 8. The proposed fast DCT algorithm is a subband decomposition-based multistage algorithm. Specifically, let
Mathematical Problems in Engineering x LHL,1 where n 0, 1. And let
where n 0. Based on subband decompositions using 2.2 , 2.7 , and 2.8 , data flow of computing the 2-point subband DCT: C LL,2 and subband DST: C LH,2 for the 8-point DCT is shown in Figure 1 . As one can see, data flow of computing C HL,2 and C HH,2 can be obtained Mathematical Problems in Engineering 5 in a similar way, and therefore is not shown in Figure 1 . All of the 2-point subband DCTs and DSTs are given by
.
2.9
Thus, we have
T is the original signal, and Similarly, we have the following:
,
2.12
Figure 2 depicts the relationship between C LL,4 and C LL,2 , which can be obtained by the following:
where T 2 is the 2 × 2 transform matrix of the conventional 2-point DCT. Hence, 2.13 can be rewritten as
The relationship between S LH,4 and C LH,2 shown in Figure 3 is based on the following:
Mathematical Problems in Engineering 7
Similarly, based on 2.5 and the following equations:
2.18
where T 4 is the 4 × 4 transform matrix of the conventional 4-point DCTs, we have Figure 6 . Data flow of computing C 8 using 8-point subband DCT and DST is shown in Figure 7 . In other words, C 8 can be obtained by
2.19
Base on 2.12 , 2.15 , 2.17 , 2.19 and 2.20 , we have
where 
Mathematical Problems in Engineering 
2.27
According to 2.24 -2.27 , we have 
2.28
Finally, the proposed 8-point DCT computation based on subband decomposition is as follows:
2.30 Figure 8 shows block diagram of the proposed DCT computation; one of the advantages is that R 8 is orthogonal, and all of the submatrices of F 8 are orthonormal.
Fast IDCT Computation Based on Subband Decomposition Algorithm
According to 2.29 , IDCT can be obtained by
2.32
As R 8 is orthogonal and all of the submatrices of F 8 are orthonormal, the inverse of R 8 and F 8 can be obtained easily. In addition, it takes only twenty multiplication operations for both DCT and IDCT.
VLSI Implementation of an Efficient Linear-Array DCT/IDCT Processor
Based on the proposed approach to fast DCT computation shown in Figure 8 , an efficient architecture for implementing the fast DCT/IDCT processor is thus presented in this section.
Recall that the DCT of a signal, Figure 9 shows the matrix-vector multiplication of R 8 · x 8 , in which six CSA 3,2 s carry-save-adder 3,2 and one CSA carry-save-adder 29, 30 are utilized, and therefore four simple-addition time and one CSA computation time is required to compute each element of y 8 . Figures 10 and 11 show the Multiplier array MA consisted of four multipliers and the CSA array CA consisted of eight CSAs, respectively, which are used to compute the matrix-vector computation of F 8 · y 8 ; thus, only one multiplication time with one CSA computation time is needed to compute each element of C 8 , that is, the DCT coefficient. Figure 12 shows the so-called full CSA 4,2 FCSA 4,2 consisted of two CSA 3,2 and one CSA for the computation of z 8 29, 30 . It is noted that the CSA array consisted of eight CSAs shown in Figure 11 can also be used for the computation of x 8 . As shown in Table 4 , only five multiplication cycles with three addition cycles are needed to compute 8-point IDCT. As one can see, the computation time and hardware complexity of the proposed fast IDCT architecture are the same as that of the proposed fast DCT architecture. In addition, only 16-word RAM/registers and 10-word ROM are required to store the intermediate results and constants, respectively; and the latency time is only 5-multiplication-cycle. Figure 13 shows system block diagram of the proposed fast DCT/IDCT architecture. The platform for architecture development and verification has been designed as well as implemented in order to evaluate the development cost. Figure 14 depicts block diagram of the platform, in which the 8051 microcontroller reads data from PC via DMA channel and writes the result back to PC by USB 2.0 bus; the Xilinx XC2V6000 FPGA chip implements the proposed DCT processor 32 . The architecture development and verification board shown in Figure 15 are to verify and evaluate the proposed DCT/IDCT architecture. Moreover, the reusable intellectual property IP DCT/IDCT core has also been implemented in Matlab for functional simulations. The hardware code written in Verilog is running on a workstation with the ModelSim simulation tool and Xilinx ISE smart compiler. In addition, the FPGA platform shown in Figure 14 is to verify and evaluate the proposed DCT architecture. It is noted that the throughput can be improved by using the proposed architecture while the computation accuracy is the same as that obtained by using the conventional one with the same word length. The SoC is synthesized by the TSMC 0.18 μm 1P6M CMOS cell libraries 33 . The physical circuit is synthesized by the Astro tool. The circuit is evaluated by DRC, LVS, and PVS 34 . Figure 16 shows the cell-based design flow. The layout view of the 8-point DCT/IDCT processor with 32-bit operand is shown in Figure 17 . The core areas are obtained by the Synopsys design analyzer. The power consumptions are obtained by the PrimePower. The reported core size of the implemented the proposed processor is 1520 × 1520 μm 2 and the power dissipation is 102.2 mW at 1.8 V with clock rate of 1 GHz. Thus, the proposed programmable DCT/IDCT architecture is able to improve the power consumption and computation speed significantly. All the control signals are internally generated on-chip. The proposed DCT/IDCT processor provides both high-throughput and low gate count.
The proposed reconfigurable DCT/IDCT processor used to compute 8/16/32/64-point DCT/IDCT on FPGA are composed mainly of the 8-point DCT/IDCT core; the computation complexity using a single 8-point DCT/IDCT core is O 5N/8 for extending Add.-cycle 2 --C 05 C 6 C 06
Add.-cycle 3 --C 06 C 7 C 07 x 0 , x 1 , x 2 , x 3 , x 4 , x 5 , x 6 , x 7 N-point DCT/IDCT computation. Note that the transform matrices used for the proposed linear array with 8-point DCT core can be extended to a variety of different sizes. Thus, the proposed architecture is highly scalable.
The linear-array architecture with use of hardware resources has been proposed for trade offs of performance, chip area and power consumption. As a result, it has the advantage of balancing the need for power saving with computation speed.
Conclusion
By taking advantage of subband decomposition, a high-efficiency architecture with pipelined structures is proposed for fast DCT/IDCT computation. Specifically, the proposed DCT/IDCT architecture not only improves throughput by more than two times that of the conventional architectures 9-11, 15-19 , but also saves memory space significantly 1, 9-22 . Table 1 shows comparisons between the proposed architecture and the conventional architectures 1, 9-14 with dual memory banks , and 15-19 . Table 2 shows comparisons with other commonly used architectures 1, 12-14, 20-24 . For 8 × 8 DCT, the algorithm proposed by Feig requires 54 multiplications and 462 additions 27 ; the proposed method requires 25 multiplications and 100 additions. Thus, the performance of this work is superior to that of the Feig algorithm. In addition, the proposed fast DCT/IDCT architecture is highly regular, scalable, and flexible. The DCT/IDCT processor designed by using the portable and reusable Verilog is a reusable IP, which can be implemented in various processes; combined with efficient use of hardware resources for tradeoffs of performance, area and power consumption; and therefore is much suited to the JPEG and MPEG-1/2 applications.
