Discrete cosine transform DCT and inverse DCT IDCT have been widely used in many image processing systems and real-time computation of nonlinear time series. In this paper, the unified DCT/IDCT algorithm based on the subband decompositions of a signal is proposed. It is derived from the data flow of subband decompositions with factorized coefficient matrices in a recursive manner. The proposed algorithm only requires 4 log 2 n −1 − 1 and 4 log 2 n −1 − 1 /3 multiplication time for n-point DCT and IDCT, with a single multiplier and a single processor, respectively. Moreover, the peak signal-to-noise ratio PSNR of the proposed algorithm outperforms the conventional DCT/IDCT. As a result, the subband-based approach to DCT/IDCT is preferable to the conventional approach in terms of computational complexity and system performance. The proposed reconfigurable architecture of linear array DCT/IDCT processor has been implemented by FPGA.
Introduction
The discrete cosine transform DCT first proposed by Ahemd et al. 1 is a Fourier-like transform. While the Fourier transform decomposes a signal into sine and cosine functions, DCT only makes use of cosine functions with the property of high energy compaction. As DCT is preferable for a trade-off between the optimal decorrelation known as the KarhunenLoève transform and computational simplicity 2 , it has been extensively used in many applications 3-14 . In particular, two-dimensional 2D DCT, such as 8 × 8 DCT, has been adopted in some international standards such as JPEG, MPEG, and H.264 15 . In MP3 audio codec, the subband analysis and synthesis filter banks requires the use of 32-point DCT/integer DCT to expedite computation 16 . Other audio compression standards, for example, the Dolby Digital AC-3 codec, utilize a modified DCT with 256 or 512 data points.
Many algorithms have been proposed for DCT/IDCT 17-21 . In which, the transportation matrix is factorized into products of simpler matrices. It is noted that, however, the factorized matrices are no longer as regular as the fast Fourier transform FFT ; thus, these algorithms can only achieve moderate computational speed. Specifically, the dedicated data paths deduced from the signal flow graphs SFGs of the above algorithms need to be optimized for performance enhancement, which is computationally intensive, and the custom-designed DCT is often complicated and cannot be easily scalable for variable data points.
In this paper, we propose a novel linear-array architecture based on the subband decomposition of a signal for scalable DCT/IDCT. The remainder of this paper proceeds as follows. First, the subband-based 8-point DCT/IDCT algorithm 22 is reviewed in Section 2. Its extension to n-point DCT/IDCT called the unified subband-based algorithm is proposed in Section 3. Section 4 presents the analysis of system complexity. The reconfigurable architecture of linear-array DCT/IDCT processor implemented by FPGA field programmable gate array is proposed in Section 5, and the conclusion can be found in Section 6.
The Subband-Based 8-Point DCT/IDCT Algorithm
The discrete cosine transform DCT of an 8-point signal, 8 , is defined as n and x H n denote the low-frequency and high-frequency subband signals of x n , respectively 22 , which can be obtained by
where n 0, 1, 2, 3. As one can see, the DCT of x 8 can be rewritten as c k
where c L k and s H k are the subbands DCT and DST discrete sine transform of x n , respectively. Its vector form is as follows:
where T SB DCT, 8 and T SB DST,8 denote the 8 × 4 matrices of the subband DCT and DST, respectively,
Mathematical Problems in Engineering
According to 2.4 , the 8-point matrix M 8 can be written as 
2.7
Due to the orthogonality between T SB DCT, 8 
Mathematical Problems in Engineering 5
We have 
2.10
Similarly, the 4-point DCT computations of c L, 4 and c H,4 can be obtained by
where
x HH 0 x HH 1 T , and
for n 0, 1. The 4-point distributed matrix M 4 , can be defined as 
2.17
Let c LL,2 , c LH,2 , c HL,2 , and c HH,2 be the 2-point DCT of x LL , x LH , x HL , and x HH , respectively, which can be computed by using the following 2-point transformation matrix, D 2 :
.
2.37
The following matrix, M 8 , can derived from 2.12 ∼ 2.15 . 
2.38
Similarly, the following matrix, M 8 , can be derived from 2.21 ∼ 2.28 
Mathematical Problems in Engineering 9
The decomposition matrix R 8 can be defined as
and the coefficient matrix F 8 can be defined as 
2.41
According to 2.34 , 2.38 , and 2.39 , we have
2.42
The coefficient matrix F 8 can be represented by the reordered coefficient matrix F 8 , prepermutation matrix T 8 and post-permutation matrix T 8 , and can be written as
where the matrices F 8 , T 8 , and T 8 can be defined as 
2.44
The reordered coefficient matrix F 8 can be represented as
Mathematical Problems in Engineering 11
The computation of sub-coefficient matrix A can be written as
where a 0.9239 and b 0.3827. The above can be rewritten as 20
Thus, the number of multiplications can be reduced to 3 for matrix A; this technique can also be applied to matrices B, C, D, and E. As a result, the total number of multiplications of the subband-based 8-point DCT is only 15. Based on 2.40 and 2.41 , the corresponding subband-based IDCT can be obtained by
where R −1 8 is the inverse decomposition matrix. As the decomposition matrix R 8 is orthonormal, R −1 8 can be derived from the transportation of R 8
2.50 
2.53
The inverse reordered coefficient matrix F −1 8 can be represented as
2.54
As a result, the total number of multiplications of the subband based 8-point IDCT is only 15.
Mathematical Problems in Engineering 13

The Unified Subband-Based n-Point DCT/IDCT Algorithm
The subband-based DCT algorithm 22 can be unified for n-point DCT/IDCT due to the inherent regular pattern. For an n-point signal, x n , the unified subband-based discrete cosine transform can be defined as
where n {2 m |m 3, 4, 5, . . .}, R n is the decomposition matrix, F n is the reordered coefficient matrix, T n is the pre-permutation matrix, and T n is the post-permutation matrix. The unified decomposition matrix R n can be written as 
Mathematical Problems in Engineering According to 3.1 and 3.2 , the unified distributed matrix M n can be derived as
According to 2.2 and 2.40 , we have
The 4-point reordered coefficient matrix F 4 can be derived as 
According to 3.5 , the 8-point reordered coefficient matrix F 8 can be derived as
Hence, the unified reordered coefficient matrix can be written as
Mathematical Problems in Engineering
15
The 4-point pre-permutation matrix T 4 and post-permutation matrix T 4 can be written as 
3.10
According to 2.50 , 2.51 , 3.8 , and 3.9 , the 8-point pre-and post-permutation matrices can be written as
3.11
Hence, the unified pre-and post-permutation matrices can be represented as
3.12
where A denotes n/2 × n/8 . According to 2.47 and 2.53 , the unified subband-based IDCT can be obtained.
where R [7] c [0] c [1] c [2] c [3] c [4] c [5] c [6] c [7] Addition Pre-shu e Multiplication Addition Post-shu e Shift operation the decomposition matrix, reordered coefficient matrix, pre-and post-permutation matrices are all orthonormal. Hence, we have
3.14
Analysis of Computation Complexity and System Performance
Based on the 8-point subband-based DCT and IDCT algorithm, the data flow of parallelpipelined processing for 8-point DCT and IDCT are described as follows. The data flow of the subband-based 8-point DCT with six pipelined stages is shown in Figure 1 . In which, 8 , the matrix-vector multiplication of R 8 · x 8 in the first stage, takes one simple-addition time for each element of y 8 . The preshuffle performs the prepermutation matrix T 8 operation in the second stage. The matrix-vector multiplication is used to compute F 8 · y 8 in the third and fourth stages. In the fifth stage, the postshuffle is used for the post-permutation matrix T 8 . The final stage is to compute √ 2/4 · z 8 by using simple shift operation with the Booth recoded algorithm.
Similarly, Figure 2 shows the data flow of the subband-based 8-point IDCT with seven pipelined stages.
In which, [1] c [2] c [3] c [4] c [5] c [6] c [7] x [0] x [1] x [2] x [3] x [4] x [5] x [6] x [7] Addition Pre-shu e Multiplication Recall that the DCT of a signal, x n , can be represented as c n √ n/n · T n · F n · T n ·R n ·x n . The multiplication time of the unified subband-based algorithm can be derived as The PSNR curves of Lena, Baboon, Barbara, Peppers, and Boat images obtained by using the conventional 8-point DCT and the proposed subband-based 8-point DCT at various word lengths are shown in Figure 5 . Figures 6 a , 6 b , 6 c , 6 d , and 6 e show the PSNR curves of Lena, Baboon, Barbara, Peppers, and Boat images obtained by using the conventional DCT and the proposed subband-based DCT with 32-bit operand at various DCT points. As one can see, the subband-based DCT is preferable.
FPGA Implementation of the Reconfigurable Linear-Array DCT/IDCT Processor
The reconfigurable architecture of the fast 8-, 16-, 32-and 64-point DCT and IDCT processors based on the subband-based 8-point DCT is proposed in this section. 
The Proposed 8-Point DCT/IDCT Processor
According to the data flow of the subband-based 8-point DCT with six pipelined stages Figure 1 , the architecture of the proposed 8-point DCT processor is shown in Figure 7 . In which, the adder array AA with three CSA 4,2 s performing the matrix-vector multiplication of R 8 · x 8 is shown in Figure 8 . Figure 9 shows the multiplier array MA performing three types of operation, which are needed to compute the subcoefficient matrix computation of F 8 . The control signals of swap and inv determine the types of operation. The functions determined by swap and inv are shown in Table 1 . Figure 10 shows the hardwired [7] Adder array (AA) shifters used for performing √ 2/4 ·z 8 by the Booth recoded algorithm 23 . Figure 11 shows the proposed 8-point IDCT processor with seven pipelined stages. In which, the fast adder arrays, shuffle, multiplier array, CLA, and hardwired shifters for DCT architecture can also be used for performing IDCT. The latch array for retiming the input data is shown in Figure 12 .
The hardware complexity of the proposed subband-based IDCT architecture is the same as that of the proposed subband-based DCT architecture. Figure 13 shows the proposed integrated 8-point DCT/IDCT processor. 
The Proposed Reconfigurable DCT/IDCT Processor
According to the integrated 8-point DCT/IDCT processor Figure 13 , the proposed reconfigurable 8-, 16-, 32-, and 64-point DCT/IDCT processor is shown in Figure 14 . In which, the integrated adder array IAA for the fast computation of 8-, 16-, 32-, and 64-point DCT/IDCT is shown in Figure 15 . The modified hardwired shifter MHS for multiplication by √ n/n where n 8, 16, 32, 64 using the Booth recoded algorithm is shown in Figure 16 . In order to improve the computation efficiency, the number of multiplier arrays should be increased. The log plot of computation cycles versus number of multiplier arrays is shown in Figure 17 .
FPGA Implementation of the Reconfigurable 2D DCT/IDCT Processor
The N × N DCT is defined as 29 where α k 1/ √ 2 for k 0, and α k 1 for k > 0. It can be rewritten as
Thus, the separable 2-D DCT can be obtained by using 1-D DCT as follows:
Similarly, the separable 2-D IDCT can be obtained by using 1-D IDCT as follows:
As a result, the architecture of 2D DCT/IDCT can be implemented by using two successive 1D DCT/IDCT processors with only one transpose memory 29 . The proposed architecture of 2-D DCT and IDCT is shown in Figure 18 . In which, the control signals provided by the finite state machine FSM controller are used to manage the data flow and the DCT IDCT [7] c[0] ∼ c [7] Pre-permutation shu e Post-permutation shu e microcontroller reads data and commands from PC and writes the results back to PC by USB 2.0; the Xilinx Spartan-3 FPGA implements the proposed 2-D DCT/IDCT processor. The hardware code written in Verilog is for PC with the ModelSim simulation tool 31 and Xilinx ISE smart compiler 32 . It is noted that the throughput can be improved by using the proposed architecture while the computation accuracy is the same as that obtained by using the conventional one with the same word length. Thus, the proposed programmable DCT/IDCT architecture is able to improve the power consumption and computation speed significantly. The proposed processor for 8-, 16-, 32-, and 64-point DCT/IDCT is an extension of the 8-point DCT/IDCT processor. Moreover, the reusable intellectual property IP DCT/IDCT core has also been implemented in Verilog for the hardware realization. All the control signals are internally generated on chip. The proposed DCT/IDCT processor provides both high throughput and low gate count. 
Conclusion
With the advantages of the subband decomposition of a signal, a high-efficiency algorithm with pipelined stages has been proposed for fast DCT/IDCT computations. It is noted that the proposed DCT/IDCT algorithm not only simplifies computation complexity but also improves system performance. The PSNR and system complexity of the proposed algorithm is better than those of the previous algorithms 33-36 . Table 2 shows comparisons between the proposed algorithm and architecture and other commonly used algorithms and architectures 24-28 . Thus, the proposed subband-based DCT/IDCT algorithm is suitable for the real-time signal processing applications. The proposed DCT/IDCT processor provides both high throughput and low gate count and has been applied to various images with great satisfactions. 
