Abstract: A coordinate rotation digital computer (CORDIC) based variable length reconfigurable DCT/IDCT algorithm and corresponding architecture are proposed. The proposed algorithm is easily to extend to the 2 n -point DCT/IDCT. Furthermore, we can easily construct the N-point DCT/IDCT with two N/2-pt DCTs/IDCTs based the proposed algorithm. The architecture based on the proposed algorithm can support several power-of-two transform sizes. To speed up the computation of DCT/IDCT without losing accuracy, we develop the modified unfolded CORDIC with the efficient carry save adder (CSA). The rotation angles of CORDIC used in proposed algorithm are arithmetic sequence. For convenience, we develop the architecture of N-point IDCT with the orthogonal property of DCT and IDCT transforms. The proposed architecture are modeled with MATLAB language and performed in DCT-based JPEG process, the experimental results show that the peak signal to noise ratio (PSNR) values of proposed architectures are higher than the existing CORDIC based architectures at both different quantization factors and different test images. Furthermore, the proposed architectures have higher regularity, modularity, computation accuracy and suitable for VLSI implementation.
INTRODUCTION
The discrete cosine transform (DCT) and its inverse (IDCT) [1] are the most widely used transforms in the image and signal processing due to the near optimal performance for compression of a highly correlated data. The commonly used DCT algorithms aim at speeding up the computation [2] [3] [4] , or reducing the complexity of the architecture and computation [5, 6] . Loeffler based architectures [7] [8] [9] use the flow-graph algorithm to reduce computation complexity and make computation more efficiently. However, current existing algorithms often suffer from a lack of scalability and hardly to be extended to the variable length DCT/IDCT. With the development of signal processing systems, variable length reconfigurable architectures for DCT/IDCT are highly desirable for low power applications, multi-standard, and multi-mode environments. Some reconfigurable DCT/IDCT architectures [10] [11] [12] are proposed in the literature. However, all of them have their drawbacks, such as changing different reconfigurable modules [10] or different pre-processing and post-processing stages [12] to realize different processor, using a greedy algorithm [11] . Furthermore, they do not have the scalability to adapt to power-of-tow transform size.
CORDIC-based architectures are suitable for VLSI implementation with regularity and simple hardware architecture. However, due to the recursive nature of itself, it is difficult to realize pipeline [12] . Using the unfolding technique *Address correspondence to this author at the Microelectronics Center, Harbin Institute of Technology, 150000, Harbin, P. R. China; Tel: 13258676498: Fax: 86397100; E-mail: husthh@yahoo.com.cn can overcome this problem [13] [14] [15] , but will introduce new problems, such as numerical inaccuracy, the scalability problem of variable length DCT computations.
In this paper, we propose a computationally efficient and variable length reconfigurable DCT/IDCT algorithm and corresponding architecture. Based on the proposed algorithm, we can easily obtain the 2 n -point DCT/IDCT and construct the N-point DCT/IDCT with two N/2-point DCTs/IDCTs. The proposed architecture of one-dimension (1-D) 8-point DCT is developed, and the architecture of 1-D 8-point IDCT is developed by taking the orthogonal property of DCT and IDCT transforms. To enhance the computation accuracy of the unfolded CORDIC and improve the performance, a modified unfolded CORDIC is proposed. The modified unfolded CORDIC has better accuracy than conventional ones as well as computation efficiency by taking advantage of its certain property of using carry save adders (CSAs). Furthermore, based on the row-column decomposition algorithm, the 2-D 8×8 DCT/IDCT architecture with the modified unfolded CORDIC is implemented and verified in DCT-based JPEG process model. The experimental results show that the proposed architectures have the good transformation quality compared to the existing CORDIC based DCT in terms of PSNR results.
The paper is organized as follows. In section 2, we derive the variable length reconfigurable DCT/IDCT algorithm based on the CORDIC. In section 3, the signal flows of the proposed algorithm are depicted in detail. In section 4, the architectures design of the 4/8-point DCT/IDCT based on modified unfolded CORDIC are presented. The experimental results and conclusion can be found in section 5 and 6.
PROPOSED VARIABLE LENGTH RECONFIGUR-ABLE DCT/ IDCT ALGORITHM
For a N-point signal, ] [n x , the type-II DCT and IDCT are defined as [6] :
where
The type-II discrete sine transform (DST) is defined as:
According to (1) and (3), neglecting the post-scaling factor without loss generality, the main operation of an N-point DCT and DST denoted as T C D and T S D can be written as:
A length-N input sequence x[n], with N being power of two, can be decomposed into x L [n] and x H [n], which denote the low-frequency and high-frequency sub-band signals of x[n] respectively [16] , are defined as:
where n = 0,1, 2,...., (N / 2) ! 1 
Substituting (8) and (9) into (4), (4) can be rewritten as:
We get
Let
According to (12) and (13), (10) can be rewritten as:
Let l = N ! k , we get
In (15), except 2 , 0 N k = all the points can be evaluated by (N/2)-point T C D . Therefore, we separate the formula (15) into four parts shown as:
Where
As one can see, we can evaluate the N-point T C D with two (N/2)-point Ts C D based on the CORDIC algorithm. Therefore, the proposed algorithm is a variable length reconfigurable algorithm for DCT, and easily extended to 2 n -point DCT. In addition, the rotation angles of the CORDICs are arithmetic sequence which has a common difference of N
.Similarly, the variable length reconfigurable algorithm of the N-point IDCT can be deduced. Alternatively, the orthogonal property of DCT and IDCT transforms can be used to obtain the reconfigurable algorithm for the N-point IDCT more easily, which will be depicted in detail in section III.
SIGNAL FLOW OF THE PROPOSED DCT/IDCT ALGORITHM
In this section, we depict the variable length reconfigurable DCT/IDCT signal flows based on the proposed algorithm in detail.
According to (4), 2-point T C D computation is as follows
Thus, the signal flow graph of computing the 2-point T C D can be depicted as Fig. (1) .
According to (16) , the signal flow graph of computing the 4-point T C D can be constructed by two 2-point Ts C D , which depicted as Fig. (2) .
According to (6) and (7), the sub-band decomposition matrix is shown as:
According to (13) the odd sign change matrix is shown as: Similarly, the signal flow graph of computing the 8-point T C D can be constructed by two 4-point DCTs, which depicted as Fig. (3) .
In Fig. (3) the 4-point Ts C D are bordered by the dashed lines and the sub-band decomposition matrix is shown as:
The odd sign change matrix of 8-point DCT is shown as:
As described above, according to (1) and (16), the signal flow graph of computing the N-point DCT with two N/2-point Ts C D can be generalized, which depicted as Fig. (4) .
In Fig. (4) the N-point Ts C D are bordered by dash lines, and the post-scaling factor of N-point DCT
According to (16) , the CORDIC arrays contain (N/2-1) CORDIC cell with the rota-
The sub-band decomposition matrix of the N-point DCT is shown as: The odd sign change matrix of the N-point DCT is shown as:
The N-point DCT can be obtained by multiply Npoint T C D by the post-scaling factor, and permuted according to (16) . Furthermore, we can combine the constant According to (1) and (2), DCT and IDCT are orthogonal transforms, the signal flow of the N-point IDCT can be easily obtained by inverting the transfer functions of each building block shown in Table 1 and reversing the signal flow direction [17, 18] . Fig. (5) depicts the signal flow of the Npoint IDCT.
ARCHITECTURE DESIGN FOR THE RECON-FIGURABLE 4/8-POINT DCT/IDCT BASED ON MODIFIED UNFOLDED CORDIC
Without loss the generality, based on the proposed variable length reconfigurable DCT/IDCT algorithm, we develop an efficient 4/8-point DCT/IDCT architecture in this section. A modified unfolded CORDIC architecture is proposed to improve the performance and reduce the hardware complexity.
In the CORDIC algorithm, to rotate a vector (x, y) by an angle θ, the circular rotation angle is decomposed as:
Then, the vector rotation can be performed iteratively as follow [10, 13, 15] : 
Multiply constant Furthermore, the results of the rotation iterations need to be scaled by a compensation factor s.
Alternatively, replacing (26) with the following iterative method.
In (25) and (27), only shift and add operations are required to perform the operation.
Since rotation angles of CORDICs are fixed in the proposed DCT/IDCT algorithm, we can skip some unnecessary CORDIC iterations. According to (16) , for 8-point DCT, three different fixed angles are needed to rotate. The number and compensation of iterations are given in Table 2 .
According to Table 2 , the unfolded CORDIC flow graph of the -π/16 angle is shown as Fig. (6) .
In Fig. (6) , it needs six shifts and six additions operations to evaluate the -π/16 angle rotation. To evaluate the rotation more efficiently without losing accuracy, when the sequence iterations numbers i and j are big enough, two iterations can be combined into one iteration according to the equations as follow:
Similarly, (27) can be approximated as follow:
According to (28) and (29), we obtain the modified unfolded CORDIC. The modified unfolded CORDIC needs less shift and add operations without losing accuracy under certain condition. The modified unfolded CORDIC flow graph of the -π/16 is shown in Fig. (7) . In order to gain higher accuracy, the shift numbers are optimized by numerical simulation in MATLAB, and a subtle change (7 to 6) is made in third shift stage. In Fig. (7) , as one can see, it needs three shifts and three additions to evaluate the -π/16 angle rotation. Furthermore, we can implement the CORDIC architecture with more efficient adder CSAs, which make the computation faster.
To evaluate the accuracy of the modified unfolded CORDIC, we take two types unfolded CORDIC as references. One is the conventional unfolded CORDIC which meets the precision requirements of IEEE Std. 1180-1990 [15] , the other one is an approximate unfolded CORDIC [9] . The word-length of inputs and outputs are both 12-bits, and the test data are 1000 uniform random data. The computation relative errors of the three CORDICs are evaluated, and the compared results of the two outputs (x, y) of the CORDIC are shown as Fig. (8) .
In Fig. (8) , it can be seen that the computation results of the modified unfolded CORDIC have much smaller relative error than the approximate unfolded CORDIC [9] , and nearly the same values as the conventional unfolded COR-DIC. In addition, most of relative error values of the modified unfolded CORDIC are under 0.4%.
Since the iteration numbers of the -π/8 and -3π/16 are not meet the accuracy requirements, if directly using the formula (28) and (29), the double angle formula is used alternatively. According to the double angle formula, the operation of rotating 2θ can be replaced by two sequential operations of rotating θ, and three sequential operations of rotating θ for 3θ. Moreover, the computation accuracy of the -π/8 and -3π/16 architectures can meet the requirements in most signal processing application, such as JPEG, which will be verified in section 5.
We take a reconfigurable 4/8 DCT architecture as an example to implement the proposed algorithm. According to the Fig. (3) , we separate the signal flow into two parts: the upper 4-point DCT and the lower 4-point DCT. Since the two 4-point DCTs are independent, we can reuse the 4-point DCT processor and implement the reconfigurable 4/8-pointDCT architecture in series manner, which depicts as Fig. (9) . Furthermore, we can implement the three rotations with only three same CORDIC processors, thus the proposed architecture has high reusability and modularity. Comparing with the existing reconfigurable architecture, our architecture no need to change any architecture to switch from 4-point to 8-point DCT processor, and is suitable for VLSI implementation.
In Fig. (9) , we use the demultiplexer to dynamically switch from 4-point to 8-point DCT without changing the architecture. The FIFO is used to store the immediate results generated by the 4-point DCT that implemented with two -π/16 CORDIC processors. The architecture of the preprocessing element is shown as Fig. (10) . In Fig. (10) , we use 2' complement to implement the subtract operation and use multiplexer (MUX) to chose add operation or subtract operation.
We use the transposing property depicted in Table 1 to implement the 4/8-point reconfigurable IDCT architecture. When change CORDIC from clockwise rotation to anticlockwise rotation with same angle, the only thing need to do is change the sign of i ! , or equivalently change all adders to subtractors, and subtractors to adders respectively in rotation iterations stage.
EXPERIMENTAL RESULTS
To verify the performance of the proposed architecture, we design and estimate three different architectures: Loeffler [9] , Jeong [15] , and the proposed architecture with 12-bit accuracy using MATLAB language. After modeling the four architectures, we use the DCT-based JPGE process model to verify their performance. Fig. (11) depicts the block diagram of DCT-based JPGE process model. The architecture of 8 8 ! 2-D DCT/IDCT is implemented using two cascading 1-D DCT/IDCT architectures with one transpose memory [16] . In this paper, we use the peak signal-to-noise ratio (PSNR) to measure the reconstructed image quality. The values of PSNR are measured by sending image data through the JPGE process including the DCT architecture and IDCT architecture.
The quantization factor q is used to trade off image quality and compression ratio (CR). Huffman coding is used as entropy code, and the quantization matrix in Fig. (11) , as specified in the original JPEG standard, is as follows: Table 3 .
From the Table 3 , one may observe that the PSNR values of the proposed DCT architecture are better than the other DCT architectures at most of the different values of q. In addition, the CR values of the proposed architecture are better than the architectures of Loeffler [9] and Matrix multiplication at all different values of q, and nearly the same values as the architecture of Jeong [15] . This implies that the proposed architecture not only makes the computation faster, since using the modified unfolded CORDIC architecture, but also improves the quality of the results in terms of PSNR values.
Experimentations have been also carried out with four well-known grey scale test images, the image of 'Lena', 'Baboon', 'Peppers', 'Board' of size 256 × 256, with quantization factor q = 1. The original images and reconstructed images with PSNR values are demonstrated in Fig. (12) .
CONCLUSION
In this paper, we derive a variable length reconfigurable DCT/IDCT algorithm, and develop the reconfigurable 4/8-point DCT architecture based on modified unfolded COR-DIC. Specifically, the rotation angles used in our proposed architecture are arithmetic sequence. Consequently, we can use double angle formula to implement the bigger angle rotation with the smaller angle rotation, which make the architecture has higher modularity. In addition, it has higher computation accuracy than the existing architectures [9, 15] , and higher regularity and modularity than the existing architectures [10] [11] [12] . The future work includes VLSI implementation for the proposed architecture using systolic arrays, the applications on data dependent compression using the proposed architecture.
