In this paper, the minimum adder-delay Discrete Cosine Transform (DCT) architecture is proposed using the Adaptive CORDIC (ACor) algorithm with fixed-rotation implementations. The proposed method has six different versions differ from the number of DCT point, i.e., 8-point (8p), 16-point (16p), and 32-point (32p), and the number of ACor stages, i.e., 2-Stage (2S) and 3-Stage (3S). The Altera Stratix IV and Stratix II FPGAs were used to built and verified the implementations. The 2S designs of 8p, 16p, and 32p DCT achieved the timing performances of four, five, and six adder-delay results, respectively. The proposed method was proven to have the best timing performances, good accuracy results, and adequate resources cost in comparison with other recent works.
Introduction
There were almost 50 years since its first proposition in 1974 by N. Ahmed et al. [1] , Discrete Cosine Transform (DCT) still plays a vital role in the digital signal processing research area. DCT was an irreplaceable step of many compression algorithms such as JPEG [2] , MPEG1 [3] , H.264 [4] , and the latest video coding of High-Efficiency Video Coding (HEVC) [5] . On the development trend of video coding recently, the requirements for such a DCT implementation are increasing rapidly, especially in timing performances. Therefore, the research of DCT's architecture and implementation still attracts many researchers nowadays. The primary goal of this paper is to present a DCT architecture that has a minimum delay while maintaining good accuracy outputs and adequate resources cost.
The performances of a DCT implementation mainly come from its Signal-Flow-Graph (SFG). There are two classic SFGs given by W.-H. Chen et al. in 1977 [6] and C. Loeffler et al. in 1989 [7] . In the classic SFGs, the real-value multipliers were used which lead to various difficulties in the implementation. As a result, in 2001, J. Liang and T. D. Tran [8] proposed a multiplierless approach called binDCT to overcome the issue. BinDCT method has become a well-known architecture due to its high accuracy and low resources utilization. However, the main drawback of the binDCT approach is a large number of adders on its critical path which leads to poor timing performances. Thus, many attempts had been made to shorten the adder-delay of the DCT SFG's critical path. One of the most effective approaches is an approximation DCT method, such as the approximation model by M. Parfieniuk et al. [10] and Integer DCT (IntDCT) by P. K. Meher et al. [9] .
Another favorite way to implementation DCT is to use the COordinate Rotation DIgital Computer (CORDIC) algorithm. There are many CORDIC-based DCT designs had been proposed lately [11] [12] [13] . The CORDICbased DCT architecture can be seen as an approximation implementation. In CORDIC design, the state-of-the-art method is the Adaptive Angle Recoding CORDIC (AARC) method which was first introduced by Y. H. Hu et al. in 1993 [14] . The key idea of AARC algorithm is to choose only a few selected micro-rotations to perform instead of all micro-rotations as in the conventional method. The latest development of AARC is the Adaptive CORDIC (ACor) algorithm which was proposed by Hong-Thu Nguyen et al. [15] in 2015. A proposed ACor system was also implemented and proven to have the low-latency, low-resources, and high-accuracy experimental results [15] .
In this paper, the proposed DCT architecture was developed based on the ACor algorithm [15] and the extension of the previous work [16] . Six different versions were implemented based on the number of DCT-point, i.e., 8-point (8p), 16-point (16p) , and 32-point (32p), and the number of ACor stages, i.e., 2-Stage (2S) and 3-Stage (3S). The six proposed implementations were built on an Altera Stratix IV and Stratix II Field-Programmable Gate Array (FPGA). In comparison with recent works, the proposed SFGs had the best timing performances in each DCT-point category. For specific, for 8point, 16 The remainder of this paper is organized as follows. Section II briefly reviews the ACor method. Section III proposes the architectures of 8/16/32point DCT. Section IV presents the experimental results. Finally, section V gives the conclusion and future work. 
The algorithm
The ultimate goal of the ACor method [15] is to reduce the iterations number by choosing only several selected micro-rotations instead of all of the angles as in the conventional algorithm. By doing so, the latency and the resources cost can be reduced sharply. For accuracy performances, the ACor architecture was proven to have the equivalent or even better accuracy results than those of the conventional CORDIC [15] . The pseudo-code of the algorithm is given by Fig. 1 . The computation of the ACor approach is an iteration process to select the appropriate microrotation θ i in each step i in which the residual angle z i will quickly converge to zero. Eq. 1 shows the calculation for micro-rotation θ i . The pseudo-code uses the concept of C which is given by Eq. 2 for decision-making in each step. The C concept presents for the range around a micro-rotation constant θ i . If the residual angle z i is in the range of (c i+1 ; c i ], then it will rotate for the θ i angle. In the ACor algorithm, the wanted values of (x, y) were multiplied with a length-factor K. Therefore, there is a final multiplication in the procedure after all of the micro-rotation as shown in the pseudo-code. The length-factor K is the multiplications of k i in Eq. 3. Another notable improvement of the ACor method is the angle normalization. The trigonometric circle is divided into eight segments as shown in Fig.  2 . The ACor method computes only those residual angles inside the zerothsegment, the angles from other segments are converted to the zeroth-segment by simple trigonometric functions.
Fixed-rotation ACor implementation
In a DCT's SFG, a fixed-rotation ACor deployment is usually found when the residual input angle z is a known value determined by the SFG. As a result, several optimizations can be done for a fixed-rotation ACor implementation. First of all, when the input z is a constant value, the set of z i will become constants, thereby leading to the removal of z i calculation and a fixed number of iteration steps. Moreover, with the fixed set of micro-rotation, the length-factor K will become a constant factor as well. Take the 3π/8-ACor implementation for example, a full design of the 3π/8-ACor will contain six micro-rotation of {θ 1 − θ 4 − θ 7 − θ 10 + θ 11 + θ 12 }. However, without scarifying too much of the accuracy, some later stages of the ACor module can be dropped to save resources and gain better adder-delay. The proposed DCT designs in the following section are implemented with two options of ACor designs, i.e., 2-Stage (2S) and 3-Stage (3S). Fig. 3 shows the example designs of 2S and 3S 3π/8-ACor implementations. 
DCT formulas
The forward and inverse DCT formulas are described by Eq. 4 and Eq. 5, respectively. The forward DCT transforms the data from the spatial domain to the frequency domain, and the inverse DCT (IDCT) transforms the data IEICE Electronics Express, Vol.*, No.*, [1] [2] [3] [4] [5] [6] [7] [8] [9] [10] [11] [12] from the frequency domain back to the spatial domain. Both DCT and IDCT formulas require scale factor of 2 N C k , where C k equation is given by Eq. 6. 
ACor series
Based on the notation of the fixed-rotation ACor module in section 2.2, the notation of
2k+1 n is used to express the series of ACor modules of {C 1 n , C 3 n , ..., C 2K+1 n }. Several examples of ACor series are given in Fig. 5 .
DCT signal-flow-graphs
There are scale factors at the end of the DCT graphs, or at the beginning of the IDCT graphs. Those factors are 2 N C k as shown in Eq. 4, Eq. 5, and Eq. 6. In an image/video compression procedures, the scale factors were usually merged with the quantization matrixes during the encoding/decoding processes. Hence, the factors multiplications are usually done by a codec on software rather than implemented in a DCT/IDCT hardware circuit.
In an ACor-based DCT architecture, there are also length-factor K generated by ACor modules besides of the former scale factors. Therefore, an ACor-based DCT SFG's final scale factors are the multiplication of 2 N C k and the length-factor K. For Acor series, the length-factor is the mean value of all K values of all ACor submodule. The proposed SFGs of 8p-DCT and 8p-IDCT are given in Fig. 6 . The different between 2S and 3S designs is the number of ACor stages in each design. Thus, 2S and 3S implementations of 8p-DCT/IDCT share the same SFGs as shown in Fig. 6 . However, they have the different ACor submodules which lead to different final scale factors. Similarly, the 2S and 3S shared SFGs of 16p-DCT is given in Fig. 7 . In 16p-DCT SFG, the 8p-DCT graph is used as submodules.
With the same method, the proposed SFG of 32p-DCT is shown in Fig.  8 . In the graph, 16p-DCT is reused as a submodule. There are also six submodules, e.g., M1 to M6, being used in the 32p-DCT graph, and they are described in Fig. 9 . 
Evaluation criteria 4.1.1 Adder-delay
Adder-delay means how many adders are there on the critical path of the SFG. Take the 2S-8p-DCT SFG in Fig. 6(a) for example, the Y (0) and Y (4) paths have the same adder-delay of three because their delay has to propagate through three adders in 8p-TA, 4p-TA, and 2p-TA modules with one adder per TA-module. For Y (2) and Y (6) paths, besides of the two adders in 8p-TA and 4p-TA modules, they have to propagate through a π/8-ACor module as seen in Fig. 6(a) . The π/8-ACor module can make two or three adder-delay depends on the number of stages as shown in Fig. 3 . Because the 2S-8p-DCT design uses 2-Stage setting of ACor, its π/8-ACor module cost two adderdelay. As a result, the Y (2) and Y (6) paths of the 2S-8p-DCT SFG have four adder-delay in total. Similarly, the Y (1), Y (7), Y (5), and Y (3) paths of the 2S-8p-DCT SFG have four adder-delay in total: one for the 8p-TA module, two for the ACor series, and one for the sum after ACor series. To conclude, four adder-delay its longest path of the 2S-8p-DCT SFG. In other words, it can be said that the delay of the 2S-8p-DCT SFG approximates to four times of an adder's propagation-delay.
Mean-square-error (MSE)
The MSE computation is given by Eq. 7, where n is the number of samples, and a i and b i are the two vectors that need to be compared, e.g., the hardware outputs and the true DCT outputs. To maintain the excellent accuracy results of the implementations, the MSE values should be minimized. 
Coding gain (Cg)
For a transformation tool used in compression research area, Cg factor is among the most important factors. Higher Cg means that the transformation method can compact more energy into a fewer number of coefficients. Therefore, the approach with higher Cg result is the better approach for compression applications. The formulas of Cg is given by the Eq. 8. N is the number of DCT-point, and the σ 2 i is the variance of the i-th coefficient. The input data should be a first-order Gaussian-Markov process with zero-mean, unit variance, and correlation coefficient ρ = 0.95 (a good approximation for natural images).
Results and comparison
Based on the proposed DCT architecture, there are six versions of implementation in this paper differ from DCT-point (8p, 16p, and 32p) and ACor stages (2S and 3S). The proposed DCT SFGs were evaluated in the term of the number of adders, adder-delay, MSE, and Cg. Furthermore, the six implementations were built and verified on an Altera FPGA Stratix IV and Stratix II with the chipsets of EP4SGX530KH40C2 and EP2SGX90EF115C3, respectively. Table I gives the SFGs' results in comparison with other works including our previous work in [16] . Table II and Table III give the FPGAs' results on Stratix IV and Stratix II, respectively. In Table III , a fair comparison was made between Altera Stratix II and Xilinx Virtex-4. In the tables, only combinational logic resources were reported. That is because there were no other resources like registers or multipliers of FPGA in the implementations. The ACor algorithm itself was a tool to transform trigonometric multiplications to shift-and-add functions as shown in Fig. 5 . Thus, there were only adders in the design which results in combinational resources only. The Stratix IV and Stratix II results were built using the Quartus Prime 17.1 and Quartus II 13.0 service pack (sp) 1 compiler tools. The Quartus II 13.0 sp1 was used because it is the latest version of Quartus that support the Stratix II chip series. The Stratix IV was chosen to report in this paper because it is the typical modern high-end FPGA of Altera, and the Stratix II was chosen due to the fair comparison with the Xilinx Virtex-4 FPGAs in other papers. According to the Altera's white paper [17] , Stratix II and Virtex-4 can make a fair comparison because they had the same process of 90nm. The combinational logic usage of Altera FPGAs is counted by Adaptive LUT (ALUT) while that of the Xilinx FPGAs are counted by LUT. Altera's ALUT is a LUT which can be configured into various settings depends on the requirements of the implementation. For Altera FPGAs, ALUT represents for what can be implemented by the combinational logic. The details of the ALUT implementation can be found in the Altera's white paper of FPGA architecture [18] . Furthermore, according to the white paper [17] , an Altera's ALUT can be roughly compared to a Xilinx's LUT. As shown in Table I to Table III The 2S-8p-DCT can be compared with the binDCT-C5 model [8] in terms of the number of adders, MSE, and Cg performances. However, the binDCT-C5 approach has three times longer critical path, 12 adder-delay, compared to the four adder-delay of the 2S-8p-DCT model. As for the binDCT-L4 model [8] , the 2S-8p-DCT design achieves similar MSE result with a few more adders. However, the 2S-8p-DCT results in slightly better Cg and far better adder-delay compared to those of the binDCT-L4. In comparison with the approximation models in [10] , the best Cg scheme of Scheme 2 has the similar resources cost and Cg performance in comparison with the 2S-8p-DCT's results, but it has far worse timing performances in both terms of adderdelay and maximum operating frequency. Moreover, for the fastest scheme of Scheme 30, the 2S-8p-DCT still has much better adder-delay, maximum operating clock, and Cg result while requires slightly more adders.
Between 2S and 3S ACor comparison, their Cg results are nearly the same as can be seen in Table I . Therefore, the 2S approach is the best ACor approach to be deployed in such an ACor-based DCT implementation, because it will provide the best adder-delay and resources cost while producing the accuracy outputs as good as other higher number of ACor-stage modules.
Conclusion and future work
A minimum adder-delay ACor-based DCT architecture had been proposed in this paper, and it was implemented in six different versions corresponding to DCT-point and Acor design. The 2S Acor designs had been proven to be the best Acor designs because they held the best timing performances with similar accuracy outputs and equivalent resources costs. The FPGAs that were used for the implementations are Stratix IV and Stratix II. The 2S models of 8p-DCT, 16p-DCT, and 32p-DCT achieved four, five, and six adder-delay; 216.68 MHz, 153.87 MHz, and 116.13 MHz of F max on Stratix IV; 180.90 MHz, 119.42 MHz, and 98.80 MHz F max on Stratix II; 1.403e-4, 2.029e-2, and 7.663e-2 MSE results; and 8.8108 dB, 9.0984 dB, and 9.2170 dB Cg performances, respectively.
For the future work, the next step is to apply the proposed SFGs to an image/video application such as JPEG or the latest video coding of HEVC. After that, a combination between the proposed 1-D DCT SFGs in this paper with the Directional DCT (D-DCT) framework will make a strong transformation tool for image/video coding research area. The D-DCT is a 2D-DCT framework concerning about the object's direction in the image. It exploits the directional aspect to maximize the coding energy to improve both compression ratio and image quality. To conclude, the ultimate goal of the future work is a new ACor-based D-DCT architecture applied for JPEG and HEVC.
