Abstract-This paper proposes a computational method for 2D 8Â8 DCT based on algebraic integers. The proposed algorithm is based on the Loeffler 1D DCTalgorithm, and it is shown to operate with exact computation-i.e., error-free arithmetic-up to the final reconstruction step (FRS). The proposed algebraic integer architecture maintains error-free computations until an entire block of DCTcoefficients having size 8Â8 is computed, unlike algorithms in the literature which claim to be error-free but in fact introduce arithmetic errors between the column-and row-wise 1D DCTstages in a 2D DCToperation. Fast algorithms are proposed for the final reconstruction step employing two approaches, namely, the expansion factor and dyadic approximation. A digital architecture is also proposed for a particular FRS algorithm, and is implemented on an FPGA platform for on-chip verification. The FPGA implementation operates at 360 MHz, and is capable of a real-time throughput of 3:6 Á 10 8 2D DCTs of size 8Â8 every second, with corresponding pixel rate of 2:3 Á 10 10 pixels per second. The digital architecture is synthesized using 180 nm CMOS standard cells and shows a chip area of 7.41 mm 2 . The CMOS design is predicted to operate at 893 MHz clock frequency, at a dynamic power consumption 13.22 mW/MHz Á V 2 sup .
stationary Markov signals of type I [3] . This is relevant because images often follow such model [3] .
The 8-point DCT of type II, hereafter referred only as DCT [3] , has been employed in different image and video compression standards [8] , including JPEG [9] , MPEG-1 [10] , H.264 [11] , and HEVC [12] . Due to such wide acceptance, several fast algorithms were proposed for the 8-point DCT [3] . A particularly relevant fast algorithm is the one proposed by Loeffler et al. described in [13] , which is capable of computing the 8-point DCT with the minimum possible number of multiplications [13] , [14] , [15] . Because of this, the Loeffler factorization for the 8-point DCT is considered to be a reference method for comparing DCT algorithms.
The theory of algebraic integers (AI) was first introduced in the context of digital signal processing in 1985 by Cozzens and Finkelstein [16] , [17] aiming at the computation of the discrete Fourier transform (DFT). The method included the use of residue number systems in order to reduce the dynamic range of the quantities involved in the DFT computation. In [17] , it was shown that it is possible to numerically evaluate the DFT in exact format and without error propagation, achieving arbitrary precision according to a final reconstruction step (FRS). The FRS is responsible for mapping back the quantities from algebraic integer representation into usual fixed-point representation. The irrational quantities required in the FRS are approximated by rational quantities that can be efficiently implemented in hardware.
Several fast algorithms based on algebraic integer theory have been proposed for the computation of the 1D and 2D DCT [18] , [19] , [20] , [21] . These architectures are able to compute the 1D DCT without multipliers within an errorfree structure.
Usual computation of the 2D DCT is accomplished by column-and row-wise calls of the 1D DCT. However simply computing the 1D DCT by means of AI-based algorithm does not result in a bona fide AI-based 2D DCT computation. Indeed, from the 2D DCT point-of-view, the FRS blocks from AI-based 1D DCT appear as an intermediate computation. Such intermediate reconstruction precludes error-free computation and undermines one of the purposes of employing algebraic integers. This drawback was identified in [22] , [23] where an error-free computation of 2D DCT was proposed for the Arai algorithm [24] . Therefore, the output of the algorithm used both in [22] and [23] is a nonuniform scaled version of the 2D DCT spectrum.
The error-free characteristic of the methods proposed in [16] , [17] , [18] , [19] , [20] , [21] , [22] , [23] is a by-product of the algebraic integer encoding, possibly not the most important. Additional advantages of AI-based fast algorithms include (i) parallelization and (ii) low latency due to the accumulation of multiplicative complexity at the FRS.
This paper aims at introducing a 2D DCT algorithm based on the AI representation that combines (i) high throughput; (ii) low latency; (iii) parallelization; and (iv) error-free architecture. This is achieved by means of the Loeffler fast algorithm for the 1D 8-point DCT using the encoding proposed in [25] . The error-free architecture is possible only because we propose new fast algorithms for the 1D DCT tailored for the inputs required by the 2D architecture. The use of the proposed dedicated algorithms allows the removal of the FRS at the end of each 1D DCT when applied to the columns of the 8Â8 blocks. A digital circuit capable of computing the 2D DCT with the above properties becomes an attractive tool according to several metrics [2] .
This article unfolds as follows. Section 2 reviews the algebraic integer representation proposed in [25] and employed in the present work. Section 2 summarizes the 1D DCT for real quantized input sequences and the Loeffler DCT fast algorithm. Section 3 reviews the 2D DCT and presents the fast algorithms needed to compute the 2D DCT using algebraic integer theory in a error-free fashion. A comparison between the introduced scheme with previous works that propose AI-based error-free 2D DCT computation is supplied. Section 4 details the FRS for the 2D DCT computation and introduces two different methods for the efficient decoding of algebraic integer quantities. In Section 5, an FPGA implementation for the proposed 2D DCT architecture is proposed. Section 6 concludes the paper. The 8-point 1D DCT is a linear orthogonal transformation given by [3] , [4] 
where
In [25] , the authors characterized the ring spanned by the set Z whose elements are 1 and c k , where c k ¼ 2 cos ðkp=16Þ, for k ¼ 1; 2; . . . ; 7. The vector space spanðZÞ generated by linear combination of the elements of Z is suitable for the computation of the 8-point DCT. This is due to the fact that the 8-point DCT requires the quantities cos pkð2n þ 1Þ=16 ½ , n; k ¼ 0; 1; . . . ; 7 [3] . Hereafter, we denote z z ¼ 1 c 1 c 2 ½ c 3 c 4 c 5 c 6 c 7 > as the basis element vector.
Encoding and Decoding
The encoding of a given real number x over the considered AI basis is denoted by f enc ðx; z zÞ ¼ x, where
> is the encoded integer vector, a k 2 Z, k ¼ 0; 1; . . . ; 7, and > denotes transposition. The decoding operation is given directly by the dot product operation [22] 
In [25] , it was shown that the above representation is dense and can provide arbitrary precision, i.e., it is always possible to determine a vector x such that x Àx j j < ", for any " > 0. Authors have also pointed out that in usual applications, such as in the context of image compression, the input data are real, discrete, and quantized [26] in the form of an integer [3] . In such conditions, a real quantized input m, the AI-encoded data can be trivially obtained according to f enc ðm; z zÞ ¼ m 0 0 0 0 0 0 ½ > .
Arithmetic Operations
AI-based addition and multiplication operations over the considered basis were defined in [25] , being the only elementary operations required by the Loeffler DCT algorithm. Since AI quantities are represented by arrays of integers, the addition and subtraction operations obey the usual vector addition and subtraction rule. For the multiplication operation, the product of an arbitrary algebraic integer in the proposed representation by one of the basis elements obey the relations described in Table 1 . Such multiplications are trivial in the sense that only additions, subtractions, and permutations of the input coefficients are needed. In hardware implementation, these operations are performed by simple wiring and adders/subtractors.
Loeffler 1D DCT Multiplicands
The Loeffler DCT algorithm has a signal flow graph (SFG) with four stages (cf. Fig. 1 in [25] ) and it requires the [25] .
In [25] , it was shown that the ring implied by the AI formalism for the 1D DCT case is over-complete; and thus the coefficients linked to the basis element c 4 are not required. On the other hand, the 2D DCT demands the coefficients associated to c 4 .
1D AI-Based Fast Algorithm
The representation proposed in [25] when applied to the 1D 8-point DCT furnishes a multiplierless algorithm that requires a total of 20 additions (cf. Fig. 3 in [25] ). Indeed, due to Loeffler DCT definition, the algorithm output is a scaled DCT with scaling factor of 2. If required, the output can be re-scaled by a simple bit-shifting or it can be inserted into the decoding stage. Therefore, the scaling by 2 does not add to the increase of arithmetic complexity when performing the processing.
THE 2D DCT AND THE AI BASIS REPRESENTATION

The 2D DCT
Let x m;n be a 2D array for m; n ¼ 0; 1; . . . ; 7. The 2D 8-point DCT is a linear transformation defined as [3] , [4] 
As adopted by several image encoding schemes [3] , [27] , [28] , [29] , the 2D DCT computation is performed by successive calls of the 1D DCT applied to the columns of the input 2D data, then to the rows of the resulting matrix. For blocks of size 8 Â 8, sixteen calls of the 1D DCTs are required to furnish the 2D DCT.
2D AI-Based Fast Algorithm
When the 2D input array is real and quantized, several simplications arise. These simplifications can be exploited to provide efficient fast algorithms for the 2D DCT over the AI basis representation proposed in [25] without the need of FRS for each 1D DCT between the computation over the columns and rows.
Considering the 2D DCT computation by means of column-and row-wise calls of the 1D DCT, we notice the following structure. If the 2D input data consists of integer elements, then the AI-encoded quantities resulting from the columnwise calls of the 1D DCT have the following configuration: (i) the elements in the 0th and 4th rows have always non-null first coefficient in its AI-based representation; (ii) the elements in 1st, 3rd, 5th, and 7th rows exhibit non-null odd-index coefficients; and (iii) the 2nd and 6th rows have non-null coefficients only in the 3rd and 7th coefficients on its respective AIbased representation. Such fixed patterns are due to the algorithm for class A input (cf. Fig. 3 in [25] ).
In view of their patterns, we categorize the AI quantities into the five classes as shown in Table 2 , where non-null coefficients locations are represented by the cross symbol. If we represent the two-dimensional input sequence in graphical format as in Fig. 1a , we have the configuration in Fig. 1b after the application of 1D DCT over the columns. Letters A, B, C, D, and E represent the class to which the quantity belongs according to Table 2 .
For an error-free realization of the 2D DCT without FRS blocks between the column-and row-wise 1D DCT calls, we need to derive tailored AI-based DCT algorithms considering input data in Class B and C. For input in Class A, the DCT algorithm collapses to the method detailed in [25] . For such, we consider the multiplication rules in Table 1 . The obtained procedures are detailed in the algorithms in Fig. 2 , for Class B data; and in Fig. 3 , for Class C data. The algorithm in Fig. 2 requires 136 additions and 14 bit-shifting operations; whereas the procedure in Fig. 3 demands 74 additions and 8 bit-shifting operations.
The outputs of the algorithms in Figs. 2 and 3 also follow a fixed pattern that determines the class of each output element in the AI representation. The class of each element of the output sequence is shown in Fig. 1c .
For a given 8 Â 8 block x, let x m;Á and let x Á;n denote the mth row and the nth column of x, respectively. Let also the operators DCT A ðÁÞ, DCT B ðÁÞ, and DCT C ðÁÞ be instantiations of the algorithms for Class A input (cf. Fig. 3 in [25] ), Class B as in Fig. 2 , and Class C as in Fig. 3 , respectively. For example, DCT A ðx Á;n Þ denotes the computation of the 1D DCT over the nth column of the block x whose coefficients belong to Class A in the AI-based representation. Let also Z X be the set of 8-point integer vectors belonging to the Class X, where X 2 fA; B; Cg. Thus, a complete fast 
Cross symbols correspond to non-null coefficients.
algorithm for the computation of the 2D DCT using AI representation can be expressed in terms of DCT A ðÁÞ, DCT B ðÁÞ, and DCT C ðÁÞ, as shown in Fig. 4 . Notice that the output are in AI-based representation. The algorithm in Fig. 4 along with the FRS is graphically depicted in Fig. 5 . The computational cost of the algorithm shown in Fig. 4 can be derived by noticing that it demands 10, 4, and 2 instantiations of DCT A , DCT B , and DCT C , respectively. Considering the number of additions and bit-shifting operations required by DCT A , DCT B , and DCT C , the computation of the 2D DCT using AI-based representation requires a total of 892 additions and 92 bit-shifting operations. Arithmetic operation count and comparison are show in Table 3 .
Comparison with Previous Works
Different works [18] , [30] , [31] , [32] proposed fast algorithms for the AI-based 2D DCT computation claiming to provide error-free computation. However, in previous works, the computation of 2D DCT was based in several instantiations of the 1D DCT with its FRS blocks placed in-between the column-and row-wise DCT computations. Such intermediate reconstruction step introduces numerical inaccuracies, which makes the implementation incapable of error-free computation [18] , [30] , [31] , [32] . This also reflects on deterioration of performance metrics such as throughput, maximum operating frequency, area, and latency as the FRS blocks are computationally intensive and hardware demanding.
Nonetheless, there is one algorithm for full error-free computation of the 2D DCT using the Arai factorization [24] . The Arai algorithm over algebraic integers was employed in two different architectures: a row-parallel 8Â8 2D DCT architecture using algebraic integer-based exact computation [22] and a single-channel architecture for algebraic integer-based 8Â8 2D DCT computation [23] . The work in [22] is a extension of the work in [33] , where the first error-free 2D DCT was proposed. Since these architectures are based on the Arai algorithm [24] , they are capable of providing the non-uniformly scaled 2D DCT spectrum. This is due to the fact that Arai 1D DCT is capable of only provide a scaled spectrum, thus eliminating some multiplicands required for the non-scaled 1D DCT. The implementations in [22] and [23] can be modified in order to provide exact 2D DCT spectrum at the cost of changes on the FRS. However, our proposed architecture is error-free up to the FRS after the application of the 1D DCT in both dimensions and does not share the intricate details of non-uniform scale as in [22] , [23] . Table 3 summarizes the comparison between the proposed method and previous works.
FINAL RECONSTRUCTION STEP
The final reconstruction step block performs the AI decoding described in (2) . It maps AI quantities back to fixedpoint representation. In the proposed design implementation of the 2D DCT, the FRS is performed only at the very final stage after all the computations required by the 2D DCT are completed over the AI representation. No intermediate reconstructions are required.
In this section, we consider two methods for AI decoding: (i) dyadic approximation [3] and (ii) the expansion factor method [3] , [36] with a modified cost function for optimized results. The dyadic approximation method is suitable for scenarios where the exact spectrum is required, whereas the expansion factor is applicable when a scaled version is acceptable.
In both cases, the FRS is reduced to the evaluation of the product of a few integers by known integer constants. This operation can be understood as an instance of the multiple constant multiplication (MCM) problem with a small number of constants-no more than four, as will be clear in the next sections. Several methods for MCM evaluation with different constants have been developed by means of optimization and graph theory [37] , [38] , [39] , [40] . For solving the present MCM problems, we employ the method described in [41] , which is based on number recoding and subexpression factorization and has not been applied to the design of FRS block in previous works [18] , [22] , [23] , [30] , [31] , [32] , [34] , [35] . MCM evaluation can save up to 40 percent of area in FPGA implementation and provide faster computation when compared to routine methods [41] .
Dyadic Approximation Method
The irrational quantities required by the FRS can be approximated with arbitrary precision by dyadic integers [3] . Dyadic integers are of the form p=2 k , where p is an odd integer and k 2 N. Thus, they can be efficiently implemented in hardware [2] , [3] , [42] .
In order to implement the FRS with a minimum arithmetic cost, we first approximate each of the involved irrational constants by a dyadic integer. The accuracy of its approximation is determined by the wordlength employed, which can vary according to specific applications. For the sake of clarity, adopting the 11-bit wordlength, we have 
For the decoding of 2D DCT output coefficients into fixed-point representation, we need to consider the quantities in each AI number class and design specific algorithms for each class. Considering classes shown in Table 2 , we have the algorithms shown in Table 4 for the approximate z z with 11-bit wordlength. The operation x ( k denotes the left shift of k bits over the integer quantity x (i.e., x Á 2 k ); whereas x ) k denotes the right shift of k bits.
Expansion Factor
The expansion factor method returns a scaled version of DCT spectrum and is based on finding an appropriate real constant a Ã > 1 such that a Ã Á z z is as close as possible to a vector of integers. This provides means of performing the decoding operation with multiplications by known integer constants that can be efficiently performed with
where roundðÁÞ operates over each component of its vector argument. Previous works have considered an expansion factor a Ã satisfying the following optimization problem
In this context, all the components of the basis vector z z are taken into account with the same weight. However, this is not suitable for this problem. In fact, the required number of multiplications by each of the components of z z is not uniform. This can be seen by the output pattern of the 2D DCT coefficients in Fig. 1c . Therefore, in order to obtain a more precise estimation for a Ã , we must take into account the relative frequency of occurrence of multiplications of the coefficients of z z. This results in the following optimization problem
where f f represents the vector with the relative frequency of the occurrence of multiplications by each coefficient of z z. 
The problem in (7) is non-linear and has no closed solution in terms of simple algebraic functions. In order to solve (7) we employ exhaustive search methods. Although exhaustive search methods are not considered to be efficient for solving optimization problems in general, the search space for finding suitable expansion factors can be made small enough without imposing prohibitive limitations.
For instance, considering the search space ½0; 2048 (11-bit wordlength) and a step size of 10 À2 , we obtain the optimal value of a Ã ¼ (9) Table 5 shows the optimal expansion factors for some wordlengths N for searches with steps of 10 À2 . Minimum and maximum relative errors are also shown.
For the decoding of 2D DCT output coefficient into fixed-point representation, we need to consider the different number classes shown in Table 2 and design specific algorithms for them. We adopted the MCM method described in [41] . We derived the algorithms shown in Table 6 for the optimal constant a Ã ¼ 1844:95 with 11-bit wordlength. A fully-parallel architecture for the real-time implementation of the proposed 2D DCT using AI encoding has been designed, simulated and implemented using field programmable gate array (FPGA) technology. The architecture assumes 64 parallel input channels pertaining to the 64 locations of an 8 Â 8 matrix of pixel values, which are assumed to be of 8-bit signed values using twos complement format. The inputs are assumed to be normalized in the range À1 to 127=128. The AI encoded architectures corresponding to DCT A , DCT B , and DCT C are realized in parallelized digital hardware. The 64 output coefficients are maintained in the AIencoded infinite precision format up to the FRS block for conversion to fixed-point representation for subsequent processing. The algorithms for multiple constant multiplication described in Tables 4 and 6 were used in the FRS design. The FRS was also made fully-parallel. Both the AIencoded DCT architecture as well as the FRS are fine-grain pipelined for low critical path delay.
The resulting digital design was simulated using bit-true and cycle accurate models using 1:5 Á 10 4 number of randomly generated 8 Â 8 input vectors to verify correct operation. The verified design was thereafter targeted to a Xilinx Virtex-6 XC6VLX240T-1FFG1156 FPGA device installed on a Xilinx ML605 evaluation platform. The design was subjected to physical implementation and test using 1:5 Á 10 4 test matrices provided to the implementation using stepped hardware co-simulation on the JTAG port. The FPGA 
10 10 resources consumption and metrics are shown in Table 7 along with competitors designs available metrics in Table 3 .
1
The throughput is calculated as the number of output coefficients in each cycle and the designs in the works in Table 7 are classified into fully parallel 64 coefficients per clock cycle (FPar64); row parallel 8 coefficients per clock cycle (RPar8); fully serial 1 coefficient per clock cycle (FSer1). The block and pixel rate represent the number of 8 Â 8 blocks and pixels processed per second. The maximum clock frequency for potential real-time operation is 360 MHz. This implies a throughput of 360 million 2D DCT computations of size 8 Â 8 every second, if this core is used as part of a larger image/video processing system designed on the same FPGA technology. This throughput is equivalent to a pixel rate of 23,040 billion pixels/second, and a sustained data processing rate of 184.32 Gbps (internal to the core). Getting such a high data rate into the processing core is a challenging problem itself. The intention here is to give the reader a sense of the capabilities of the FPGA realization, if a suitable data source and algorithm is in fact available to feed it. Given that the proposed core is expected to be part of a larger ultra-high definition video processing system, the obtained throughput from the FPGA implementation is quite sufficient for today's most challenging UHD video applications.
Note that the proposed design achieves the highest maximum frequency among all the competitors designs. The work in [31] proposes a 2D DCT algorithms, but only the FPGA implementations results for 1D DCT are presented and therefore used in Table 7 . The only works containing complete error-free implementation of 2D DCT are [22] , [23] and [34] . The proposed design demands a total of 26,000 registers and 30,200 look-up tables (LUTs) used for logic against 10,282 registers and 12,007 LUTs required by the design in [22] . Although the number of registers and LUTs used are around three times larger, the proposed design offers 100 times higher block processing rate and 1,000 times higher pixel rate compared to [22] .
ASIC Synthesis
Apart from the FPGA implementation, the design is subjected to the application specific integrated circuit (ASIC) synthesis. The employed ASIC technology is the AMS 180 nm with the software Genus version 15.23. The supply voltage (V sup ) for the ASIC synthesis is 1.8 V. Table 8 results shows the ASIC metrics.
Note that because the proposed architecture is capable of computing the DCT coefficients in a parallel fashion reducing the critical path, the maximum operating frequency achieved is very high compared to its competitors. The proposed scheme requires around 3-4 times the area in the design in [35] . However, this area requirement is translated into a thousandfold pixel rate improvement.
CONCLUSIONS
In this paper we proposed a new error-free fast algorithm for the computation of the 2D DCT. The algorithm is based on algebraic integer representation proposed in [25] . The proposed fast algorithm does not require any intermediate FRS between the the column-and row-wise 1D DCT calls. 1. Note, however, that we do not include the work [18] in Table 7 . This is because the work in [18] does not present any FPGA implementation metric.
This work provided FPGA and ASIC implementation results. The FPGA results shows that the proposed fast algorithm provided higher maximum operating frequency when compared to competitors. This is important in applications requiring high data throughput and real time processing of images or videos. Future works may include the design of 3D DCT algorithms using the algebraic integer based numerical representation employed in this work. Vassil S. Dimitrov is a professor with the Department of Electrical and Computer Engineering, University of Calgary, Canada, and member of the Management Board of the Centre for Information Security and Cryptography. He has vast experience in the domain of cryptography and efficient implementation of cryptographic protocols. He has published three books, more than 100 papers in peer-reviewed journals and holds three patents. He has extensively taught courses on digital signal processing, cryptography, information theory, and computational complexity. Since 1997, he is a member of the New York Academy of Sciences. Prior to his professorship with the University of Calgary, he held academic position with the University of Windsor, Canada and Helsinki University of Technology, Finland. He is doing extensive consulting work in the domains of cryptography, big data analysis, and high performance computing. Arnaud Tisserand received the PhD degree, in 1997. He is senior researcher with the CNRS (French National Center for Scientific Research) in computer science in Lab-STICC laboratory. His research interests include computer arithmetic, computer architecture, digital security, VLSI and FPGA design, design automation, low-power design and applications in applied cryptography, scientific computing, digital signal processing. He is an associate editor of the IEEE Transactions on Computers and a senior member of the IEEE (SSCS, CAS).
Arjuna Madanayake
(M'03) received the BSc degree in electronic and telecommunication engineering (with First Class Honors) from the Univer
