An algebraic integer (AI) based time-multiplexed row-parallel architecture and two final-reconstruction step (FRS) algorithms are proposed for the implementation of bivariate AI-encoded 2-D discrete cosine transform (DCT). The architecture directly realizes an error-free 2-D DCT without using FRSs between row-column transforms, leading to an 8×8 2-D DCT which is entirely free of quantization errors in AI basis. As a result, the user-selectable accuracy for each of the coefficients in the FRS facilitates each of the 64 coefficients to have its precision set independently of others, avoiding the leakage of quantization noise between channels as is the case for published DCT designs. The proposed FRS uses two approaches based on (i) optimized Dempster-Macleod multipliers and (ii) expansion factor scaling. This architecture enables low-noise high-dynamic range applications in digital video processing that requires full control of the finite-precision computation of the 2-D DCT. The proposed architectures and FRS techniques are experimentally verified and validated using hardware implementations that are physically realized and verified on FPGA chip. Six designs, for 4-and 8-bit input word sizes, using the two proposed FRS schemes, have been designed, simulated, physically implemented and measured. The maximum clock rate and block-rate achieved among 8-bit input designs are 307.787 MHz and 38.47 MHz, respectively, implying a pixel rate of 8×307.787≈2.462 GHz if eventually embedded in a real-time video-processing system. The equivalent frame rate is about 1187.35 Hz for the image size of 1920×1080. All implementations are functional on a Xilinx Virtex-6 XC6VLX240T FPGA device.
INTRODUCTION
High-quality digital video in multimedia devices and video-over-IP networks connected to the Internet are under exponential growth and therefore the demand for applications capable of high dynamic range (HDR) video is accordingly increasing. Some HDR imaging applications include automatic surveillance [1] [2] [3] [4] , geospatial remote sensing [5] , traffic cameras [6] , homeland security [4] , satellite based imaging [7] [8] [9] , unmanned aerial vehicles [10] [11] [12] , automotive industry [13] , and multimedia wireless sensor networks [14] . Such HDR video systems operating at high resolutions require an associate hardware capable of significant throughput at allowable area-power complexity.
into fixed-point representations [36] . This procedure allows the selection of individual levels of precision for each of the 64 DCT spectral components at the FRS. At the same time, such flexibility does not affect noise levels or speed of other sections of the 2-D DCT.
This works extends the 8-point 1-D AI-based DCT architecture [37, 41, 42] into a fully-parallel time-multiplexed 2-D architecture for 8×8 data blocks. The fundamental differences are (i) the absence of any intermediate reconstruction step; (ii) a new doubly AI encoding scheme; and (iii) the utilization of a single FRS. The proposed 2-D 8 × 8 architecture has the following characteristics: (i) independently selectable precision levels for the 2-D DCT coefficients;
(ii) total absence of multiplication operations; and (iii) absence of leakage of quantization noise between coefficient channels. The proposed architectures aim at performing the FRS operation directly in the bi-variate encoded 2-D AI basis. We introduce designs based on (i) optimized Dempster-Macleod multipliers and on (ii) the expansion factor approach [44] . All hardware implementations are designed to be realized on field programmable gate arrays (FPGAs) from Xilinx [45] .
This paper unfolds as follows. In Section 2 we review existing designs and the main theoretical points of number representation based on AI. We keep our focus on the core results needed for our design. Section 3 brings a description of the proposed circuitry and hardware architecture in block level detail. In Section 4 strategies for obtaining the FRS block are proposed and described. Simulation results and actual test measurements are reported in Section 5.
Concluding remarks are drawn in Section 6.
REVIEW
The AI encoding was originally proposed for digital signal processing systems by Cozzens and Finkelstein [46] .
Since then it has been adapted for the VLSI implementation of the 1-D DCT and other trigonometric transforms by
Julien et al. in [47] [48] [49] [50] [51] , leading to a 1-D bivariate encoded Arai DCT algorithm by Wahid and Dimitrov [37, 41, 42, 52] . Recently, subsequent contributions by Wahid et al. (using bivariate encoded 1-D Arai DCT blocks for row and column transforms of the 2-D DCT) has led to practical area-efficient VLSI video processing circuits with low-power consumption [53] [54] [55] . We now briefly summarize the state-of-the-art in both 1-D and 2-D DCT VLSI cores based on conventional fixed-point arithmetic as well as on AI encoding.
SUMMARY AND COMPARISON WITH LITERATURE

FIXED-POINT DCT VLSI CIRCUITS
A unified distributed-arithmetic parallel architecture for the computation of DCT and the DST was proposed in [24] . A direct-connected 3-D VLSI architecture for the 2-D prime-factor DCT that does not need a transpose memory (buffer) is available in [25] . A pioneering implementation at a clock of 100 MHz on 0.8 µm CMOS technology for the 2-D DCT with block-size 8 × 8 which is suitable for HDTV applications is available in [17] . An efficient VLSI linear-array for both N-point DCT and IDCT using a subband decomposition algorithm that results in computational-and hardware-complexity of O(5N/8) with FPGA realization is reported in [20] . Recently, VLSI linear-array 2-D architectures and FPGA realizations having computation complexity O(5N/8) (for forward DCT) was reported in [21] . An efficient adder-based 2-D DCT core on 0.35 µm CMOS using cyclic convolution is described in [29] . A high-performance video transform engine employing a space-time scheduling scheme for computing the 2-D DCT in real-time has been proposed and implemented in 0.18 µm CMOS [22] . A systolic-array algorithm using a memory based design for both the DCT and the discrete sine transform which is suitable for real-time VLSI realization was proposed in [18] . An FPGA-based system-on-chip realization of the 2-D DCT for 8 × 8 block size that operates at 107 MHz with a latency of 80 cycles is available in [28] . A low-complexity IP core for quantized 8 × 8/4 × 4 DCT combined with MPEG4 codecs and FPGA synthesis is available in [30] . "New distributed-arithmetic (NEDA)" based low-power 8 × 8 2-D DCT is reported in [31] . A reconfigurable processor on TSMC 0.13 µm CMOS technology operating at 100 MHz is described in [32] for the calculation of the fast Fourier transform and the 2-D DCT. A high-speed 2-D transform architecture based on NEDA technique and having unique kernel for multi-standard video processing is described in [33] .
AI-BASED DCT VLSI CIRCUITS
The following AI-based realizations of 2-D DCT computation relies on the row-and column-wise application of 1-D DCT cores that employ AI quantization [47] [48] [49] [50] [51] . The architectures proposed by Wahid et al. rely on the lowcomplexity Arai Algorithm and lead to low-power realizations [41, 42, [52] [53] [54] . However, these realizations also are based on repeated application along row and columns of an fundamental 1-D DCT building block having an FRS section at the output stage. Here, 8 × 8 2-D DCT refers to the use of bivariate encoding in the AI basis and not to the a true AI-based 2-D DCT operation.
A 4 × 4 approximate 2-D-DCT using AI quantization is reported in [56] . Both FPGA implementation and ASIC synthesis on 90 nm CMOS results are provided. Although [56] employs AI encoding, it is not an error-free architecture.
The low complexity of this architecture makes it suitable for H.264 realizations.
PRELIMINARIES FOR ALGEBRAIC INTEGER ENCODING AND DECODING
In order to prevent quantization noise, we adopt the AI representation. Such representation is based on a mapping function that links input numbers to integer arrays. This topic is a major and classic field in number theory. A famous exposition is due to Hardy and Wright [57, Chap. XI and XIV], which is widely regarded as masterpiece on this subject for its clarity and depth. Pohst also brings a didactic explanation in [58] with emphasis on computational realization. In [59, p. 79 ], Pollard and Diamond devote an entire chapter to the connections between algebraic integers and integral basis. In the following, we furnish an overview focused on the practical aspects of AI, which may be useful for circuit designers.
Definition 1 A real or complex number is called an algebraic integer if it is a root of a monic polynomial with integer
coefficients [38, 57] .
The set of algebraic integers have useful mathematical properties. For instance, they form a commutative ring, which means that addition and multiplication operations are commutative and also satisfies distribution over addition.
A general AI encoding mapping has the following format
where a is a multidimensional array of integers and z is a fixed multidimensional array of algebraic integers. It can be shown that there always exist integers such that any real number can be represented with arbitrary precision [46] . Also there are real numbers that can be represented without error. 
which is an exact representation. In principle, any number can be represented in an arbitrarily high precision [46, 60] . However, within a limited dynamic range for the employed integers, not all numbers can be exactly encoded. For instance, considering the real
, where integers were limited to be 8-bit long. Although very close, the representation is not exact:
In a similar way, the multipliers required by the DCT could be encoded into 2-point integer vectors:
Given that the DCT constants are algebraic integers [38] , an exact AI representation can be derived [61] . Thus, the integer sequences a 0 [n] and a 1 [n] can be easily realized in VLSI hardware. The multiplication between two numbers represented over an AI basis may be interpreted as a modular polynomial multiplication with respect to the monic polynomial that defines the AI basis. In the above particular illustrative example, consider the multiplication of the following pair of numbers a 0 + a 1 z 1 with b 0 + b 1 z 1 , where b 0 and b 1 are integers. This operation is equivalent to the computation of the following expression:
Thus, existing algorithms for fast polynomial multiplication may be of consideration [62, p. 311] .
In practical terms, a good AI representation possesses a basis such that: (i) the required constants can be represented without error; (ii) the integer elements provided by the representation are sufficiently small to allow a simple architecture design and fast signal processing; and (iii) the basis itself contains few elements to facilitate simple encoding-decoding operations.
Other AI procedures allow the constants to be approximated, yielding much better options for encoding, at the cost of introducing error within the transform (before the FRS) [38] . These particular values can be conveniently encoded as follows. Considering z 1 = 2 + √ 2 + 2 − √ 2 and
we adopt the following 2-D array for AI encoding:
This leads to a 2-D encoded coefficients of the form (scaled by 4):
Such encoding is referred to as bivariate. For this specific AI basis, the required cosine values possess an error-free and sparse representation as given in Table 1 [37, 41, 42] . Also we note that this representation utilizes very small integers and therefore is suitable for fast arithmetic computation. Moreover, these employed integers are powers of two, which require no hardware components other than wired-shifts, being cost-free. Encoding an arbitrary real number can be a sophisticated operation requiring the usage of look-up tables and greedy algorithms [63] . Essentially, an exhaustive search is required to obtain the most accurate representation. However, integer numbers can be encoded effortlessly:
where m is an integer. In this case, the encoding step is unnecessary. Our proposed design takes advantage of this property. For a given encoded number a, the decoding operation is simply expressed by: In terms of circuitry design, this operation is usually performed by the FRS.
In order to reduce and simplify the employed notation, hereafter a superscript notation is used for identifying the bivariate AI encoded coefficients. For a given real x, we have the following representation
where superscripts (a) , (b) , (c) , and (d) indicate the encoded integers associated to basis elements 1, z 1 , z 2 , and z 1 z 2 , respectively. We denote this basis as
It is worth to emphasize that in the 2-D AI encoding the equivalence between the algebraic integer multiplication and the polynomial modular multiplication does not hold true. Thus, a tailored computational technique to handle this operation must be developed.
2-D AI DCT ARCHITECTURE
An 8×8 image block A has its 2-D DCT transform mathematically expressed by [16] :
where C is the usual DCT matrix [44] . It is straightforward to notice that this operation corresponds to the columnwise application of the 1-D DCT to the input image A, followed by a transposition, and then the row-wise application of the 1-D DCT to the resulted matrix.
The 2-D DCT realizations in [41, 42, 64, 65] use the AI encoding scheme with decoding sections placed in between the row-and column-wise 1-D DCT operations. This intermediate reconstruction step leads to the introduction of quantization noise and cross-coupling of correlated noise components. In contrast, we employ a bivariate AI encoding, maintaining the computation over AI arithmetic to completely avoid arithmetic errors within the algorithm [61] .
The proposed architecture consists of five sub-circuits [61] : (i) an input decimator circuit; (ii) an 8-point AI- Our implementation covers items (ii)-(v) listed above. We now describe in detail each of the system blocks.
BIT SERIAL DATA INPUT, SERDES, AND DECIMATION
We assume that the input video data, in raster-scanned format, has already been split into 8×8 pixel blocks. We further assume that these blocks can be stacked to form an 8-column and (8 × (number of blocks))-row data structure. This leads to so-called "blocked" video frames, each of size 8×8 pixels. The blocking procedure leads to a raster-scanned sequence of pixel intensity (or color) values x i,n , i = 0, 1, . . . , 7, n = 0, 1, . . . , 8 × (number of blocks) − 1, from an 8×8 blocked image. Notice that we use column-row order for the indexes, instead of row-column. Due to the 8×8 size of the 2-D DCT computation, we find it quite convenient to consider the time index n after a modular operation k ≡ n (mod 8). Hereafter, we will refer to the time index as a modular quantity k = 0, 1, . . . , 7, 0, 1, . . . , 7, 0, 1 . . . , 7, . . ..
The video signal is serially streamed through the input port of the architecture at a rate of F s . A bit serial port connected to a serializer/deserializer (SerDes) is required to be fed using a bit rate of 8 × F s without considering overheads. As an aside, we note that this input bit stream may be typically derived from optical fiber transmission or high throughput Ethernet ports driven at 9.6 Gbps. Following the SerDes, a decimation block converts the input byte sequence into a row structure by means of delaying and downsampling by eight as shown in Fig. 3 .
Therefore, the raster-scanned input is decimated in time into eight parallel streams operating rate of
resulting in eight columns of the input block. It is important to emphasize that such input data consist of integer values. Thus, they are AI coded without any computation as shown in (1) . The obtained column data is submitted to the column-wise application of the AI-based 1-D DCT.
AN 8-POINT AI-ENCODED ARAI DCT CORE
The column-wise transform operation is performed according to the 8-point AI-based Arai DCT hardware cores as designed in [41, 42] shown in Fig. 1 . Here, this scheme is employed with the removal of its original FRS. The proposed 2-D architecture employs an integer arithmetic entirely defined over the AI basis z 4 . This transformation step operates at the reduced clock rate of F clock . Indeed, the resulting AI encoded data components are split in four channels according to their z 4 basis representation [61] . Such outputs are time-multiplexed mixed-domain partially computed spectral components. We denote them as
, where i = 0, 1, . . . , 7 is the column index and k is the modular time index containing the information of the row number. In hardware, this means that the AI representation is contained in at most four parallel integer channels [61] . Some quantities are known beforehand to require less than four AI encoded integers (cf. (2)). Thus, in some cases, less than four connections are required. These channels are routed to the proposed AI-based transpose buffer (AI-TB) shown in Fig. 2 , as a necessary pre-processing for the subsequent row-wise DCT calculation.
. . .
. . . Hard wired cross-connections are used that physically realize the required transpose matrix for the next row-wise DCT section. These physical connections are encapsulated in the cross-connection block in Fig. 3 for brevity. The AI-TB is clocked at a rate of F clock and yields a new 8×8 block of transposed data every 64 clock periods of the master clock F s . Subsequently, the transposed AI-encoded elements are submitted to four 1-D AI DCT cores operating in parallel.
ROW-WISE DCT COMPUTATION
After route cross-connection, the output taps from the transposition operation are connected to 32 parallel 8:1 multiplexers. Each multiplexer commutes continuously and routes each partially computed DCT component by cycling through its 3-bit control codes such that the q channel inputs of each of the four row-wise AI-based DCT cores are provided with a new set of valid input vectors at rate F clock . The cores are set in parallel being able to compute an 8-point DCT every eight clock cycles of the master clock signal. This operation performs the required row-wise DCT computation in order to complete the 2-D DCT evaluation, resulting in a doubly encoded AI representation X i,k (q) (p) , p, q ∈ {a, b, c, d}. Fig. 4 shows the above described block.
FINAL RECONSTRUCTION STEP
The output channels for the 64 2-D DCT coefficients are passed through the proposed FRS for decoding the AIencoded numbers back into their fixed-point, binary representation, in 2's complement format. Two different architectures are proposed for the FRS.
The proposed FRS architectures differ from the one in [64] by having individualized circuits to compute each output value at possibly different precisions.
Indeed, no FRS circuits are employed in any intermediate 1-D DCT block. This prevents quantization noise crosscoupling between DCT channels. Any quantization noise is injected only at the final output. Therefore noise signals are uncorrelated, which further allows the noise for each output to be independently adjustable and made as low as required.
FRS BASED ON DEMPSTER-MACLEOD METHOD
In this method the doubly encoded elements can be decoded according to:
which are then submitted to (2) . The result is the kth row of the final 2-D DCT data X i,k , i = 0, 1, . . . , 7.
Therefore, for each q, (4) unfolds into a particular mathematical expression as shown below:
The summation of above quantities returns X i,k (cf. (2)). Terms depending on z 1 and z 2 may not be rational numbers.
Indeed, they are given by
Multiplier z 2 1 z 2 2 = 8 is a power of two and can be represented exactly. Remaining constants require a binary approximation.
Closest signed 12-bit approximations can be employed to approximate the above listed numbers. Such approach furnished the quantities below: Consequently, the 12-bit approximation expressions related to X i,k (q) are given by: (10) and (11), respectively.
Finally, considering the above quantities and applying (2), the sought fixed-point representations are fully recovered. Hardware implementation of the multiplier circuits, required by the 12-bit approximations above, is accomplished by using the method of Dempster and Macleod [66, 67] . This method is known to be optimal for constant integer multiplier circuits. In this multiplierless method, the minimum number of 2-input adders are used for each constant integer multiplier.
Wired shifts that perform "costless" multiplications by powers of two are used in each constant integer multiplier. Here, an enhancement to the Dempster-Macleod method is made for the constant integer multiplier circuits: the number of adder-bits is minimized, rather than the number of 2-input adders, yielding a smaller overall design. Accordingly, the multiplications by non powers of two shown in expressions (10)- (13) can be algorithmically implemented as described in Table 2 . 
FRS BASED ON EXPANSION FACTOR SCALING
The set of exact values given in (9) suggests further relations among those quantities. Indeed, it may be established the following relations:
These identities indicate that a new design can be fostered. In fact, by substituting the above relations into (5)- (8), we have the following expressions:
Notice that the output value X i,k is the summation of the above quantities. Therefore, by grouping the terms on {1, z 1 , z 2 , z 1 z 2 }, we can express X i,k by the following summation:
where stages after the DCT operation. Typically, it is absorbed into the quantizer. This approach has been employed in several DCT architectures [69] [70] [71] . Fig. 7 depicts the full block diagram of the discussed computing scheme. Eight separate instances of this block are necessary to compute coefficients X i,0 to X i,7 , for each i.
ON-FPGA TEST AND MEASUREMENT
Six designs were implemented on Xilinx ML605 evaluation kit which is populated with a a Xilinx Virtex-6 XC6VLX240T device. The designs included the three implementations of the 2D 8×8 Arai AI DCT architecture with the two types of FRS described in Section 4 for fixed-point 4-and 8-bit wordlengths. Two versions of the expansion factor FRSs are provided, corresponding to expansion factors α ′ = 4.5941 and α * = 167.2309, resulting in 6 designs in total. The proposed designs are listed in Table 4 .
The JTAG interface was used to input the test 8×8 2-D DCT arrays to the device from the MATLAB workspace.
Then the measured outputs were returned to the MATLAB workspace via the same interface. Hardware computed coefficients were compared to its numerical evaluation furnished by MATLAB signal processing toolbox. Figure 8 : Resource utilization, speed of operation, and power consumption of the DCT designs given in Table 4 on Xilinx Virtex-6 XC6VLX240T FPGA for input fixed-point wordlength L = 4. Figure 9 : Resource utilization, speed of operation, and power consumption of the DCT designs given in Table 4 on Xilinx Virtex-6 XC6VLX240T FPGA for input fixed-point wordlength L = 8. As a figure of merit, we considered the success rate defined as the percentage of coefficients which are within the error limit of ±e %. For e = {0.005, 0.01, 0.05, 0.1, 1, 5, 10}, the success rates were measured as given in the Table 4 . Input wordlengths L was set to 4 or 8 bits. The 8-bit size is the typical video processing configuration. The proposed AI architectures enjoy overflow-free bit-growth at each stage throughout the AI encoded structure thereby ensuring that all sources of error are at the FRS and there only. Results show that the FRS based on the expansion factor approach for {437, 181, 473} (Designs 5 and 6) offers a significant improvement in accuracy when compared to remaining FRS architectures.
FPGA RESOURCE CONSUMPTION
The resource consumption of the proposed architectures on Xilinx Virtex-6 XC6VLX240T device are shown in Fig. 8 for L = 4 bits. Fig. 9 brings analogous information for L = 8 bits. Here, FPGA resources are measured in terms of slices, slice registers, and slice look-up-tables (LUTs). Designs 3 and 4, which use the FRS based on the expansion factor approach for {12, 5, 13}, consumed the least resources in the device and has the worst accuracy of the three designs (Table 4) . Moreover, even though Designs 5 and 6 (FRS based on expansion factor approach for {437, 181, 473}) possesses superior accuracy when compared to Designs 1 and 2 (FRS based on Dempster-Macleod method), they consume less hardware resources. Overall the FRS step of the proposed architectures require a considerable amount of area when compared to the AI steps of the architecture.
CLOCK SPEED, BLOCK RATE, FRAME RATE
Frame rates and block rates achieved by the implemented designs for video at resolution 1920×1080 is shown in fold the clock frequency of the DCT core (due to the downsampling by eight in the signal flow graph). For example, a potential pixel rate of ≈2.499 GHz and ≈2.462 GHz, for Designs 5 and 6, may be possible.
XILINX POWER CONSUMPTION AND CRITICAL PATH
The total power consumption of FPGA circuits consist of the sum of dynamic and quiescent power consumptions.
Both estimated dynamic and quiescent power consumptions obtained from the design tools for the Xilinx Virtex-6 XC6VLX240T device are provided in Fig. 8 and Fig. 9 .
AREA-TIME COMPLEXITY METRICS
Estimates for VLSI area-time complexity metrics are provided for all designs are given in Fig. 8 (L = 4) and Fig. 9 (L = 8), respectively. In general, the area-time metric measures complexity of VLSI circuits where chip real-estate is important over speed, while metric area-time 2 is used often for VLSI circuits where speed is of paramount concern. We provide both metrics to offer a broad overview of the area-time complexity levels present in the proposed architectures as a function of input size and choice of FRS algorithm.
The architectures are free of general purpose multipliers.
OVERALL COMPARISON WITH EXISTING ARCHITECTURES
Fixed point VLSI implementations that are directly comparable to the proposed architecture are compared in detail in Table 6 . Table 7 Tables 6 and 7 was provided in Section 2.
CONCLUSIONS
A time-multiplexed systolic-array hardware architecture is proposed for the real-time computation of the bivariate AI encoded 2-D Arai DCT. The architecture is the first 2-D AI encoded DCT hardware that operates completely in the AI domain. This not only makes the proposed system completely multiplier-free, but also quantization free up to the final output channels.
Our architecture employs a novel AI-TB, which facilitates real-time data transposition. The 2-D separable DCT operation is entirely performed in the AI domain. Indeed, the architecture does not have intermediate FRS sections between the column-and row-wise AI-based Arai DCT operations. This makes the quantization noise only appear at the final output stage of the architecture: the single FRS section.
The location of the FRS at the final output stage results in the complete decoupling of quantization noise between the 64 parallel coefficient channels of the 2-D DCT. This fact is noteworthy because it enables the independent selection of precision for each of the 64 channels without having any effect on the speed, power, complexity, or noise level of the remaining channels.
Two algorithms for the FRS are proposed, numerically optimized, analyzed, hardware implemented, and tested with the proposed 2-D AI encoded section. The architectures are physically implemented for input precision of 4 and 8 bits, and fully verified on-chip. Of particular relevance is the commonly required 8-bit realization, which is operational at a clock frequency of 307.787 MHz on a Xilinx Virtex-6 XC6VLX240T FPGA device (see Design 6).
This implies a 8 × 8 block rate of 38.47 MHz and a potential pixel rate of ≈2.462 GHz if the proposed 2-D DCT core is embedded in a real-time video processing system. The frame rate for standard HD video at 1920 × 1080 resolution is ≈1187.35 Hz assuming 8-bit input words and core clock frequency of 307.787 MHz. Shams et al. [31] Madisetti et al. [17] Guo et al. [29] Tumeo et al. [28] Sun et al. [30] Chen et al. [22] Proposed 
