In this paper, we defined a low complexity 2D-DCT architecture. The latter will be able to transform spatial pixels to spectral pixels while taking into account the constraints of the considered compression standard. Indeed, this work is our first attempt to obtain one reconfigurable multistandard DCT. Due to our new matrix decomposition, we could define one common 2D-DCT architecture. The constant multipliers can be configured to handle the case of RealDCT and/or IntDCT (multiplication by 2). Our optimized algorithm not only provides a reduction of computational complexity, but also leads to scalable pipelined design in systolic arrays. Indeed, the 8 × 8 StdDCT can be computed by using 4 × 4 StdDCT which can be obtained by calculating 2 × 2 StdDCT. Besides, the proposed structure can be extended to deal with higher number of N (i.e. 16 × 16 and 32 × 32). The performance of the proposed architecture are better when compared with conventional designs. In particular, for N = 4, it is found that the proposed design have nearly third the area-time complexity of the existing DCT structures. This gain is expected to be higher for a greater size of 2D-DCT.
INTRODUCTION
The use of multimedia data (image and video) as well as the widespread adoption of embedded devices have increased significantly in recent years. As a result, numerous compression standards have been proposed and validated according to several applications: MPEG, 1 H264 2 and HEVC 3 for video compression, and JPEG and JPEG XR for image compression. In order to reduce the heterogeneity of decoding devices and converge to a universal decoder, it becomes useful to avoid the design of image processing algorithms for a specific standard and to promote the design for multistandards. More particularly, the Discrete Cosine Transform (DCT) is used for JPEG and MPEG in order to contribute on the reduction of spatial redundancies by transforming the spatial domain in the spectral domain. The direct realization (as opposed to the line-column separation) of the 2D-DCT requires 2N
3 multiplications and 2N 2 (N − 1) additions, where N × N is the pixel size of the elemental block to be transformed. In order to alleviate the hardware requirements, new standards like JPEG XR and H264 have adopted the Integer DCT (IntDCT) which has a multiplierless architecture based on Hadamard transform.
In this paper, in order to keep a certain degree of interoperability between different standards, we propose a first attempt of common hardware architecture for the 2D-DCT. We believe that this work is an introduction to a series of future work dedicated to obtain one multistandard DCT circuit. Indeed, the matrix multiplication used to compute the 2D-DCT coefficients is reformulated in order to extract some similarities with the matrices used in Int-2D-DCT. We concluded that it was possible to move from DCT to IntDCT by maintaining a scale factor and by changing one constant in the matrix representation of the IntDCT. Following these decompositions, we defined a new 2D-DCT algorithm and we have named it StdDCT. It is an invertible and a standard adaptive DCT and it has a butterfly-based architecture which is efficient in terms of Area-Delay product. Moreover, StdDCT is a multiplierless architecture when it is used as IntDCT. It can compute the 2D-DCT coefficients with a reduced number of multipliers when it is used as DCT (Real DCT). This number can reach 16 constant multipliers for N=4 instead of 128 with the direct realization. Finally, the above mentioned scale factor can be used in the quantization matrices to obtain the exact values of the transformed and quantized coefficients.
The proposed design is scalable and is validated with Xilinx FPGA implementation for several block sizes of 2 × 2, 4 × 4 and 8 × 8. It is found that the proposed design offers the same performances in terms of latency and throughput when compared with IntDCT. However, when compared with the direct realization of RealDCT, the proposed architecture presents the same latency and involves nearly 87% of saving in hardware which is estimated by evaluating the number of slices. In addition, when compared with the architecture based on the line-column separation, the proposed design offers a significant gain in terms of maximum operating frequency. These comparisons are performed with N=4 and for a greater N (N > 4), the gain in frequency and area are higher.
The remainder of the paper is organized as follows: an overview of DCT and fundamental design issues are given in Section 2. The extention of low-complexity DCT and the functional validation of the proposed DCT are described in Section 3. Finally, a signal flow graph of the proposed DCT is proposed in Section 4 before the conclusion.
REVIEW OF THE DISCRETE COSINE TRANSFORM

Definition
The DCT is commonly used in data compression applications due to its high reconstruction capabilities. 4 Indeed, when applied to image and video, the DCT decorrelates each block of input pixels. The energy of the correlated images is packed into the low frequency region (i.e., top left region). Consequently, the Inverse DCT (IDCT) can use the last region to reconstrut the original iamge. Given an input sequence {X (n)}, n ∈ [0, N − 1], the N -point DCT is defined as:
where C (0) = 1/ √ 2 and C (n) = 1 if n = 0. The 2D-DCT can be represented by matrix multiplication as mentionned in (2):
where Y are the output coefficients, X is a block of the input image and COS is the cosine matrix used to compute the DCT coefficients. The entries of COS are a = 8 . In the same way, the IDCT can be obtained by:
Here we would like to underline that the matrix multiplications form presented in (2) and in (3) requires 128 multiplications and 96 additions.
Existing implementations
Reducing the computational complexity of DCT/IDCT transforms is considered by researchers and industrials as an attractive thematic. 5 The optimization of DCT/IDCT has focused generally on reducing the number of required arithmetic operators and especially the number of multipliers. Indeed, the multipliers are the most power and area consuming circuits. Moreover, in digital electronic design, multipliers are caracterized by a higher latency and are considered as the bottleneck for achieving the real-time requirements. Then, it has been demonstrated that the theoretical lower limit of 8-point DCT algorithm is 11 multiplications. In literature, many fast DCT algorithms are reported and all of them use the symmetry of the cosine function to reduce the number of multipliers. In 6 a summary of these algorithms is presented. In Table 1 , we have listed the number of multipliers and adder involved in different DCT algorithms.
As mentionned in Table 1 , the number of required arithmetic operators stills high. Therefore, many multiplierless DCT algorithms have been introduced for efficient implementation of constant multiplications. All those methods can be classified as : the Distributed Arithmetic (DA)-based design 13 and 14 , the New Distributed As mentionned before, for image and video processing, the DCT is used in 2D form. One efficient way for the 2D-DCT calculation is by row/column decomposition. For a block size of 4 × 4 pixels, the decomposition consist in running the DCT 4 times on lines and then running the output of the first DCT 4 times on columns. Consequently, the use of row/column strategy requires an additional transpose memory to save the 1D-DCT outputs. Moreover, the absolute latency which is measured by evaluating the number of used clock cycles to obtain the output coefficients is relatively high. Indeed, for an input block of 4 × 4 pixels, the obsolute latency is equal to 15 clock cycles by using Loeffler or Chen algorithms.
Low-complexity direct realization of DCT
To eleminate the use of transpose memory and to reduce the absolute latency, Hallapuro et al. have introduced in 19 a low complexe direct 2D-DCT design based on the direct realization. Indeed, by means of matrix manipulations, equation (2) can be rewritten by:
where ⊗ denotes element by element multiplication and Y std is expressed by:
Matrix E can be used as scale factor and can be combined with the quantization matrix at the encoder or with the dequantization table at the decoder. Note that d = c/b and E is expressed by:
Compared to the more traditional formula of DCT presented in (2), Y std calculaed with (5) has many trivial operations like the multiplications by ±1. Hence, the number of multiplications used in (5) is equal to 16 which is less than 128 multiplications required by (2) . Moreover, Y std can be calculated without transpose memory. Indeed, many symetries can be obtained with matrix C in (5) which facilitate the representation of the Y std by a signal flow graph based butterflies. Another advantage of the reresentation given in (5) consists in the possibility of proposing a generalized DCT for multistandard. Indeed, matrix C of (5) can be used in MPEG or in HEVC standards. This matrix is composed of ±1 and d = √ 2 − 1 = 0.4142 entries. Coefficient d can be sustituted by 1/2. In terms of hardware complexity, the multiplication by 1/2 can be implemented by means of shifter.
Then, in order to avoid truncation erros, authors of 19 proposed to scale C matrix by 2. The forward transform becomes:
Under these conditions, matrix E is expressed by:
EXTENTION AND FUNCTIONAL VALIDATION
Extention
DCT presented in 19 is defined for block size of 4 × 4 pixels. In order to support all video coding standards in a single plateforms, it becomes necessay to develop a generalized DCT architecture. Equation (7) is devoted to compute DCT for H264 video coding standard as well as JPEG XR standard. However, for HEVC and JPEG standards, the pixel block to be transformed has a size of 8 × 8
* . Under these conditions, matrix COS presented (2) is updated according to (9) :
where a = cos(π/16), b = cos(2π/16), c = cos(3π/16), d = cos(4π/16), e = cos(5π/16), f = cos(6π/16) and g = cos(7π/16). As in (5), we calculate matrix C in order to compute Y for block size of 8 × 8.
where Y std = C × X × C T and matrices B, C and E are defined by: 
e/c a/c g/c −1
c/e −c/e −g/e a/e −1 To obtain a low-complexity architecture, we calculate Y std instead of Y . Note that the element by element multiplication with matrix E will be performed in the quantization side. Also, we would like to underline that C × C T is proportional to the identity matrix; which means that the IDCT can be obtained by same equation as in (3):
Functional validation
In this section we analyse the effects of the proposed matrix rewriting in the quality of reconstructed images. A simplified block diagrams of the proposed compression schemes are presented in Figure 1 . Indeed, Figure 1.(a) shows the data flow graph (DFG) with 64-bits floating-point precision. Hence, the last is considered as the theoretical DFG. In the other hand, Figures 1.(b) and 1.(c) show DFGs with Fixedpoint precision respectively for theoretical matrix representation (equations (2) and (9)) and optimized matrix representations (equations (4) and (10)).
Accordingly, the 2D-DCT of 4 × 4 or 8 × 8 blocks of the image is performed to decorrelate each block of input pixels. The DCT coefficients are then quantized to represent them in a reduced range of values using a quantization matrix. Finally, the quantized components are scanned in a zigzag order, and the encoder employs run-length encoding (RLE) and Huffman coding for entropy coding. Remember that the quantized 
where ⌊x⌋ denotes the nearest integer less than or equal to x, Q s (u, v) are the elements of the quantization matrix given by the standard, Q p ∈ [1 : 31] is the quantization parameter and Q (u, v) = Q s (u, v) × Q p is the used quantization matrix for a set of Q p parameter. Note that quantizer scaling does not affect the quantization of the DC coefficient.
For our case, when we use Y std to compute DCT coefficients, (15) is updated to include E:
Simulation results are perfomed using Matlab tool. Figures 2 and 3 It is mentionned that the PSNR decreases when the Q p parameter increases. This is in accordance with the quantization process. Moreover, it is mentionned that the deterioration in PSNR obtained with the architectural optimization is less thant 1 dB. In some cases, the PSNR given with the optimized DCT is higher than that calculated by DCT of Matlab. To sum up, we can confirm that the proposed DCT does not affect the image quality.
DCT DESIGN
The low complexity architecture of DCT can be easily presented by butterfly structure. As mentionned in Figure 5 , the 4 × 4 2D-DCT is transpose-memory free and has a regular structure which fits well with FPGA implementation. As we can see in Figure 5 , the 2D-DCT is computed by means of two types of buterflies. The first one uses one addition and one subtraction while the second butterfly uses the same ressources along with two constant multipliers.
Hence, when we use the matrix multiplication according to equation (2) we consume 128 multiplications and 
CONCLUSION
In this paper, we have presented a low-complexity DCT for image and video compression. We extended an existing work for a block size of 8 × 8 to be used for multistandad. We showed that the proposed DCT is hardware implementation friendly. Then, we demonstrated that the proposed design consumes less hardware ressources. Morevoer, we proved that the modification in the matrix representation doest not affect the image quality. Nevertheless, this work is our first tentative in the domain of multistandard encoders. In our future work, we aim to present an automatic methodologie to extend the matrix reformulation for a higher block sizes. Also we expect to present a generalized DCT design for many encoding standards. FPGA implementation results will be given in our future work.
