Abstract-This work describes an approximate DCT architecture for the High Efficiency Video Coding (HEVC) standard. Since the standard requires to support multiple block sizes, architectures based on exact implementation require a relevant amount of hardware resources, namely multipliers and adders. This work aims to reduce the amount of hardware resources while keeping the rate-distortion performance nearly optimal. To achieve this goal, this work exploits an exact factorization of the DCT of size N = 8, which is then extended to obtain approximate DCTs of size N = 16 and N = 32. Simulation and implementation results prove that the proposed approximate solution features a complexity reduction with respect to exact one of more than 43% with an average rate-distortion performance loss of 4.74% for the worst-case (all-intra) configuration.
I. INTRODUCTION
Transform coding is an important feature of the High Efficiency Video Coding (HEVC) standard [1] , which allows to improve coding efficiency by removing redundancy between residuals after the intra/inter prediction stage. To achieve higher coding efficiency, transform coding in HEVC has been improved at the expense of much higher complexity with respect to H.264/AVC [2] . Indeed the HEVC standard specifies Discrete Cosine Transform (DCT) block sizes from 4 × 4 up to 32 × 32 [3] . Moreover, the rate-distortion optimization leads to an increased complexity at the encoder side [4] , which puts severe throughput requirements on the design of the DCT module of an HEVC encoder.
Therefore, several hardware architectures to compute the variable-size DCT in HEVC have been proposed in the last years. Dias et al. [5] exploited a 2D systolic array to implement the DCT as matrix-vector multiplication, thus supporting multiple standards. On the other hand, Meher et al. [6] designed an efficient integer DCT architecture for HEVC by relying on the odd-even decomposition of the DCT matrix and by reusing the core N /2-point DCT for the even computation of the N -point DCT. Moreover, to achieve high throughput, such an architecture includes and additional N /2-point DCT unit, so that it computes 32/N N-point DCTs concurrently. However, these approaches require a lot of hardware resources as they implement exactly the DCT matrix specified by the HEVC standard [3] . For this reason, approximation has been introduced as a new paradigm to efficiently compute the DCT in video coding applications, by trading complexity for rate-distortion performance loss [7] . Several approximations of the 8-point DCT have been derived by manipulating the coefficients and by simplifying the DCT matrix. A collection of these methods is available in [8] . To extend the transform size from 8 to 32, Jridi et al. [9] proposed a generalized algorithm and a reconfigurable hardware architecture. In particular, their solution relies on factorizing the DCT matrix by using the odd-even decomposition and processing both the even and the odd parts with the approximate 8-point DCT proposed by Cintra and Bayer [10] , which requires 22 additions only. However, this approach results in poor ratedistortion performance because of the rough approximation of the core 8-point DCT.
The aim of this work is to explore the design space generated by the adoption of an exact low-complexity factorization of the 8-point DCT to be used as core module in the generalized algorithm proposed in [9] . Among the different DCT factorizations described in [11] , the one proposed by Arai et al. [12] has been chosen in this work, because it needs 5 multiplications and 29 additions only. Thus, this work implements the 8-point Arai-based DCT and exploits it as the core unit for the reconfigurable architecture proposed in [9] . This solution allows to investigate different trade-offs between rate-distortion performance and hardware cost by changing the number of bits used to represent the internal multiplication constants.
The paper is organized as follows. Section II briefly overviews the generalized DCT algorithm for HEVC proposed in [9] and shows the proposed low-complexity fixed-point DCT based on the Arai factorization [12] . The proposed architecture is shown in Section III while Section IV shows the results of the rate-distortion performance analysis and the hardware details. Finally, Section V concludes the paper.
II. APPROXIMATE DCT ALGORITHM
This Section recalls the generalized algorithm proposed in [9] , which is used to approximate the 16-point and the 32-point DCT by employing the inner core 8-point DCT. Moreover, the DCT factorization proposed in [12] is briefly summarized.
A. Generalized DCT Algorithm
According to [11] , the DCT matrix C N is defined as:
where 0 = 1/ √ 2 and i = 1 for i > 0. By applying the odd-even decomposition, the DCT matrix in (1) can be rewritten in the following form:
where C N 2 is the N /2-order DCT matrix and S N 2 is composed of the first N /2 coefficients of the odd rows of √ 2 · C N . The P N matrix is the alternating permutation matrix defined by Φ N , which assigns the φ N (k)-th input to the k-th output:
and B N is the input butterfly matrix, which is defined using I N and J N , the N -order identity and anti-diagonal identity matrices, respectively:
The DCT approximation proposed in [9] modifies (2) 
This recursion applies for N = 32 and 16, while for N = 8,Ĉ 8 is the one proposed in [10] . This matrix has been generated by rounding the original DCT matrix in (1) asĈ 8 = 2 · C 8 , thus showing null multiplicative complexity, since the only constituent elements are 0, 1 or -1. Moreover, it is worth noting thatĈ N is orthogonalizable. Therefore, for eachĈ N it is possible to compute D N as:
which can be integrated in the quantization process of the video encoder, thus not introducing any additional computation when calculating the DCT.
B. Low-Complexity Arai-Based Factorization
The 8-point DCT factorization proposed by Arai et al. [12] has been considered in this current work, because of its very low complexity. It is derived by only computing the real part of the first eight output coefficients of the 16-point Discrete Fourier Transform (DFT) factorized using the Winograd algorithm [13] . The DCT inputs x(k) are connected to the DFT inputs X(k) according to the following mapping:
while the output DCT coefficients y(n) are calculated by applying the final normalization:
where Y (n) is the DFT output and 0 = 1/ √ 2 and n = 1 for n > 0. 
In this work, we propose an hardware-oriented implementation of the Arai-based DCT, where all the internal multiplications have been substituted with add-and-shift blocks [14] . Indeed, the sinusoidal factors (α i ) defined by the Winograd algorithm (see Table I ) have been scaled on N q fractional bits, thus generating a space of DCT approximations which trade accuracy for hardware complexity. As it can be observed, 0 < α i < 2 for i = 1,. . . ,5, thus only one bit is required to represent the integer part. In order to analyze the effectiveness of each approximation, the matrix proximity metrics and the transform-related measures defined in [8] , [11] , namely the error energy (ε), the Mean Square Error (MSE), the transform coding gain (C g ) and the transform efficiency (η) metrics have been calculated and compared with the exact 8-point DCT, the integer DCT used in HEVC [15] and the low complexity approximation proposed in [10] and exploited in [9] . Table II reports these accuracy measures and the arithmetic complexity in terms of number of multiplications and additions as well. As shown in the Table, using more than N q = 4 bits to represent the internal coefficients, the fixed-point DCT based on the Arai factorization approximates very well the exact DCT and the one employed in the reference software of the HEVC standard [15] . Moreover, it is observed that the CB-2011, which was adopted in [9] , shows the minimum arithmetic complexity while providing worse accuracy measures. The integer values of the sinusoidal coefficients of the Arai-based DCT have been calculated as α i · 2 Nq . Table I lists the real and the integer values of each coefficient, as well as the add-andshift implementation with N q = 8 [14] . Moreover, it is worth noting that the final normalization in (9) does not affect the structure of the factorization. Therefore, it has been integrated in the quantization step, thus not requiring any additional multiplication in the transform stage.
x (2) x (3) x (4) x (5) x (6) x (7) Y (3) Y (7) Y (1) Y (5) Y (6) Y (2) Y ( 
III. PROPOSED ARCHITECTURE
The proposed 8-point DCT architecture based on the Arai factorization is depicted in Fig. 1 . It is composed of 29 adders/subtracters and 5 integer multipliers, which are followed by right-shift of N q bits. It is worth noting that N q is fixed, therefore the shift operation is implemented by simple wiring, thus not incurring additional hardware overhead. Also, since α i are constants, multipliers are simplified as adders and wired shift operations (see Table I ).
The adopted reconfigurable architecture which implements the generalized algorithm proposed in [9] is reported in Fig.  2 . This recursive structure applies forĈ 32 andĈ 16 , where the innerĈ 8 is the 8-point DCT architecture of Fig. 1 . By adopting this strategy, the architecture is able to concurrently process 32 samples which are grouped according with the transform size. To this purpose, the architecture makes use of two approximate N /2-point DCT units plus the additional hardware of the N -point butterfly unit, which implements B N , and of the banks of multiplexers used to reconfigure the architecture to support variable-size computation. Specifically, the first bank between the butterfly unit and the approximate DCTs, is used to skip the computation of the butterflies when sel=0, i.e. two N /2-point DCTs have to be executed in parallel. When sel=1 the results produced by the butterfly unit become the inputs of the two approximate N /2-point DCT cores. On the other hand, the multiplexers at the output implement the permutation defined by the P N matrix when sel=1; otherwise they select the outputs of the two N /2-point DCTs without reordering.
This architecture has been designed to work in a Folded structure. As proposed in [6] , the Folded implementation reuses the one-dimensional DCT unit for computing the twodimensional DCT. Indeed, the DCT block is fed with 32 samples taken row-wise from the 32/N input data blocks of size N × N in the first N cycles. Then, the output of the DCT block is scaled according to [3] and stored row-wise in a transposition buffer. During the following N cycles, successive columns are read from the transposition buffer and fed to the DCT unit, which produces the final output DCT coefficients. The whole computation needs 2N clock cycles to compute N ×N results independently of the DCT size N , thus resulting in a constant throughput of 16 samples/cycle. Since the DCT block in Fig. 2 is used for both the row-wise and column-wise computations, its inputs and outputs are represented with 16 bits, as specified in [3] , while the internal signals have been sized in order to avoid overflow.
IV. IMPLEMENTATION RESULTS
As observed in Section II, different N q values define a design space, which has been explored both in terms of coding efficiency and hardware complexity. In order to assess the ratedistortion performance of the modified encoder [16] , the Araibased factorization of the DCT has been integrated into the HEVC reference software model HM-16.12 [15] . Only the forward transform of sizes from 8 to 32 has been changed, whereas the 4-point DCT and the decoder implements the original HEVC transform, as specified in [3] . Simulations have been performed on all the sequences taken from classes A, B, C, D, E and F according to three encoding configurations, namely All-Intra, Low-Delay and Random-Access main. Quantization parameters equal to 22, 27, 32 and 37 have been used according to [17] . The computational resources were provided by HPC@POLITO (http://www.hpc.polito.it). Table  III reports the Bjøntegaard Delta rate (BD-rate) metric [18] , averaged on all the sequences of the same class and measured between curves obtained by encoding the video sequences with the modified algorithm and with the original partial-butterfly approach used in the HM. The table compares the proposed solutions for 4 ≤ N q ≤ 8 with the method employed in [9] . As expected, the coding efficiency degrades when lowering the number of bits used to represent α i , thus showing an average loss of about 4.74% in the worst-case configuration (All-Intra). However, the rate-distortion loss is lower than the one achieved in [9] independently of N q . On the other hand, Table IV lists the clock frequency (f CK ), the gate count and the power consumption (P ) of the proposed architectures when synthesized using a 90-nm standard cell library. As expected, architectures with small N q achieve higher clock frequencies and reduced power consumption and gate count, thus showing similar hardware complexity as the work in [9] while providing improved rate-distortion performance. When used in the folded structure and synthesized for the same frequency of [6] (f CK =187 MHz), the proposed architectures show gate counts of 116.9 K and 107.7 K for N q equal to 8 and 4, respectively. Thus, they feature a complexity reduction ranging from 43% to 48% with respect to the implementation of the HEVC non-approximated DCT (208 K) [6] .
V. CONCLUSION
This paper has proposed a reconfigurable approximate DCT architecture for HEVC, which exploits the Arai factorization to reduce the hardware cost of the core 8-point DCT. The ratedistortion analysis and the hardware synthesis results show that the proposed implementations outperform the one presented in [9] by providing similar complexity reduction with better ratedistortion performance and reduce the complexity with respect to the exact DCT in [6] at the cost of small quality loss. 
