I. INTRODUCTION
T HE High Efficiency Video Coding (HEVC) standard is the latest video coding standard jointly developed by the ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group (MPEG) [1] . According to [2] , the HEVC standard is able to obtain a bitrate reduction of about the 50% while maintaining the same visual quality produced by the previous Advanced Video Coding (AVC) standard. In order to achieve this saving, the standard exploits a large number of new features and tools such as new structures for recursive partitioning, new intra-and interprediction modes, and larger transform unit sizes. However, the resulting improvement in terms of video compression comes at the expense of increasing the encoder complexity by about 40%-70% with respect to the AVC, due to the exploration of a large space of possible encoder decisions [3] , [4] . Moreover, the challenge to design optimized area and power-efficient hardware modules becomes more evident for such devices as mobile phones and cameras, where the chip has to incorporate a lot of functions and the battery lifetime is limited. In particular, one of the key features of HEVC is the variable size transform computation [5] . The standard exploits the discrete cosine transform (DCT) [6] , which can be applied to blocks made of N × N samples, where N can be: 4, 8, 16, or 32. Some existing works have proposed hardware architectures for the DCT computation in the context of image and video compression, as the HEVC standard [7] . These works provide both exact and approximated DCT computations, where the quality is traded for a reduction of the computational complexity. Among the architectures proposed in the literature for HEVC transforms, the one introduced by Shen et al. [8] uses the multiple constant multiplication (MCM) for the fourpoint and eight-point DCT, while it adopts shared multipliers for DCTs of larger size. Park et al. [9] have exploited the Chen's factorization [10] of the DCT implementing each butterfly operation by means of multiplierless processing elements. Zhu et al. [11] proposed a pipelined unit that is capable of computing the forward and inverse DCT as well as the Hadamard Transform (HT) by reusing smallsize transform hardware for other larger-size ones. Similarly, Budagavi et al. [12] exploited symmetry properties of the forward and inverse HEVC transform matrices to allow resource sharing in a unified architecture. Meher et al. [13] proposed a flexible architecture that is capable of computing the DCT for any of the four different N values with the same throughput. Their work exploits the partial butterfly approach, where the 1D N-point DCT can be calculated recursively by means of an N/2-point DCT and an N/2 × N/2 matrix multiplication. Another type of hardware architecture is the one proposed by Ahmed et al. [14] , which is inspired by the factorization proposed in [15] , [16] , i.e. exploiting the Walsh-Hadamard transform (WHT) followed by a set of Givens rotations. In [17] - [19] , an algebraic integer-based scheme, which exploits the Arai's factorization [20] , is proposed for the exact computation of the 8×8 DCT. Among the approximated ones, Bouguezel et al. [21] , [22] provided a parametric eight-point DCT matrix, which allows to generate different transforms with low complexity, and also defined approximated DCT matrices for N > 4 [23] . Cintra and Bayer et al. [24] obtained the eight-point transformation matrix by rounding-off to 0 or 1 each entry of the original DCT matrix. In [25] some entries of the DCT matrix in [24] have been set to 0, thus producing an approximated DCT, which requires 14 additions only. Starting from the matrix of [25] , Cintra et al. [26] applied frequency-domain pruning to obtain an approximated DCT. In particular, they removed the signal-flows related to those DCT coefficients that are likely discarded by the video compression quantization step. All the previous 8×8 DCT approximations have been implemented and compared by Potluri et al. in [27] , where a novel eight-point DCT matrix has been also introduced.
As pointed out in [13] , since the HEVC standard supports DCT of different sizes, a flexible and reusable architecture is required. Furthermore, the challenge to design optimized architectures for area and power has encouraged us to define a new method to trade dynamic power for small losses on the reconstructed video quality. Unlike [21] - [27] , where only fixed approximations of the transform matrix of size 8 × 8 are shown, we propose a content-adaptive DCT scheme, which is applied to all the DCT sizes defined in HEVC. Moreover, four operating modes are defined in order to allow the designer to select a different tradeoff between video quality and power consumption.
The contributions of this work are: 1) the statistical analysis of the DCT usage during the encoding process; 2) the description of a new algorithm to dynamically choose which rotations will be performed; 3) the design of two flexible architectures, which have been sized resorting to the calculated statistics and a practical method to select a proper folding degree for a target application.
The rest of this paper is organized as follows. In Section II, the adopted DCT factorization over different lengths is briefly reported. In Section III, the results of the statistical analysis are shown and the proposed operating modes are defined. In particular, Section IV reports the results of the architectural space exploration achieved by resorting to the folding technique [28] . Then, two hardware implementations of the DCT are proposed: the first one is designed to achieve the highest throughput and the second one improves resource utilizations and reduces the required area. Finally, Section V reports the synthesis and the power estimation results, while Section VI concludes this paper.
II. WHT-BASED DCT

A. DCT Factorization
According to the property of separability, the 2D DCT of a matrix of size N × N pixels (2D-DCTN) can be decomposed into two 1D DCTs of length N, which are performed row-wise and column-wise (or vice versa). Therefore, in the following, only the 1D DCT (1D-DCTN) is addressed. According to [29] , the 1D-DCTN can be computed as
where X = (X 0 , . . . , X N−1 ) is the column vector of output results, x = (x 0 , . . . , x N−1 ) is the column vector containing input samples, and C N is the DCT matrix. As suggested in [14] , the DCT matrix can be factorized as
where W N is the Walsh-ordered WHT matrix, which is generated by applying the bit reverse and Gray coding ordering to the row indices of the Nth-order Hadamard matrix
where H 1 = 1. The other two matrices in (2) are the bit-reversal matrix B N and a block diagonal matrix T N , which contains the Givens rotations. The latter can be defined through the following recursion:
where T 2 is the identity matrix of size 2, and U N/2 is the product of two permutation matrices and the Givens rotation matrices, namely
with m = log 2 N + 1, 3 ≤ q ≤ m and
The Givens rotation matrices in (5) are defined for 3 ≤ q ≤ m, and they are composed of r = m − q + 1 submatrices placed on the diagonal
where
and p is an odd positive integer lower than N/2 r . The coefficients c p,q and s p,q , which are placed in a concentric square way, perform plane rotations and the rotation angle is identified by the couple of indices ( p, q) as
Noticeably, each rotation can be decomposed into three lifting steps [30] to reduce the computational complexity of the algorithm
where P θ = (1 − cos θ)/ sin θ and U θ = − sin θ . 
B. Hardware Oriented Optimization
The computational complexity of a 1D-DCTN, factorized by means of the WHT, is determined by
butterfly operators, for the HT, and by
Givens rotations. One butterfly is composed of two adders, while one rotation is implemented by means of the lifting scheme, which is composed of three stages, as shown in (10), each of which requires one multiplication and one addition. According to the approach suggested in [33] , lifting coefficients are expressed as
The first architecture proposed in this paper relies on an unfolded 1D-DCT data-flow, which takes advantage of a and b values to simplify the multipliers required in the lifting scheme, by exploiting the reduced adder graph (RAG-n) technique [34] . By representing the coefficients with n = 8 bits, we first evaluated the matrix proximity metrics and the transform-related measures defined in [24] , [29] , namely, the error energy ( ), the mean square error (MSE), the coding gain (C g ), and the transform efficiency (η), for the case 1D-DCT with N = 8. As shown in the first part of Table I , the proposed fixed-point DCT approximates very well the exact DCT and the one adopted in the HEVC reference software [31] . Moreover, as expected, the WHT is less accurate than the proposed DCT. The values assumed by a and b, as well as the number of adders and shifters required to implement each coefficient through the RAG-n representation, are reported in Table II .
C. DCT Algorithm Comparison
Since this paper deals with DCT complexity reduction and performance tradeoffs, it is important to compare the WHT-based DCT with other exact and approximated algorithms. To the best of our knowledge, the architecture proposed in [13] is the best performing one for the HEVC standard, being able to support all the DCT sizes. Table III compares the computational complexity, in terms of multiplications, additions and shifts, of the 1D-DCTN algorithm in [13] with the WHT-based one, proposed in [14] , and its multiplierless version, obtained by applying the RAG-n technique to the lifting scheme coefficients. As it can be observed, the WHT-based factorization requires less multiplications than the partial butterfly implementation for every DCT size, especially for large sizes. On the other hand, considering multiplierless implementations, the MCM-based one is better than the WHT-based one for all the DCT sizes, excepting the case N = 32. However, since the WHT can be seen as a very simple approximation of the DCT, the WHT-based DCT features the possibility to adapt the accuracy of the computation to current data, being an interesting alternative for hardware implementation. This property is exploited in this paper to design content-based approximated DCTs.
It is known that approximated DCT algorithms trade hardware complexity for accuracy. As discussed in Section II-B and shown in the first part of Table I , the proposed fixedpoint implementation of the WHT-based DCT features excellent matrix proximity and transform-related accuracy. As a consequence, it is important to assess the complexity-accuracy tradeoff of approximated solutions to make a fair comparison. The second part of Table I extends the matrix proximity metrics and transform-related measures to the solutions proposed in [21] , [22] , [24] , [25] , [27] . As it can be observed, the [21] , [22] , [24] , [25] , [27] obtain a lower arithmetic complexity than the proposed WHT-based DCT at the cost of lower accuracy. Thus, in the following sections, only the most accurate and the two least complex ones, namely [24] , [25] and [27] , will be further considered.
III. DYNAMIC DCT APPROXIMATION
In this Section, we propose a modified algorithm that leads to relevant rotation reduction. Since the DCT factorization, described in Section II-A, allows to compute the DCT results by means of the WHT and the following rotation scheme, the key-idea of this modified algorithm is to explore different approximations of the DCT by adaptively reducing the number of rotations to be computed. This space goes from the WHT, where all the Givens rotations are skipped, to the complete DCT where all the computation is performed. It is worth pointing out that the results obtained by removing rotations are approximated. However, in HEVC, the coefficients produced by the DCT are quantized, so the injected quantization noise partially hides the accuracy degradation introduced at the transform step. Therefore, if a small quality loss is considered acceptable, then it is unnecessary to compensate the DCT approximation.
In order to determine whether a rotation has to be applied, a precomputation mechanism is adopted. The idea is to compare the two inputs of one rotation unit with a threshold. Then, the rotation is skipped when the magnitude of both inputs is lower than the threshold or the special signal SKIP is asserted. The effectiveness of this method strongly depends on a proper choice of the thresholds. In this paper, we analyze the general approach, which allows to identify appropriate thresholds. Then, four different tradeoffs between computation saving and rate-distortion performance have been determined on the basis of the results of such analysis.
A. Experimental Setup
All the simulations have been performed encoding 12 highresolution video sequences taken from the set of sequences employed during the HEVC standardization process and referred to as common test conditions (CTC) [35] . These sequences belong to three classes, which differ in terms of resolution, characteristics of the content, and application, as reported in Table IV . According to the CTC [35] , each class can be encoded with different configurations: all intra (AI), low delay (LD), and random access (RA). AI configuration encodes the video as a sequence of intra frames. The LD configuration is used for interactive applications such as videoconferencing. It is worth noting that it uses only B frames with reference to previous pictures in order to avoid delay due to the encoding computation. The RA configuration is related to entertainment applications and it allows to start decoding from different points in the sequence. This feature is achieved by using a hierarchical Group of Pictures (GOP) structure made of both I-frames and B-frames.
All the simulations shown in this paper have been performed with the HEVC reference software HM 8.0 [31] , which has been modified with the introduction of our DCT and other approximations taken from the literature [32] . Four quantization parameters (QP) were fixed, namely, 22, 27, 32, and 37, as suggested in [35] . Finally, rate-distortion curves, which use the combined peak-signal-to-noise-ratio PSNR YUV , as defined in [2] , were used as quality measure. The Bjøntegaard method [36] for calculating objective differences ( PSNR and Rate) between rate-distortion curves has been used as the metric for evaluating quality loss.
B. DCT and Rotation Statistics
From now on, the 2D-DCTN will be referred to as DCTN for brevity. In order to limit quality loss, a statistical analysis of which DCTN are used is required. This information is crucial to understand which DCTs rotation mainly contributes to the quality degradation, as well as to calculate the average throughput of the proposed architectures. As an example, Table V reports the usage statistics of each DCT and the corresponding percentage of rotations for three sequences, taken from different classes, encoded with the configurations specified in the CTC [35] . As it can be observed, simulation results with LD or RA configurations point out that all the sequences exhibit similar percentage of usage: the most used DCT is the DCT4 with almost the 70% of the total count, then the DCT8 with about the 20% and the DCT16 with the 5%. The least used one is the DCT32 with a percentage below the 1%. The values are slightly different when the AI configuration is employed; in this case, the count for DCT4 decreases to about 58%, while larger transforms increase, especially the DCT8 passing from about 20% to about the 34%. The mismatch between statistics of AI and LD, RA configurations is due to the different performance between intra and inter prediction, which can remove some TU partitioning from the exhaustive search set.
On the other hand, the highest number of rotations belongs to the DCTs of size 16 and 32, for which it grows more than linearly, as indicated in (12) . Together, they cover approximately 70% of the total, and the remaining 30% is due to the DCT4 (about 7%) and the DCT8 (about 23%). Therefore, since large-size DCTs require higher computational effort than small-size ones, it is likely that most of saved operations and quality loss, are due to DCT16 and DCT32. However, some rotations are used across more than one DCT, so the effect on the PSNR of one Givens rotation depends on both the DCT size and the rotation angle. Thus, Table VI shows the number of rotations per angle to compute each DCTN, where p is an odd integer lower than N/2.
Results in Tables V and VI highlight that the contribution of each rotation to both computational complexity and quality loss depends on the angle and DCT size. Therefore, different thresholds can be assigned to the same rotation module depending on the working conditions. PSNR and (t) curves over the threshold value, averaged on all the sequences encoded with AI.
C. Operating Mode Definition
The proposed method relies on the values assigned to the thresholds. Since the proposed transform module supports four DCT sizes, with many rotation angles, the threshold set (T ) contains up to 26 elements, each of which is associated with a rotation of an angle in a DCT. The general optimization problem, which allows to determine the optimal set of thresholds, can be written as
where T * is the optimal set of thresholds and (T ) and (T ) are functions used to model the HEVC encoding process and representing the computational saving and the quality loss, respectively. The maximum allowable quality loss L is measured using the Bjøntegaard difference on the PSNR YUV .
In addition, we constrain the threshold values to be powers of 2, in order to simplify the hardware implementation. Since the exhaustive search of a solution is computationally too complex, we reduced the design space by using one threshold t for all the rotation angles, i.e. T = t. Fig. 1 shows the PSNR difference and the percentage of saved rotations ( ), averaged on all the test sequences encoded with the AI configuration, for threshold values t in the range from 0 to 2048 and in the case of SKIP signal assertion. As expected, both the quality loss and the rotation saving increase with the threshold value up to the limit fixed by asserting the SKIP signal, where all the rotations are skipped and the DCT is approximated by the WHT only. As it can be observed, by choosing values smaller than 8 or larger than 128, the quality loss and the rotation saving are close to either the full DCT or the WHT, respectively. Therefore, only the results for thresholds within this range are reported in the left part of Table VII, which shows the quality loss and the rotation reduction, averaged on video sequences reported in Table IV taken from classes (A, B, and E) separately or together (All) and encoded with the experimental setup described in Section III-A. Noticeably, the approximations with low threshold values exhibit negligible performance loss with respect to the complete DCT. Nevertheless, on average, they reduce the computational effort of the lifting scheme by more than 40%. Then, the quality loss and the rotation saving grow up to t = 64. From that point, a further increase of the threshold leads to large quality degradation without a significant reduction in terms of complexity. This behavior is observed for all the encoding configurations. Focusing on the quality loss, the AI configuration, which employs only I-frames, is more sensitive to DCT approximation than the LD and RA configurations. This effect is due to the fact that inter prediction provides better performance than intra prediction, by producing smaller residuals and DCT coefficients, which are likely to be quantized near zero. Therefore, the approximation of larger residuals generated by the intra prediction leads to larger quality loss. Stemming from the results reported in Fig. 1 and Table VII , we have selected two of the proposed tradeoffs as operating modes for our DCT modules, in addiction to the full DCT and the WHT. Four operating modes have been defined as. 1) MODE0 is the complete DCT. It computes all the rotations (t = 0) and no power saving is achieved. 2) MODE1 is the first approximation. The maximum PSNR loss is fixed to 0.02 dB. It leads to a minimum reduction in the rotation computation of 37% with t = 16. 3) MODE2 is the second approximation. The maximum PSNR loss is 0.1 dB, corresponding to a minimum rotation reduction of about 55% with t = 32. 4) MODE3 is the WHT computation only. It leads to the maximum quality loss (less than 0.8 dB), but also to the maximum saving (100%) using the SKIP signal. It is worth noting that the operating modes can be redefined by the designers depending on the application by selecting different tradeoffs among the ones reported in Table VII. Moreover, the proposed method can be also used in conjunction with other techniques, such as zero block detection algorithms [37] , to further reduce the complexity of the encoder despite of small quality degradation.
D. DCT Approximation Comparison in HEVC
To compare our DCT approximations with other ones proposed in the literature, we implemented in the HEVC reference software [31] the CB-2011 [24] , the Modified CB-2011 [25] , and the Improved Modified CB-2011 [27] 8×8 DCT algorithms, which are respectively the most accurate and the two least complex among the previous approximations presented in Table I. 1 Table VIII reports the average PSNR and Rate with reference to the complete DCT, calculated on the ratedistortion curves of all the video sequences encoded with two configurations. The first one is the default configuration, while the second one is derived from the default configuration by limiting the maximum transform size to 8 × 8. As shown in Table VIII , the rate-distortion performance with default configurations of our MODE1 approximation is the best one, whereas MODE3 is the worst one. Indeed, in this configuration, the results achieved with [24] , [25] and [27] methods are affected by DCT8 approximations only. In order to make a fair comparison, we analyzed the custom configuration with maximum TU size limited to 8×8. As it can be observed, the behavior of our proposed DCT is independent of the transform size, due to its inherent adaptivity. On the other hand, the methods taken from the literature show some degradation of rate-distortion performance when the percentage of usage of the 8×8 DCT grows. In particular, the reduction of complexity provided in [25] , [27] and shown in Table I , leads to significant 
IV. PROPOSED ARCHITECTURES
In this Section, we show the top level of the proposed architectures and two possible implementations of the 1D-DCT module. The first one is a completely unfolded architecture derived from the one proposed in [14] , where all the required operations are mapped to different resources, thus achieving the highest possible throughput for the DCT factorization presented in Section II. The second architecture is a folded one, where the folding technique [28] is exploited to reuse hardware resources during the computation of the different DCTs. Fig. 2 reports the proposed 2D-DCT architecture. It is composed of two main blocks: the 1D-DCT module and the transposition memory, which has the role of transposing the intermediate results. Due to its flexibility, this architecture is able to concurrently perform the DCT computation on multiple blocks, depending on the DCT size. Since the number of input samples is fixed to 32, this module can compute 32/N DCTN, i.e. one DCT32, two DCT16, four DCT8, or eight DCT4, thus leading to an efficient usage of the hardware resources. For this reason, in this work the transposition memory is designed in the same way as presented in [13] . It is made of an array of 32 × 32 registers, required to support the DCT32, and it is able to transpose blocks of different size. The whole system computation is scheduled as follows. The rows of the input block pixels are fed into the 1D DCT module, which computes the 1D-DCTN. Input pixels are represented with 9 bits, this because they are the result of the difference between the current and the predicted frames. Then, according to the HM 8.0, the produced results are scaled to 16 bits and stored row-wise in the transposition memory, as shown by the log 2 N − 1 right shift block in the feedback path in Fig. 2 . Once all the rows have been processed, the multiplexers feed the 1D-DCT with the columns of the intermediate data, which are stored in the transposition memory. Finally, the results are scaled back to 16 bits for compliance with the HM 8.0 (see the log 2 N + 6 shift block in Fig. 2 ). All the operations are managed by a control unit (CU), which generates both the signals for the data selection and storing and for the 1D-DCT block.
A. 2D-DCT Architecture
B. Unfolded 1D-DCT
The first proposed architecture is the unfolded one. Its data flow is reported in Fig. 3 . It is a four-stage pipelined data path composed of two main computational entities: the HT (left side of Fig. 3 ) and the rotation scheme (right part of Fig. 3 ). The former block receives 32 samples at each clock cycle and computes the HT by means of the butterfly stages (BUT), which perform the B butterflies indicated in (11) . The HT block is followed by a network, which implements: 1) bit reverse and Gray coding, to obtain Walsh-ordered data, and 2) bit reverse and permutation to reorder the signals for the rotation block.
According to (12) , the rotation scheme contains R = 49 modules, depicted as circles and ovals in Fig. 3 , to support the worst case, which is the DCT32. Each rotation receives as input the threshold for the related angle and it is equipped with some logic to implement the precomputation mechanism introduced in Section III. Fig. 4 reports the block scheme of a generic rotation module. Two comparators are used to determine whether the incoming signals are smaller than the threshold and the enable signal activates only the register bank, which is used in the following clock cycle. If one rotation is not calculated, then the data are sampled by register bank 1 and they are not propagated to the rotation logic, thus saving dynamic power. Otherwise, the path that passes through register bank 2 is enabled. A final multiplexer, driven by the same condition signal, chooses the correct output. Moreover, the special SKIP signal (which has been introduced in Section III) is also used to avoid rotations when the DCT size is smaller than 32 or to bypass all the rotations when the module works in MODE3 (WHT only). According to (10) , the rotation is computed by means of three lifting steps, each composed of a multiplication, a shift and an addition, as illustrated in Fig. 5 , where x 1 and x 2 are the input values and y 1 and y 2 the output results. The values of the multiplier coefficients (a, b) are the ones reported in Table II . The output of the rotation block is then propagated to a network, which integrates permutation and bit reverse reordering. The flexibility of the architecture is given by a custom set of connections, which arranges the paths between operators as required for the DCTN.
Since different DCT types require a variable number of rotation stages, the throughput of this architecture, defined as the number of produced results over the time required to produce them, varies with the DCT size. The throughput of the complete 2D-DCT architecture employing the 1D-DCT is
where 32/N represents the number of N × N blocks processed concurrently, = 2 · (P + N)/ f CK is the time required to compute the results, P = log 2 N − 1 is the number of pipeline stages required for the computation of a DCTN and f CK is the clock frequency.
C. Folded 1D-DCT
The second implementation is the folded 1D-DCT, where a set of resources is shared among DCTs of different size. The reuse of such operators can be exploited either to support DCT computations of different lengths or to increase the throughput of small size DCTs. Therefore, the amount of resources and the folding degree define a large design space. The exploration of such a space is detailed in the following paragraphs. The technique is applied to both HT and rotations. Let K B , K R be the number of butterfly and rotation resources Assuming that the architecture works in pipeline and that the number of clock cycles required to compute one 1D-DCTN is M N , then the time to compute N 2 results is
where L = 2 · M N is the latency of the architecture. Since the architecture is folded M N = max{α, β}, where α and β depend on the number of available resources, K B and K R . Assuming perfect scheduling, α and β can be obtained for each DCT size as α = B/K B and β = R/K R . In order to satisfy the data dependency between the computational stages (see Fig. 3 ), we assume that each gray shaded block (HT and Givens rotations) is computed in a minimum number of cycles, namely log 2 N for the HT and log 2 N − 1 for the rotations. Thanks to the concurrent execution of different DCTs, a feasible perfect scheduling of the resources can always be identified. From a detailed analysis of the proposed folded architecture, we discovered that data dependencies occurs with N = 32 in the HT computation only when 8 < K B < 16. Moreover, the rotation block requires two cycles: the first one to evaluate the enable condition and the second one to compute the rotation, if needed. The plots of the throughput as function of K B and K R are depicted in Fig. 6 . In this example the throughput is calculated as in (15) and (16) with a reference clock frequency f CK equal to 250 MHz. As expected, when K B and K R are increased, the throughput grows up to a maximum value, which depends only on the data dependencies and no longer on the available resources. As it can be observed, there is a relevant increase of throughput when K B reaches 8, 11 and 16 and when K R becomes equal to 4 and 8, which correspond to a more efficient usage of the resources. Thus, Fig. 6 is also intended for design purposes. Indeed, depending on the application, the designer can set the throughput and find the minimum number of resources required to achieve it. This work targets a throughput of 1.2 G samples/s (7680 × 4320 × 24 × 1.5), i.e. the one required for the encoding of 8K ultrahigh definition (UHD) video sequences at 24 fps with 4:2:0 YUV subsampling, which is one of the HEVC applications. Therefore, the solution with K B = 16 and K R = 8 is selected. Such a solution requires α = 2, 3, 4, 5 and β = 2, 6, 10, 14 clock cycles to compute the 1D-DCT of size 4, 8, 16 , and 32, respectively. The proposed architecture for the 1D-DCT is depicted in Fig. 7 . It is composed of two main blocks: the HT computation (left part of Fig. 7 ) and the rotation scheme (right side of Fig. 7 ). Resorting to the folding technique [28] , each module shows two selection blocks, used to implement the time multiplexing of the resources, and a bank of temporary registers to store intermediate results.
A pipeline stage separates the HT computation from the rotation scheme, thus allowing concurrent computation of the two parts on successive samples. Two networks for data reordering complete the folded implementation. The first network implements the Walsh ordering and arranges the data for the Givens rotations. The second one applies the permutation and bit-reverse ordering to the results.
V. IMPLEMENTATION RESULTS
A. Synthesis Results for 1D-DCT
The proposed architectures have been coded in VHDL, and synthesized with Synopsys Design Compiler using a 90-nm standard cell library for an operating clock frequency equal to 250 MHz. In Table IX the Unfolded 1D-DCT and the Folded 1D-DCT are compared in terms of gate count, frequency ( f CK ) and throughput (T ) with other existing 1D-DCT architectures. It is important to note that the throughput is calculated as the average of the different DCT sizes weighted with the statistics reported in Section III-B, and it is determined considering the 2D folded structure in Fig. 2 . The proposed Unfolded 1D-DCT architecture shows the highest throughput at the expense of larger gate count with respect to the other designs. In particular, the Hadamard Transform and the Rotation Scheme block in Fig. 3 requires 22 K and 135 K gates respectively. It is worth noting that the proposed architecture features some hardware overhead compared with the solution provided in [13] . This figure depends on the fact that [13] supports only exact DCT computation, whereas the proposed one includes some logic and registers to support the four operating modes defined in Section III. On the other hand, the proposed Folded 1D-DCT shows a very reduced gate count for the computation of the 1D-DCT, even if it supports the four operating modes as well as the unfolded one. Only 51 K gates are needed to implement the Rotation Scheme, while 15 K gates are used for the Hadamard Transform. When compared with [9] , where only transforms of sizes 16 and 32 are implemented, the proposed Folded 1D-DCT architecture shows a similar gate count, but it can achieve double throughput.
B. Synthesis Results for 2D-DCT
The two proposed 2D-DCT architectures, based on the unfolded and the folded 1D-DCT modules, have been syn-thesized as well. In the following, they will be referred to as Architecture 1 and Architecture 2 respectively. Table X lists the technology, gate count, operating frequency ( f CK ), throughput (T ), power consumption (P), energy-per-sample (EPS), and frequency-normalized dynamic power (P d ), which characterize the two designs and other existing 2D-DCT architectures for HEVC. As it can be observed, all the implementations reported in [27] address the design of DCT of size 8 × 8 only. They provide very high throughput at the cost of very high power consumption. Besides, the operating frequency of 250 MHz allows the Architecture 1 to support 8K UHD applications up to 64 fps with 4:2:0 YUV subsampling. For such applications, the Full-parallel architecture proposed in [13] achieves the highest throughput with a large gate count, as it relies on two 1D-DCT modules. On the contrary, the Folded architecture described in [13] contains one 1D-DCT module, as the Architecture 1 we propose in this current work. As it can be observed, the proposed Architecture 1 achieves higher throughput with respect to the folded implementation proposed in [13] , but it shows a slightly larger gate count and power consumption, when the operating mode is set to MODE0, because additional logic is required to support the proposed power reduction algorithm.
On the other hand, Architecture 2 provides the smallest absolute power consumption, which is equal to 28.98 mW, showing nearly the same gate count as [14] but achieving about five times larger throughput. Indeed, the proposed folded architecture supports 8K UHD applications with a maximum frame rate equal to 26 fps. Moreover, the folded architecture can be properly sized by choosing the target throughput with the methodology proposed in Section IV-C, thus optimizing area occupation and power consumption.
Finally, it is worth noting that both the proposed architectures outperform the other ones in terms of frequencynormalized dynamic power (P d ), while they provide a slightly higher EPS than the implementations in [13] . However, both can be reduced by operating in one of the low-power modes defined in Section III-C, which allow to reduce the power consumption, as shown in the following section.
C. Power Consumption Reduction
The power consumption of the proposed architecture can be further reduced by using the proposed operating modes. To compute the power consumption reduction achieved by each mode, power estimation has been performed simulating and annotating the switching activities of each node of the gate-level netlist generated by the synthesis tool. A specific testbench has been used to apply values of real samples to the input ports of the designed modules. These samples have been extracted by annotating the DCT inputs during encoding simulation of sequences taken from Table IV and they comply with the DCT usage statistics. Simulations have been performed for each of the four operating modes defined in Section III.
The last two rows of Table X show the power consumption, energy-per-sample, and frequency-normalized dynamic power of the two proposed architectures when operating in Table XI summarizes the power saving calculated with reference to the MODE0 (complete DCT). As expected, the power saving in Architecture 1 increases when reducing the computation, namely, passing from MODE1 to MODE3 (WHT only), where a power reduction of about 56% can be achieved. The same trend as Architecture 1 is observed for Architecture 2, even though the saving is slightly lower. This is due to the folding technique, which exploits a more effective usage of the resources than the unfolded architecture, and to the instantiation of real multipliers in the lifting scheme, instead of custom add-shift multipliers.
VI. CONCLUSION In this paper two novel DCT architectures for the HEVC standard have been proposed. The proposed 2D-DCTN computation is based on a 1D-DCTN core, where the complete DCT matrix is factorized as the cascade of the WHT and Givens rotations. In order to reduce the number of operations, the algorithm has been modified by introducing a precomputation mechanism, which allows to save rotations and dynamic power at the expense of very small PSNR loss. Then, two flexible and HEVC compliant architectures, able to support the DCT of size N equal to 4, 8, 16 , and 32, have been proposed. The first one implements the 1D-DCTN in a completely unfolded fashion, while the second one has been selected by identifying the proper folding degree. Moreover, the proposed architectural space exploration provides a method to design such systems by relying on the throughput required by the application. From the implementation results, it is found that the architectures employing the unfolded and folded 1D-DCT module, respectively, show competitive throughput and gate count with respect to previous existing architectures. Finally, power consumption results show the advantages offered by the proposed operating modes, namely, MODE1, MODE2, and MODE3. In particular, MODE1 reduces roughly by 30% the power consumption with negligible quality loss. According to the complexity analysis provided in [3] , power saving up to 10% can be achieved for the entire HEVC encoder and decoder by operating with the proposed approximations.
