I. INTRODUCTION

Joint collaborative team on video coding (JCT-VC) recently developed a new international video compression standard called High Efficiency Video Coding (HEVC)
- [4] . HEVC has 36% better compression efficiency than H.264 which is the current state-of-the-art video compression standard. The video compression efficiency achieved in HEVC is not a result of any single feature but rather a combination of a number of encoding tools. One of these tools is the intra prediction algorithm. HEVC provides 17% bit rate reduction for the intra prediction only case [5] .
HEVC intra prediction algorithm predicts the pixels in prediction units (PU) of a coding unit (CU), which is similar to macroblock (MB) in H.264, from the pixels of its already coded and reconstructed neighboring PUs. In H.264, there are 9 intra prediction modes for 4x4 luminance blocks, and 4 intra prediction modes for 16x16 luminance blocks [6] , [7] . In HEVC, there are 18 modes for 4x4, 35 modes for 8x8, 35 modes for 16x16, 35 modes for 32x32 and 4 modes for 64x64 luminance PUs [1] . The number of HEVC intra prediction modes for a 64x64 luminance CU is approximately 3.2 times larger than H.264. In order to determine the best HEVC intra prediction mode for the luminance component of a 64x64 CU, intra predictions for 7552 prediction modes are calculated. 1 This work was supported in part by the Scientific and Technological Research Council of Turkey (TUBITAK).
E. Ozcan, E. Kalali, Y. Adibelli, and I. Hamzaoglu are with Faculty of Engineering and Natural Sciences, Sabanci University 34956 Tuzla, Istanbul, Turkey (e-mail: {eozcan, ercankalali, yadibelli, hamzaoglu} @sabanciuniv.edu).
Fig. 1. Addition Amounts in HEVC and H.264 SATD Calculations
The intra mode decision algorithm compares the intra predictions of all intra prediction modes and determines the best intra prediction mode. The intra mode decision algorithms implemented in both H.264 JM reference software encoder and HEVC HM reference software encoder use Sum of Absolute Transformed Difference (SATD) based cost function. Fig. 1 shows the amount of additions performed by SATD calculations in these HEVC and H.264 intra mode decision algorithms. Because of the larger PU sizes and more intra prediction modes, 24 times more additions are performed for SATD calculations in HEVC intra mode decision than SATD calculations in H.264 intra mode decision. SATD calculations performed by HEVC intra mode decision algorithm implemented in HEVC HM reference software encoder constitute 9% of the computational complexity of an HEVC encoder [5] .
Therefore, in this paper, a computation and energy reduction technique is proposed for reducing the amount of computations performed by SATD calculations in HEVC intra mode decision, and therefore reducing the energy consumption of HEVC SATD calculation hardware without any PSNR loss and bit rate increase. The proposed technique significantly reduces the number of additions performed by SATD calculations in HEVC intra mode decision algorithm implemented in HEVC HM reference software encoder [8] for 4x4 and 8x8 luminance intra prediction modes. Since 94% of the intra predicted blocks are predicted by 4x4 and 8x8 PU sizes [9] , in this paper, the proposed technique is used for 4x4 and 8x8 PUs. But, it can also be used for 16x16, 32x32 and 64x64 PUs.
In this paper, efficient hardware architectures are also designed for both original HEVC SATD calculation and HEVC SATD calculation with the proposed technique for 4x4 and 8x8 PUs. The proposed hardware architectures are implemented in Verilog HDL. The Verilog RTL codes are verified to work at 116 MHz in an FPGA implemented in 40nm CMOS technology. FPGA implementations of original HEVC SATD calculation and HEVC SATD calculation with the proposed technique can process 9 and 21 HD (1280x720) frames per second, respectively. The proposed technique reduced the energy consumption of the original HEVC SATD calculation hardware up to 64.6%. Therefore, it can be used in portable consumer electronics products that require a real-time HEVC encoder.
A similar energy reduction technique is proposed for H.264 intra mode decision in [10] . However, the proposed technique includes more optimizations and it is applied to HEVC intra mode decision. In the literature, there are fast H.264 intra mode decision algorithms [11] , [12] and fast HEVC intra mode decision algorithms [13] - [15] . These algorithms reduce the amount of computations performed by intra mode decision at the expense of PSNR loss and bit rate increase. However, the proposed technique reduces the amount of computations performed by intra mode decision without any PSNR loss and bit rate increase. In addition, the proposed technique can be used together with these fast intra mode decision algorithms for further reducing the amount of computations performed by intra mode decision.
As shown in Fig. 2 , HEVC intra prediction hardware calculates the intra predictions of all intra prediction modes, and HEVC intra mode decision hardware determines the best intra prediction mode by comparing these intra predictions.
There are a few HEVC intra prediction hardware implementations in the literature [16] , [17] . Since these hardware only implement HEVC intra prediction, the HEVC intra mode decision hardware proposed in this paper cannot be compared with them.
An HEVC intra mode decision hardware implementation only for 4x4 PU size is proposed in [18] . However, no energy reduction technique is used in this hardware, and its power consumption is not reported. Since the performance and area results of this hardware are reported together with a 4x4 intra prediction hardware and an entropy coder hardware, the HEVC intra mode decision hardware proposed in this paper for 4x4 and 8x8 PU sizes cannot be compared with it.
The rest of the paper is organized as follows. In Section II, HEVC intra prediction and intra mode decision algorithms are explained. Section III describes the proposed computation and energy reduction technique for HEVC intra mode decision. The proposed HEVC SATD calculation hardware is explained and its implementation results are given in Section IV. Section V presents the conclusions.
II. HEVC INTRA PREDICTION AND INTRA MODE DECISION ALGORITHMS
HEVC intra prediction algorithm predicts the pixels in PUs of a CU using the pixels in the available neighboring PUs. For the luminance component of a frame, 4x4, 8x8, 16x16, 32x32 and 64x64 PU sizes are available. There are 16 angular prediction modes for 4x4 PU size, 33 angular prediction modes for 8x8, 16x16 and 32x32 PU sizes, and 2 angular prediction modes for 64x64 PU size. In addition to angular prediction modes, there are DC and planar prediction modes for all PU sizes [1] . Fig. 3 shows the intra prediction angles and intra prediction modes corresponding to these intra (1) (2) Hadamard cost function estimates distortion as SATD and rate (R) as the number of bits used for encoding the intra prediction mode. λ is the Lagrangian multiplier. SATD is computed by calculating the sum of absolute differences between the Hadamard Transform of current block and the Hadamard Transform of intra prediction block. RDO cost function calculates the actual distortion after coding based on SSD and rate (R) as the actual bit rate used after coding. λ is the Lagrangian multiplier. SSD is computed by calculating the sum of squared differences between the current block and the reconstructed block.
This mode decision algorithm determines the best PU size, transform unit (TU) size and intra prediction mode of a CU as follows. First, SATD values for each intra prediction mode of each PU for the largest PU size are calculated as follows. Find residue block by subtracting intra predicted block from current block, apply Hadamard Transform (HT) to the residue block, and add the absolute values of the transformed residues. Then, 8 candidate modes for 4x4 and 8x8 PUs and 3 candidate modes for 16x16, 32x32 and 64x64 PUs with minimum Hadamard cost function value are selected as candidate modes for each PU. After that, for each PU, the most selected candidate modes for neighboring PUs are compared with the candidate modes selected for the current PU and up to 3 additional modes from neighboring PUs are added to the candidate modes of the current PU. Then, RD costs of each candidate mode of each PU are calculated using the cost function in (2) , and the best mode with minimum RD cost is selected for each PU. After that, for each PU, RD cost of its best mode is calculated with TU sizes from 4x4 to 32x32 and best TU size with minimum RD cost is also selected. This process is repeated for each PU size of the CU from largest to smallest, and the best PU size, TU size and intra prediction mode with minimum RD cost are selected for the CU.
III. PROPOSED COMPUTATION AND ENERGY REDUCTION TECHNIQUE
HT is a linear operation and it can be applied before subtraction operation as shown in (3) . H, C and P shown in (3) are Hadamard matrix, current block, and predicted block, respectively. 8x8 Hadamard matrix is shown in (4). Instead of applying HT after subtraction operation, the proposed technique applies HT before subtraction operation. Applying HT before subtraction requires performing two HTs instead of one. However, this reduces the amount of computations 
performed by SATD calculations in HEVC intra mode decision. Since the intra predicted blocks have regular patterns, HTs of the predicted blocks (H*P*H') can be calculated with a small amount of computation. In addition, since HT of the current block (H*C*H') is common to all intra prediction modes, it can be calculated only once. The predicted block pattern of horizontal mode and the result of performing HT for this predicted block pattern are shown in Fig. 5 for 8x8 PU size. SATD of an 8x8 block including HT can be calculated with 959 additions. However, SATD of an 8x8 block predicted by horizontal mode including HT can be calculated with 95 additions and 8 shifts as shown in Fig. 5 . Similarly, SATD of an 8x8 block predicted by vertical mode and all angle 2 modes including HT can be calculated with 95 additions and 8 shifts. Therefore, the proposed technique significantly reduces the number of additions performed by SATD calculation.
In this paper, the proposed technique is applied to all 4x4 intra prediction modes except planar and DC modes, and all 8x8 intra prediction modes of angles 2, 5, 13, 17, and vertical and horizontal modes. Therefore, the proposed technique is applied to 16 4x4 modes and 18 8x8 modes. Since the other modes have relatively irregular prediction patterns, the proposed technique achieves small amount of computation reduction for these modes. In order to have a less complex and smaller SATD calculation hardware, the proposed technique is not applied to these prediction modes. Instead, for these prediction modes, the original HT operation which is applying HT after subtraction operation is used.
The proposed technique reduces the amount of computations because of two reasons. First, as shown in Fig.  5 , most of the values in HT of intra predicted blocks are zero. Therefore, there is no need to calculate these values. Second, since intra predicted blocks have regular patterns, some of the values in HT of intra predicted blocks are the same. Therefore, these values are calculated only once. For example, the values in sixth row of HT of an 8x8 block predicted by an 8x8 intra prediction mode of angle 17 is shown in Fig 6. The first line gives the first value in the row, and the last line gives the last value in the row. Since some of the values are the same, they are calculated only once.
The computation reductions achieved by the proposed technique are presented in Table I . The columns labeled I show the amount of computations performed by the original HT operation and the columns labeled II show the amount of computations performed by the HT operation using the proposed technique. The proposed technique reduced the number of additions performed by HT operation for 4x4 and 8x8 luminance intra prediction modes by 54% and 70% respectively without any PSNR loss. The results show that the proposed technique significantly reduces the amount of computations performed by SATD calculations in HEVC intra mode decision.
IV. PROPOSED HEVC SATD CALCULATION HARDWARE
In this paper, two different hardware architectures are designed for SATD calculation in HEVC intra mode decision for 4x4 and 8x8 PU sizes. The first hardware implements original SATD calculation. Therefore, it first subtracts predicted block from current block, and then performs HT. The second hardware implements SATD calculation with the proposed technique. Therefore, it first performs HT for predicted block and current block, and then performs subtraction.
The hardware architecture implementing the original SATD calculation has two 8 parallel datapaths in order to increase its throughput. The hardware architecture with 8 parallel datapaths is shown in Fig. 7 . One of these datapaths is shown in Fig. 8 . Input pixels are stored in IBUF input buffer. First, predicted block pixels are subtracted from current block pixels. Then, addition or subtraction operation is performed depending on HT matrix. Since HT matrix is multiplied with the residue block both from left and right side as shown in (3), the results of the left side multiplication are stored in transpose memory as shown in Fig. 7 . For 8x8 PU size, in each clock cycle, the values in one column of H*(C-P) are calculated by 8 parallel datapaths. Therefore, H*(C-P) is calculated in 8 clock cycles. Then, right side multiplication is performed. In each clock cycle, the values in one row of H*(C-P)*H´ are calculated by the same 8 parallel datapaths. Therefore, H*(C-P)*H´ is calculated in 8 clock cycles using the same 8 parallel datapaths. Then, absolute values are calculated and stored in transpose memory. Finally, SATD value is calculated by adding the absolute values using the last datapath. The original SATD calculation hardware calculates SATD values of all 4x4 and 8x8 intra prediction modes in 879 clock cycles. The hardware architecture implementing the SATD calculation with the proposed technique is shown in Fig. 9 . Parallel processing elements (PEs) are used in the hardware in order to increase its throughput. As it is shown in Fig. 10 , each PE only has 3 adders and 4 multiplexers. HTs for 4x4 intra prediction modes are calculated using 4 PEs. HTs for 8x8 intra prediction modes of angles 2, 5, and vertical and horizontal modes are calculated using one PE. HTs for 8x8 intra prediction modes of angles 13 and 17 are calculated using 4 PEs. Since the proposed technique is not applied to some intra prediction modes, as it is shown in Fig. 9 , the hardware also Architecture of 4 PEs is shown in Fig. 11 . Since there is no matrix multiplication in the proposed technique, there is no transpose memory in this hardware. First, predicted pixels are stored in IBUF input buffer. Then, each PE reads 4 pixels from IBUF and performs operations of HT. The outputs of PEs are stored either in SPAD for performing further operations of HT or in OBUF output buffer. IBUF, SPAD and OBUF are implemented as BlockRAMs.
4 PEs used for performing HTs of 4x4 intra predicted blocks perform HTs of four 4x4 blocks in an 8x8 block sequentially. These 4 PEs are divided into two groups. Each group has 2 PEs and the PEs in a group perform HTs of 4x4 blocks predicted by the same intra prediction modes. HTs of 4x4 and 8x8 current blocks are calculated once in 8 parallel original SATD calculation hardware and stored. Then, SATD values are calculated by subtracting HT of intra predicted blocks from HT of current block and adding absolute values of the results using the adder tree shown in Fig. 12 .
Since 56 of 64 values in the HT of 8x8 blocks predicted by 8x8 intra prediction modes of angle 2 and horizontal mode are zero, these zero values are not subtracted from HT of current block in order to reduce the power consumption. Since, only one adder tree is used to reduce hardware area, the adder tree operations are scheduled to use this adder tree hardware efficiently. HT flow and adder tree scheduling for an 8x8 PU for 4x4 intra prediction modes and 8x8 intra prediction modes of angles 2, 5, 13, 17, and vertical and horizontal modes are shown in Fig. 13 .
Adder tree calculates SATD value for each 4x4 intra prediction mode and 8x8 intra prediction mode in 5 and 9 clock cycles respectively. Therefore, it takes 330 clock cycles to calculate SATD values of 4x4 and 8x8 intra prediction modes for which the proposed technique is applied. It takes 400 clock cycles to calculate SATD values of the intra prediction modes for which the proposed technique is not applied. Therefore, SATD calculation hardware with the proposed technique calculates SATD values of all 4x4 and 8x8 intra prediction modes in 400 clock cycles. Since PEs and adder tree wait for 70 clock cycles before processing the next 8x8 block, they are clock gated in order to reduce power consumption.
Both BRAMs are implemented as dual-port block SelectRAMs. Therefore, the proposed technique reduces the FPGA resources used by SATD calculation hardware except BRAMs.
As shown in Table II , in order to increase the performance of the original HEVC SATD calculation hardware, the number of parallel datapaths can be increased at the expense of using more FPGA resources. For example, 48 parallel datapaths can be used to process 27 HD frames per second.
The Verilog RTL code of the HEVC SATD calculation hardware with the proposed technique is also synthesized to a 90nm standard cell library, and the resulting netlist is placed & routed. The resulting ASIC implementation works at 115 MHz, and its gate count is calculated as 49.8K according to NAND (2x1) gate area excluding on-chip memory.
The power consumptions of both FPGA implementations are estimated using a gate level power estimation tool. Post place & route timing simulations are performed for Vidyo1 (1280x720), Vidyo3 (1280x720), Johnny (1280x720), and KristenAndSara (1280x720) video sequences [19] at 100 MHz and signal activities are stored in VCD files. These VCD files are used for estimating the power consumptions of both FPGA implementations. The power and energy consumption results for one frame of each video sequence are shown in Table III . The results show that the proposed technique reduced the power and energy consumptions of the original SATD calculation hardware up to 24.2% and 64.6% respectively. Since HEVC SATD calculation hardware is used as part of an HEVC video encoder, only internal power consumption is considered and input and output power consumptions are ignored. Therefore, power consumption of HEVC SATD calculation hardware can be divided into four main categories; clock power, logic power, signal power and BRAM power.
V. CONCLUSIONS
In this paper, a computation and energy reduction technique is proposed for reducing the amount of computations performed by SATD calculations in HEVC intra mode decision, and therefore reducing the energy consumption of HEVC SATD calculation hardware without any PSNR loss and bit rate increase. Efficient hardware architectures are also designed for both original HEVC SATD calculation and HEVC SATD calculation with the proposed technique for 4x4 and 8x8 PUs. The proposed hardware architectures are implemented in Verilog HDL. The Verilog RTL codes are mapped to an FPGA implemented in 40nm CMOS technology. The proposed technique reduced the energy consumption of the FPGA implementation of original HEVC SATD calculation hardware up to 64.6%. Therefore, it can be used in portable consumer electronics products that require a real-time HEVC encoder. 
