In this paper, an approximate High Efficiency Video Coding (HEVC) intra angular prediction technique is proposed for reducing area of HEVC intra prediction hardware. The proposed approximation technique uses closer neighboring pixels instead of distant neighboring pixels in intra angular prediction equations. It causes 0.0569% bit rate increase and 0.0028 dB PSNR loss on average. In this paper, an approximate HEVC intra angular prediction hardware implementing the proposed approximation technique is also proposed. FPGA and ASIC implementations of the proposed approximate hardware can process 24 and 40 quad full HD (3840x2160) video frames per second, respectively. The proposed approximate HEVC intra angular prediction hardware is the smallest and the second fastest HEVC intra prediction hardware in the literature. It is ten times smaller and 20% slower than the fastest HEVC intra prediction hardware in the literature.
I. INTRODUCTION
High efficiency video coding (HEVC) provides 50% better video compression efficiency than H.264 [1] - [9] . However, it has higher computational complexity than H.264. HEVC intra prediction has higher computational complexity than H.264 intra prediction as well. In HEVC intra prediction algorithm, pixels of a prediction unit (PU) are predicted using the neighboring pixels in already coded and reconstructed neighboring PUs. 4x4, 8x8, 16x16 and 32x32 PU sizes are used in HEVC intra prediction algorithm. There are 35 intra prediction modes for each PU size [1] .
Approximate computing allows designing faster, smaller area and lower power consuming hardware than exact optimized hardware designs by trading off speed, area and power consumption with quality. Therefore, it is used for error tolerant applications with high computational complexity such as video compression [10] - [18] .
In this paper, an approximate HEVC intra angular prediction technique is proposed for reducing area of HEVC intra prediction hardware. The proposed approximation technique uses closer neighboring pixels instead of distant neighboring pixels in an intra angular prediction equation if the distance between the neighboring pixels used in this intra angular The associate editor coordinating the review of this manuscript and approving it for publication was Zhaoqing Pan . prediction equation is larger than 2. Otherwise, it uses the original intra angular prediction equations.
The proposed approximate HEVC intra angular prediction technique causes 0.0569% bit rate increase and 0.0028 dB PSNR loss on average. The proposed approximate HEVC intra angular prediction technique is used in search and mode decision stage of an HEVC encoder. Original HEVC intra angular prediction is used in coding stage of the HEVC encoder. Therefore, the proposed approximation technique does not cause encoder-decoder mismatch.
In this paper, an approximate HEVC intra angular prediction hardware implementing the proposed approximation technique is also proposed. The proposed hardware implements angular prediction modes for all PU sizes (4x4 to 32x32). The proposed approximation technique significantly reduces area of the proposed hardware by enabling efficient use of one multiple constant multiplication (MCM) datapath to implement all constant multiplications using add and shift operations and by reducing amount of on-chip memory. It also reduces amount of computations and amount of onchip memory accesses.
The proposed approximate HEVC intra angular prediction hardware is implemented with Verilog. Verilog RTL codes are mapped to a Xilinx Virtex 6 FPGA. The FPGA implementation is verified on an FPGA board. It can work at 200 MHz, and it can process 24 quad full HD (3840 x 2160) video frames per second (fps). VOLUME 8, 2020 This work is licensed under a Creative Commons Attribution 4.0 License. For more information, see http://creativecommons.org/licenses/by/4.0/ Approximate hardware implementations of several HEVC algorithms such as discrete cosine transform, fractional interpolation and motion estimation are proposed in the literature [13] - [18] .
Approximate reconfigurable adder/subtractor blocks (RABs) are proposed in [13] . They can change their approximation level dynamically to achieve low power with high accuracy. They are used in motion estimation and DCT modules of MPEG encoder. HEVC DCT coefficients are replaced with the closest 2N integer in [14] . This reduces area and energy consumption of the HEVC DCT hardware. Several algorithmic approximate computing techniques are applied to HEVC decoder software in [15] . This reduces energy consumption. Different approximate adder circuits are used for HEVC DCT and HEVC motion estimation hardware in [16] and [17] , respectively. 3 and 4 tap FIR filters are proposed for HEVC fractional interpolation in [18] . Using these FIR filters instead of 7 and 8 tap original FIR filters reduces area and energy consumption of HEVC fractional interpolation hardware.
There is no approximate HEVC intra prediction hardware in the literature. Since the proposed approximate HEVC intra angular prediction hardware is the first approximate hardware implementation of HEVC intra angular prediction, we could not compare its PSNR and bit rate results with literature.
There are several HEVC intra prediction hardware implementations in the literature [19] - [29] . Some of these hardware implementations use separate datapaths for each PU size. Some of them use multipliers to implement multiplication with constants instead of using adders and shifters. Some of them use novel memory management techniques to read neighboring pixels efficiently.
The proposed approximate HEVC intra angular prediction hardware is the smallest and the second fastest HEVC intra prediction hardware in the literature. It is ten times smaller and 20% slower than the fastest HEVC intra prediction hardware in the literature. This paper is organized as follows. HEVC intra prediction algorithm is described in Section II. The proposed approximate HEVC intra angular prediction technique is explained in Section III. In Section IV, the proposed approximate HEVC intra angular prediction hardware is explained. In Section V, its implementation results are given. Conclusions are presented in Section VI.
II. HEVC INTRA PREDICTION ALGORITHM
In HEVC intra prediction algorithm, pixels of a PU are predicted using weighted average of neighboring pixels in neighboring PUs in the same frame. 4x4, 8x8, 16x16 and 32x32 PU sizes are used for luminance components of frames. For each PU size, DC prediction mode, planar prediction mode and 33 angular prediction modes are used.
As shown in Fig. 1 , each angular prediction mode has a direction and an angle. Angles of some prediction modes are the same. However, these prediction modes use different reference pixels. For example, angle of mode 10 is 0, and angle of mode 26 is 0. Angle of mode 9 is 2, and angle of mode 27 is 2. Angle of mode 11 is −2, and angle of mode 25 is −2.
As shown in Fig. 2 , HEVC intra angular prediction algorithm, first, selects some of the neighboring pixels as reference pixels (Ref) based on intra angular prediction mode and its angle. There are four different cases. As shown in Fig. 1 , if intra angular prediction mode is in region 0 (modes 26, 27, . . . , 34), all reference pixels are selected from top neighboring pixels. If intra angular prediction mode is in region 1 (modes 18, 19, . . . , 25), up to four reference pixels are selected from left neighboring pixels and the other reference pixels are selected from top neighboring pixels. If intra angular prediction mode is in region 2 (modes 11, 12, . . . , 17), up to four reference pixels are selected from top neighboring pixels and the other reference pixels are selected from left neighboring pixels. If intra angular prediction mode is in region 3 (modes 2, 3, . . . , 10), all reference pixels are selected from left neighboring pixels.
As shown in Fig of the pixels in Ref that will be used in the intra prediction equation. Coefficients of these pixels are calculated using Coeff. Finally, predicted pixels are calculated using pred[x, y] equation.
Neighboring pixels of a 4x4 PU, and directions of intra angular prediction modes 6 and 30 are shown in Fig. 3 . Prediction equations of mode 6 are shown in Fig. 4 . Prediction equations of mode 30 are shown in Fig. 5 . Although angles of these two prediction modes are the same (13), they use different reference pixels. Reference pixels of mode 6 are selected from left neighboring pixels. Reference pixels of mode 30 are selected from top neighboring pixels.
III. PROPOSED APPROXIMATE HEVC INTRA ANGULAR PREDICTION TECHNIQUE
In this paper, first, data reuse technique is used to reduce amount of computations [28] . Since some of the HEVC intra angular prediction equations use same Coeff and reference pixels, there are identical luminance angular prediction equations for each PU size. Since different PU sizes may use same neighboring pixels, there are also identical luminance angular prediction equations between different PU sizes. Data reuse technique calculates the common prediction equations for all luminance angular prediction modes only once and uses the result for corresponding modes. As shown in Table 1 , this reduces the number of prediction equations that should be calculated for a 32x32 coding unit (CU), which includes 1 32x32 PU, 4 16x16 PU, 16 8x8 PU and 64 4x4 PU, from 135168 to 14848.
Since we use data reuse technique, instead of calculating intra prediction equations of different prediction modes and PUs separately, we calculate all necessary intra prediction equations together and use the results for the corresponding prediction modes and PUs.
As shown in Fig. 6 , there are much more intra prediction equations using closer neighboring pixels than intra prediction equations using distant neighboring pixels. HEVC intra angular prediction equations using neighboring pixels that have larger than 2 distance between them are only 4% of all HEVC intra angular prediction equations. None of the prediction equations for 4x4 PUs use neighboring pixels that have larger than 2 distance between them. Therefore, in this paper, an approximate HEVC intra angular prediction technique is proposed. If distance between neighboring pixels used in an intra angular prediction equation is larger than 2, the neighboring pixel that has 2 distance with the first neighboring pixel will be used instead of the second neighboring pixel. Otherwise, original neighboring pixels will be used.
For example, in Fig. 6 , neighboring pixel vC will be used instead of neighboring pixel vD in the intra prediction equations using neighboring pixels vA and vD. Similarly, neighboring pixel vC will be used instead of neighboring pixel vE in the intra prediction equation using neighboring pixels vA and vE. Original neighboring pixels (vA and vB) will be used in the intra prediction equations using neighboring pixels vA and vB. Similarly, original neighboring pixels (vA and vC) will be used in the intra prediction equations using neighboring pixels vA and vC.
Original and proposed approximate prediction equations of intra angular prediction mode 23 for 16x16 PU are shown in Fig. 7 . One prediction equation is shown in each box.
The notation 9vA, 23vB denotes the intra angular prediction equation 9 × vA + 23 × vB + 16 5.
Original prediction equations are shown on the right in red boxes. Identical prediction equations are shown only once. For example, prediction equations for the pixels in rows 1-3 of 16x16 PU are identical. Similarly, prediction equations for the pixels in rows 4-7 of 16x16 PU are identical.
Six original prediction equations in Fig. 7 use neighboring pixels that have larger than 2 distance between them. The proposed approximate prediction equations for these six original prediction equations are shown on the left in blue boxes. The other original prediction equations do not use neighboring pixels that have larger than 2 distance between them.
The proposed approximate HEVC intra angular prediction technique is integrated into intra angular prediction in HEVC HM software encoder 15.0 [30] . If distance between neighboring pixels used in an intra angular prediction equation is larger than 2, the neighboring pixel that has 2 distance with the first neighboring pixel is used instead of the second neighboring pixel in the intra angular prediction function of HM software.
First ten frames of some of the HEVC test videos [31] are coded with all intra (AI) test configuration and four different quantization parameters (QP) using HEVC HM 15.0 with three different HEVC intra angular predictions; original, the proposed approximate HEVC intra angular prediction using neighboring pixels that have at most 1 distance between them (D1), and the proposed approximate HEVC intra angular prediction using neighboring pixels that have at most 2 distance between them (D2).
The resulting rate-distortion performances are shown in Table 2 . D2 causes negligible PSNR loss and bit rate increase because replaced neighboring pixel values are similar as they are close to each other in the video frame. Since D2 has a negligible impact on PSNR and bit rate, it is implemented in the proposed approximate HEVC intra angular prediction hardware instead of D1.
IV. PROPOSED APPROXIMATE HEVC INTRA ANGULAR PREDICTION HARDWARE
The proposed approximate HEVC intra angular prediction hardware for all PU sizes (4x4 to 32x32) implementing data reuse and the proposed approximation technique is shown in Fig. 8 . The proposed approximation technique significantly reduces area of the proposed hardware by enabling efficient use of one multiple constant multiplication (MCM) datapath to implement all constant multiplications using add and shift operations and by reducing amount of on-chip memory. It also reduces amount of computations and amount of on-chip memory accesses.
As shown in Fig. 6 , one neighboring pixel is multiplied with different constants in different intra prediction equations. Therefore, in the proposed hardware, one MCM datapath is used to efficiently implement all constant multiplications using add and shift operations. In the proposed MCM datapath, Hcub MCM algorithm is used to reduce number and size of adders, and adder tree depth [32] .
As shown in Fig. 8 , the proposed MCM datapath multiplies an input pixel with constants 1, 2, 3, . . . , 31 by calculating common parts in these constant multiplications once and using them to perform all constant multiplications. It takes only one neighboring pixel in every two cycles and performs multiplications with constants 1, 3, 5, 7, 9, 11, 13, 15. Multiplications with constants 2, 4, 6, 8, 10, 12, 14, 16 are performed by using these multiplication results and shift operations. Multiplications with constants 17, 18, 19, . . . , 31 are performed by adding 16 to these multiplication results.
As shown in Fig. 6 , since the number of HEVC intra angular prediction equations using distant neighboring pixels is small and MCM hardware multiplies an input pixel with constants 1, 2, 3, . . . , 31, MCM hardware will perform many unnecessary constant multiplications for distant neighboring pixels. Since the number of HEVC intra angular prediction equations using closer neighboring pixels is large and the proposed approximate intra angular prediction technique uses closer neighboring pixels, the proposed hardware performs few unnecessary computations.
Three different on-chip buffers are used to store neighboring pixels in the neighboring PUs. After neighboring PUs of the current PU are coded and reconstructed, the neighboring pixels in these neighboring PUs are stored in the corresponding buffers. These on-chip buffers are used to decrease necessary off-chip memory accesses.
More on-chip memory accesses are required when intra angular prediction equations use distant neighboring pixels. Since the proposed approximate intra angular prediction technique uses closer neighboring pixels, it reduces amount of on-chip memory accesses.
As shown in Fig. 8 , three rotational buffers are used in the proposed hardware. As shown in Fig. 9 , first, constant multiplication results of neighboring pixels vA and vB are stored to rotational buffers 1 and 2, respectively. While the intra prediction equations using both neighboring pixels vA and vB are calculated, constant multiplication results of neighboring pixel vC are stored to rotational buffer 3. After the intra prediction equations using neighboring pixel vA are calculated, there is no need to store the constant multiplication results of neighboring pixel vA in rotational buffer 1. Therefore, while the intra prediction equations using both neighboring pixels vB and vC are calculated, constant multiplication results of neighboring pixel vD are stored to rotational buffer 1. After the intra prediction equations using neighboring pixel vB are calculated, there is no need to store the constant multiplication results of neighboring pixel vB in rotational buffer 2. Therefore, while the intra prediction equations using both neighboring pixels vC and vD are calculated, constant multiplication results of neighboring pixel vE are stored to rotational buffer 2. This process repeats rotationally. Therefore, constant multiplication results of a neighboring pixel are stored in a rotational buffer for 6 clock cycles.
If original HEVC intra angular prediction equations using distant neighboring pixels are used, more rotational buffers will be used to store constant multiplication results of more neighboring pixels. Since the proposed approximate intra angular prediction technique uses closer neighboring pixels instead of distant neighboring pixels, it reduces amount of on-chip rotational buffers.
If original HEVC intra angular prediction equations using distant neighboring pixels are used, additional clock cycles will be used to calculate the intra prediction equations using distant neighboring pixels. For example, in Fig. 9 , additional clock cycles will be used to calculate the intra prediction equations using both neighboring pixels vA and vD.
Since the proposed approximate intra angular prediction technique uses closer neighboring pixels instead of distant neighboring pixels, it reduces amount of computations.
V. IMPLEMENTATION RESULTS
The proposed approximate HEVC intra angular prediction hardware is implemented with Verilog. Simulation results of the Verilog RTL codes matched results of a software implementation of the proposed approximate intra angular prediction technique.
The Verilog RTL codes are synthesized and mapped to a Xilinx Virtex 6 FPGA. FPGA implementation results are shown in Table 3 . The proposed approximate HEVC intra angular prediction hardware uses 318 LUTs, 1068 DFFs, 8 BRAMs. The proposed FPGA implementation is verified to work at 200 MHz by post place and route simulations using reconstructed video frames taken from HM software as input. It can process 24 quad full HD (3840x2160) video frames per second.
As shown in Fig. 10 , a Xilinx Zynq FPGA board, which has a Xilinx FPGA and a dual-core ARM microprocessor, is also used to verify the proposed FPGA implementation. The microprocessor reads video frames from SD card and sends them to the FPGA using a high speed AXI bus. The proposed hardware performs intra prediction. Then, the microprocessor displays intra predicted frames on HDMI monitor and stores them to SD card.
The Verilog RTL codes are synthesized, placed and routed to a TSMC 90nm standard cell library as well. 2x1 NAND gate area is used to calculate gate count of the proposed ASIC implementation. ASIC implementation results are shown in Table 4 .
There are several HEVC intra prediction hardware implementations in the literature [19] - [29] . Efficient neighboring pixels management buffers are proposed in [19] . They eliminate extra clock cycles for neighboring pixels padding by using individual buffers for each depth of quad-tree. Therefore, the proposed hardware can process 30 Quad Full HD fps [19] . Intra prediction hardware for HEVC decoder is proposed in [20] . This hardware can predict 4 pixels per clock cycle. Different datapaths with many pipeline stages are used for each PU size in the intra prediction hardware proposed in [21] . Therefore, this hardware has high performance at the expense of very large area. Intra prediction datapaths that can predict 8 pixels per clock cycle each are proposed in [22] . One row of pixels is predicted in each clock cycle. Clock gating is used to disable switching activities in some of the datapaths.
Two different intra angular prediction datapaths are proposed in [23] . The first datapath can only process 4x4 PUs. The second datapath can process 8x8 to 32x32 PUs. Neighboring pixels are stored in registers. Therefore, this hardware has large area. Intra prediction hardware for HEVC decoder is proposed in [24] . Resource sharing and mode-adaptive scheduling are proposed in this hardware, because many modules are not active simultaneously. 64 parallel reconfigurable processing elements are proposed in [25] . Reconfigurable intra angular prediction datapath is proposed in [26] . This hardware has very large area.
Pixel equality-based computation and energy reduction technique is proposed to reduce power consumption of the HEVC intra prediction hardware in [27] . This hardware implements only 4x4 and 8x8 PUs. HEVC intra angular prediction equations are manipulated in [28] . This reduced number of adders used in the intra prediction datapath. HEVC intra prediction datapath is implemented using DSP blocks available in FPGA in [29] .
The proposed approximate HEVC intra angular prediction FPGA and ASIC implementations are compared with the HEVC intra prediction FPGA and ASIC implementations in the literature in Table 3 and Table 4 , respectively [19] - [29] . Since some of the results of the hardware in the literature are not available, we compared the proposed hardware with available results. The results in Table 3 and Table 4 show that the proposed approximate HEVC intra angular prediction hardware is the smallest and the second fastest HEVC intra prediction hardware in the literature. It is ten times smaller and 20% slower than the fastest HEVC intra prediction hardware in the literature.
Xilinx XPower Analyzer tool is used to estimate power consumption of the proposed approximate HEVC intra angular prediction FPGA implementation. All switching activities are stored in VCD files during post place and route timing simulation of the proposed hardware at 100 MHz clock frequency. Xilinx XPower Analyzer tool uses these VCD files to estimate power consumption of the proposed FPGA implementation.
Power and energy consumptions of the HEVC intra prediction FPGA implementations proposed in [19] - [22] are not reported. As shown in Fig. 11 , the proposed FPGA implementation consumes less energy than the FPGA implementations proposed in [28] and [29] . Since the HEVC intra prediction FPGA implementation proposed in [27] implements only 4x4 and 8x8 PUs, it consumes less energy than the proposed FPGA implementation. It consumes 1221 uJ for Tennis (1920x1080) and 1322 uJ for Kimono (1920x1080).
VI. CONCLUSION
In this paper, an approximate HEVC intra angular prediction technique and an approximate HEVC intra angular prediction hardware implementing the proposed approximation technique are proposed. The proposed approximation technique causes negligible PSNR loss and bit rate increase. It significantly reduces area of the proposed approximate hardware by enabling efficient use of one MCM datapath to implement all constant multiplications using add and shift operations and by reducing amount of on-chip memory. The proposed approximate hardware is the smallest and the second fastest HEVC intra prediction hardware in the literature. It is ten times smaller and 20% slower than the fastest HEVC intra prediction hardware in the literature.
