Coding (VVC) standard has much higher computational complexity than fractional interpolation in previous video compression standards. In this paper, a low power VVC fractional interpolation hardware is designed and implemented using Verilog HDL. The proposed hardware is the first VVC fractional interpolation hardware in the literature. It interpolates necessary fractional pixels for 1/16 pixel accuracy for all prediction unit sizes. The proposed VVC fractional interpolation hardware, in the worst case, can process 40 full HD (1920x1080) frames per second. It has up to 17% less power consumption than original VVC fractional interpolation hardware.
I. INTRODUCTION
ITU and ISO are developing a new international video compression standard called Versatile Video Coding (VVC) [1] - [6] . VVC will have higher compression efficiency than High Efficiency Video Coding (HEVC) standard at the expense of much higher computational complexity [7] - [11] .
HEVC standard uses 3 different 8-tap FIR filters for fractional interpolations (FI) and provides 1/4 fractional pixel accuracy. However, VVC standard uses 15 different 8-tap FIR filters for fractional interpolations and provides 1/16 fractional pixel accuracy. Therefore, VVC fractional interpolation has much higher computational complexity than HEVC fractional interpolation.
In this paper, a low power VVC fractional interpolation hardware for all prediction unit (PU) sizes is proposed. The proposed hardware interpolates all necessary fractional pixels for an 8x8 PU. For larger PU sizes, the PU is divided into 8x8 blocks, and the blocks are interpolated separately.
The proposed hardware calculates a common offset for 15 different FIR filter equations using the same input pixels in order to reduce number of constant coefficient multiplications necessary for fractional interpolation. It also calculates common sub-expressions in different FIR filter equations once and uses the results in necessary equations. Hcub multiplierless constant multiplication (MCM) algorithm [12] is also used in the proposed hardware in order to reduce number and size of the adders.
The proposed VVC fractional interpolation hardware is implemented in Verilog HDL. The Verilog RTL code is verified to work at 200 MHz in a Xilinx Virtex 7 FPGA. The proposed VVC fractional interpolation hardware, in the worst case, can process 40 full HD (1920x1080) frames per second. It has up to 17% less power consumption than original VVC fractional interpolation hardware.
The proposed hardware is the first VVC fractional interpolation hardware in the literature. Several HEVC fractional interpolation hardware implementations are proposed in the literature [13] - [15] . In [13] , common subexpressions in FIR filters are calculated once and used in all equations. It also uses Hcub MCM algorithm to implement constant multiplications. The implementation in [14] uses coarse-grained reconfigurable datapaths to implement filter equations. A high-throughput FI hardware is proposed for HEVC encoder in [15] . In Section III, VVC fractional interpolation hardware proposed in this paper is compared with them.
The rest of the paper is organized as follows. In Section II, VVC fractional interpolation algorithm is explained. In Section III, the proposed low power VVC fractional interpolation hardware is presented, and its implementation results are given. Finally, Section IV presents the conclusions.
II. VVC FRACTIONAL INTERPOLATION ALGORITHM
VVC standard uses 15 different 8-tap FIR filters for fractional pixel interpolations. The coefficients of these 15 FIR filters are shown in Table I 
Integer pixels, fractional pixels and FIR filters used to interpolate these fractional pixels are shown in Fig. 1 . There are 255 fractional (half and quarter) pixels for one integer pixel. There are 15 half-pixels between two neighboring horizontal integer pixels called horizontal half-pixels. There are 15 half-pixels between two neighboring vertical integer pixels called vertical half-pixels. These 15 horizontal and 15 vertical half-pixels are interpolated from nearest integer pixels in horizontal and vertical directions, respectively, using 15 different 8-tap FIR filters. There are 15x15=225 quarter-pixels between 15 horizontal and 15 vertical halfpixels. These quarter-pixels are interpolated from nearest horizontal half-pixels using 15 different 8-tap FIR filters. Table II shows the number of addition and shift operations required for interpolating fractional pixels by using the filters F1 to F8. Since the filters F9 to F15 are symmetric of the filters F1 to F7, the number of addition and shift operations required for F9 to F15 are the same as F1 to F7. The number of addition and shift operations required for interpolating fractional pixels in HEVC by using F1 and F2 are also shown in Table II . Since F3 is symmetric of F1, the number of addition and shift operations required for F3 are the same as F1. The number of addition and shift operations shows that VVC FI has much higher computational complexity than HEVC FI. The total number of addition and shift operations required for interpolating all fractional pixels for an 8x8 PU in HEVC and VVC are shown in Table III . Since VVC fractional interpolation calculates more fractional pixels than HEVC fractional interpolation, it has much higher computational complexity than HEVC fractional interpolation.
III. PROPOSED VVC FRACTIONAL INTERPOLATION HARDWARE
The proposed VVC fractional interpolation hardware for all PU sizes is shown in Fig. 2 . The proposed hardware interpolates all fractional pixels for luma component of a PU using integer or horizontal half-pixels. The proposed hardware interpolates all necessary fractional pixels for an 8x8 PU. For larger PU sizes, the PU is divided into 8x8 blocks, and the blocks are interpolated separately. For example, a 16x16 PU is divided into four 8x8 blocks and each 8x8 block is interpolated separately. In the proposed hardware, 8x15 fractional pixels are interpolated in parallel using 15 different FIR filters in each clock cycle. The proposed hardware uses 15 pixels, integer pixels or horizontal half-pixels, to interpolate 8x15 fractional pixels in each clock cycle. The proposed hardware calculates a common offset, as shown in Table IV , for 15 different FIR filter equations in order to reduce number of constant coefficient multiplications necessary for fractional interpolation. Offset values are calculated in Offset datapath using input pixels as shown in (2).
Since common offset value is calculated, each FIR filter equation should be calculated using the filter coefficients in Table IV . Then, the resulting value should be added with common offset value. The F5 filter equation with offset value is shown in (3) as an example. Since filters F1 to F7 are symmetric of filters F9 to F15, only the coefficients for filters F1 to F8 are shown in Table IV 
Each one of 15 input pixels, integer pixel or horizontal half-pixel, should be multiplied with multiple constant coefficients as explained in [13] . Table V shows constant coefficient multiplications necessary for each pixel when FIR filter equations are calculated with and without using common offset value. In Table V , A-6 to A8 show 15 input pixels for filters where sub-indices represent the indices of coefficients. As shown in Table V , since constant coefficients of input pixels (A-4, A6) and (A-3 … A5) are different, two different datapaths, M1 and M2, are used. When common offset value is used, number of calculated products in M1 is reduced from 4 to 2 and number of calculated products in M2 is reduced from 12 to 7. M1, M2 and Offset datapaths are shown in Fig.  3 . Pixels A-6, A-5, A7 and A8 are used to calculate common subexpressions in different equations.
Multiplications with constant coefficients are performed using adders and shifters in M1 and M2 datapaths. In order to reduce number and size of the adders, Hcub MCM algorithm [12] is used. It minimizes number and size of the adders in a multiplier block which multiplies a single input with multiple constants using addition and shift operations. M1 datapath takes pixel A as input and calculates 3xA and 5xA using adders and shifters. M2 datapath takes pixel A as input and calculates 3xA, 5xA, 7xA, 13xA, 15xA, 19xA and 31xA using adders and shifters. Offset datapath calculates eight common offset values using adders and shifters. Since 8x15 fractional pixels are calculated in parallel, eight common offset values are calculated in Offset datapath. One offset value is used for calculating 15 fractional pixels. After constant coefficient multiplications and common offset calculations are performed, fractional pixels are calculated using adder trees. As shown in Table IV , there are common sub-expressions in different equations. The expression (A-3 -3 * A-2) is common for FIR filters 1, 12, 13, 14 and 15. The expression (A4 -3 * A3) is common for FIR filters 1, 2, 3, 4 and 15. These common sub-expressions in different equations are calculated once in C1 datapath, and the results are used in necessary equations.
As shown in Fig. 4 , 30 block RAMs (BRAM) are used in the proposed hardware. 15 BRAMs are used as output memories to store fractional pixels. 15 BRAMs are used as a transpose memory to store horizontal half-pixels necessary for interpolating quarter-pixels. Each BRAM address can store eight pixels. Horizontal half-pixels are interpolated in 15 clock cycles. In each clock cycle, 8x15 horizontal half-pixels are interpolated and each 8 horizontal half-pixels are stored in 15 different BRAMs as shown in Fig. 4 .
The transpose memory uses a rotating addressing scheme and the boxes with the same colors show the horizontal halfpixels stored in the same clock cycle. After all horizontal halfpixels are stored in the transpose memory in 15 clock cycles, 15 pixels necessary for interpolating quarter-pixels can always be read in one clock cycle from 15 different BRAMs.
Since 255 fractional pixels should be interpolated for each integer pixel, 64x255 fractional pixels should be interpolated for an 8x8 PU. 8x7x15 extra horizontal half-pixels are necessary for interpolating quarter-pixels.
First, 8x15x15 horizontal half-pixels necessary for interpolating quarter-pixels are interpolated in 15 clock cycles, and stored in the transpose memory. Then, 8x8x15 vertical half-pixels are interpolated in 8 clock cycles. Finally, 8x8x255 quarter-pixels are interpolated in 8x15 clock cycles using horizontal half-pixels. There are four pipeline stages in the proposed hardware. Therefore, the proposed hardware interpolates the fractional pixels for an 8x8 PU in 147 clock cycles.
In this paper, an original VVC fractional interpolation hardware is also designed for comparison. This hardware implements 15 different FIR filter equations separately.
The original and proposed VVC fractional interpolation hardware for all PU sizes are implemented using Verilog HDL. The Verilog RTL codes are verified with RTL simulations. RTL simulation results matched the results of a software implementation of VVC fractional interpolation algorithm. The Verilog RTL codes are synthesized and mapped to a Xilinx VC7VX330T-3FFG1157 FPGA using Xilinx ISE 14.7. The FPGA implementations are verified with post place and route simulations. As shown in Fig. 5 , FPGA implementations are also verified to work correctly on an FPGA board which includes an FPGA, dual-core microprocessor, 1 GB DRAM and interfaces such as UART and HDMI. Verilog RTL codes of the original and proposed VVC fractional interpolation hardware are also synthesized to TSMC 90 nm standard cell library, and the resulting netlists are placed and routed. ASIC implementations of the original and proposed hardware use 64.2K and 37.6K gates, respectively, based on NAND (2x1) gate area excluding onchip memory. ASIC implementations of the original and proposed hardware can work at 333 and 435 MHz, respectively, and they can process 67 and 88 full HD frames per second, respectively. The implementation results are shown in Table VI. Power consumptions of the original and proposed hardware are estimated using Xilinx XPower Analyzer tool. Post place and route timing simulations are performed for Tennis and Kimono (1920x1080) video frames at 100 MHz [16] . The signal activities of timing simulations are stored in VCD files, and they are used for estimating the power consumptions of FPGA implementations. The power consumption results for one frame of each video are shown in Table VII . The proposed VVC fractional interpolation hardware has up to 17% less power consumption than original VVC fractional interpolation hardware. Comparison of the proposed VVC fractional interpolation hardware with HEVC fractional interpolation hardware in the literature is shown in Table VIII . Since VVC fractional interpolation has much higher computational complexity than HEVC fractional interpolation, the proposed hardware has larger area and lower performance than HEVC fractional interpolation hardware.
IV. CONCLUSION
In this paper, a low power VVC fractional interpolation hardware for all PU sizes is proposed. It is the first VVC fractional interpolation hardware in the literature. The proposed VVC fractional interpolation hardware can process 40 full HD (1920x1080) frames per second on a Xilinx Virtex 7 FPGA. It has up to 17% less power consumption than original VVC fractional interpolation hardware on the same FPGA.
