In this paper, we present a high performance and low cost hardware architecture for real-time implementation of an SAD reuse based hierarchical motion estimation algorithm for H.264 / MPEG4 Part 10 video coding. This hardware is designed to be used as part of a complete H.264 video coding system for portable applications. The proposed architecture is implemented in Verilog HDL. The Verilog RTL code is verified to work at 68 MHz in a Xilinx Virtex II FPGA. The FPGA implementation can process 27 VGA frames (640x480) or 82 CIF frames (352x288) per second.
INTRODUCTION
Video compression systems are used in many commercial products, from consumer electronic devices such as digital camcorders, cellular phones to video teleconferencing systems. These applications make the video compression hardware devices an inevitable part of many commercial products. To improve the performance of the existing applications and to enable the applicability of video compression to new real-time applications, recently, a new international standard for video compression is developed. This new standard, offering significantly better video compression efficiency than previous video compression standards, is developed with the collobaration of ITU and ISO standardization organizations. Hence it is called with two different names, H.264 and MPEG4 Part 10.
The video compression efficiency achieved in H.264 standard is not a result of any single feature but rather a combination of a number of encoding tools. As it is shown in the top-level block diagram of an H.264 Encoder in Figure 1 , one of these tools is the variable block size motion estimation used in the baseline profile of H.264 standard [1, 2, 3] . Motion estimation is the most computationally demanding part of the encoders implementing the previous video compression standards. Variable block size motion estimation achieves better coding results than the fixed block size motion estimation used in the previous video compression standards. However, the amount of computation required by variable block size motion estimation is even more than the amount required by fixed block size motion estimation. Therefore, this coding gain comes with an increase in encoding complexity which makes it an exciting challenge to have a real-time implementation of motion estimation for H.264 video coding.
In this paper, we present a high performance and low cost hardware architecture for real-time implementation of an SAD reuse based hierarchical motion estimation algorithm for H.264 / MPEG4 Part 10 video coding. This hardware is designed to be used as part of a complete H.264 video coding system for portable applications. The proposed architecture is implemented in Verilog HDL. The Verilog RTL code is verified to work at 68 MHz in a Xilinx Virtex II FPGA. The FPGA implementation can process 27 VGA frames (640x480) or 82 CIF frames (352x288) per second.
A hardware architecture for real-time implementation of a variable block size motion estimation algorithm for H.264 video coding is presented in [4] . This hardware achieves higher performance than our hardware design at the expense of a much higher hardware cost. Our hardware design is a more cost-effective solution for portable applications. They use 256 processing elements in their datapath as opposed to 36 processing elements in our datapath.
The rest of the paper is organized as follows. Section II explains the hierarchical motion estimation algorithm. Section III describes the proposed architecture in detail. The implementation results are given in Section IV. Finally, Section V presents the conclusions.
SAD REUSE BASED HIERARCHICAL MOTION ESTIMATION ALGORITHM
The amount of computation required by full-search method (FSM) is not practical for real-time implementation even for fixed block size motion estimation (ME [5, 6] .
In this paper, we propose to use an SAD reuse based hierarchical ME algorithm similar to the algorithm presented in [6] . The simulation results show that even though this algorithm has a much lower computational cost than FSM, it provides almost as good coding efficiency as FSM.
Fig. 2. Hierarchical Motion Estimation Algorithm
The algorithm is illustrated in Figure 2 . It consists of the following four steps: The SAD reuse based hierarchical ME algorithm is integrated into the Joint Model (JM) Reference Software Version 7.4 [7] . The updated software is then used to simulate the hierarchical ME algorithm for R=16 using video sequences carphone (QCIF), foreman (CIF), mobile (SIF) and flowergarden (SIF) at 30fps. All frames except the first one are coded as P-frames. One reference frame is allowed. The CAVLC entropy coder is used, with quantization parameter values QP = 24, 28, 32, 36. For comparison to FSM, average PSNR loss in dB and percentage change in bitrate are reported in Table 1 . In addition, at equal bitrates, PSNR loss is observed to be less than 0.2 dB for all the tested sequences. These results confirm that even though our algorithm has a much lower computational cost than FSM, it provides almost as good coding efficiency as FSM. 
PROPOSED HARDWARE ARCHITECTURE
In this section, we will explain the proposed hardware architecture for real-time implementation of the SAD reuse based hierarchical motion estimation algorithm described in section 2. The proposed hardware implements the algorithm for the case where R=16 and therefore the search ranges used in all 3 levels l 0 , l 1 and l 2 are [-4, 4]. The search window for a [-4, 4] search range contains 9x9 = 81 search locations; 2*4+1 = 9 rows and 2*4+1 = 9 search locations in each row. The current MB (16x16 pixels) and search window (64x64 pixels) are stored in block RAMs in the FPGA. The proposed hardware first constructs a 3-level pyramid by using the averaging datapath shown in Figure 3 . The datapath is used to generate the current block and search window values in levels l 1 and l 2 by calculating the average of the corresponding pixels in the current MB and search range in level l 0 . Each averaging unit calculates the average of 4 pixels in level l 0 . The resulting values are stored in registers and they are used to perform full search for the 8x8 block in level l 1 within a search range of [-4, 4] . The averaging unit A5 calculates the average of the results produced by A1-A4 which corresponds to the average of 16 The proposed hardware then performs both the hierarchical MV prediction in levels l 2 and l 1 , and motion estimation with SAD reuse in level l 0 using the datapath shown in Figure 4 . The datapath uses 36 PEs divided into four separate groups. Each group has an array of 9 PEs. The architecture of a PE and the organization of PEs in a group are shown in Figure 5 . As we will explain in this section, the reason for using 36 PEs divided into four separate groups is to have an efficient real-time implementation of the motion estimation with SAD reuse in level l 0 . The hierarchical MV prediction in levels l 2 and l 1 are implemented by utilizing the hardware resources used for the motion estimation with SAD reuse in level l 0 .
The datapath is first used for the hierarchical MV prediction in level l 2 by performing full search for the 4x4 block in level l 2 within a search range of [-4, 4] . All 36 PEs in the datapath are used to perform the full search as follows. Each PE is used to calculate the SAD value for one search location in the search window. Since there are 9 search locations in one row of the search window, a PE group is used to calculate the SAD values for the search locations in one row of the search window. After each PE group finishes calculating the SAD values for the search locations in one row of the search window, it starts calculating the SAD values in another row of the search window. Therefore, each PE group together with a multiplexer and comparator is used to find the minimum SAD in two rows of the search window. All 4 PE groups are, therefore, utilized to find the motion vector with the minimum SAD in the search window. This process takes 42 clock cycles. . Since there are four 4x4 partitions (a, b, c, and d) of the 8x8 block and there are 9 search locations in one row of the search window, each PE group is used to calculate the SAD values for a 4x4 partition for the search locations in one row of the search window. Each PE in a group calculates the SAD value for its 4x4 partition for one search location in one row of the search window. PE groups 0, 1, 2, and 3 are used for partitions a, b, c, and d respectively. After each PE group finishes calculating the SAD values for its 4x4 partition for the search locations in the current row of the search window, it starts calculating the SAD values for its 4x4 partition in the next row of the search window. After the corresponding processing elements in each PE group, e.g. processing element 0 in each PE group, calculate the SAD value for a search location for its 4x4 partition, the 4x8SAD and 8x8SAD adders in the datapath are used to calculate the SAD value for that search location for the 8x8 partition. The multiplexer and comparator at the outputs of the 8x8SAD adders are used to find the minimum SAD for the 8x8 partition and the corresponding motion vector in the search window. This process takes 156 clock cycles.
The datapath is finally used for the motion estimation with SAD reuse in level l 0 . It is used to perform full search based on minimizing the Lagrangian cost for the 16x16 current MB and for all of its partitions at both the location pointed by the motion vector and location (0,0) within a search range of [-4, 4] to determine the 41 best motion vectors for all partitions of the MB. The datapath is designed to use the SAD reuse technique for performing full search for a 16x16 MB and for all of its partitions within a search range of [-4, 4] . Each PE group in the datapath together with a multiplexer and comparator is used to perform full search for a 4x4 partition of the 16x16 MB within a search range of [-4, 4] . Since there are 9 search locations in one row of the search window, 9 PEs are grouped together to calculate the SAD values for a 4x4 partition for the search locations in one row of the search window. Each processing element in a group calculates the SAD value for a 4x4 partition for one search location in one row of the search window. As it is shown in Figure 6 , in order to reduce the number of current block and search window register ports and number of accesses to these registers, each PE in a group starts calculating its SAD value one cycle later than the previous PE in that group so that PEs can reuse the current block value accessed by the first PE in the group and several PEs can use the same search window value in the same cycle. Since PE0 starts working in cycle 0, it finishes calculating its first SAD in cycle 15. The last PE in that group, PE8, finishes calculating its SAD in cycle 8 + 15 = 23. After each PE finishes calculating an SAD value for a 4x4 partition in the current row of the search window, it starts calculating an SAD value for the same 4x4 partition in the next row of the search window. Since there are 9 rows in the search window, the minimum SAD for a 4x4 partition and the corresponding motion vector is found in 8 + 9x16 = 152 cycles.
Since the full search for a 16x16 MB and for all of its partitions are performed starting at the same location in level l 0 (location pointed by the motion vector or location (0,0)) within the same size search range ([-4, 4]), the search windows of two neighboring 4x4 partitions (a, b) of the MB overlap as shown in Figure 7 . The search window regions s1, s2 and s3 are used for partition a, and the search window regions s2, s3 and s4 are used for partition b. Therefore, the search window regions s2 and s3 are shared by both a and b partitions. In order to exploit this to reduce the number of search window register ports (from 3+3=6 to 4) and the number of accesses to search window registers, the full search for partitions a and b are performed simultaneously by using PE group 0 for partition a and PE group 1 for partition b. As it is shown in Figure 6 , the processing elements in PE group 1 starts calculating their SADs 4 cycles later than the corresponding processing elements in PE group 0 so that several PEs in group 0 and group 1 can use the same search window value (in regions s2 or s3) in the same cycle. Therefore, the minimum SAD for partition b and the corresponding motion vector is found in 4+152=156 cycles. 1 
l p
As the PE groups 0 and 1 perform the full search for partitions a and b, PE groups 2 and 3 perform the full search for partitions c and d simultaneously based on the same data flow shown in Figure 6 . Therefore, the minimum SADs for 4x4 partitions a, b, c and d and the corresponding motion vectors are found in 156 cycles.
After the corresponding processing elements in each PE group, e.g. processing element 0 in each PE group, calculate the SAD value for a search location for its 4x4 partition, the 4x8SAD, 8x4SAD and 8x8SAD adders in the datapath are used to calculate the SAD values for that search location for the 4x8 (a+b and c+d), 8x4 (a+c and b+d), and 8x8 (a+b+c+d) partitions by reusing the SAD values of the 4x4 partitions. In other words, as the full search for 4x4 partitions a, b, c, and d are performed, the full search for two 4x8 (a+b and c+d), two 8x4 (a+c and b+d), and one 8x8 (a+b+c+d) partition are also performed in parallel by using the 4x8SAD, 8x4SAD and 8x8SAD adders and the multiplexers and comparators at their outputs in the datapath. Therefore, by using the SAD reuse technique, the minimum SADs for two 4x8, two 8x4 and one 8x8 partition and the corresponding motion vectors are found as well in the same 156 cycles.
After the full search for the first four 4x4 partitions are performed, the four PE groups are used to perform the full search for the next four 4x4 partitions of the MB. Again, by using the SAD reuse technique, the full search for the corresponding two 4x8, two 8x4, and one 8x8 partition are performed in parallel. Since there are four 8x8 partitions in a MB, this process is repeated 4 times. Therefore, full search for all 4x4, 4x8, 8x4 and 8x8 partitions are performed in 4*156 = 624 clock cycles.
As the full search for 8x8 partitions are performed, the full search for 8x16, 16x8 and 16x16 partitions are also performed in parallel by using the 8x16SAD, 16x8SAD and 16x16SAD registers, adders, multiplexers and comparators in the datapath. Therefore, by using the SAD reuse technique, the minimum SADs for 8x16, 16x8 and 16x16 partitions and the corresponding motion vectors are found as well in the same 624 clock cycles.
After the full search for the 16x16 current MB and for all of its partitions at the location pointed by the motion vector within a search range of [-4, 4] 
IMPLEMENTATION RESULTS
The proposed architecture is implemented in Verilog HDL. The implementation is verified with RTL simulations using Mentor Graphics ModelSim SE. The Verilog RTL is then synthesized to a 2V8000ff1152 Xilinx Virtex II FPGA with speed grade 5 using Mentor Graphics Leonardo Spectrum. The resulting netlist is placed and routed to the same FPGA using Xilinx ISE Series 5.2i. The FPGA implementation is verified to work at 68 MHz under worst-case PVT conditions with post place and route simulations. The FPGA implementation can process a VGA frame in 36.8 msec. (1200 MB * 2086 clock cycles per MB * 14.7 ns clock cycle = 36.8 msec) Therefore, it can process 1000/36.8 = 27 VGA frames (640x480) per second. The FPGA implementation can process a CIF frame in 12.2 msec. (396 MB * 2086 clock cycles per MB * 14.7 ns clock cycle = 12.2 msec) Therefore, it can process 1000/12.2 = 82 CIF frames (352x288) per second.
The FPGA implementation including input, output and internal RAMs and register files uses the following FPGA resources; 14505 Function Generators, 7253 CLB Slices, 5227 Dffs/Latches, 13 Block RAMs, and 7 Block Multipliers (used for calculating M * R), i.e. %15.5 of Function Generators, %15.5 of CLB Slices, %5.4 of Dffs/Latches, %7.7 of Block RAMs, and %4.1 of Block Multipliers.
CONCLUSION
In this paper, we presented a high performance and low cost hardware architecture for real-time implementation of an SAD reuse based hierarchical motion estimation algorithm for H.264 / MPEG4 Part 10 video coding. This hardware is designed to be used as part of a complete H.264 video coding system for portable applications. The proposed architecture is implemented in Verilog HDL. The Verilog RTL code is verified to work at 68 MHz in a Xilinx Virtex II FPGA. The FPGA implementation can process 27 VGA frames (640x480) or 82 CIF frames (352x288) per second.
