We propose a high performance architecture for fractional motion estimation and Lagrange mode decision in H.264/AVC. Instead of time-consuming fractional-pixel interpolation and secondary search, our fractional motion estimator employees a mathematical model to estimate SADs at quarter-pixel position. Both computation time and memory access requirements are greatly reduced without significant quality degradation. We propose a novel cost function for mode decision that leads to much better performance than traditional low complexity method. Synthesized into a TSMC 0.13 m CMOS technology, our design takes 56k gates at 100MHz and is sufficient to process QUXGA (3200x2400) video sequences at 30 frames per second (fps). Compared with a state-of-the-art design operating under the same frequency, ours is 30% smaller and has 18 times more throughput at the expense of only 0.05db in PSNR difference.
INTRODUCTION
H.264 advanced video coding [1] is the latest video coding standard of ITU-T Video Coding Experts Group (VCEG) and ISO/IEC Moving Picture Experts Group (MPEG). Its new features include variable block-size motion estimation (ME) with multiple reference frames, integer 4x4 discrete cosine transform, in-loop deblocking filter and contextadaptive binary arithmetic coding (CABAC). H.264/AVC can save up to 50% bit-rate compared to MPEG-4 simple profile at the same video quality level. However, large amount of computation is required. Profiling report shows that motion estimation (5 reference frame, 16 search range) consumes more than 90% of the encoding time. There are many fast algorithms and efficient architectures proposed for integer motion estimation (IME) but few for fractional motion estimation (FME), which accounts for about 40% of motion estimation time. Therefore, an efficient hardware accelerator for fractional motion estimation is necessary in real-time applications.
After motion estimation, mode decision determines the encoding cost of each mode, as shown below, and chooses the mode with the minimal cost.
The rest of this paper is organized as follows. In Section 2, we briefly survey related work. In Section 3, we present our approach for fractional motion estimation and mode decision. In Section 4, we propose our VLSI architecture. Experimental results are presented in Section 5. Finally, we draw conclusions and point to possible directions for future research in Section 6.
RELATED WORK
In conventional motion estimation, integer ME (IME) is first performed and then half-pixel precision search is applied to eight neighboring half-pixel positions around the best full-pixel position. The process of quarter-pixel precision search is then performed similarly. This method requires time consuming half-pixel and quarter-pixel interpolation in advance. Large amount of memory access and computation become obstacle to real time application. Although there are several fast fraction-pixel precision searching methods to decrease complexity caused by interpolation, they usually lead to large estimation error and result in poor video quality.
There are two popular mode decision algorithms. RateDistortion Optimized (RDO) mode decision [2] considers the distortion and bit-rate by carrying out the entire encoding loop. It is computationally expensive but results in better quality and compression rate. Low complexity mode decision on the other hand only considers sum of absolute difference (SAD) and estimated bit-rate for encoding motion information. By avoiding going through the whole encoding loop, this approach speeds up the decision process at the expense of poor quality and compression rate compared with the RDO approach.
PROPOSED ALGORITHM

Fractional Motion Estimation
In Reference [3] , the authors proposed a mathematical model to estimate SADs at half-pixel precision according to neighboring integer-pixel precision SADs. This method avoids half-pixel interpolation and reduces computation time. We apply this method and extend it to quarter-pixel precision.
In Figure 1 , squares denote integer pixels (f 1 , f 2 , …, f 9 ) and triangles denote half pixels(h 1 , h 2 , …, h 9 ). Let (0, 0) be the position of integer motion vector (MV) determined by IME.
Fig.1. Integer and Half-pixel
The mathematical model used to approximate the surface defined by the nine integer pixels is as following: Writing down the 9 SADs, we can get Eq. (2) below:
Its 9 coefficients can be determined by 9 integer-pixel precision SADs around (0, 0). We can obtain 9 coefficients by the inverse matrix of Eq. (2), as shown in Eq. (3).
We substitute these 9 coefficients into the original mathematical model (Eq. (2)). In the next step, SADs at the neighboring half-pixel positions (h 1 , h 2 , …, h 9 ) can be obtained by replacing x and y in Eq. (2) with its coordinates. The position which causes minimum SAD is the half-pixel precision MV. Besides, to reduce computation time, each polynomial of Eq. (2) can be calculated in advance as below.
After finding the minimum SAD at half-pixel level, we can reset the origin (0, 0) to the position pointed to by halfpixel MV and find the minimum SAD at quarter-pixel precision similarly.
Our mathematical-model-based approach greatly reduces computation time requirement. Table 1 compares the computation costs between our mathematical approach and the traditional sub-pixel interpolation one for processing one 16x16 macroblock. Another advantage of this model is to use the same equations to refine MVs of variable size blocks. This feature further reduces hardware costs. By this mathematical model and a pipelined architecture, we can refine 41 variable block-size integer MVs to quarter-pixel precision in 45 clock cycles.
Mode Decision
Although RDO mode decision results in better performance, it is very complicated for hardware and software implementation. We set it in our road map as a long term goal. Presently, we propose to enhance the low complexity mode decision with more accurate Lagrange cost function as shown in Eq. (5): SAD is the sum of absolute difference between current and reference block, is a weight parameter, and Bit-Usage represents the bit-rate used to encode the motion information. Instead of looking up the MVBITS table as traditional low complexity mode decision, we formulate as a function of quantization parameter (QP) for better ratedistortion estimation. This is similar to the RDO approach [4] . Bit-usage is a function of reference index (Ref_idx) and motion vector difference (MVD). The reference index points to the referenced frame among all reference frames. We take motion vector difference (MVD) instead of MV into account because the entropy coder (CAVLC or CABAC) encodes the MVD instead of MV for bit-rate saving. The functions of Bit-Usage and are shown below.
ARCHITECTURE
Figure 2 depicts our proposed top-level architecture. It consists of two parts: FME and mode decision. For each variable-size block, the FME receives from IME an integer MV and nine associated SADs (one for the best integer position and eight around the best) and outputs a fractional MV and a minimum SAD at quarter-pixel precision. The mode decision engine receives SAD and fractional MV from FME and reference index from IME. It produces the chosen modes and associated MVs of macroblock and submacroblocks. Figure 3 shows the architecture of our FME, which only costs 85 adders and 16 shifters. The architecture is a direct implementation of Eq (3) and Eq (4). In the first step, FME receives nine SADs from IME and figures out 9 related SADs at half-pixel precision. Then, the comparator finds the minimum half-pixel SAD and MV Refiner adjusts integer MV to half-pixel precision according to the comparison result. The engine then takes the half-pixel precision SADs and MV as it inputs and refines them to quarter-pixel precision 
EXPERIMENTAL RESULTS
We use six video sequences 'Foreman', 'Akiyo', 'News', 'Mobile', 'Tempete' and 'Carphone' to test the quality of our design. Each sequence consists of 100 frames in CIF (352x288) format. Table 2 shows the results in comparison with H.264 reference software JM9.0 [5] (GOP: IPPPP; 5 reference frames; 16 search range). We observe that our mathematical-model-based FME causes 0.2~0.4db degradation in PSNR quality. Using the proposed mode decision, we reduce this degradation to 0.03~0.21db. We have implemented the proposed FME and mode decision in synthesizable Verilog HDL. We synthesize our design with a TSMC 0.13µm cell library and under worst case operating environment (WCCOM). Table 3 shows the synthesis result and the comparison with a previous design [6] . Our design consumes 56K gates (28K each for FME and MD), which is 30% smaller than the previous work.
Under the same operating frequency (100MHz), it is capable of 18 times more processing capability. The difference in PSNR drop is only 0.05db. 
CONCLUSION
We have presented a very high performance integrated FME and mode decision for H.264/AVC. Experimental result shows that our design is 30% smaller and having 18 times higher throughput than a state of the art previous work at the expense of only 0.05db PSNR quality difference. This design is very suitable for high-end real-time application.
In the future, we would like to extend the architecture to multi-frame system and exploit the possibility of resource sharing across frames. For applications that require more than 909k MB/s throughput, we can use multiple copies of the design with partial sharing in the MD part. Table   Generator Comparator
