Abstract-This paper presents a fast multi-reference frame integer motion estimator for H.264/AVC. The proposed system uses the previously proposed fast multi-reference frame algorithm. The previously proposed algorithm executes a full search area motion estimation in reference frames 0 and 1. After that, the search areas of motion estimation in reference frames 2, 3 and 4 are minimized by a linear relationship between the motion vector and the distances from the current frame to the reference frames. For hardware implementation, the modified algorithm optimizes the search area, reduces the overlapping search area and modifies a division equation. Because the search area is reduced, the amount of computation is reduced by 58.7%. In experimental results, the modified algorithm shows an increase of bit-rate in 0.36% when compared with the five reference frame standard. The pipeline structure and the memory controller are also adopted for real-time video encoding. The proposed system is implemented using 0.13 um CMOS technology, and the gate count is 1089K with 6.50 KB of internal SRAM. It can encode a Full HD video (1920x1080P@30Hz) in real-time at a 135 MHz clock speed with 5 reference frames.
I. INTRODUCTION
H. 264/AVC is the latest video coding standard of the ITU-T Video Coding Experts Group (VCEG) and the ISO/IEC Moving Picture Experts Group (MPEG). This standard provides much higher compression than earlier standards such as MPEG-2 or MPEG-4. The higher coding efficiency comes from several new features such as the multi-reference frame, variable sub block size, quarter-pixel accuracy motion vector, intra prediction, integer transformation, adaptive in-loop deblocking filter and enhanced entropy coding methods [1] . Although the coding efficiency of H.264/AVC is much higher than that of previous standards, these new features increase the computational complexity and make it difficult to implement in mobile devices [2] [3] [4] [5] .
Multi-reference frame motion estimation is one of the new features in H.264/AVC. It increases the performance of the video encoder by using up to five reference frames, but this feature greatly increases the computation quantity. The number of computations is linear relative to the number of reference frames, so H.264/AVC, which uses multi-reference frames, takes more time than earlier standards such as MPEG-2 or MPEG-4. In particular, the integer pixel motion estimation module is one of the most time-consuming blocks, consuming 74.25% of the execution time in the H.264/AVC encoder [6] . Because the number of computations required for multi-reference frame motion estimation is very large, implementation of a real-time multi-reference integer motion estimation encoder is difficult, and most of the previous integer motion estimators used only one reference frame or two reference frames [7] [8] [9] [10] [11] .
In this paper we design a fast multi-reference frame integer motion estimator for H.264/AVC. For this work, we use and modify the fast multi reference frame motion estimation algorithm that was proposed in our in previous work [12] . The previously proposed algorithm reduces the search area by using a linear feature of object motion and does not use early termination by threshold value. This feature gives us a fixed processing time schedule and a small sized hardware area. It makes hardware implementation easy. The rest of this paper is organized as follows: Section 2 introduce the multi-reference frame motion estimation and the previously proposed algorithm. In Section 3, we modify the previously proposed algorithm for hardware implementation. Section 4 provides the experimental results. In Section 5, we design the fast multi reference integer motion estimator for H.264/AVC. Section 6 provides the implementation results and a comparison of the proposed system with previous systems. Finally, we conclude this paper in Section 7.
II. MULTI-REFERENCE FRAME MOTION ESTIMATION AND PREVIOUS FAST ALGORITHMS

The Multi-Reference Frame Motion Estimation
In the latest video standard, H.264/AVC, several features are added to the motion estimation module and they make the amount of computation required for motion estimation huge. In particular, multi-reference frame motion estimation, which uses up to five reference frames, provides a highly efficient coding rate, but requires a great deal of computations [13, 14] . To investigate the number of computation required for multi-reference frame motion estimation, we measured the motion estimation time and bit-rate. Fig. 1 shows the relationship between the bit-rate and processing time of motion estimation. The x-axis shows the number of reference frames, the dashed line shows the motion estimation time and the solid line shows the percentage increment of the bit-rate. This graph shows that the amount of computation for multi-reference frame motion estimation has a linear relationship with the number of reference frames. For this reason, the complexity of computations for multi-reference frame motion estimation that uses five reference frames is five times more than that of single reference frame motion estimation. The motion estimation computation time is very large, and the bit-rate is low in motion estimation that uses five reference frames. Because of the computation complexity, the previous integer motion estimators use only one reference frame or two reference frames. To implement the multi reference frame motion estimator, the fast multi reference frame motion estimation algorithm is required and such fast multireference algorithms were developed.
Fast Multi-frame Motion Estimation Algorithm with Adaptive Search Strategies (FMASS) [16]
The algorithm of FMASS [16] uses two features of multi-reference frame motion estimation. The first feature is that the probability of selecting ref 3 and ref 4 is lower than 3%. The second feature is that ref 3 3 and ref 4 will be skipped.
FMASS uses a feature such that ref 3 and ref 4 will be more likely to be selected under conditions of vibration or smooth moving, but it uses threshold values. It does not provide any gain regarding implementation on hardware. Fig. 1 . Relationship between the number of reference frames, bit-rate and ME time.
Adaptive and Fast Multi-frame Selection Algorithm (AFMFSA) [17]
AFMFSA [17] consists of three skip methods. The first method uses a relationship between the current block and its neighboring blocks, which are located above and to the left side of the current block. If the current block is not located at object boundaries and its neighboring blocks have the same optimal reference frame, ref n , its ultimate reference frame has a very large probability (equal to 95%) of being ref n . The second method uses the magnitudes of the ref 0 AFMFSA determines on the execution of motion estimation for each reference frame at every reference frame. It gives non-fixed timing, and data dependency appears. These make it difficult to implement a pipeline structure or a parallel structure.
Fast Reference Frame Selection Algorithm (FRFSA) [18]
FRFSA [18] uses a temporal correlation. It classifies the reference regions into four types using the reference frame index of the anchor block. The anchor block is the macro block that has the same position in the previous encoded frame. Region0 4 ).
To implement the motion estimator hardware for H.264/AVC, FRFSA requires additional memory that saves the reference frame indices of the previously encoded frame. It shows increases in hardware size and cost.
The Previously Proposed Algorithm [12]
As we explained in the previous section, previous fast multi-reference frame motion estimation algorithms, such as FMASS [16] , AFMFSA [17] , and FRFSA [18] have problems in that they use threshold values, have a large data dependency, and require the additional memory. It makes hardware implementation difficult and provides no gains for hardware implementation. To resolve these problems, we proposed a new fast multireference motion estimation algorithm. Our fast multireference algorithm uses a linear relationship between the motion vector and the distances from the current frame to the reference frames. To resolve the estimated motion vector, we find the center positions of the reduced search areas by using the motion vector of the previous reference frame and the Picture Order Counts (POC) of each frame. The center positions of the reduced search areas are solved for by using Eq. (1). 3 , and ref 4 , the search area of each frame consists of two reduced search areas and each reduced search area size is 1/16 of the full search area. One of the two reduced search areas is resolved by the motion vector in ref 0 . The other is resolved by the motion vector in ref 1 . Consequentially, the previously proposed algorithm uses 2.375 reference frames, and the number of computations is reduced by 52.5% as compared to the standard algorithm, which uses five reference frames and does not use a threshold value. It provides fixed timing for hardware implementation and gives large gains for hardware implementation, such as easy time scheduling and reduced calculations. In addition, there is no data dependency among ref 2 , ref 3 , and ref 4 . Therefore, an implementation of a pipeline structure or parallel structure is possible.
III. ALGORITHM MODIFICATION
To design the hardware of the previously proposed fast multi-reference frame motion estimation algorithm, some modification of the previously proposed algorithm is required. The first modification point is the division of Eq. (1). This equation is solving for the center position of the reduced search areas in ref 2 , ref 3 , and ref 4 . The second modification point is the optimization of the reduced search area size. Because the full search area size is increased to 64x64, the optimization of the reduced search area size is required. The third modification point is the overlap of the reduced search areas in ref 2 , ref 3 , and ref 4 . The overlap of the search areas creates overhead in terms of computation and memory bandwidth. We modify the previously proposed algorithm to solve these three problems.
The Modification of the Central Point Solving Equation
The previously proposed fast multi-reference frame motion estimation algorithm uses Eq. (1) to solve for the reduced search areas in ref 2 , ref 3 , and ref 4 . Eq. (1) has the division operation. This makes hardware implementation difficult. For this reason, the modification of Eq. (1) is required.
H.264/AVC standard uses the temporal direct mode. The temporal direct mode predicts the reference frame index and motion vector by using the motion information of the anchor block, which is the same position block as in the previously encoded frame. One reference frame of the current block is the same as that of the anchor block, and another reference frame is the anchor frame. Motion vectors of the current block are solved for by using the motion vectors and time distances of each frame. Fig. 3 shows the temporal direct mode in H.264/AVC standard. MV L0 is the L0 motion vector of the current block, and MV L1 is the L1 motion vector of the current block. MV Col is the motion vector of the anchor block. tb is the distance from the reference frame to the current frame. td is the distance from the reference frame to the anchor frame. MV L0 and MV L1 are solved for by using MV Col , tb, td, and Eq. (2). 
We can show that Eq. (2) 
The Optimization of the Reduced Search Area Sizes
In previous work [12] , the size of the full search area was 32x32. In this work, the full search area size is increased to 64x64 for higher performance of the designed motion estimator. Therefore, an adjustment of the reduced search area size is required. To find the optimum size of reduced search areas, the performance of 8x8 reduced search areas is compared with that of 16x16 reduced search areas. Table 1 displays the difference values of PSNR and bit-rate of 8x8 reduced search areas when compared with those of 16x16 reduced search areas. The quantization parameter is 25. Simulation results show that the PSNR drop is only 0.006 dB and bitrate is reduced to 0.04% rather. In other words, the performance drop is negligible when the size of reduced search areas is decreased. For this reason, the reduced search areas are decreased to 1/64.
The Removal of Overlap Operations
The previously proposed fast multi-reference frame motion estimation algorithm decreases the number of computations by reducing the search areas in ref 2 , ref 3 , and ref 4 . The search areas of ref 0 and ref 1 are a full-sized search areas. In ref 2 , ref 3 , and ref 4 , the search area of each frame consists of two reduced search areas, and each reduced search area size is 1/64 of the full search area. Because the number of search areas is two in ref 2 , ref 3 , and ref 4 , search areas overlap in such cases. Table 2 shows the proportion that CP n,0 is the same as CP n,1 .
Simulation results show that the same proportion CP n,0 as CP n,1 in ref 2 , ref 3 and ref 4 is more than 38%. The result indicates that overlapped search areas must not be neglected. Overlapped search areas cause unnecessary increases of motion estimation computation. It increases the power consumption of motion estimation system large. The power consumption is very important to design a system for mobile devices. For this reason, although the hardware time scheduling becomes more complex, we modify the proposed fast multi-reference frame motion estimation algorithm in order to reduce the number of computation by eliminating overlapping search areas. Fig. 4 shows the overlapping search area and reduced search areas in ref 2 , ref 3 , and ref 4 by using MV 0 and MV 1 . ref n,i is the reduced search area of ref n by MV i . If CP n,0 is the same as CP n,1 , the motion estimation of the search area by MV 1 is skipped. This decreases the use of unnecessary memory bandwidth. If CP n,0 is not the same as CP n,1 and if the absolute lengths of both dx and dy are shorter than R, the motion estimation of these search points is skipped in order to remove the overlapping computation. The difference of computation quantity between maximum and minimum is much less than that of previous fast algorithms. Because of this reason, the modified algorithm has more advantages than previous fast algorithms when designing a hardware architecture. The proposed modified multi-reference frame motion estimation algorithm is described as follows: 
IV. SIMULATION RESULTS
In Section 3, we modified the previously proposed fast multi-reference motion estimation algorithm. To compare the modified algorithm with the previously proposed algorithm and the previous fast algorithms, we simulate some sequences that have various sizes. We use JM v9.6 reference software. The quantization parameter Table 3 shows the result of the simulation when compared with the modified algorithm, previous fast algorithms, and the standard, which uses five reference frames. △PSNR and △Bit refer to the difference values of PSNR and bit-rate when compared with the standard, which uses five reference frames. According to these results, the maximum PSNR drop of FMASS is 0.094 dB, and the maximum bit-rate increment is 7.92%. The number of reference frames is reduced by a maximum of 1.986, but reduced only by 1.475 in the extreme case. The maximum PSNR drop of AFMFSA is 0.182dB, and the maximum bit-rate increment is 13.16%. In particular, it has low performance in the 'Akiyo' QCIF sequence and the 'Mobcal' 720p sequence. AFMFSA uses 1.211 reference frames in the 'Parkrun' 720p sequence and uses 2.938 reference frames in the 'Foreman' QCIF sequence.
The maximum bit-rate increment of FRFSA is 20.01% and the maximum PSNR drop is 0.248 dB. In particular, It has low performance in the 'Mobcal' 720P sequence. The bit-rate increment of the modified algorithm is less than 2% in every sequence, and the PSNR drop is a maximum of 0.046 dB which is much smaller than those of the previous fast algorithms. Fig. 5 compares the ratedistortion curves of various video sequences. The 720p-size sample sequences are 'Mobcal' and 'Shields,' the CIF-size sample sequence is 'Mobile,' and the QCIF-size sample sequence is 'Container'. The performance of the modified algorithm is much better than previous fast algorithms. To compare the various quantization parameters, we use the rate-distortion curves, Bjontegaard Delta PSNR (BDPSNR) and Bjontegaard Delta Bit-rate (BD-bit-rate) [19] . BDPSNR and BD-bitrate display performance by combining changes of PSNR and bit-rate. We simulate using the same sample sequences in order to solve for BDPSNR and BD-bit-rate. The quantization parameter values are changed to 20, 25, 30, 35, 40, 45, and 50. Other simulation environments are the same as previously described. Table 4 and 5 represent the difference values of BDPSNR and BD-bitrate when compared with the standard, which uses five reference frames. The results show that the modified algorithm always has a better performance than the previous algorithms, such as FMASS, AFMFSA, and FRFSA. In particular, the BDPSNR of the modified algorithm is always less than 0.035 dB when compared with the standard. This means that the performance of the modified algorithm is almost the same as that of the standard.
V. ARCHITECTURE DESIGN
We designed an architecture that uses the modified fast multi-reference frame motion estimation algorithm for the H.264/AVC integer motion estimation system. As shown in the results of Section 3 above, we modified the previously proposed algorithm to design a hardware system.
Previous Systems
The Huang's system [9] uses five reference frames, but the Chen's system [8] , the Youn's system [10] and the Kao's system [11] use only one or two reference frames and require an additional control to use the modified algorithm. The Huang's system uses only 16x16 or 8x8 block sizes. The Chen's system uses a four step search algorithm. The Youn's system uses downsampling and pixel-truncation, which reduce the complexity of hardware. However, these cause a large performance drop. To compare the modified algorithm with the previous systems, we make software simulators of the previous systems. All simulators are based on JM 
Shields, 720P
Standard FMASS [16] AFMFSA [17] FRFSA [18] The Modified   20   25   30   35   40 45 -500.00 1,000.00 1,500.00 2,000.00 2,500.00 3,000.00 3,500.00 4,000.00 4,500.00
PSNR(dB)
Bitrate(kbps)
Mobile, CIF AFMFSA [17] FRFSA [18] The Modified v9.6 reference software. Table 6 and 7 represent the difference values of BDPSNR and BD-bit-rate when compared with the standard, which uses five reference frames and the 64x64 search area. The EX-Youn's system is the extension version of the Youn's system which uses 2 reference frames. Simulation environments are the same as previously described. The results show that the proposed system, which uses the modified algorithm, usually has a better performance than the other systems. The current macro block memory saves the current macro block pixels, the size of which is 256 (16x16) bytes. Because the search area of the proposed system is 64x64, the size of one line memory is 79 bytes and the number of line memories is 79.
The Overall Architecture
The Memory Read Controller
To minimize the time to read the memory, we proposed the processing area which has a shape of vertically long rectangular. The processing area is the region of search area to be calculated immediately. For reducing external memory bandwidth, the motion estimation of the two reduced search areas is performed at the same time in ref 2 , ref 3 and ref 4 . Fig. 7 shows the scan order of search area memories. The scan direction of the reduced search area is only (a), but that of the full search area can be (a), (b) and (c). If the processing area is horizontally long rectangular same as most cases, the size of the processing area is 31x16. In this case, each line memory output has to be only one byte at one clock cycle when the scan order is the direction (a) or (c) but the last search area line memory output has to be 31 bytes at one clock cycle when scan order is the direction (b). The almost CMOS technology does not supports that the bit-width of the SRAM is 31 bytes. But our system uses the vertically long rectangular read area. It makes the line memory output to be only 1 byte when the scan order is the direction (a) or (c), too. Additionally, the last search area line memory output is only 16 bytes at one clock cycle when scan order is the direction (b).
Processing Units (PU)
The PU calculates the 4x4 unit SAD values of the 16x16 macro block. One PU calculates 16 4x4 unit SAD values between the current block pixels and the reference block pixels. For real-time full HD video encoding, we use 16 PU in the proposed integer motion estimation system.
SAD Summation Blocks and the Comparison Block
The SAD summation blocks calculate the SAD values of each mode by using 4x4 unit SAD values, which are calculated in the PU. The inputs of one SAD summation block are 16 4x4 SAD values, and its outputs are 41 SAD values for the 7 block size in H.264/AVC: one 16x16, two 16x8, two 8x16, four 8x8, eight 8x4, eight 4x8, and sixteen 4x4. For real-time full HD video encoding, we use 16 SAD summation blocks in the proposed integer motion estimation system.
The comparison block calculates the cost value of each mode and determines the final mode and determines the final mode, which has a minimum cost value. The comparison block consists of 41 cost calculators and 41 comparison units.
The Center Point Generator
To design the center point generator, we modify Eq.
(1) in Section 2. The modified Eq. (4) can be implemented by using ROM, several multipliers, and adders. Additionally, because Eq. (4) is similar to Eq. (3), the proposed reduced search area center position generator can be shared with the temporal direct mode predictor. Fig. 8 is the proposed search area center position generator. It consists of two multipliers, three adders, and ROM that has 128 bit-depth and 15 bit-width.
It is designed to solve for the motion vector in the temporal direct mode.
The Pipeline Process
For the pipeline process, the proposed system is divided into six stages, which are memory reading, PU, SAD summation 1, SAD summation 2, solving cost 1 and solving cost 2. All stages are designed to use a one clock cycle. The CU uses two clock cycles when the calculations of the all search area are finished. Fig. 9 shows the pipeline process of the proposed system. It requires 263 clock cycles to process motion estimation of macro block in one reference frame. Because the maximum number of the proposed algorithm's reference frames is 2.0938, motion estimation with the modified algorithm uses 550 clock cycles. The minimum required clock of Full HD (1920x1080P@30Hz) video encoding is 135 MHz.
VI. SYNTHESIS RESULTS
The proposed system is implemented in Verilog HDL and synthesized by using 0.13 um CMOS technology. 3 , and ref 4 , the search area of each frame consists of two reduced search areas, and each reduced search area size is 8x8. The search algorithm in each search area is a full search algorithm. The operation frequency is 135 MHz when encoding a Full-HD (1920x1080p@30Hz) sized video. Table 9 contains the comparison of the proposed system with previous integer motion estimation systems. The performance of the Youn's system which uses 5 reference frames is almost same with that of the standard which uses 5 reference frames. However, the controller part is only hardware resource which can be shared when extend reference frames to 5 without increasing the operation clock in design of motion estimator. Therefore, the gate count of the Youn's system which uses 5 reference frames is 4 times more than that of the Youn's system which uses 1 reference frame. Otherwise, the Operating Frequency 135MHz
Search Area 64x64
Reference Frames 5 
