Fractional motion estimation (FME) is widely used in video compression standards. In H.264/AVC, the precision of motion vector is down to quarter pixels to improve the coding efficiency. However, FME occupies over 45% of the computation complexity in an H.264 encoder and this high complexity limits the processing capability. In this paper, a single-iteration full search FME is proposed. By the algorithm and architecture co-optimization, the bandwidth to the frame buffer is reduced by 31%. Furthermore, 82% of circuit area for the Hadamard transformation and subtraction are saved from the direct implementation. Compared with prior arts, the proposed design supports 3.39 higher throughput with only 0.02 dB PSNR drop. Thus, the specification of 4096 2160 quad full high definition H.264/AVC FME processing can be achieved.
INTRODUCTION
Sub-pixel motion compensated prediction is widely used in current video compression standards. By interpolating the reference frame and performing the fractional motion estimation (FME) searching scheme, the rate-distortion (R-D) performance can be significantly improved. There are several types of finite impulse response (FIR) filters used to generate the sub-pixel value in different video coding standards. For example, in H.264/AVC, a 6-tap filter with coefficients (1, -5, 20, 20, -5, 1) is used for half-pixel samples, and in MPEG-4 an 8-tap filter with coefficients (-1, 3, -6, 20, 20, -6, 3, -1) is required. In H.264/AVC, the precision of motion vector (MV) can further be down to quarter pixels and a 2-tap filter is used to generate quarter pixel samples. Because there are two different types of the sub-pixel interpolation, a 2-iteration search is often used in H.264/AVC FME. In the 2-iteration search, the half-pixel search is performed first for the 8 candidates around the best-matching integer MV. After finding the best half-pixel candidate, the quarter-pixels are generated by the bilinear filter and the searching is performed on the 8 quarter pixels around the best half pixel candidate. With FME in quarter-pixel accuracy, the R-D performance of the H.264/AVC can be improved by about 2 dB than without. Therefore, higher quality, which is needed by large video specifications like quad full HD (QFHD), can be provided.
However, FME occupies over 45% of the computation complexity of H.264 encoding process [1] . Such high computation makes FME the critical part even in a hardware accelerated H.264 encoder. For instance, in QFHD resolution, there are less than 350 cycles for each macro block (MB) even if the operating frequency is higher than 300 MHz. Therefore, for QFHD resolution, the throughput should be at least three times higher than the prior-art in [2] . In order to provide higher processing capability, a singleiteration full search algorithm and architecture is proposed in this paper. After the bandwidth optimization, the bandwidth of accessing the reference frame buffer is reduced by 31%. Then, by the simplification of the Hadamard transformation, the single-iteration full-search algorithm is achieved. Compared to the direct implementation, 82% area of the transformation and difference circuit are saved. As the result, the proposed design supports 3.39 higher throughput than the work in [2] with only 0.02 dB quality drop in the R-D performance compared to the original 2-iteration search.
The remaining of this paper is organized as follows. The preliminary and problem statement is described in sec. 2. Then, sec. 3 introduces the proposed single-iteration fullsearch algorithm and architecture and sec. 4 gives the simulation result. Finally, sec. 5 summarizes this paper.
PROBLEM STATEMENT
In H.264/AVC, the precision of MV can be down to quarter pixels. There are many previous works trying to bring up the hardware solution for FME in HDTV specifications [2] [3][4] [5] . The contribution of these works can be roughly classified into two approaches. One way is to ease the computational complexity by reducing the number of the searching candidate. For example, the number of searching candidates is reduced to only 6 in [2] . However, these improvements have their limit since the original algorithm needs only 17 searching candidates. Further, the required access pattern of search range remains the same although the required searching candidate is reduced. The other way is to modify the search algorithm and enable the singleiteration search instead of the original half-then-quarter 2-iteration search. In [4] [5], single-iteration search is achieved by shrinking the search range to 5 5 quarter pixels. Nevertheless, the access pattern of the search range is defined by the 6-tap filter of the half-pixel interpolation. So, the overall window remains the same. Since the throughput is affected by both the processing time and the data loading time, the throughput is also limited by the access pattern and the corresponding memory bandwidth and latency.
PROPOSED SINGLE-ITERATION FULL-SEARCH
FME SCHEME
Bandwidth Reduction
In H.264/AVC, the algorithm of sub-pixel interpolation for motion compensation (MC) is defined in the coding standard. However, how to produce the fractional MV, including the interpolation scheme on the reference frame and the FME searching algorithm, can be decided by the designer. Since the bottleneck in FME is mainly from the access requirement on half-pixels, the modification of the filter setting is the most straightforward way to solve it. The halfpixel interpolation is done by a 6-tap filter with coefficients (1, -5, 20, 20, -5, 1) in MC. With longer FIR taps, wider region is needed to generate the fractional pixels. Moreover, the access pattern is decided by the FIR filter used in the half-pixel. Thus, the access pattern and the corresponding bandwidth requirement cannot be reduced even the search range is limited to only inside the 5 5 quarter pixels [4] [5]. If we replace this 6-tap filter by a 2-tap filter with coefficients (1, 1), then the supporting region for an 8 8 block is reduced from 14 14 to 10 10 as shown in Fig. 1 . With this simplification, the area of supporting region and the corresponding memory bandwidth decreases by 31% for an 8 8 block.
Algorithm Optimization
In FME, the sum of absolute transformed difference (SATD) is commonly used as the cost function. Since the Hadamard transform is a linear operator, it follows the distributive law:
.
Here H means the Hadamard transform operator and A and B are pixel arrays. Based on the distributive law, the residue in the quarter pixel FME can be estimated while the half pixel residues are known. For example, candidate G in Fig.  2 is a quarter pixel candidate and is generated by averaging the half pixel C and D. Assume the pixel in current MB is U. The residue on candidate G can be computed by:
H G U H C D U H C U D U H C U H D U (2).
The above equation shows the residue on candidate G is equal to the average of the residue on candidate C and D. Therefore, once the residues on the half-pixels are computed, the residues on the quarter-pixels inside the halfpixels are also known. In the proposed single-iteration fullsearch FME, the bilinear filter rather than the 6-tap filter is used for half-pixel interpolation. In addition, since all the non-integer locations are bilinearly interpolated, this linearity helps additional simplification. That is, the residues on not only quarter pixels but also half pixels can be directly computed by equation (2) . Thus, the single-iteration FME can be achieved. Furthermore, the searching range needs not to be limited. As Fig. 3 shows, in the original algorithm, half pixels are 6-tap filtered, and the quarter pixels are linearly interpolated, so the Hadamard distribution can only apply to the quarter pixels. When the half pixels are 2-tap filtered, we can use bilinear interpolation to get all the noninteger pixels, so that the distributive characteristics can be applied to all the non-integer pixels.
Architecture and Data Flow Optimization
According to Eq. (2), the Hadamard transform needs not to operate on each candidate. By the proposed algorithm, only 9 integer pixels need the Hadamard transform and difference operation to generate the cost. The cost on the other candidates, including half pixels and quarter pixels, can be computed by interpolating the cost on the integer pixels. The simplification in interpolation can further lead to simpler hardware architecture. Since the Hadamard transform is linear, it can be distributed and save the redundant hardware resource. Figure 4 shows the comparison between original and optimized FME data path. 49 difference and transformation unit are needed by the full search in the original data Table 1 . Average PSNR drop (dB) of different FME algorithms compared with 6-tap filter flow. After the proposed data-flow rescheduling, the requirement is down to 9 . Therefore, 82% area of the transformation and difference circuit are saved. Figure 5 shows the proposed FME architecture based on the single-iteration full-search scheme. ME-oriented cache memory architecture proposed in the previous work [6] is used here. By prefetching the searching area, the access in FME stage can be performed without bubble cycles. The luminance MC is not included here because it needs 6-tap filter and needs to be handled separately.
SIMULATION RESULT AND COMPARISON

Software Implementation & Comparison
The proposed single-iteration full-search FME is implemented and analyzed by modifying the Joint Multiview Video Model (JMVM) 4.4. JMVM is released by the MPEG 3DAV Group as the reference software and the research Table 2 . The detail specifications of the proposed FME Fig .8 . Chip photo platform on multiview video coding (MVC) [7] . In JMVM, H.264/AVC is adopted as the base layer. Therefore, it can also be used as the platform of the H.264/AVC. In this paper, FME algorithm used in [2] is also implemented as a reference. Figure 6 shows the R-D performance comparison. As Fig. 6 shows, the proposed algorithm has only about 0.02 dB quality drop in the R-D performance. All the HD test sequences are also tested in this paper. The result is shown in Fig. 7 and Table 1 . To increase the stability of the simulation result, some special video clips are tested independently. For example, "tractor_f300" means the simulation in the test sequence "tractor" and started from frame number 300 since it is a high motion and blurry case. As Fig.  7 shows, the proposed algorithm has similar PSNR quality through all the test sequences because of the full-search characteristic of the proposed FME algorithm. On the other hand, the performance of the reference algorithm [2] has large variance on the quality drop since it only searches on 6 candidates. Table 1 shows the average PSNR drop between different FME algorithms in the resolution above HD 720p. According to the analysis, the proposed algorithm is suitable for video above the HD resolution.
Hardware Implementation & Comparison
The proposed algorithm and architecture is also implemented in an H.264/AVC encoder as a chip design. [8] Figure 8 and Table 2 show the die photo of the whole encoder and the detailed specifications of the FME part. The chip is implemented using TSMC 90nm 1P9M technology. Besides the H.264/AVC encoding, the proposed design also supports the fractional motion/disparity estimation in multiview video coding. The processing capability supports FME for 4096×2160 pixels quad-HD single view H.264 coding, HDTV 1080p 3-view MVC, and HDTV 720p 7-view MVC. The throughput comparison is listed in Fig. 9 . The throughput is computed as (number of MBs per second) (number of reference frames). The proposed algorithm supports throughput at least three times higher than the prior-arts in [2] and [3] .
CONCLUSION
In this paper, a single-iteration full-search FME algorithm and architecture is proposed. By bandwidth reduction, the required access pattern in FME is reduced by 31%. With algorithm and architecture optimization, 82% area of the transformation and difference circuit are saved. As a result, an FME design with only 0.02 dB PSNR drop and 1659K MB/sec throughput is proposed. Compared with the references, the proposed design supports 3.39 higher throughput. Therefore, the 4096 2160 QFHD H.264/AVC encoding can be achieved.
