Abstract-Motion estimation (ME) is a computation and data intensive processing in video coding system. In MPEG-2 to H.264 transcoding, ME of H.264 encoder end is simplified by reusing MPEG-2 motion vector (MV) to curtail search window. However reusing MPEG-2 MV brings difficulty in applying search window reuse method, which plays a critical role in bandwidth reduction. Based on method proposed in [1], a hardware architecture is proposed for integer motion estimation (IME) of MPEG-2 to H.264 transcoding. Experiment results shows the proposed architecture's bandwidth is 70.6% of H.264 regular IME (Level C+ scheme, 2 MB stitched), while the on-chip memory is 11.7% of that.
I. INTRODUCTION
Video transcoding performs operations to transform one compressed video stream to another [2] . The operations include bit rate, frame rate, spatial resolution, coding syntax and content transform. In [3] , transcoding between different standards is defined as heterogeneous transcoding, while transcoding of the same standard is defined as homogeneous transcoding. Four transcoder architecture are often used: closed-loop, open-loop, spatial domain and frequency domain. Considering syntax difference between MPEG-2 and H.264 as well as video quality, close-loop architecture is applied in this paper to handle heterogeneous transcoding between MPEG-2 and H.264 video coding standard. And only format change is taken into consideration.
Many works have been contributed to develop an efficient MPEG-2 to H.264 transcoding algorithm. The MPEG-2 motion vector reuse is a commonly employed method to reduce ME computation. In [4] , decoded MPEG-2 motion vector is used as one of motion vector predictors (MVP) in EPZS algorithm. In [5] , MPEG-2 motion vector is reused as MVP. Then MVP selection, motion vector refinement and a top-down splitting strategy for sub-block motion vector reestimation are applied. In MVP selection, motion costs of MPEG-2 motion vector and motion vector from neighboring blocks are compared. The motion vector with the smallest cost is chosen as search center.
The way of reusing MPEG-2 MV in hardware must be carefully considered. In MPEG-2 inter motion estimation, only 16 × 16 block mode is available, while there exist seven block modes in H.264. Direct motion vector mapping or composition is impossible. Our solution in hardware is to use a search window centered on MPEG-2 MV in H.264 encoder end. ME for all seven block modes are performed in this search window.
A critical issue in video coding system design is bandwidth reduction because external bandwidth is a limited resource in hardware. Several methods have been proposed for full search block matching algorithms (FSBMA). In [6] , four search region data reuse methods from Level A to Level D have been discussed. In [7] , a Level C+ algorithm and its associate n-stitched zig-zag scan method are introduced. This method utilizes horizontal and partially vertical overlapping area of search window by stitching several MB vertically.
The existing search window reuse methods are based on regular search position in reference frame. When it comes to MPEG-2 to H.264 transcoding, it is difficult to directly apply these methods because introducing MPEG-2 motion vector as search center will lead to irregular overlapping between successive search windows. To address this problem, a Level C search window reuse method is proposed in [1] , which utilizes the similarity between successive MV to regularize position of search windows.
In this paper, an IME hardware architecture is proposed for MPEG-2 to H.264 transcoding combined with Level C scheme. In section II, the Level C search window reuse scheme for transcoding is briefly introduced. The proposed IME hardware architecture is presented in section III. Experiment results are shown in section IV. In section V, we will draw some conclusions. 
II. LEVEL C SEARCH

A. Level C Search Window Reuse Scheme 1) Overall Algorithm:
The Level C scheme is based on the fact that neighboring MV often have similar value. If the MV difference between successive MVs is less than threshold, they are assumed to have fixed interval. That is the neighboring two MBs have 16 pixel difference in x-coordinate or y-coordinate. After the regularization processing, search window reuse is available for most macro-block. The Level C algorithm is shown as following.
The raster scan is assumed to be the processing order for H.264 ME. − → mv i and − → mv i−1 are current and previous MPEG-2 MV; (x i , y i ) and (x i−1 , x i−1 ) are their coordinates; t is a predefined threshold which is set to 6; MotionEstimation represents the function of motion estimation performed within SW i ; SearchWin is the function to determine search window based on − → mv i . 2) Performance Evaluation: Two factors must be taken into consideration to evaluate performance of data reuse scheme: on-chip memory size for reference frame and redundancy access factor [7] . The on-chip memory size represent required buffer of candidate block for data reuse. The redundancy access factor R α evaluates external bandwidth and is defined as the number of reference pixel be loaded for each MB pixel. R α of the proposed Level C method is calculated as the expectation of R α of all MB, which is shown as following
where p reuse is the probability that a MB can reuse search window. R α(Level C) and R α(N o Reuse) are calculated as following
The on-chip memory is equal to 4/3 times of the size of search window, which is (sr h + N − 1)(sr v + N − 1). This is because reference buffer must be flushed when the coordinate difference is larger than the threshold; and the next MB must employ different schedule to load reference pixel.
B. Bandwidth and On-chip Memory Comparison
The bandwidth and on-chip memory of four IME modules are compared based on [1] , [6] and [7] . The test sequence is HDTV720p. The search range is set to [−64, 64) for H.264 and [−16, 16) for transcoder.
• H.264 (Level C) -a regular H.264 integer motion estimation (IME) module with Level C scheme.
• H.264 (Level C+) -a regular H.264 IME module with Level C+, 2 and 4 MB stitched vertically.
• Transcoder (No Reuse) -a transcoding IME module without any data reuse scheme.
• Transcoder (Level C) -a transcoding IME module with Level C. p reuse is set to 0.9 [1] . It is observed from Table I that the redundancy access factor R α of the proposed Level C scheme for transcoding is at the same level of regular H.264 IME with Level C+ scheme. But the on-chip is much smaller (11.7% of 2 MB stitched and 8.1% of 4 MB stitched). The proposed method achieves 40.6% bandwidth of transcoder without any data reuse scheme, while the on-chip memory is 4/3 times of that in the proposed architecture.
III. HARDWARE ARCHITECTURE FOR LEVEL C SCHEME
A. Overall Architecture Figure 1 shows overall architecture of the proposed transcoder IME module for HDTV720p application. There exist four reference pixel memories, each of which is a 47 × 16 single-port memory. Actually the search window is 47 × 47. The applied memories size is for an easy hardware implementation. The memory update and output is controlled by memory input and output control unit, which is controlled by MV smoothness decision unit. The IME module is implemented with Partial SAD architecture.
B. Performance Analysis
Given the working frequency (f ) of IME module, extra frequency (f extra ) for initialization latency and number of PEG (m), the processing ability of IME is expressed as (f − f extra ) × m. The number of reference point to be processed is expressed as
, while r is frame rate, n ref is number of reference frame, W and H is frame width and height. Therefore Equation 3 must be satisfied to process specific video sequence.
In the discussed transcoding application of HDTV720p, f extra is set to 16 because this number of latency clocks are needed to produced SAD of the first reference position. If one PEG is used to process HDTV720p video stream (1 reference frame, search range [−16, 16)), working frequency f must satisfy Equation 4 . Therefore the longest critical path delay must be less than 9.04ns. C. IME Architecture SAD Tree and Partial SAD architecture are proposed by [8] to realize variable block size motion estimation (VB-SME). SAD tree architecture is suitable for highly parallelized application and can share reference buffer between parallel PEG. But it has long critical path delay, which is 14.1ns based on our implementation. This delay cannot meet the performance requirement according to Section III-B. To reduce delay, 16 12-bit registers can be inserted between SAD4 × 4 and larger block's SAD addition to form a 2-stage pipeline. When applying snake-scan, one PEG (256 PE) of SAD Tree needs 16 × 12 + 16 × 17 × 8 = 2368 bit register.
Partial SAD architecture has the smallest gate count and suitable for medium and small resolution videos. Another advantage is that it has shorter critical path compared with SAD Tree. If one PEG of Partial SAD is used, it needs 1872 bit register. Therefore 496 bit register can be saved compared to SAD Tree architecture. In this paper, the Partial SAD is chosen to implement IME architecture as shown in 2.
D. Memory Input and Output Control Unit
The memory input and output control units must achieve two primary goals: 1) avoid memory input and output confliction; 2) keep IME module to be fully utilized, which means the ME operation must has no stall.
The proposed architecture contains four memory banks (Mem 0-3) for storage of reference pixel. Two memories are involved to perform ME of 47×16 reference pixels. Reference pixel for each MB is stored in three memory banks. Mem 0-2 are circularly accessed when MV field is smooth; Mem 3 is used when MV field is non-smooth.
An example of memory transition is presented in Figure 3 to show how to generate control signal of memory input and output. The finite state machine (FSM) is shown in Figure   4 . The FSM is composed of nine states, whose state label indicates which memory to output and the output order. The decision of MV smoothness is made in S01, S12 and S20, based on which next state is decided. For example, if MV field is smooth in S20, the next state is S20a. This means that Mem 2 and Mem 0 can be reused. In S20a, Mem 1 must be updated since it will be accessed in next state S01. If MV field is non-smooth in S20, the next two states are S31 and S12. Mem 3 and Mem 1 is updated in S20; Mem 2 is updated in S31.
32 × 16 + 15 = 527 clocks are needed to process 47 × 16 reference pixels using Partial SAD. 47 clocks are needed to update one memory bank. Therefore memory input and output operated concurrently do not lead to any memory accessing confliction. In HDTV720p application, more than 90% MB [1] just uses Mem 0-2. Therefore Mem 3 can be disabled to save power in this situation.
IV. EXPERIMENT RESULT
A. VLSI Implementation Result
Implementation result of the proposed transcoding architecture is summarized in Table II . It is implemented with TSMC 0.18µm 1P6M technology. The logic synthesis is done with Synopsys Design Compiler. Working condition is set to the worst (1.62V, 125°C). It is observed from Table II that the proposed design can achieve 110.6MHz with 92.2K NAND gate. Table III shows hardware cost comparison with a regular IME module. It is observed that the number of PE and search window size are reduced, which benefits from precise search center indicated by MPEG-2 MV.
V. CONCLUSION
An IME architecture for MPEG-2 to H.264 transcoding is proposed. The IME architecture and memory control logic is discussed in this paper. Combined with the Level C search window reuse method, the proposed architecture can reach the bandwidth level of regular H.264 IME module with Level C+ scheme, while the on-chip memory is at most 11.7% of that. 
