Abstract This paper describes a sub" motion estimation processor core for MPEG-4 video encoding. It features a Gradient Descent Search algonthm combined with a Sub Block search method that reduces required computation power to 8 MOPS maintaining high picture quality as equivalent as a Full Search method, and an optimized SIMD datapath architecture to decrease a clock frequency and an operating voltage. It has been designed with CMOS 5-metal 0.18 um rule. The estimated power consumption to process a QCIF 15 fps video is less than I mW at 1.70
Introduction A mobile terminal bv which oeoole can visuallv
A search direction is calculated bv differential ' .
communicate with others continues to gain popularity. To realize an ultra low power and high quality real-time coefficients of the function at the point indicated by a start vector. The differential coefficients are defined as: MPEG-4 video codec-in the terminal, a-highly efficient motion estimation processor is essential, because the motion estimator with a conventional Full Search using integer-pel and mean absolute difference method (FS) shares more than 70% of the total computational complexity and requires about 200 MOPS computation power, for the MPEG-4 video encoder at QCIF. Power Consumption of a 0.18um motion estimation processor using the conventional method is more than about 20 mW. This power consumption is prohibitively large for an IP core in the mobile terminal. A low power motion estimation circuit with Greedy Search (GRS) method has been already reported [2] . Unfortunately, its picture quality for predicted picture is degraded for high motion videos because of a local minimum problem. This paper describes a sub-mW motion estimation processor core (ME core) for MPEG-4 video encoding which solves these problems. Features of the ME core are as follows:
The Gradient Descent Search (GDS) algorithm [3] combined with a Sub Block (SB) search method is i h d u c e d for motion estimation. The GDS algorithm reduces the computing power to approximately 4%, and gives almost equivalent picture quality comparing with the FS technique, and higher picture quality than the GRS technique.
A SIMD datapath architecture optimized for the GDS algorithm is designed. It consists of I6 processing elements (PE), so that it can execute a calculation for 16 pixels per 1 cycle. The PE is dedicated to calculate both Mean Square Error (MSE) and differential coefficients. The SIMD datapath, an Address Generator (AG) and data-caches are optimized for the GDS algorithm to keep 16-way operation. 
Where, the TB is pixels of a current frame, the SW is pixels of a reference frame.
The MSEs of motion vectors toward the search direction are calculated step by step in integer pixel width. This search process is called a I-Dimentional Search (I-DS). The I-DS is repeated several times to reach minimum distortion. A integer-pel motion vector whose MSE is lowest among search points of the repeated 1-DS is a temporal solution. This is followed by a 3x3 Neighbor Half-pel Search (3x3-NHS), at the position indicated by the temporal solution. Hence, we can reach a final solution with half-pel accuracy.
Optimization for VLSI-implementation aE
Optimization techniques for VLSI-implementation are described as follows: 
A. Search Direction Rounding (8 directions)
If the I-DS search is executed toward arbitrary directions, it requires a ' quite complex calculation. By limiting the search direction to 8 directions, the calculation becomes more simply, so that the multiplier circuits can be eliminated.
E. Search Range Minimization
Search range should be minimized because Search Window (SW) RAM accounts for a significant amount of power consumption. A simulation result for picture quality when varying the search range is shown in Fig.2 . Figure. 2 indicates that the optimum search range is +-16x16 pixels.
C Repeat Number Minimization for I-DensionalSearch
The number of I-DSs affects on the required computing power. A Simulation result obtained by changing the number indicates that the maximum repeat number of I-DSs is two.
2.3. Sub Block Search Method The GDS algorithm was combined with a Sub Block (SB) search method to enhance picture quality. The SB search method is described as follows:
Stepl. Divide one MB into four SBs divided into 4 SBs (8 x 8 pixels).
Step2. Search for each SB The differential coefficients are calculated for each SB in the starting MB. Then, the I-DS for each SB are executed toward a search direction indicated by the differential coefficients. As a result of 1-DS, four SBs as temporal solution indicated by SB vectors are obtained.
Steps. Expand into MB size MB size in such a way as shown in Fig.3 Step4. Decide a motion vector A final motion vector is decided from 5 vectors obtained by a MB search and four SB searches.
The GDS algorithm with the SB search decides a motion vector whose MSE is the smallest during the MB search and the SB search. Therefore, the algorithm always attains higher or equal picture quality comparing with the original algorithm.
Results of GDS Optimization
The PSNR between the predicted picture and the original picture is measured through simulations. The result for "BUS" which is a typical high motion sequence is shown in Fig.4 . The average PSNR obtained by the GDS algorithm with the SB method is higher approximately by 0.5 dB than the GDS algorithm without the SB search. Figure 5 shows an advantage of the algorithm with the SB search. It indicates that the picture quality of the GDS algorithm with the SB search is better than that of the I:4-subsampling Search, almost as same as that of the FS, and 0.5 dB higher than that of the GRS. On the other hand, the GDS algorithm without the SB search dramatically reduces the computing power consumption approximately by 98% comparing with the FS algorithm.
In addition, it is notable that the quality of the algorithm without the SB search is higher than that of the GRS A MB (16 x 16 pixels) indicated by the start vector is Four SBs indicated by SB vectors are expanded into algorithm, although the computing power is almost equal to that of the GRS. The ME core architecture features as follows:
The above features enable the ME core to operate at low frequency and low operation voltage, allowing very low power characteristic. Figure 6 shows the block diagram of the ME core. The ME core is connected to a 32 bit CPU Bus (CPU-BUS) and a 32 bit Memory Bus (MEM-BUS). The SIMD datapath architecture contains 2 Template Buffers (TB), 8 SW Buffers and a Processor Unit (PU). The PU contains 16 PES, an AdderTrec (AT) and an Accumulater (ACC). The SW has 3-port access capability (2 read / 1 write) and 512 words by 8 bit configuration. The TB has 3-port access capability and 64 words by 64 bit configuration. The TBs, SWs and PES are connected by a Crosspath to sort pixel data. The PE executes a calculation for I pixel in one cycle. The PES are followed by an AT which completes the summation. The control part consists of a sequencer (SEQ), an address generator (AG) and a vector generator (VG). The VG is a circuit to decide a search direction and a motion vector.
SlMD Datapath Architecture
3.3. Processing Element A PE was newly developed to perform efficiently the GDS algorithm optimized for the SB search and VLSI implementation. The PE can calculate MSE and differential coefficients, and it can execute the 3x3-NHS by using Half Pixel Blender (HPB) for both a MB and a SB. The HPB generates half-pel data by filtering operation among integer-pel data. Figure 7 shows the block diagram of the PE. Figure 7 (a) describes the PE operation for a MSE calculation. The PE receives each pixel data from TBs and SWs. Figure7 @) describes the PE operation for calculating a differential coefficient in x direction. In this case, the PE receives one pixel data from TBs, and center, right and left-pixel data from SWs.
A SIMD datapath architecture constructed by 16 PES The datapath optimized for the SB search Gated clock and operand isolation 4 -+ :
1 Figure 6 Block Diagram of ME core Fig. 7 (c) describes the PE operation for calculating a differential coefficient in y direction. The PE receives one pixel data from TBs, and center, upper and lower-pixel data from SWs. Figure 7 (d) describes the PE operation for the 3x3-NHS. In this case, the PE serially receives the 8 surrounding pixel data from TBs and SWs. Here, a PPP (Pre Pixels Processor) in Fig.7 consists of MUXs and DEMUXs to distribute the input pixel data to a subtracter. AND gates are inserted at input stage for an operand isolation described later.
3.4. Timing Chart .. 
Power Consumption Estimation
The power consumption was estimated by circuit simulations. The estimated power consumption of RAM part and other part which contain logic portion and interconnections is 0.60 mW and 0.22 mW respectively under 1.70 MHz@I.O V. The power consumption without the SB method is 0.41 mW. Therefore, less than ImW of power consumption for the ME core is attained. 
5.
Conclusions A motion estimation processor for MPEG-4 video encoding has been newly designed. The GDS algorithm combined with the Sub Block search reduces the computing power approximately by 96%, and produces almost equal picture quality comparing with the conventional FS technique. The ME core contains Idway SIMD datapath architecture and 3-port SRAM for highly parallel operation. An address generator and memory data mapping are also optimized for the SB search. A clock frequency and an operating voltage were reduced by above techniques.
Therefore, the ME core attains ultra low power less than 1 mW at a QCIF IS fps with high picture quality. Also, the ME core supports wide resolution video processing from QCIF IS fps to a CIF 30 fps.
