This paper presents a cost-effective platform architecture design for MPEG-4 video coding. A fast motion estimator architecture supporting predictive diamond search and spiral full search with halfway termination is implemented to make good compromise between compression pcrformance and design cost. An efficient block-level scheduling for texture coding engine is employed to reduce the hardware cost. Both these key modules are integrated into an efficient platform in hardwareisoftware co-design fashion. With high degree of optimization in both algorithm and architecture levels, a cost-efficient video encoder is implemented. It consumes 256.8mW at 40MHz and achieves real-time encoding of 30 CIF (352x288) frames per second.
INTRODUCTION
The emerging MPEG-4 standard becomes the main technique of the mobile devices and streaming video applications such as smart phone and handheld PDA devices. In these applications, low power, low cost, high flexibility, and high performance are four key issues to implement the video coding system for real-time specification and future applications.
Several MPEG-4 video chips have been reported in the past. To satisfy rich functionality of future multimedia, some are implemented in software [I] 121 based on the low-power DSP platform. They have highest flexibility but to achieve the real-time performance under the limited resources, the fast algorithms of motion estimation (ME) and discrete cosine transform (DCT) are applied and the compression quality degrades at the same time. Some [3] use the dedicated hardware methodology to achieve low power and low area cost. Lack of potential for future modification of advanced algorithms and higher design effort are disadvantages. Hence, some 141 [SI adopted the hybrid softwarehardware co-design to compromise the performance and flexibility for complex coding flow.
In this paper, a RISC-based platfom with hardware accelerators is presented to implement MPEG-4 video encoding algorithms. The optimization in both algorithm and architecture level is applied. Not only the key components but also the connection optimization are discussed in this paper. First. the coding system is divided into three main subsystems, motion, texture, and bitstream, which are optimized by observing the relationship at the algorithm and architecture level. In motion subsystem. the hybrid motion estimator supporting both predictive diamond search and spiral full search with halfway termination for real-time or 0-7803-7750-8/03/$I7.00 02003 IEEE high compression quality applications are proposed to reduce the dominant cost in the typical coding system. In the texture subsystem, the efficient interleaving schedule and substructure sharing technique among quantization and DC/AC prediction are proposed (6] to reduce the cost further, In the bitstream subsystem, to handle the complex bitstream syntax and avoid inefficient bit-level storage, the hardwareisoftware co-operations scheme is applied for the bitstream generation. After the optimization described above, a compact MPEG-4 video encoder chip is implemented and occupies the 5.02x5.13 mmz in 4-layer-metal, 0.35 pm CMOS standard cell process. It is much smaller than any MPEG4 video encoder previously reported and achieves the same performance. It consumes 257 mW at 40MHz operation and achieves real-time encoding of 30 CIF (352x288) frames per second. Fig. 1 depicts the proposed platform-based system with hardware accelerators to achieve a MPEG-4 video coding functionalities. RlSC takes responsibility for macroblock level hardware scheduling, coding mode decision, motion vector coding, and other high level procedures. Other hardware accelerators improve the system performance by parallel processing according to the parallelism of algorithms. Motion estimator (ME) carries out motion estimation with the search range -16.0 to f15.5 pixel unit. Motion compensator (MC) interpolates pixels in reference frames into compensated blocks by specified motion vectors. Texture block engine (TBE) carries out discrete cosine transform (DCT), inverse cosine transform (IDCT), quantization (Q), inverse quantization (IQ), and AC/DC prediction on texture pixels in block unit. Bitstream gcnerator (BTS) produces headers, motion information, and texture information in the format of variable length codes. In addition, share memory builds the direct channels from MC to TBE and BE to BTS to decrease the traffic of the data bus. Sequencer (SEQ) handles the pixel by pixel scheduling of these share memory without bothering RISC. DMA involved in dedicated commands efficiently generates the proper addresses issued by RlSC or SEQ. Four global bus channels are used in this system. First, RlSC bus broadcasts controlling information to each hardware modules. ARer applying operations issued by RISC, hardware modules respond processed side information on which RlSC depends to decide the coding modes for macroblocks. At the same time source, reference, and reconstructed frames required by hardware modules are passed through DMA and then pmvided by DATA bus. Hardware modules efficiently access this data automatically according BITSTREAM PROGRAM (21 inleger-pel motion estimation (16x16 blmk size) and then hall-pixel refinement Fig. I . System Architecture to pre-determined scheduling. These pans are integrated into a single chip with the firmware stored outside for programmability through PROGRAM bus after taped out. SHARE bus can transfer DCT coefficients, quantized coefficients, or other immediate information in the testing mode. The developing time and effort can he reduced through this information.
MPEC-4 VIDEO ENCODER ARCHITECTURE

MOTION ESTIMATION
Algorithm
Motion estimation is the key technique of video coding and can reduce the temporal redundancies of sequences to make compression efficient. In all algorithms of motion estimation. full search block matching (FSBM) algorithm is well known and commonly used in the video coding system because of its good performance and regularity. However, the huge computational power is required to meet the real-time application. Dedicated hardware is usually employed through the parallel processing and it CBUSCS a large cost design. Besides, the encoder should decide the optimal prediction blocks among the various block sizes and in the finer pixel precision in the MPEG-4 standard. It makes the system difficult to handle these operations under acceptable cost and maintain the same compression quality. To meet the requirement o f various applications under the acceptable cost, we adopt two kinds of algorithms for the motion estimation of 16x16 block size at integerpixel precision. One is the spiral full search with halfway termination (called fast full search, FFS) which can achieve the same compression efficiency as the full search algorithm. The other is the diamond search starting from the predictor derived from neighboring macroblocks (called predictive diamond search, PDS) and it meets the real-time specification under the visual quality degradation. Afterwards, the hierarchy scheme is applied for the motion estimation for four 8x8 pixels blocks in a macroblock around 1 2 to -2 positions of the previous best motion vector. The half-pixel refinement is also applied for all found integer-pixel motion vectors. Fig.2 depicb the whole stages of motion estimation and describes as follows. The predictor is determined from neighboring macroblocks. The PDS mode or FFS mode is employed to find the integer pixel motion vectors. The half-pixel refinement is applied around the motion vector found in the phase 2. For four 8x8 pixel blocks in a macroblock, the spiral search around -2 to +2 is applied to obtain four optimal motion vectors. Four times of half- Architecture   Fig.3 depicts the hardware architecture of the motion estimator supporting PDS and FFS. This architecture mainly includes three processing stages and two buffers to store current MB and the search window. Before performing motion estimation, the video coding system transfers data from external memory into these buffers to eliminate the bus bandwidth for calculating of sum of absolute difference in the following. Meanwhile, the adder tree accumulates the sum of the pixels in the current MB to save it into a register for the mode decision in the future. To speed UP the data loading and reduce the bus traffic, the search window buffer can be loaded using column-by-column data-reuse scheme. Many different algorithms can be adopted alternatively under the different conditions of the cost, bit-rate, and picture quality. In our paint of view, we use a novel motion estimator to support PDE or FFS algorithms to compromise the compression performance and the design cost. The PDS mode can satisfy the real-time specification while the FFS mode can achieve the same compression quality as MPEG-4 software verified model (VM) [7] . To explore the degradation in the PDS mode, four sequences with different features are used as test patterns. The average difference between PDS and VM in PSNR is only 0. I36 dB and the maximum PSNR drop through the testing sequences is only 0.618 dB. Even in the frames whose the difference in PSNR are maximum, it is still indistinguishable between these two in subject view. While encoding in the FFS mode, the PSNR and bit-rate of the reconstructed frames are almost the same as that encoded by VM. The average PSNR are even better than 0.00625 dB. The general R-D cuwes for testing sequence are simulated and shown in Fig.4 .
3.2.
CONFIGURABLE PLATFORM PROTOTYPING
A configurable platform is used to verify the functionality of our architecture design. This prototyping board is connected through the PCI interface to the host computer. Four separated memory with DMA modules are used to handle PROGRAM, DATA, SHARE, and BITSTREAM bus from our design. An arbiter is responsible for the memory access through PCI and memory. The MPEG- Raw image data is transferred from the host computer to the frame memory on the prototyping board. Video encoding is processed concurrently. Afterwards, bitstream data are stored in the hitstream memory and then read from the host computer. Besides, the share memory can record the immediate information for debugging in the testing mode.
5. IMPLEMENTATION Fig.6 shows a micrograph of the encoder LSI and Table I depicts its characteristics. The LSI contains 828K transistors and is fabricated on a 5.02 x 5.13 mm2 with 0.35 p m and single-poly quadruple-metal CMOS process. The chip is tested and works successfully. The supply voltage is 3.3V and consumes 256.8mW at 40MHz working frequency. Table 2 shows the number of transistors, the area, and the size ratio to the LSI of each unit.
CONCLUSION
In this paper, an efficient platform architecture design with hardware accelerators for MPEG-4 Simple Profile@Level 3 video encoder is proposed. The hardware module is written in Verilog and verified in modular fashion while the firmware is written in assenbly. The co-design and co-simulation is employed to reduce the development time. Also. the efficient reconfigurable FPGA prototyping system is exploited to verify the functionality. With costeffective hybird motion estimation and interleaving DCTilDCT hardware modules, the system are implemented into 5.03x5.13 mm2 die size with 0.35 o m CMOS technology process. It works at40MHr and consumes 256.8mW to meet the real-time encoding specification.
SIPS, VOI. 23, pp. 2 7~9 , 2 0 0 2 .
[4] M. Takahashi and et al., 
