We propose a high-performance hardware accelerator for intra prediction and mode decision in H.264/AVC video encoding. We use two intra prediction units to increase the performance. 
INTRODUCTION
H.264/AVC is the latest video coding standard of ITU-T (VCEG) and ISO/IEC (MPEG) [1] . It can save up to 39%, 49%, and 64% of bit-rate compared with MPEG-4, H.263, and MPEG-2, respectively [2] . Intra prediction in H.264/AVC predicts pixel data by exploiting spatial redundancy and has 13 modes for luma macroblock and 4 modes for chroma macroblock. It is an important technology in H.264/AVC to achieve better compression performance. However, it is very computational intensive. Therefore, hardware acceleration is essential for real-time encoding of high-resolution video. Intra prediction predicts one block by referring to its reconstructed neighboring pixels. After intra prediction generates results of all modes, mode decision chooses the best prediction mode with the minimum ratedistortion cost. In this paper we propose a hardwired unit for both intra prediction and mode decision functions in real-time H.264/AVC encoding. The rest of this paper is organized as following. In Section II we describe intra prediction and mode decision algorithms. In Section III we present our hardware design. Finally we show our experimental result in Section IV and draw a conclusion in Section V. 
INTRA PREDICTION ALGORITHM
A. Intra Prediction Intra prediction generates prediction pixels for each block according to reconstructed neighboring pixels. Different from the decoding side, intra prediction in encoding side needs to produce pixels for all possible modes. Therefore, it has higher computation complexity. In a 4:2:0 format, the 16x16 luma component can be predicted as either a single 16x16 block or 16 4x4 blocks while both 8x8 chroma components are each predicted as one 8x8 block. Fig. 1 illustrates all possible modes in intra prediction. Each 4x4 luma block has 9 possible prediction modes, the 16x16 luma block has 4 possible prediction modes, and each 8x8 chroma block has 4 possible prediction modes. In many modes except the plane mode, intra prediction produces pixels by weighted summation of neighboring reconstructed pixels. There are two mode decision algorithms used in the H.264/AVC reference software. The Rate-Distortion Optimized (RDO) mode decision [3] evaluates the distortion and bit-rate by carrying out the entire encoding process. On the other hand, the low complexity mode decision uses either sum of absolute difference (SAD) or sum of absolute transformed difference (SATD) to evaluate distortion and estimate bit-rate. Although RDO mode decision achieves better performance, it is very complicated for hardware and software implementation. Therefore, we propose a modified low complexity mode decision algorithm. The cost function of our proposed algorithm is defined in (1).
Mode-Cost = SATD + X BitUsage (1) SATD is the sum of absolute transformed difference between original and prediction pixels as defined in (2) .
DifflT denotes the 4x4 residual error values after Hadamard transform. Lagrangian parameter X is a quantization parameter (QP) dependent variable [4] and is defined in (3). BitUsage represents the estimated bit-rate cost of the coding mode. The intra prediction unit reads 4 2) Ipred2 Unit All modes classified to be performed by Ipred2 unit can be computed by adding together 2 or 3 pixel values and right-shifting by 1 or 2 bits. We design PE according to this observation. Each PE has three 3-to-I MUXes for selecting inputs, two adders for adding pixels, and clippers for limiting the final results. If we use only 1 PE, we need 51 cycles to complete all computation. Balancing the area cost and performance, we employ 4 PEs in the Ipred2 unit. In order to reduce the amount of computation, we merge all The mode decision unit first reads original block pixels and prediction pixels to calculate 4x4 residuals. It then performs Hadamard transform on the residuals to obtain SATD. BitUsage Generator and X Generator are responsible for calculating BitUsage and X, respectively. The cost of each mode is then calculated according to (1) . To encode the chroma components, we compute the cost of 4 intra chroma prediction modes and accumulate them into Plane, DC, Horizontal, and Vertical registers. The best prediction mode for chroma components is chosen by the Macroblock Level Comparator as well. The first one takes 40+7=47 cycles and the remaining 15 each takes 20+7=27 cycles. Therefore, 452 cycles are needed. For each of two chroma components, the first 4x4 block takes 22+7=29 cycles while the remaining three each takes 20+7=27 cycles. Therefore, we need 220 cycles for chroma components. Finally, six cycles are needed for flushing the pipeline registers. We have synthesized our design using Design Compiler targeted towards a TSMC 0.13um CMOS cell library. It consumes 36K gates (12k for intra prediction, Ilk for mode decision and 13k for Hadamard transform) when running at 75MHz. Table II compares our work against a previous work [6] . The proposed design delivers much higher performance.
We use three video sequences 'Foreman', 'News', and 'Carphone' to test the video quality and compression ratio of our design. Each sequence consists of 100 frames in CIF (352x288) format. Fig. 6 shows the results in comparison with H.264 reference software JM9.0 [5] . Our proposed design has only on-the-average 0.01 db PSNR drop. Our design can encode in real-time 720p HD (1280x720) video sequences at 30 frames per second (fps). It consumes only 3.7mWatt of power at this level of performance.
V. CONCULSION
We have presented a VLSI architecture and its HDL implementation for high-performance intra prediction and mode decision in H.264/AVC video encoding. Experimental result shows that our design exhibits very small quality degradation compared with pure software implementation. It is capable of real-time encoding of high-definition video (720p HD d4 30fps) with very low power consumption.
In the future, we would like to develop a H.264/AVC encoder system and integrate the proposed intra prediction and mode decision into the whole system. We would also like to improve our design to support 1080p HD (1920x1080) video. 
