Direct VLSI implementation of context-based adaptive variable length coding (CAVLC) for residues, as a modification from conventional run-length coding, will lead to low lhroughgut and utilization. In this paper, an efficient CAVLC design is proposed. The main concept is the two-stage block pipelining scheme for parallel processing of two 4x4-blocks. When one block is processed by the scanning engine to collect the required symbols, its previous block is handled by the coding engine to translate symbols into bitsueam. Our dual-block-pipelined architecture doubles the throughput and utilization of CAVLC at high bitrates. Moreover, a zero skipping technique is adopted to reduce up to 90% of cycles at low bitrates. Last but not least, exponential-Golomb coding for other general symbols and bitstream encapsulation for network abstraction layer are integrated with CAVLC engine as a complete entropy coder for H.Z64/AVC baseline profile. Simulation results show that our design is capabie of real-time processing for 192Ox 1088 3Ofps videos with 23.6K logic gates at 100MHz. transmission from entropy coding engine onto thc system bus. The rest of this paper is organized as follows. In Section 2, the background and the profiling analysis are mentioned . In Section 3, the architecture design of the H.264/AVC entropy coding engine is described. Section 4 shows the simulation and VLSI implementation results of our entropy coding design. Finally, the conclusion is in Section 5.
transmission from entropy coding engine onto thc system bus. The rest of this paper is organized as follows. In Section 2, the background and the profiling analysis are mentioned . In Section 3, the architecture design of the H.264/AVC entropy coding engine is described. Section 4 shows the simulation and VLSI implementation results of our entropy coding design. Finally, the conclusion is in Section 5.
INTRODUCTION
Digital video compression technique has played a n important role that enables efficient transmission and storage of multimedia data where bandwidth and storage space are limited. The new video coding standard, H.Z&/AVC [l] [2], developed by Joint Video Team (JVT) significantly outperforms previous slandads in compression performance [3] . The higher coding efficiency comes from the new features and functionalities including the entropy coding tools of context-based adaptive variable length coding (CAVLC). In this paper, the h s t published entropy coding engine for H.264/AVC baseline profile encoder is proposed.
Ry usage of the instruction profile and the symbol-count analysis, h e reason of why Entropy coding needs to be accelerated by hardware will be described. Afterwards, the architecture of the entropy coding engine is designed based on thc hardwuelsoftwarc (HW/SW) partition. Some of the important issues of our prototype entropy coding engine are as follows. The dependency caused by content-adaptability will confine the hardware utilization and decrease the throughput. The dud-buffer architecture with blockpipeline schedule is proposed to improve the hardware utilization.
The zero skip technique according to coding block pattern (CBP) in macroblock (MB) header is used to save the redundant computation especially in thc low bit-rate situation. To reduce the system load in platform-based system, the network abstraction layer (NAL) encapsulation is implemented in bitstream packer. The analysis of bitstream buffer si,e is used to favor the burst data 0-7803-9060-1 105/$20.00 02005 IEEE
FUNDAMENTALS
There are two entropy coding schemes adopted by H. . One is the VLC-based coding. The other is the binary arithmetic coding. In this paper, only the VLC-based coding will be described. Two VLC-based coding techniques, Exp-GoIomb code and CAVLC, cooperate in H.2WAVC. CAVLC encodes the transform residues while ExpGolomb is responsibIe for the rest symbols, such as prediction modes, block types, and motion vcctors. In this section, the bitstream hierarchy and the analysis of instruction set profile will be mentioned. For more details ,please Figure 1 shows the hierarchical structure of bitstrcam. The whole sequence can be categorized into four layers, sequence layer, as slice layer, MB layer, and block layer. The first lhrec layers begin with their corresponding headers. The sequence parameter set and picture parameter set (SPSPPS) define the coding tools and sequence information such as profile, level, frame size, frame number, etc. The slice header represents the slice information such as slice mode and initial quantization parameter, while MI3 header represents the MB information such as block types, predicted modes, and motion vectors (MV's). After MB header, transformed coefficients of each MB are codded by CAVLC. In CAVLC, each category of symbols has several context-based adaptive VLC 
Bitstream Hierarchy

Profiling
We use iprof, a software instruction level analyzer, to make profiling at a processor-based platform The profiling condition considers CIF format, 3Ofps, quantization parameter as 20,and CAVLC for residue coding. Entropy coder requires 115.4 million instructions/$ (MlPS) of computation complexity. It will exhaust the system RlSC resource in platform-based VLSl system for software implementation. Besides, the entropy coding is a kind of bit-level operation and cannot be efficiently handled by general purpose processors (GPP's). Therefore, dedicated hardware of entropy coding is a must. Table 1 shows the symbol rate of Foreman sequence in CIF format at 30fps. The symbol rates of t h~ SPS, PPS, and slice hcadcr in biistrcam ( Fig1 ) arc vcry low. Bcsidcs, thcsc symbols are almost fixed for the specified profile and level. The symbol rates of MI3 header and coefficients are much higher. There are 396x 30 macroblocks per second and up to million symbolsls for CIF format. As for HWlSW partition, the SPS, PPS, and slice header will be generated by system processor. The remaining part jncluding scanning, coding, and bitstream packing will be mapped into hardware as an MB engine.
ARCHITECTURE DESIGN
In this section, we will introduce the architecture design for entropy coding engine in H.264 baseline profile. Figure 2 shows the block diagram. The bitstream of SPS, PPS, and slice header are generated by system processor because their symbol rates of them is very low. The information about MB header and trarrsform quantized residues are assumed IO be inputted from prediction and reconslruction engine of the encoder, The final bitstream will be outputted to the system buffer in network abstraction layer (NAL)
format. This enlropy coding engine is divided into three stages. At symbol level, the Exp-Golomb code unit and CAVLC unit take ME idormation and transformed coefficients in proper order, and translate them into codewords by table look-up. At codeword level, the bitstream packer concatenates Ihc generated codewords.
This compressed result is then stored in bitstream buffer, and then outputted via bus interface. The architecture design of each stage will be described in the following subsections. Figure 3 shows the basic architecture that has one degree of parallelism in terms of reading one coefficient from residue buffer or coding one symbol. When one MB starts to be processed, the MB information is translated into codeword by using Exp-Golomb code tablc. Afterwards, the transformed coefficients are coded by using CAVLC. The macroblock (MB) is be dividd'into several 4x 4 blocks, and those 4x4 blocks are processed one after another in double zigzag order. Each 4 x 4 block is processed through two phases, scan phase and coding phase. In scan phase, the transformed coefficients are read from residue buffer in reverse zig-zag order. Then, the run-level symbols and required statistics are extracted by level detector and stored in statistic buffer. In coding phase, the symbols are translated into codewords by usage of the corresponding class of tables. The selection of VLC tables within a class is according to the related statistic and the previous coded symbol.
Dual-Buger Archifecfure with Block Pipeline Scheme
Compared with the traditional VLC tables that use fixed static probability distribution model, CAVLC utilizes Ihe inter-symbol correlation to further reduce the statistics redundancy. Not until the scanning of a 4x4-block is finished can we know the statis- tic of total coefficients, trailing ones, and total zeros that are the adaptive factors of most tables. Thus, the scan and coding phase of each block must be processed in sequential order. Though this structure ofbasic architecture is similar to those used for JPEG and MPEG-1/2/4, its utilization and throughput are only half.
To deal with this problem, we propose an advanced dual-buffer architecture and thc corresponding block pipeline scheme as shown in Fig. 4 . There is a pair of pingpong mode statistic buffers (Fig.  4(a) ). After the scan phase of the first 4x4-block, the run-level symbols and statistics are stored in the first buffer, and the coding phase is processed. At the same time the scan phase of the second 4x4-block is processed in parallel by usage of the second buffer. As shown in Fig. 4(b) , by switching the pingpmg mode buffers, scanning and coding ofthe 4xCblocks within a macroblock can be processed simultaneously with the interleaved matter. In this way, both the throughput and utilization are doubled .
Zero Skipping by CBP Look-Ahead
The symbol count of transformed coefficients decreases with the incrcasing of quantization parameter because of the larger perccntage of zero coefficients. In this situation, Ihe throughput of dual buffer architecture will be conlined by the scan phase. To further improve our design, a zero skipping technique is applied. When the coefficients within an 8x8-block are all zero, the 4x4-blocks inside are unnecessary to be coded in this situation. We can save the operation cycles and power by skipping the redundant scan process including the memory access toward residual buffer. In this method, the coded block pattern (CBP) in macroblock header is used for the skipping decision. This scheme is useful for the well-predicted MI3 or in low bitrate situation.
Bitstream Packer H.Z&/AVC defines a byte-stream format to transmit a sequence as the ordered stream of bytes in network abstraction layer (NAL).
In this systcm, the usage of emulation prevention bytes guarantees that start code prefix can be uniquely identificd. The emulation prevention on byte basis requires the format translation from raw byte sequence payloads (RBSPs) to encapsulates byte sequence payload (EBSP). When successive three bytes of "OxOOa000", "Ox0
OOOOl", "OXOOMN)~", or "'OxOOOOD3" is found in original bitstream, the byte, "OY', is inserled. Large amount of memory access and all passed. At the same t h e , the first 32-bits word is compatible with EBSP format and will be outputted to the bitstream buffer. ahenvise, the dummy byte insertion is performed in serial, and the circuit of coding core must be paused via feedback stall signal.
By simulation of variable sequences, the occurrence probability of dummy byte insertion is very small (less than 0.001%). Therefore, the backward stall seldom occurs, and the throughput of the coding core is still very high.
Analysis of Bitstream Buffer
The cutropy coding engine acts as an output interfacc of the encoder. After bitstream packer, the concatenated bitstream jnEBSPformat will be stored temporarily and then transmit to system buffer via system bus. A bitstream buffer is used to favor the burst transmission from core engine onto the system bus. If the bitstream buffer is full while the coding procedure of one MB is not finished, the entropy coding engine must be halted immediately, and the system bus is requested to handle this exceptional condition. If the size of bitstream buffer Is too smalI, such exceptional condition occurs too frequently, which will decrease the utilizatiod of both the system bus and coding engine. To decide the minimum size of the bitstream buffer and to guarantee that buffer fullness seldom takes place, we collect statistics of variable sequenccs and analyze the bits usage of each MB. Almost all MB's (over 99.9%), either in I frame or P Frame, has less than 2K-bits size of bitstream. Therefore, the bitstream buffer with 2K-bits storage capacity is enough. Figure 6 shows the cycle count requirements of three entropy coding engines : basic architecture, dual-buffer architecture with block pipeline scheme, and the former with zero skip technique. Compared with basic architecture, dud-buffer architecture with block pipeline scheme can process the scan phase and coding phase of two neighboring 4x4-bIocks in parallel and thus enhance the hard ware utilization. It can almost half the processing cycles when the quantized residue energy is large at high bit-rate situation. However, when prediction is fine or at low bit-rate situation, most residuals are zero, and the scan phase dominates the processing cycles. The zero skipping technique according to CBP can further improve the design by passing over the redundant scan process in this situation.
Implementation Result
The proposed entropy coding engine with dual-block-pipelined architecture, zero skipping technique, N U encapsulation bitstream packer, and 2K-bits bitstream buffer is implemented by using cellbased design flow and 0.18um UMClArtisan cell library. Table  2 shows the gate count profile. To achieve full hardware utilization by dual buffer architecture, two block statistic buffers are required. Additional 6000 gates are needed. Table 3 shows the local memory requirement. Three types of memories a~ required. The coefficient memory and bitstream memory are used as input and output buffer for system consideration. The upper 4x4-block total coefficient memory is used to story the 4x4-block totd coefficients required by following blocks. The entropy coding engine requires about 500 cycles for high-quality application and about 200 cycles for low-bitrate application (QP=3045). 
CONCLUSION
In this paper, an entropy coding design for H.264/AVC baseline profile encoder is described. We consider the feature of context adaptation and propose the dud-buffer architecture. The scan and coding procedure are interleavingly processed in block pipeline method, which enhances the hardware utilization and processing throughput. The zero skipping technique can further skip the redundant computations in low bit-rate situation. Besides, for system consideration, the RBSP-EBSP conversion is integrated in bitstream packer, and the bitstream buffer is also used to reduce the interaction with the system. It can encode 1920x 1088 3Ofps videos in realtime with 23.6K logic gates at 1DDMHz.
