Abstract. In this paper, the bitstream parsing analysis and an efficient and flexible bitstream parsing processor are presented. The bitstream parsing analysis explores the critical part in bitstream parsing. Based on the result, the novel approaches to parse data partitioned bitstreams are presented. An efficient instruction set optimized for bitstream processing, especially for DCT coefficient decoding, is designed and the processor architecture can be programmed for various video standards. It has been integrated into an MPEG-4 video decoding system successfully and can achieve real time bitstream decoding with bitstream coded under 4CIF frame size with 30 fps, 8Mbps, which is the specification of MPEG-4 Advanced Simple Profile Level 5.
Introduction
As video coding standards development process keeps going on, no matter for MPEG or H.26x series, more and more coding tools are added to provide more functionalities and better compression performance in a general video coding structure. Thus, the coded data format for newer standard must be changed for either better compression technique or advanced coding options. Moreover, commercial video products are now beginning to support several coding standards simultaneously [1] . Various standards differ from their bitstream formats. The operation to extract the data hidden in the bitstream is called parsing. So, a versatile bitstream parser for a video coding system implementation will be the trend. A hardwired parser is not a good choice to satisfy the rapidly transition of video coding standards because it lacks of flexibility. On the other hand, adaptation of an embedded processor will reduce the time-to-market. By re-programming the firmware, one can change the specification to match another new coding standard in a short time. Moreover, firmware upgrade can be accomplished by replacing the layout of the ROM in fabrication. Up till now, implementations of video decoder usually embed microprocessors on-chip to be the parsing unit [1] [2] [3] [4] .
Processing of the bitstream often requires bit-level operations such as bit extraction and variable-length decoding functions. Processors designed with 16-bit or 32-bit operations would spend many cycles for a single bit operation. Therefore, it is not efficient to use a general processor for bitstream parsing task. Since bitstream parsing is bit-serial operation and is the first stage of the whole decoding task, the overall decoding system performance is determined by its throughput and efficiency. The introduction of processors needs more analysis and optimizations. For MPEG-4 video decoding system, both [6] and [5] propose such a solution. In [6] , the instruction set extension for MPEG-4 bitstream parsing is proposed for general purpose RISC. In [5] , a processor along with the instruction set dedicated for bitstream parsing is proposed. These previous designs emphasize the importance of variablelength code decoding(VLD)/fixed-length code decoding(FLD) operations, and propose enhanced datapath for VLD/FLD. In addition, a bitstream processor for object based MPEG-4 profile is also proposed in [7, 8] .
In this paper, we propose an embedded bitstream processor for MPEG-4 video decoder. Based on our previous work [5] and codeword type distribution analysis, efficient parsing algorithms supporting data partitioned bitstreams are proposed and realized with the proposed bitstream processor. The proposed design can achieve MPEG-4 Advanced Simple Profile (excluding Global Motion Compensation(GMC) and Quarter-pixel Motion Compensation (QMC)) Level 5 (720 × 576, 30fps) real time decoding.
The paper is organized as follows. The analysis of MPEG-4 video bitstream structure is shown in Section 2. Based on the analysis, efficient parsing algorithms supporting data partitioned bitstream parsing are presented. The proposed architecture is described in Section 3. The implementation result are presented in Section 4. The conclusion is given in Section 5.
Bitstream Parsing Analysis and Proposed Algorithms
A detailed analysis for MPEG-4 video bitstream structure has been discussed in [5] , which shows that there are six classes of operations that occur very often in the bitstream parsing operations. Based on its result, we focus on finding out the most critical part during parsing and accelerating it. In addition, to support the error resilience decoding function, a parsing approach for data partitioned bitstreams is also illustrated.
Codeword Type Distribution
A software model for bitstream parsing is applied for MPEG-4 video decoding. The codeword distribution among several bitstreams is acquired during parsing.
Since the computation loading of parsing is proportional to the bit-rate of encoded bitstream, we only perform the analysis on high bit-rate bitstreams. Most of the bits are DCT coefficient codewords, which occupies about 70% of the bitstream. The DCT coefficients not only occupy large portion but also occur successively in the bitstream. Besides, if the DCT coefficients cannot be decoded for the MPEG-4 video decoder in time, the decoding system has to be paused and the overall decoding performance is decreased. So, the operations for DCT coefficient decoding have to be optimized.
Proposed DCT Coefficients Parsing Approach
From hardware implementation viewpoint, the operations for the DCT coefficient decoding consist of four parts:
1. The conventional processor-based implementations [5, 6] for DCT coefficients parsing focus on the first to third parts, but ignore the fourth part presented. If we take it into account, the required cycle for VLD in [5] is 4 rather than 1, and that in [6] is about 6 to 10 rather than 4. So, if we can merge this essential branch decision operation with the former 3 parts, the decoding performance can be improved.
We perform simulation on this idea. A processor emulator with similar instruction architecture to the RISC but slight modification is setup. Then, a firmware program for bitstream parsing is written for simulation. In one case, it is assumed that VLD operation can be finished in single instruction, but the branch decision is required after each DCT decoding. In the other case, the essential branch operation is merged with the VLD instruction such that the codeword decoding and branch condition checking can be accomplished within one cycle. The comparison on average required cycle to parse an I-and P-VOP is shown in Tables 1 and 2 . The term "enhanced" means to merge the branch operation with VLD. It is shown that the improvement with the new merged instruction is between 20 and 50% of processing cycles. Thus, it is desirable to merge the essential branch decision operation with the VLD operation for a processor-based parser. 
Proposed Data Partitioned Parsing Approach
In MPEG-4 video standard, the bitstream structure for a P-VOP can be data partitioned or combined. In combined mode, the data is arranged in MB order. All data for a specific MB is put together. In data partitioned mode, a P-VOP video packet is composed of three parts: Motion part, which keeps the motion data of all MBs, DC and low-frequency DCT related data, which contains DC values and the AC prediction flag, and DCT coefficients. In each part, the data is ordered in one MB after another. Since data within one MB is divided into three different locations, we have two approaches traditionally. One is to allocate a large storage space in either external or internal memory to store all the previously decoded motion data and DC data. These data can be read out only when the corresponding DCT coefficients for the block are decoded. The other one is to parse a video packet several times to obtain the necessary codeword for each MB. However, the former approach costs too high, while the latter one is inefficient.
We propose an cost-effective algorithm to parse the data partitioned bitstream efficiently. It's shown in Fig. 1 . The parsing is composed of two stages. In the first stage, the whole video packet is only watched and stored to find the starting positions of the three parts described above. After the starting positions of the three parts are found, the second stage parsing starts. At first, the motion data of the first MB are parsed. Then the DC/low-frequency data of the first MB, followed by the DCT coefficients of the first MB, are decoded. The data in the three parts are decoded alternatively until all MBs in the video packet are parsed. With the proposed approach, the required storage size can be reduced greatly. While decoding one frame with CIF size, only one packet buffer with maximum packet size, which is 8 K bits at MPEG-4 Advanced Simple Profile Level 5, and one side buffer with size of about 700 bits are required. Compared with conventional implementation, which may demand 43 K bits, the proposed algorithm is more cost-effective. The cycle overhead for the proposed algorithm is shown in Table 3 . We encode the sequence in either data partitioned mode or non-data-partitioned mode, and use the emulator to parse it to count the required cycles. It is shown that the overhead is tolerable with respect to the total required cycles.
Architecture Design

Proposed Instruction Set
From the analysis results in Tables 1 and 2 , the most critical part during bitstream parsing is DCT coefficient decoding. The DCT coefficient decoding involves two memory access operations in one decoding cycle. One is the read operation for symbol lookup, and the other is the write operation for data output. So, we target at single-cycle DCT coefficient decoding. Besides, the analysis about the parsing operations in [5] shows that the branch instruction occupies a large proportion. The occurrence of the branch is quite often, but the target for the jump usually consists of a single operation such a VLD or a FLD. In order to eliminate the branch overhead, we introduce the conditional executions in modern DSP and micro-controllers [9] . By conditioning the execution with a flag (a one-bit register), every instruction can be controlled more freely than the branch architecture. Moreover, to provide more flexibility and for supporting other video coding standards in the future, the VLC tables are programmable.
In order to meet the above requirement, the instruction set is designed and can be divided into five categories according to its functionalities. The bitstream operation instructions contains a set of enhanced bitstream operations, including fixed-length decoding and variable-length decoding. To optimize the DCT coefficient decoding, as mentioned above, one special variable-length decoding instruction called 'REP.VLDS' is used for repeatedly decoding. With the help of conditional execution, it will execute DCT coefficient parsing repeatedly automatically until parsing for a series of DCT coefficient codeword is finished since most of the RISC branch instructions are replaced by the conditional execution. The code for DCT coefficients parsing is simply shown as follows. The r1 stores the front 16-bit of the bitstream.
DCT decoding: REP.VLDS r1, AD DCT COEF
The arithmetic instructions contains Boolean logic operations, 16-bit addition and subtraction. An 8-bit multiplication is also included. Special functions such as absolute value, conversion from sign-magnitude to 2's complement and bit-field extraction are also available to use. The branch instructions only contain jump with or without linking the return address to registers, and the jump to address indicated by register. In parsing applications, the branch condition generation often consists of several data comparisons with Boolean operations to each other. To make the comparison more efficient, the comparison instructions use logic operations such as AND/OR on a conditional flag register and the current comparison result, and write the logic result back to the conditional flag register. The memory access instructions contains 16-bit, 32-bit, and single bit load/store pairs of memory access operations. The single bit load is simply to mask the unnecessary bits from the loaded data word, while the store operation is realized by adjusting the input bit at correct position and writing back. The bit-array operations, including one-bit signal comparison and memory access, are supported with the proposed instruction set.
Bitstream Processor Architecture
The block diagram for the bitstream processor is sown in Fig. 2 . The processor is composed of four stages: Instruction Fetch(IF), Execution(EXE), Memory Load(MEML), and Write Back/Memory Store(WBMS).
The instruction is fetched from the program memory by the program counter, and buffered by a register. At the execution stages, a bit sequencer provides the basic bitstream functions such as show-bit and flush-bit operations. Meanwhile, the ALU provides some arithmetic functions such as addition, subtraction, 8-bit multiplication and logic operations. The multiplexer at the end of the EXE stages selects the output between the group detector and the ALU. At the MEML stages, data read address generated at the previous stage appears at the read address port of the data memory. Once the data is read, it passes the bit placing and extracting block for bit replacement or bit extracting instructions. After the bit placement, some instructions need to write the data back to the memory, and the data write address is applied at the write address port. Meanwhile the data will be written back to the register file, which contains 32 16-bit registers, through the write port. A pipeline controller exists for monitoring the execution of each stage. It is responsible for clearing the pipeline registers if bubble has to be inserted, or stalling the pipeline when necessary.
The group detector [10] works as an address generator by taking the most front bits of the bitstream from the sequencer and calculates the symbol address. It is composed of 16 group detector cells since MPEG-4 TCOEF table has 15 groups of codewords. Each cell compares the bitstream to check if it is valid in this group. The group detector determines the valid group and calculates the symbol address. Simultaneously it sends back the number of bits to be discarded to the sequencer to update the bitstream. So, the group detector is applied here to perform codeword identification and provide VLC table programmability. Next, the symbol lookup and branch decision are executed when we load data from memory. Finally, the symbol output is accomplished by writing data back. When the REP.VLDS instruction is executed, the PC will stall. REP.VLDS will output one symbol in one clock cycle. At the same time, it checks the branch condition. If branch is taken, REP.VLDS breaks. If branch is not taken, REP.VLDS goes on. Branch decision is made by check the exception code field of the loaded symbol.
To integrate the proposed design to a video decoding system, an interface module is designed as shown in Fig. 3 . It handles parameter control by acquiring following decoding units or the bitstream processor according to whether the FIFO is empty or full.
Stream Handler
To support the compressed domain data partitioning parsing discussed in Section 2.3, a video packet buffer to store the packet data and three addressing pointers for locating the start positions are required. In our design, the two components are included in stream handler, which is an interface between the bitstream sequencer and the external bitstream data input. Its block diagram 
is shown in Fig. 4 . The video packet buffer is addressed by three addressing registers, which are corresponding to Pointer 0, 1 and 2, respectively. The processor can access the external bitstream by using Pointer 0, and the bitstream input data will be bypassed. Meanwhile, the stream handler writes the bitstream input data word into the video packet buffer to save the packet data. By using the Pointer 1 or 2, the addressing 1 or 2 will be activated to address the packet buffer and send the data to the processor. Once the whole packet has been parsed, a signal is passed to the stream handler to reset the addressing registers to prepare next packet parsing.
Implementation and Comparison
The proposed bitstream processor is successfully integrated into the decoding unit to form an MPEG-4 video decoding system. With TSMC 0.35 um 1P4M technology, it operates at 33 MHz under 3.3 V to achieve MPEG-4 Advanced Simple Profile Level 5 (4CIF, 30fps) real-time decoding. Its gate count is 32,603. The chip implementation is shown in Fig. 5 . The overhead for parsing data partitioned bitstream twice from hardware simulation is shown in Table. 4. As discussed in Section 2.3, the overhead is negligible. The comparison results with other implementations are shown in Table 5 . We compare the proposed design with [5, 6, 11] , and TI C6X DSP [12, 13] . Among them, [5, 6] , and C6X are programmable architecture, while [11] is a dedicated design. Chang et al. [5] is programmable at compile time. As discussed in Section 2, DCT coefficients parsing occupy most part in bitstream parsing. Only the proposed architecture can achieve single-cycle DCT coefficients decoding, which has same performance with the dedicated state-machine implementation. Besides, the proposed design achieves high performance with low operating frequency. So, the proposed design achieves highest programmability with least required DCT coefficient decoding cycle and small memory requirement.
Conclusions
The MPEG-4 video bitstream parsing analysis and an efficient and flexible bitstream parsing processor are discussed in this paper. The bitstream parsing analysis explores that the most critical part in bitstream parsing lies in DCT coefficient codeword decoding. We propose approaches for DCT coefficients and data partitioned bitstreams. Based on the analysis results and proposed approaches, an efficient instruction set optimized for bitstream parsing is presented, and the processor architecture is proposed and implemented. It has been integrated in to an MPEG-4 video decoding system successfully and can achieve real time bitstream decoding with bitstream coded under 4CIF frame size with 30 fps, 8 Mbps. This is the specification of MPEG-4 Advanced Simple Profile Level 5. 
VLSI Signal Processing-Systems for
