An application specific processor for an H.264 decoder with a configurable embedded processor is designed in this research. The motion compensation, inverse integer transform, inverse quantization, and entropy decoding algorithm of H.264 decoder software are optimized. We improved the performance of the processor with instruction-level hardware optimization, which is tailored to configurable embedded processor architecture. The optimized instructions for video processing can be used in other video compression standards such as MPEG 1, 2, and 4. A significant performance improvement is achieved with high flexibility. Experimental results show that we could achieve 300% performance for the H.264 baseline profile level 2 decoder.
I. Introduction
The demand of multimedia communications on mobile and portable applications is growing nowadays. To realize multimedia communications, implementing a video compression standard is essential in any multimedia processing system-on-a-chip (SoC). There have been reports on the very large-scale integration (VLSI) implementation of MPEG-4 video recently. The emerging efficient H.264 or MPEG-4 Part 10 standard can greatly reduce the bandwidth and storage requirements for multimedia data. The VLSI implementation of H.264 is a challenge since an H.264 baseline decoder is approximately three times more complex than an H.263 baseline decoder [1] .
With experience from the design of previous video compression standards such as MPEG 1, 2, and 4, we can reuse the designed IPs to reduce the design time. Implementational flexibility is an important factor of concern for SoC designs. Since the traditional hardwired design is less flexible, the processor-based implementation is a preferred choice. VLSI implementation can be categorized into three types, hardwired, digital-signal-processor-based, and hybrid. To achieve higher performance with flexibility, the hybrid architecture has been proposed, where operation-intensive functions are implemented with hardwired blocks, while other functions of less complexity are implemented with software on an application specific instruction processor [2] , [3] . H.264 on OMAP Solution, which combined an ARM processor with TI's digital signal processor, implemented every function with software using an accelerated instruction set for multimedia Application Specific Processor Design for H.264 Decoder with a Configurable Embedded Processor Jin Ho Han, Mi Young Lee, Younghwan Bae, and Hanjin Cho processing while keeping the flexible software structure [4] . However, it is worthwhile to point out that OMAP is not designed only to the embedded processor for video compression. There are many instructions unused in H.264 video processing. The cost is too high to be attractive so the advent of newly standardized video decoding/encoding such as H.264 has created the need for an application specific embedded processor to accelerate the specific function. In this paper, we describe the implementation of an H.264 video decoder using a configurable embedded processor.
This paper is organized as follows. In section II, we summarize the new processor design methodology using a configurable embedded processor. In section III, we give an overview of the H.264 decoder software. In section IV, we describe the instruction set and processor architecture specific to H.264 decoder application. Finally, we summarize the implementation results.
II. Design Methodology
The proposed processor is implemented using a configurable embedded processor. A new configurable embedded processor called Xtensa has been recently developed by Tensilica. Figure 1 shows the architecture of the Xtensa processor. The Xtensa processor consists of a base instruction set architecture (ISA) feature that includes the basic instruction set, an extensible function that is able to add a user-defined instruction set, and configurable and optional functions that are configurable to the processor architectures, such as the interface options, memory subsystem options, and OS supports. The Tensilica Instruction Extension (TIE) language is used to Fig. 1 Figure 2 shows the five pipeline stages: instruction fetch, instruction decoding, execution, memory access, and write back. When a new instruction is added, new decode, pipe control and coprocessor registers, and coprocessor ALU blocks are added in the processor data path, and address generation, exception resolution, and write back block are modified. Based on the processor architecture described above, we can generate the instruction set simulator, compiler, and register transfer level (RTL) code of the processor which includes the new instruction set. Furthermore, using the simulator, compiler, and RTL code, we can analyze the run cycles after the application software runs on the processor. Also, we can know the die area and the power consumption of the processor.
Because of this functionality, we can use the new design methodology as shown in Fig. 3 . Using the simulator and compiler, of which the processor is composed of the basic instruction set, the application software programmed in C/C++ is compiled and analyzed. And a new instruction that can reduce the run cycles of the application software is described in TIE. After generating the compiler and simulator, of which the processor is composed of extended instructions, the application software is compiled and analyzed again. This flow can repeat until the processor performance is satisfactory for the application software.
The design methodology can trade off between the performance and cost of the proposed instruction set. And we can estimate the exact number of run cycles of the application software.
III. Overview of H.264 Decoder
The H.264/AVC video coding standard has been introduced with significant enhancements in both video coding efficiency and flexibility over a variety of network domains. In a video coding layer (VCL), some of the important enhancements are the use of a small block-size (4 × 4) exact-match transform, adaptive in-loop de-blocking filter, and motion-prediction capability.
As shown in Fig. 4 after receiving the data from network abstraction layer (NAL), entropy decoded, inverse quantized and inverse transformed data are created and added to the predicted data from the previous frames depending upon the header information. Then, the original block can be obtained after the de-blocking filter [2] . H.264 defines three types of profiles. The baseline profile is the simplest and supports intra-and inter-coding, as well as entropy coding with context-adaptive variable-length coding (CAVLC). The distribution of time complexity among major subsystems was analyzed, and loop filtering, interpolation, inverse transform, bit stream parsing, and entropy decoding are the order of the averaged time consuming parts.
IV. Processor Architecture
Numbering and Ordering of References
The processor architecture consists of many options such as OS support, memory map, cache policy, interrupts, debug interface, and except-instruction set. We mention the memory system, the instruction set architecture, which is dependent on a processor performance.
We proposed an embedded processor architecture, based on MIPS basic instruction set, which is specific to MPEG4 codec application software using a SimpleScalar simulation environment [5] . The memory system for H.264 decoder software has the same parameter as in [6] and [7] using the proposed instruction set. Table 1 shows the cache size and policy, memory interface width, and performance.
The research on the instruction set for an application-specific processor is focused on reducing the run time of the MPEG application algorithm, such as motion compensation/estimation, discrete cosine transform, variable-length decoding, and color space conversion [6] , [8] . We propose the extended instruction set for the H.264 video compression standard which was announced in 2004. First, we developed an H.264 video decoder for a baseline profile at level 2 in C language. The video format is CIF 15 frames/sec. We analyzed the execution times of the functions for the H.264 video decoder.
Based on the profiling result with general example data, we noticed that motion compensation, entropy decoding, and 
Motion Compensation
Motion compensation reconstructs the current macro block using a motion vector, which is the difference between the current macro block and the macro block of the previous frame. The difference value is quantized inversely and performed using an inverse integer transform. The major arithmetic included are the multiply and accumulation operations. Motion compensation performs interpolation operations for chrominance components. Table 2 shows the profiling results of a motion compensation function based on simulation cycles.
The main functions of motion compensation are get_block, mc_main, _mulsi3, and _divsi3. The get_block function calculates the block value by interpolating two 8-bit integer values. The mc_main function calculates the current block value by adding the reference block value to the value multiplied by the coefficient. The _mulsi3 function is a 32-bit signed multiplication function. The _divsi3 function is a 32-bit signed division function. They are inserted by the compiler because the processor does not have mulsi3, divsi3 instructions. Table 3 shows the proposed instructions for motion compensation. Table 3 . The proposed instructions for motion compensation. 3. Entropy Decoding H.264 supports two different methods for the final entropy encoding step: CAVLC is the standard method using simple variable length Huffman-like codes and codebooks. Table 4 shows the profile result of the entropy decoding function based Table 5 . The proposed instructions for entropy decoding. The showbits function calculates the bit position in any frame. The function has not many run cycles but many calling numbers. The instructions written for the showbits function are proposed. We proposed the instructions for a mod operation and a 32-bit unsigned division operation to reduce the run cycles of _umodesi3 and _udivsi3. Also we proposed instructions for code_from_bitstream and read_coeff_4×4 which have a coefficient calculation as shown in Table 5 .
Inverse Quantization and Integer Transform
The basic algorithm of H.264 uses a separable transformation. The mode of operation is similar to that of JPEG and MPEG, but the transformation used is not an 8 × 8 discrete cosine transform (DCT), but a 4 × 4 integer transformation derived from the DCT. It can be computed using only additions, subtractions, and binary shifts. The actual video data is subtracted from the prediction. The resulting residual is transformed by inverse quantization. The coefficients of this transform are divided by a constant integer number. Table 6 shows the profiling results of the inverse quantization and integer transform (ITIQ) function based on simulation cycles.
Itrans and itiq_main are used in inverse integer transformation. The itrans instruction for itrans function, ih_luma instruction for ih_lumadc, and ih_chromadc instruction for ih_chromadc function are adaptive instructions. The iquant function has many 32-bit multiplications. Table 7 shows the proposed instructions for ITIQ.
V. Implementation Result
We implemented processors with various instruction sets to trade off area and power. Table 8 summarizes the comparison results for the processors.
Processor I includes only basic instruction sets. Processor II includes the instructions proposed for motion compensation and entropy decoding, as well as the iTRANS instruction in addition to the instructions used in processor I. Processor III includes the MUL32 instruction in addition to the instructions used in processor II. In the case of processor III, although the gate counts are increased, it has the most reduced run cycles. We implemented processor III, finally. The processor runs motion compensation, ITIQ, and ENT of an H.264 baseline@L2 decoder of 10 frames/sec at a 115.3 MHz clock frequency. Table  9 shows the comparison results of the software run cycles between ARM9TDMI and the proposed processor. The implemented processor is proved on an XT2000 board, an FPGA board dedicated to an Xtensa processor, as shown in Fig. 5 . The application software is cross-compiled with program and data code at the host PC, and the compiled binary code is transferred to SDRAM through the debugging interface. We can monitor the results of the program through the serial interface on the host PC. 
VI. Conclusion
We described an embedded processor design for an H.264 video decoder with a configurable embedded processor for wireless multimedia communication. We designed the processor architecture, which has a new instruction set and configured memory system. We designed the extended instruction set based on the profiling results. The extended instructions are specified for H.264 video decoder application software. The proposed processor is implemented with the configurable processor, Xtensa of Tensilica, Inc. Our improvement was carried out with a special design of instructions and the proper processor configuration. The overall improvement allows the software implementation of motion compensation, ITIQ, and ENT algorithms to be over 10 fps for a CIF sequence with low power, low cost, and great software flexibility.
