In this paper, H.264/AVC, the newest video coding standard, baseline profile decoder was implemented exploiting Intel MMX instructions. Both control-and data-level parallel processing approaches'were applied to the kernels of the baseline subsystems for efficiently utilizing the SIMD (Single Instruction Multiple Data) instructions. The data-level parallel approach tries to process multiple pixels at a time to fully utilize SIMD instructions. The data-level approach shows a better performance even though loop unrolling is further applied to the conlrol-level approach. The resultant implementations are also compared with the Intel Performance Primitives.
INTRODUCTION
SIMD extensions to general purpose processors have been adopted pervasively to effectively support multimedia applications which usually demand a large amount of fixed-point operations on multiple and small data types. The most recent architecture of Intel supports up to a 128 bit word-length, which translates 8 Idbit sub-words can be processed using an instruction. It would not be a problem to increase the total word-length as long as that can help the speedup since the number of gates in a chip is almost unlimited in these days. However, it has been noticed that the partitioned data-path architecture is not always very effective partly because it requires a regular data structure to reduce the overhead of packing-unpacking. Most of the previous implementations using the partitioned data-path are usually based on the control-level based parallel approach, where adjacent repetitive instructions are combined to reduce the number of cycles. In our approach, we try to use the data-level based parallel approach, a multi-pixel processing method, to reduce the overhead of packing-unpacking as well as increase the degree of parallelism.
H.264/AVC is the newest video coding standard of ITU-T Video Coding Experts Group and the ISO/IEC Moving Picture Experts Group. The main goal of the standard is to achieve enhanced compression performance and provide more network 6iendliness for efficient communications.
In this paper, H.264/AVC baseline profile decoder was implemented with Intel Pentium 4 processor exploiting MMX extensions. The saturation arithmetic instructions, which are very useful for signal processing applications, are also supported. Intel SSE instructions provide the programmer with supplementary eight 128-bit registers (XMMO-XMM7) in which each register consists of four 32-bit single precision floating point numbers. SSEZ instruction sets add 128-bit double-precision floating point and 128-bit packed byte integers and provide some enhanced instructions such as cacheability-control, etc. The detail features of SSE extensions can be found in [4] .
INTEL MMX

n.264 DECODER OVERVIEW
H.264/AVC video coding standard has been introduced with significant enhancements in video coding for both efficiency and flexibility over a variety of network domains [SI. Iri video coding layer (VCL), some of important enhancements are the use of a small block-size (4x4) exact-match transform, adaptive in-loop deblocking filter and motion-prediction capability. Figure 2 shows the H.Z64/AVC video decoder block diagram [6] . H.264 defined three types of profiles. The baseline profile is the simplest and supports intra and inter-frame coding, and entropy coding with context-adaptive variable-length coding (CAVLC). The H.264 baseline decoder, JM 7.2 version C code, was compiled and initially profiled to know the distribution of execution time in each part, such as memory allocation, get block, integer transform, loop filter, variable length decoder (VLD), VO, and etc. In the subsystems, interpolation, inverse transform and loop filter are the most compute intensive [7] . Our initial implementation shows that the memory allocation consumes about 41% of the execution time since the memory allocation-and-free operations are repeated at every frame. Thus, we devised the code so that the memory allocation is conducted just once at the beginning of the program execution and memory is freed after the completion of the execution. This simple modification reduces the memory allocation-free overhead to 19.5%. Table 1 shows the initial profiling results with only the dynamic memory allocation code devised. 
SIMD OPTIMIZATION
In this paper, the key subsystems of H.264/AVC baseline decoder are implemented using SIMD instructions for speedup. The use of partitioned data-path requires some alignment of partitioned data because the allowed packing-unpacking patterns are limited [SI.
A straightforward approach is to combine adjacent, usually identical, operations or instructions for SIMD processing. This approach is often called the "control-level parallelism based," and is useful for the automatic parallelization of inner loops. Another approach is to process multiple output samples concurrently in order to obtain the needed parallelism. Thus, this approach is similar to the oufer-product method in the matrix-vector multiplication, and is often called the "data-level parallelism based" since multiple data are computed concurrently. The latter approach usually results in more parallelism, but requires different data rearrangement and program modification because most conventional programs are written to compute the output samples sequentially.
In this study, both methods are implemented and compared.
Control-Level Parallel Method
The interpolation kemel employs 6-tap FIR filtering. Figure 3(a) shows the reference C-code implementation of 6-tap FIR filter used in the quarter-pixel interpolation kemel. SIMD based implementation using the control-level parallel method is depicted in Fig. 3@ ).
The procedure requires three MAC and three PACK operations and three memory accesses for each output pixel. 
Data-Level Parallel Approach
This method computes multiple data at a time, thus the latency would be longer when compared to the "control-level parallel method." Note that most programs do not try to compute multiple output samples concurrently in order to save the storage size as well as reduce the latency. However, this approach can be quite efficient for parallel processing in many cases because the relation between the packed data can be independent. Figure 3( The data-level based approach shows similar performance to Intel IPP mainly because the integer instructions in the Pentium can be executed twice the speed of the MMX operations. In addition, 128-bit MMX operations are executed using 64-bit data-path in the CPU we used. These architectural characteristics seem reasonable considering that Intel CPU's are mostly used for general purpose, instead of multimedia embedded, applications.
However, since many demanding multimedia applications are developed, we consider that Intel CPU's need some enhancements in the data-paths for MMX instructions.
CONCLUDING REMARKS
In this paper, the newest video coding standard, H.264/AVC baseline profile decoder is implemented exploiting Intel SlMD instructions. Both control-and data-level based approaches were used, and compared to the Intel Performance Primitive library. Although the data-level based approach does not show better performances when compared to the IPP, there is a good opportunity to increase the multi-media performance by enhancing the current ALU architecture, such as equipping 128-bit hardware or increasing the speed of the M M X data-path. Future work needs to be focused on memory access optimizations, such as prefetch and cache miss reduction.
