A key enabling technology for the prolikration of multima dia PC's is the availability of fast video codeca, which are the basic building blocks of many new multimedia applications. Since most industrial video coding standards (e.g., MPEG1, MPEG2, H.261, H.263, etc.) only specify the decoder syntax, there are a lot of moms for optimization in a practical implementation. When considering a specific hardware platform like the PC, the algorithmic optimization must be considered in taudem with the architecture of the PC. Specifically, an algorithm that is optimal in the sense of number of operations needed may not be the fastest implementation on the PC. This is because special instructions are available which can perform several operations at once under special circumstances. In this work, we describe a fast implementation of H.263 video encoder for the Pentium processor with MMX technology.
coder syntax, there are a lot of moms for optimization in a practical implementation. When considering a specific hardware platform like the PC, the algorithmic optimization must be considered in taudem with the architecture of the PC. Specifically, an algorithm that is optimal in the sense of number of operations needed may not be the fastest implementation on the PC. This is because special instructions are available which can perform several operations at once under special circumstances. In this work, we describe a fast implementation of H.263 video encoder for the Pentium processor with MMX technology.
INTRODUCTION
Recent advances in the personal computer industry have provided the neceSSaPy computation power and storage required by many multimedia applications. These tremendous technological advances have enabled the PCs to perform image/video compression and decompression efliciently in software only. Some advantages of implementing the video codec in software for the PC are the elimination of axpensive hardware, the ease of upgrade through replacement of software modules, and the wide availability of PCs.
Video coding requires tremendous amount of computations. There have been many fast algorithms proposed in the literature to ease the computation load for various components in a video codec. Typically these fast algorithms are proposed and compared with each other by U+ ing the total number of operations as a criterion assuming a general-purpose processor without considering the target hardware platform. However, the comparisons cau be misleadiig when we consider a software implementation on a specific hardware platform. This is because each m i w p m cessor has it's own strengths and weaknesses which places bias on certain operations. For axample, some microprocessors may have dedicated hardware to execute the multiplymumulate operation in one cycle. Then, it will be advantageous to arrange an algorithm such that the multiplyaccumulate operation occurs frequently. Thus, we see that the design of a fast s o h e only video wdec is highly dependent on the hardware platform. Each component of the video codec must be properly selected to take maximum advantage of the underlying hardware.
In this paper, we present a fast software implementation of a R.263 video encoder on the Intel Pentium with MMX technology processor, which powers a vast majority of the computers in the world. The optimization of the encoder is performed iteratively through profiling and recoding to speed up the inner loops. lhditional optimiz% tion techniques were used along with the MMX instructions to achieve speed-up. Optimization techniques such as rem o d s of loop invariant computation, strength reduction, loop jamming, loop unrolling, and table lookup were used.
Loop unrolling was used often in tight loops with MMX instructions to achieve speed-up through so-e pipelining.
MMX TECHNOLOGY
MMX technology is an extension to the Intel Architecture whose aim is to improve the perhrmance of multimedia and communications algorithms. With the addition of the MMX technology comes fifty-seven new instructions, and eight new 64bit registers. The MMX instruction set was designed by analyzing a broad range of dimare applications in the field of multimedia and communications. In the analysis of these applications &om cent domains, it is found that certain common characmistics exist fw a majority of the core time-consuming code sequences. Fkom on these observations, it was found that a salient feature of many multimedia algorithms was the execution of the same set of operations on a large number of small data elements. Therefore, the MMX technology adopted the SIMD (!3iq. smaller packed data type to a bigger packed data type. In addition, the unpack instruction can perform an interleaved merge operation which c &~l be used d c i e n t to perform insertion, transposition, and other data manipulation oper* tions.
H.263 ENCODER IMPLEMENTATION DETAIL
The major computational blocks in a H.263 video encoder are motion estimation, motion compensation, DCT / IDCT, q u a n t i z a t i o n /~~t~t i o n , entropy coding, and inter/in* coding. Among these functional blocks, motion estimation and DCT/IDCT are typically the most computational intensive portion of the encoder. To get an idea of the computational load distribution of the functional blocks from a typical H.263 encoder, we encoded a video sequence using the ITU TMN H.263 video encoder provided by Telenor t o obtain a profile of the encoding computational load. Intel's V-Tune sofkware package, which is a visual o p t i m b tion/pro%ng tool, was used to monitor the encoding process. In Fig 1, we show the distribution of CPU load o b tained by the profiling. Indeed, we can see that the motion estimation and the DCT/IDCT are the most time consuming portions where the motion estimation occupies a mi+ jority of the CPU power.
H.263 uses block matching motion estimation and compensation to exploit the temporal melation between adjacent frames. Various block matching algorithms has been proposed in the literature and basically they m e r in the matching criteria, search strategy, or block size. The method that the TMN H.263 Encoder employs is the full search block matching algorithm using sum of absolute diikence (SAD) as the matching criterion. This method guarantees a global m i n i u m by exhaustively comparing all possible candidates in the search space. However, the complexity of such an search is prohibitively high as we can see from These techniques can be divided into two categories namely fast matching or fast search. In fast matching, different matching criteria that requires fewer computations [6] than the sum of absolute di&rence (SAD) or the mean square error (MSE) are used. In fast search, the SAD or MSE criteria is typically still used but the average number of points searched is smaller the total number of points in the entire search space In our video encoder, we employed a fast search block matching that is based on the three step search [5] . In this scheme, we start our search at the center of the search region. h m the starting point, we search its eight surrounding neighbors to find the best matching out of all nine points. If the starting point was found as the best match, we stop the process and declare it as the motion vector. Otherwise, we set the newly found best match as the new starting point and repeating the process over again. We note that the computation of the matching criteria for the eight neighboring points of a starting point might be need in the search process of future starting points due to overlap. Therefore, the computed matching scores are stored so that they can be access instead of computed later if needed. Along with the searching strategy, we tried several diEerent matching criterions including the MSE, MAD, and the error variance. In terms of computational complexity, the MAD matching criterion required the least amount of computation. However, the m r variance matching criterion which is the variance of the difFerence between the block and its prediction resulted in better prediction among the three. We have implemented the MAD and m r variance matching m e a " using MMX instructions, which significantly improved the speed of these operations. where each element of z and g are eight-bit quantities. Since the same arithmetic operation is applied to each element independent of other elements, we can take advantage of the inherent instruction level parallelism through MMX instructions. We note that the resulting dynamic range of subtraction between two eight bits unsigned numbens is nine bits. Therefore, if we perform full precision subtraction with 2 and y, we must work with 16-bits quantities which reduces the p a " to four instead of eight. However, the ablute difference between two &bits numbem 2 and y can be performed in 8-bits precision using saturating arithmetic as follows. We first compute x -1 and y -2 using saturating arithmetic and then we logically OR the two difference together to form the absolute dif€erence. If 2 equals to g, then the computation produces the corm3 result. If 2 does not equals to y, we note that one of the two quantities 2-y and y -E is the absolute difference while the other one will be saturated to zero. Thus the m e c t result can be obtained by logically OR the two differences together. Therefore, we can perform the absolute Werence using eight-bit precision, which will allow us to work on eight elements at a time.
In order to perform motion estimation, we must generate each candidate blocks through motion compensation followed by computation of the matching criterion. Typ idly, the motion vector search is done using full integer accuracy until the best match is found. Then, a half pixel (i.e. 0.5) motion vector search is done centered on the best integer motion vector. Thus, we must generate the eight candidate half pixel blocks using bilinear interpolation at the end of the best integer motion vector search to find the best half pixel accurate motion vector. This process involves averaging two pixels or four pixels to find the missing pixels where special attention must be paid to ensure proper rounding is performed. We can organize the computation into three cases according to the motion vector. In the first case, only the vertical component of the motion vector contains a half pixel component. In the second case, only the horizontal component of the motion vector contains a half pixel component. In the third case, both the horizontal and vertical components of the motion vector contain half pixel components. For the .first case, we need to perform averaging across adjacent rows to find the predicted value.
For the second case, we need to perform averaging across adjacent columns to find the predicted value. For the third case, we need to perform averaging aaoss the adjacent r m and columns to find the predicted value.
We take advantage of the MMX technology to perform the bilinear interpolation in the motion compensation process. Let's consider the first case where we are averaging across the rows to obtain the predicted value. Ideally, we want to take advantage of the instruction level parallelism by performing the averaging process on eight pixels on two adjacent r m at a time. However, We first note that the additions can not be performed on the eight elements in parallel because of possible overflow. Furthermore, the MMX technology does not support parallel shiR on byte elements (the smallest size it support is on word elements). Thus, in a straightforward implementation, we will have to convert the pixel from an eight-bit quantity to a sixteen-bit quantity, which reduce the parallelism from eight to four.
Fortunately, there is another way to perform this operation while preserving the parallelism to eight and achieve proper rounding at the same time. Let's consider the case of averaging Dwo integers XI and Xa to form Y. S u p pose we simply perform the following operations to form YI, Y1 = XI >> 1 + X2 >> 1, where >> indicates a right shift. Comparing Y and YI reveals that the following relilr tionship hold, Y = XI >> 1 +X2 >> 1 +Z, where Z is the logical OR of the least significant bit of XI and X2. Based on this observation, we can perform the averaging of two arrays of eight pixels in the following way to preserve the maximum parallelism of eight. First, we construct a new array whose element contains the logical ,OR of the least significant bit for the corresponding elements of the two arrays using the 64 bit logical OR and 64 bit logical AND instructions provided by MMX technology. In essence, this step generates an array 2. Next, we need to perform the shift operation on each byte element of the array. As we have pointed out, we can not perform parallel shift on byte elements directly. However, this can be done in two steps since we are working on eight bytes at a time. First, we zero out the least significant bit of each byte element in the two input arrays. Then, we simply regard the eight bytes as one 64bit quantity and perform a 64 bit logical shift by one to obtain the desired result. Afterwards, the three arrays are added in p a l l e l to get the final result. Similarly, the averaging process for the second case can be done in the same way by first transposing the block of data. Furthermore, the third case can be computed in a similar manner by separating the computation between the two least significant bits and the rest.
The H.263 encoder uses DCT to reduce the spatial r e dundancy of the video sequence. The DCT is popular in image compression because it achieves good energy compaction and it has many fast algorithms available. For the same platform, which is about 10 times faster the TMN encoder. The distribution of the CPU load for our optimized H.263 encoder is shown in Fig. 2 .
As we can see, the optimized encoder is fast enough so that the CPU cm be s h e d with d e r prooesses and we still can obtain good & m e rate. The drawback of our optimized encoder is lawered compression efEciency. This is because we traded compression efliiciency with computation complexity.
CONCLUSION
In this paper, we considered the problem of software optimization of video codecs on the Pentium with MMX plat- 
