Electric [4] , the DVxpert chips from C-Cube Microsystems [5] , and the MPEGE422 encoder chips from IBM [6] .
Although ASICs generally offer the best speed performance, they are limited by their inflexible hardware structures. They also increase the system cost by taking additional board area and requiring additional wiring to the memory and I/O subsystems. Due to the limitations of ASIC solutions and the growing interest in computationally intensive multimedia applications, many of the general purpose processors now have multimedia extensions. These extensions increase the speed of computations by supporting a single instruction multiple data (SIMD) mode of execution where an operation is performed on multiple data concurrently. Examples of such architectural extensions are the VIS extension of Sun's Ultra Sparc [7] , the MAX-2 extension of Hewlett-Packard's PA-RISC [8] and the MMX extension of Intel's Pentium [9] . These enhancements help greatly making real-time video coding in software a reality.
In this paper, we propose two methods to increase the speed of video coding, making possible interactive realtime video applications on the Pentium MMX general purpose processor. In the first method, we propose platform independent efficient video coding algorithms that reduce the number of computations by exploiting statistical properties of low bit rate video coding. More specifically, we obtain a significant reduction in Discrete Cosine Transform (DCT) and quantization computations by predicting blocks where most or all quantized DCT coefficients are equal to zero. The number of inverse DCT computations is also reduced by utilizing the zero subblock dominant structure of coarsely quantized blocks. We employ predefined look-up tables to eliminate the quantization operations. Finally, we propose algorithms to reduce the computation demands of the sum of absolute differences (SAD) and half pixel motion vector search operations. The second method employed in this paper is processor-specific, where we map several SIMD (Single Instruction Multiple Data) oriented components of H.263/H.263+ baseline video coding onto Intel's MMX architecture. Such components include the DCT, interpolation, SAD, and half pixel motion vector search. All the algorithms and techniques proposed in this paper are implemented within our public-domain H.263/H.263+ encoder/decoder software [1] . The resulting H.263 video encoder implementation is approximately 2 times faster than our unoptimized public-domain encoder implementation. Moreover, the optimized video encoder can encode QCIF video sequences at 15 frames per second on a Pentium MMX 200 MHz processor while maintaining a high video reproduction quality.
The rest of the paper is organized as follows. Section II provides an overview of the H.263 and H.263+ low bit rate video coding standards. Section III describes the platform independent algorithms. In Section IV, a brief introduction to Intel's MMX architecture is given and the mapping of several computationally intensive algorithms onto MMX and associated performance results are presented. Finally, experimental results that illustrate the resulting speed performance tradeoffs and conclusions are given in Section V and Section VI, respectively.
II. H.263 AND H.263+ VIDEO CODING
H.263 and H.263+ are both ITU-T standards that are based on the former ITU-T H.261 standard. They employ a hybrid video coding method in which inter picture prediction is used to reduce temporal redundancies and transform coding of the motion compensated prediction error data is used to reduce spatial redundancies [10] [11] .
H.263 provides some new components and methods that improve the rate-distortion performance over H.261, including half pixel motion vector compensation, modified variable length coding, more efficient motion vector prediction, and modification of the quantizer for each macroblock [12] . The baseline coding of H.263+ is the same as that of H.263 [13] .
Here, we give a brief overview to the H.263/H.263+ baseline encoding process. H.263 and H.263+ standards, like the other video coding standards, define only the bit stream syntax and the decoding process. The precise definitions of compliant video encoding algorithms are given in the Test Models. Figure 1 shows a simplified block diagram of the H.263/ H.263+ baseline encoder as defined in the TMN 11 [14] . Each picture in the input video sequence is divided into macroblocks, which consists of four 8x8 luminance blocks and two 8x8 chrominance blocks, C b and C r , as shown in Figure 2 . First, an integer pixel motion vector (MV) is determined by performing motion estimation (ME) for the current 16x16 luminance block. Because of its low computational complexity and performance, the Sum of Absolute Difference (SAD) is the most commonly used method for determining the best matching blocks in motion estimation. The SAD operation is defined as follows After motion compensation, if the difference between the current block and the predicted one is large, the four luminance and the two chrominance blocks in the macroblock are intra coded by performing 8x8 DCT transform, quantization, and entropy coding using Variable Length Codes (VLCs). If the difference between the current block and the predicted one is small, then half pixel motion search is performed around the current integer pixel motion vector. In order to find the half pixel motion vector, bilinear interpolation, which is shown in Figure 3 , is performed on the previous picture. After the motion compensation using the half motion vector, the difference between the current block and the predicted block is DCT transformed, and the resulting coefficients are quantized and VLC coded. Information bits for the motion vectors are also added to the bit stream. As in Differential Pulse Code Modulation (DPCM), the difference picture is decoded, and the reconstructed picture is added to the current predicted picture in order to be used during the prediction of future pictures.
III. EFFICIENT LOW BIT RATE VIDEO CODING ALGORITHMS
There have been many research efforts that are aimed at reducing the complexity and computation time of video encoding and decoding. Most of these efforts have concentrated on developing efficient ways of performing motion estimation [15] [16] [17] [18] [19] and computing DCT/IDCT [20] [21] . Moreover, Senda et. al. proposed a method for fast half pixel motion estimation in [22] within an MPEG-2 coding framework. Froitzheim et. al. [23] , Lengwehasatit et. al. [24] , and McCanne et. al. [25] suggested techniques to reduce the number of computations of IDCT. Methods for reducing the DCT computation time using the sparseness of the quantized coefficients were suggested by Girod et.
al. [26] , Yu et. al. [27] , and Lengwehasatit et. al. [28] .
As part of our H.263/H.263+ baseline encoder implementation, we employ a fast motion estimation (ME) algorithm that is described in the H.263+ Test Model [14] . Using this algorithm, only 10-15 16x16 SAD operations are required per macroblock, yet the resulting quality is almost the same as that of the full search algorithm [15] .
We also employ the fast DCT/IDCT algorithm proposed by Arai et. al. [20] , which requires 144 multiplication and 464 addition operations to compute the 2-D DCT/IDCT of an 8x8 block. This algorithm performs a small number of consecutive multiplications, which is important since the algorithm is also implemented in fixed point arithmetic.
Moreover, the algorithm has a scalable structure where 64 of the multiplications can be performed after the actual transform, and therefore, they can be combined with the quantization operation. This feature can be exploited when the H.263+ optional mode that employs quantization without a dead-zone is enabled [3] . Table 1 shows the relative computational costs of some of the H.263/H.263+ baseline encoder components. These results were derived by profiling our public-domain H.263/H.263+ software with the fast ME, fast DCT, and fast IDCT algorithms enabled and without using rate control. As shown in Table 1 , further increasing the computational efficiency of integer ME, half pixel ME, DCT, IDCT, and quantization modules would decrease the overall encoding time significantly. It is possible to reduce the number of computations associated with such modules by exploiting the statistical properties of slowly varying, low resolution video sequences. In this section, we present efficient algorithms for the computation of the DCT, IDCT, and SAD. We also present more efficient ways of performing quantization and half pixel motion estimation.
A. Zero Block Prediction Prior to DCT
The DCT is one of the most computationally intensive components of the H.263/H.263+ baseline encoding.
Although using fast algorithms reduce greatly the number of computations needed for the DCT operation, it still requires 15% of the encoding time when the fast search algorithm described in [15] is used for motion estimation.
In a typical H.263/H.263+ application, the motion-compensated prediction difference blocks are coarsely quantized, resulting in zero blocks, i.e. blocks where all coefficients are equal to zero, more than 50% of the time. If such blocks can be predicted prior to computing the DCT and quantization, then the corresponding computations can be eliminated. In this work, four simple predictors were considered to detect the zero blocks. They are 
where F 1 is the sum of absolute values, F 2 is the sum of squares, F 3 is the mean absolute difference, F 4 is the variance, X i,j is the value of the pixel located in row i and column j, and is the mean of the block. Figure 4 shows the distribution of the zero blocks as a function of each of the above predictors. An ideal prediction function would be the one that separates zero blocks from the non-zero blocks completely. As can be seen from the figure, the sum of absolute values (F 1 ) and the sum of squares (F 2 ) are better predictors in distinguishing zero blocks from the nonzero ones. Because of its low computation demands and slightly better performance, the sum of absolute values ( F 1 ) is selected as the predictor. More specifically, if F 1 is smaller than an experimentally determined threshold, then the DCT and quantization operations are skipped. The threshold depends on the quantizer value. Table 2 lists the statistically found thresholds for each quantizer. The table also shows the percentage of the blocks that result in zero blocks after DCT and quantization, , along with the percentage of the blocks that are predicted as zero blocks, . The and parameters strongly depend on the input sequence, while the threshold is very stable for a given quantizer.
The predictor function F 1 should be applied to each block, incurring certain constant overhead in computations.
Also, even when a block is predicted to be zero, it may still require additional operations, such as filling in the array of coefficients with zeros. Thus, the total number of operations required for the DCT and quantization is equal to
where X is the number of DCT and quantization computations, Y is the number of computations required for predicting zero blocks, Z is the number of computations required to process a zero block after it has been predicted, and is the percentage of blocks which are predicted to have all zero coefficients. The overall computation gain depends strongly on the relative values of X, Y, and Z. The parameters X, Y, and Z depend, in turn, on the implementation of each function. For example, if a system implements the DCT function in a sub-optimal way, e.g., where MMX instructions are not supported by the processor, and the cost of predicting zero blocks is very small, e.g., where the processor have a dedicated instruction for this operation, then using the proposed DCT prediction method is clearly very advantageous. On the other hand, if the DCT algorithm is implemented in a very efficient way and runs on a dedicated hardware, and predicting the zero blocks takes as much time as computing the DCT, using the above prediction method is not very useful. Nevertheless, in general, computing the DCT takes much more time than predicting the zero blocks. This is coupled with the fact that in the low bit rate coding more than 50% of the blocks result in zero blocks after DCT and quantization, using the proposed prediction method provides very significant speed improvements in most of the low bit rate video encoding systems.
B. IDCT
The IDCT is performed both at the encoder and the decoder. If the IDCT is computed only for the non-zero blocks, which are indicated by the Coded Block Pattern (CBP) in the H.263/H.263+ bit stream, approximately 70-80% of the IDCT and the dequantization operations can be avoided. It is possible to further reduce the number of IDCT computations by exploiting the sparseness of the remaining non-zero blocks. In our implementation, we reduce the number of multiplications and additions necessary to compute an IDCT by detecting zero sub-blocks in these non-zero blocks. Figure 5 shows the number of computations necessary for our adaptation of the fast IDCT from Arai et. al. [20] for different zero sub-block combinations. Depending on the input sequence and the quantizer value, approximately 80% of the blocks conform to the structure shown in Figure 5 .c. By detecting only this structure, assuming a multiplication costs the same number of cycles as an addition, the IDCT computation time can be reduced by approximately 40% (50% x 80%). However, there is an overhead involved in finding which of the sub-blocks is zero. Nevertheless, this overhead can be eliminated by extracting the required information during variable length coding of coefficients at the encoder side and variable length decoding of the bit stream at the decoder side. Since the blocks are scanned in a zigzag order prior to VLC, it is not possible to exactly identify the structure shown in Figure 5 .c, but we can instead identify the structure shown in Figure 6 . In a low bit rate video coding application, approximately 70% of the 8x8 blocks conform to this structure. This corresponds to a saving of 35% in the IDCT computation time.
C. Quantization
Quantization does require a significant number of computations, representing approximately 8% of the encoding time. Quantization of some of the macroblocks can be eliminated if zero block prediction prior to DCT is employed.
In this section, we employ a simple technique for faster quantization which can be used when quantization is necessary.
In a H.263/H.263+ baseline coder, a clipping operation is performed before both quantization and dequantization, and there are 31 predefined quantization levels. These properties make it possible to use a look-up format, which causes many irregular memory accesses.
D. Partial SAD Computation
The sum of absolute difference (SAD) computation is the most commonly used measure to determine the best matching block during the motion estimation (ME) process. Since it is computationally very intensive, most of the fast ME algorithms aim at decreasing substantially the number of SAD computations.
A common SAD computation reduction technique is to compare the partially accumulated SAD value to the minimum known SAD. The computation is terminated if the partial SAD is greater than the minimum SAD. Using this technique with full search reduces the number of absolute difference operations to approximately one fourth for a 16x16 block. We further reduce the number of operations by predicting whether the SAD of a given block will exceed the minimum SAD ahead of time, thus eliminating the need to perform actual computations. For example, during the SAD computation for a 16x16 block, if after processing the second row of the block, the partially accumulated SAD is already half of the minimum SAD, it is then highly likely that the final SAD will exceed the minimum SAD. The underlying prediction is clearly a function of the SAD value computed up to the current row, the number of rows processed, and the total number of rows. A fairly effective prediction model can be given by
where P is the predicted SAD, S is the partially accumulated SAD, I is the number of rows processed so far, N is the dimension of the ME block, and is the accuracy coefficient. The number of absolute difference computations per SAD for a 16x16 block and the change in peak signal to noise ratio (PSNR) for different values are presented in Table 3 . It can be concluded from the table that selecting equal to 0.5 yields a good compromise between picture quality and computation time.
The above SAD prediction is not very effective when it is used in conjunction with the fast search ME algorithms, because, in such fast search algorithms, the minimum SAD and the partial SAD values are usually very close. In this case, trying to predict the SAD may even increase the computation time due to the computation overhead incurred by prediction.
E. Fast Half Pixel Motion Estimation Based on Approximations
Half pixel motion estimation is one of the main enhancements of H.263 over H.261. It provides, in most cases, more than a 2 dB PSNR increase in picture quality. Since half pixel ME requires bilinear interpolation of the reconstructed picture and computation of 8 SAD values for each macroblock, it increases both the encoding time and complexity. The eight SAD computations needed for half pixel ME constitute a rather insignificant computational load compared to the 256 SAD computations needed for full search ME. However, in comparison to the fast search algorithm described in the H.263+ Test Model [14] , which uses 10 SAD computations in average, the number of half pixel SAD computations becomes significant.
1) A Simplified half pixel ME for MPEG-2 Encoder
There have already been some efforts to reduce or eliminate the computation time for half pixel ME. Senda et.
al. [22] proposed a simple approximation technique to remove the need for interpolation and the SAD computations performed for half pixel ME within an MPEG-2 coding framework. They suggest to compute each half pixel SAD from the surrounding integer pixel SADs as illustrated in to adapt this technique to low bit rate video coding (<64 Kbit/sec) by re-optimizing these coefficients. Reoptimizing is done by encoding a large variety of low bit rate video sequences and selecting the coefficients that result in the best estimations of half pixel motion vectors. The optimized coefficients are given in Table 4 .
2) Fast Search ME and Half Pixel ME via Approximation
To perform half pixel ME approximation that is described in the previous section, all the surrounding integer pixel SADs must be known. However, in many fast search ME algorithms, not all the surrounding SADs are available. One solution is to simply compute the SADs for the missing locations. Alternatively, the approximation equations and the coefficients VH , V , and H can be modified so that only the available integer SADs are used.
The fast ME algorithm described in H.263+ Test Model [14] computes the SADs at locations on the vertices of a diamond. The modified half pixel ME equations and the optimized 's for this algorithm can be written as Although this technique removes the need for interpolation and 8 half pixel SAD computations, the reproduction quality may not be acceptable in some applications. We propose three methods that improve video reproduction quality while still offering significant savings in computation time. Our motivation is essentially the same for all three methods: When a half pixel motion vector is estimated to have the minimum associated SAD by using the surrounding integer pixel SADs, a better estimation is likely to be found by performing a limited half pixel motion vector search around that motion vector. The three methods are described next:
1. Method 1: The best matching block is found among the four integer pixel locations that surround the integer motion vector. Then, block matching is performed for three half pixel motion vector candidates which are determined by the location of the best matching block. For example, in Figure 7 .b, if B pixel corresponds to the best matching block, then the block matching for each of the pixels 1, 2, and 3 is performed and so on. The half pixel MV is determined by selecting the best matching block among these and the block that corresponds to the center pixel.
Method 2:
The prediction used for determining the candidate half pixel motion vectors in the first method can be improved by using the two best matching blocks out of the four surrounding integer pixel locations.
Subsequently, two or three half pixel block matchings are performed. For instance, in Figure 7 .b, if A and B pixels correspond to the two best matching blocks, half pixel block matching for each of the pixels 1, 2, and 4 is performed. If the B and D pixels correspond to the two best matching blocks, the half pixel block matching for the pixels 2 and 7 is performed. Then, the best matching block among these pixels and the center pixel determines the half pixel motion vector.
Method 3:
In this method, first, the eight half pixel SADs are computed by approximation using the four surrounding integer pixels and the center pixel as described earlier. Then block matching is performed to find more accurate SADs for the N pixels that correspond to the smallest approximated SADs. Finally, the best matching of these N blocks and the block that corresponds to the center pixel are compared to determine the half pixel motion vector.
The tradeoffs of PSNR and speed improvements for each of the above methods are given in Table 5 where one instruction performs the same operation on multiple data elements simultaneously. MMX introduces four new data types: three packed data types (packed byte, packed word and packed doubleword) and a 64-bit quadword, as illustrated in Figure 9 . Also, the MMX extension adds 57 new instructions to Intel's Pentium instruction set. Since most of the multimedia applications use 16-bit data, the new instruction set is optimized mostly for the 16-bit data types. MMX also features saturation arithmetic where in the case of an overflow or an underflow, the operation result saturates to the maximum or the minimum value that the register can hold instead of being wrapped and setting a carry flag. In order to retain backward compatibility, the MMX registers are mapped onto the floating point registers, and therefore, MMX instructions can not be mixed with floating point instructions.
We next illustrate several mapping techniques for the computationally intensive SIMD structured components of H.263/H.263+ video coding to improve speed performance. These video coding components include the DCT, SAD, interpolation, data interleaving for half pixel motion estimation, and motion compensation functions.
A. SAD Computations for Motion Estimation
In H.263 and H.263+, SAD computations are performed on 16x16 or 8x8 blocks that have 8-bit unsigned data elements. Since the size of each data element is 8 bits, 8 data elements can be processed concurrently using MMX, yielding a significant speed performance improvement. Intel made available an MMX algorithm [31] that finds the absolute difference of two unsigned values. We adopt this algorithm to compute SADs of 16x16 and 8x8 blocks.
Computing one SAD for two 16x16 blocks with the MMX instructions requires 64 memory loads, 64 packed subtractions, 32 packed-OR, 32 unpack, and 63 addition operations. One drawback of using MMX instructions for SAD computations is that it is inefficient in partial SAD computations. Since four of the additions are saved in the same register, unpacking these values and adding them together to compute a partial SAD would cause an unacceptable delay. This is a disadvantage for full ME because most of the SAD computations could be terminated during the early stages by using a partial SAD computation technique. However, when fast search is used, efficient use of MMX is possible since performing partial SAD computation is usually not beneficial anyway.
B. DCT
We have chosen to map the DCT algorithm of Arai et. al. [20] onto the MMX because of its required small consecutive number of multiplications and its regular structure that is suitable for implementation on an SIMD structure. Because the MMX instructions can perform only integer arithmetic, the floating point algorithm was modified so that only fixed-point arithmetic is employed. The input to the DCT is an array of signed 8-bit data and the output is an array of 16-bit signed data. In the intermediate stages of the computations, 16-bit registers are used to attain the maximum speed at the cost of a very small loss in accuracy. Small DCT errors are usually negligible, because they do not propagate through the video frames and such loss in accuracy is insignificant as compared to the loss caused by the coarse quantization process which is common in low bit rate video coding.
In our MMX DCT implementation, we performed the DCT on four 1x8 input vectors at a time. In this implementation approach, the input array must be transposed before applying the DCT. After transposition, the DCT for the upper four rows of the array is computed, followed by the computations for the remaining four rows.
After another array transposition, the DCT is applied again on four rows at a time. Operations other than multiplications are performed on four 16-bit data simultaneously. The multiplications are performed using 32-bit data and the corresponding results are immediately downscaled and compressed into 16 bits. As illustrated in Figure 10 , two multiplication operations, one for the low significant part and the other for the high significant part, are performed first. Then, the results are interleaved so that the higher significant part and the lower significant part used in the same multiplication are in the same MMX register. 32-bit additions and arithmetic right shift operations are then executed for downscaling. Last, the four 32-bit data is packed into one 64-bit MMX register using signed saturation.
The MMX DCT implementation runs approximately four times faster than the floating point implementation on the same Pentium processor and three times faster than the optimized fixed point C implementation. Since a fixed point DCT algorithm is used in the MMX implementation, the rate-distortion performance of the encoder is affected as well. Table 6 shows that the resulting picture quality in terms of PSNR for different bit rates and sequences is almost unchanged for both DCT implementations.
C. IDCT
Since the H.263 and H.263+ standards are based on inter picture prediction, any error caused by the IDCT's low accuracy propagates throughout the subsequent video frames. ITU-T has an IDCT accuracy measurement procedure within the H.263 and H.263+ standards. According to our research, it is not possible to implement a H.263/H.263+ standard compliant IDCT function using a 32-bit or less precision. Even if the IDCT is implemented with 32-bit precision using MMX, only two data can be processed simultaneously and because of the large computational overhead, a significant increase in speed can not be expected. Intel has implemented and made publicly available an MMX IDCT routine for MPEG decoding. Their implementation uses 32-bit precision for multiplication and 16-bit precision for accumulation operations, and is approximately 4 times faster than the optimized C implementation. However, it does not meet the specifications of the ITU-T standard. The ITU-T IDCT accuracy specifications require that, the DCT and the IDCT, with 64-bit floating point accuracy, be applied to a certain number of blocks that are generated by a pseudo random number generator routine, and then the peak error, mean square and mean errors (pixel-wise and overall) should be less than certain predefined thresholds. Also, if the reference IDCT produces a zero output, the IDCT under test should also produce a zero output. More details on these requirements can be found in [3] . Even if an implementation does not meet the standard, it may be beneficial to compare its computational accuracy to such thresholds. codec implementation provided that the encoder/decoder software runs on processors having the same architecture.
However, in that case, the codec would not be truly standard compliant.
D. Interpolation
Interpolation is performed on every reconstructed I and P picture in order to perform motion estimation in half pixel accuracy. Figure 3 explains the bilinear interpolation performed as specified in the H.263 and H.263+
standards. Since the interpolation is performed on 16-bit data, it is possible to achieve a significant speed improvement by processing 4 units of data simultaneously. Figure 11 shows the inner core of the interpolation function that is implemented using MMX instructions. First, the four 1x8 pixel vectors are loaded into the MMX registers and the data that is going to be processed is separated by using unpacking instructions. Next, packed additions are performed and the results are written into 16-bit registers. Last, the 8-bit results are packed together again and written back to the memory. In the outer loop, this operation is repeated for the pixels in the row below until the bottom line of the picture is reached. Then, starting from the top, the same procedure is repeated for the next 8 pixels in the horizontal direction. The implemented MMX interpolation algorithm is three times faster than the optimized C implementation.
E. Other MMX optimizations
We have optimized several other computationally intensive components of our implementation and achieved additional speed improvements. These components include, data interleaving for half pixel ME, motion compensation, which consist of simple addition operations of reconstructed residual blocks, and memory copy operations.
V. EXPERIMENTAL RESULTS
In this section, we summarize the speed performance improvements of individual platform independent algorithms and MMX implementations for each compute intensive module of the software. We also present the overall H.263/H.263+ baseline video encoding system improvements. The speed performance improvement is indicated by which is defined as
, where E UOP is the execution time of unoptimized module and E OP is the execution time of the optimized module.
A. Test Sets and Conditions
In our experiments we used a wide variety of test sequences, mostly in QCIF (174x144) resolution. Here, however, we present performance results for two sequences that, we believe, represent approximately the two ends of the possible input video sequence spectrum. The first one is Foreman, a high motion video sequence, and the second one is Akiyo, a low motion news sequence. Additionally, the performance results obtained by combining all the optimizations are presented for Coastguard and Container video sequences as well. All of these sequences are in QCIF resolution and consist of 300 frames. The simulations presented here are performed by skipping two frames for every encoded frame. Therefore, the resulting frame rate is 10 frames per second and the total number of encoded frames is 100. Each simulation was repeated several times and the numerical results were averaged. Table 8 summarizes the speed performance improvement, , achieved for each module using our proposed platform independent efficient algorithms. In these simulations, the Akiyo sequence is encoded at 8 kbps, and the Foreman sequence is encoded at 28 kbps on a Pentium 200 MHz computer. A rate control scheme described in TMN-11 [14] is employed. As can be seen from Table 8 , the change in compression performance in terms of PSNR is negligible in all cases. Using zero block prediction prior to DCT, it is possible to skip approximately 60-70% of the DCT and quantization operations at low bit rates. This corresponds to computational time savings of approximately 60% in our fixed point C implementation where the DCT computation time is relatively large compared to the computation time of predicting zero blocks, and approximately 15% in the MMX implementation.
B. Speed Performance Improvements of the Proposed Algorithms
Moreover, using our proposed fast half pixel motion estimation technique, great computational gains for both half pixel ME and interpolation are achieved.
C. Speed Performance Improvements of MMX optimizations
The MMX performance improvement of each video coding module implementation is presented in Table 9 . In the table, the integer pixel ME module includes the SAD computations and memory copy functions. The half pixel ME module includes the SAD computations and the data interleaving functions. The simulations are performed on a Pentium 200 MHz computer with MMX support. The number of data elements that are processed simultaneously in one MMX register (N) is also given in Table 9 . There is no compression performance degradation going from the C implementation to an MMX implementation unless conversion from floating point arithmetic to fixed point arithmetic is required. In the DCT case, the PSNR difference is in the (-0.02, +0.02) range, which is negligible. The MMX implementation of the SAD function runs approximately 4-5 times faster than the optimized C implementation, while maintaining the same rate-distortion performance. A 4:1 speed improvement is achieved in the data interleaving operations of half pixel ME by processing 8 data in one instruction. Our MMX implementation of motion compensation, which is performed on 32 bit data, runs 1.5 times faster and memory copy operations run approximately 1.6 times faster as compared to the optimized C implementations. Note that the simulation results presented above reflect only MMX optimization algorithms. Unlike the platform independent algorithms, the MMX mapping algorithms yield speed improvements that do not depend on the sequence or the coding bit rate. Table 10 summarizes the speed performance improvements when all of the platform independent algorithmic optimizations are enabled. The partial SAD computation technique is not employed when the fast ME is used. The table shows the number of encoded frames per second for the unoptimized code (F UOP ) and for the optimized code When all of the MMX optimization algorithms are employed, approximately a 100% performance improvement is achieved as shown in Table 11 . This might seem unexpectedly low for the full ME case since 75% of all operations correspond to SAD computations and the MMX-optimized SAD implementation is 4 to 5 times faster than the C implementation. However, this can be justified in light of the discussion of Section IV.A, which indicates that the partial SAD computation technique cannot be implemented efficiently using MMX instructions. Therefore, all the SAD computations are performed until the last row is processed, whereas in the C implementation, SAD computation is usually terminated after processing the first several rows.
D. Overall Performance Improvements
Finally, Table 12 summarizes the overall speed performance improvements when all algorithmic and MMX optimizations are employed. The resulting encoder can encode more than 15 frames per second (fps) on a Pentium MMX 200 MHz processor where the unoptimized version of our encoder can only encode 6-7 fps on the same processor. The change in picture quality as the result of our speed optimizations is very small as shown in Table   13 . Moreover, even when some of the low complexity H.263 or H.263+ modes are enabled, such as the PB-frames and modified quantization modes, it is still possible to achieve a similarly high encoding rate. Finally, note that the speed advantage of our proposed algorithms would have been significantly higher were the control components of our public-domain encoder efficiently implemented.
VI. CONCLUSIONS
In this paper, we have proposed platform independent efficient coding and MMX mapping algorithms that increase substantially the speed of low bit rate (<64 Kbps) video encoding. Our algorithms are implemented using our public-domain H.263/H.263+ video coding software [1] . The resulting H.263/H.263+ baseline encoder can encode more than 15 fps on a Pentium MMX 200 MHz computer while maintaining a high video reproduction quality. 
List of Figures
PUNPCKLWD PUNPCKHWD Figure 11 . The inner core of MMX implementation of interpolation. Table 8 . The speed improvements achieved in each module by using the platform independent techniques.30 Table 13 . Change in PSNR when all the platform independent and MMX optimizations are combined (QP=18, no rate control).
List of Tables

