I. INTRODUCTION

H
.263 is an international standard optimized for compressing video at low bit rates. It supports efficient transmission of digital video over narrow-band telecommunication channels. Several research results reporting implementation [1] , [20] and improvement [4] , [12] , [15] , [16] , [21] of the H.263 video codec have been reported. The I. Ahmad is with the Department of Computer Science, Hong Kong University of Science and Technology, Clear Water Bay, Kowloon, Hong Kong (e-mail: iahmad@cs.ust.hk).
Publisher Item Identifier S 1051-8215(01)06745-3.
complexity of the H.263 video coding makes it seemingly impossible to accomplish real-time video coding without using special-purpose hardware, such as function-specific multimedia processors, parallel digital signal processing (DSP) system, programmable single-component mixed-media coprocessors etc., [3] . However, a dedicated hardware implementation is not flexible to incorporate new algorithms and can become obsolete. Real-time performance using software, on the other hand, has only been achieved on a multiprocessor system [1] . In this work, our aim is to build a software-only real-time H.263 encoder on a general-purpose single-processor ordinary computer such as a PC or a workstation. Optimization of the codec means optimizations at all of the implementation phases, including algorithmic enhancements, compiler and code optimization, and taking advantage of certain architectural features of the machine. We optimize efficient algorithms for various functional modules of H.263 video encoder such as motion estimation (ME), discrete cosine transform (DCT), and inverse DCT, and elevate their performance through efficient implementation. We use compiler optimizations in order to exploit sophisticated scheduling algorithms for redistributing the tasks for fast processing. Performance of the implementation is enhanced by using simplified model of floating-point arithmetic, loop parallelization, common subexpression elimination, copy propagation, automatic register allocation, tracing of the effects of pointer assignments, etc. Our code optimization includes loop unrolling, which decreases the number of iterations, and data type optimization (DTO), which chooses suitable data types of variables in the program's critical path so as to yield the most efficient performance of basic arithmetic operations. In addition, we remove all possible redundant operations.
In order to exploit the architectural features of the machine, we exploit low-level machine primitives that provide extensions to the core instruction sets with a view to support multimedia data. The use of extended instruction sets in existing microprocessors explore potential low-level parallelism in order to enhance performance of applications with low-precision data. Video coding deals with the data streams which are regular and have independent control flow. Thus, data-level parallelism can be explored by introducing additional logic to partition a higher precision data path to handle multiple pieces of packed lower precision data processed with a single instruction. Typical examples of such multimedia-capable general-purpose processors include Intel's Multi Media eXtension (MMX) [17] and Sun Microsystems' Visual Instruction Set (VIS) [22] . Using these ex- tended multimedia instruction sets, we accelerate the computation in a single instruction stream multiple data stream (SIMD) fashion, increase the utilization of available registers in the processor, and remove register contentions between data and control variables.
Extensive benchmarking is carried out on a 167-MHz Sun UltraSPARC-1 workstation, a 233-MHz Pentium II PC, and a 600-MHz Pentium III PC to study the performance of the encoder. Based on the benchmarking results, suggestions are made to decide the optimum coding options. We carry out a thorough benchmarking which considers various aspects of our implementation. The study determines the effect of each type of optimization for each coding mode of H.263. Our results indicate the tradeoffs between quality and complexity, as well as make an interesting comparison between the workstation and the PCs. The encoder achieves frame-encoding speeds up to 45.68 frames/s on the PCs and 12.17 frames/s on the workstation for Quarter Common Intermediate Format (QCIF) resolution of video with high perceptual quality at reasonable bit rates, which is sufficient for most of the general switched telephone networks (GSTN)-based video-telephony applications. As the speed of PCs will further increase, video coding will become an integral part of its resources and a useful commodity.
The rest of the paper is organized as follows. Section II gives an overview of the H.263 video coding standard, Section III describes the extended instruction sets for multimedia enhancements, and Section IV gives a discussion of various optimizations. Experimental results are presented in Section V.
II. OVERVIEW OF THE H.263 VIDEO-CODING STANDARD
H.263 [10] , defined by ITU-T, is aimed at low-bit-rate video coding, with the objective to provide significantly better picture quality than its predecessor H.261 [9] . Conceptually, H.263 is network independent and can be used for a wide range of applications, but its target applications are visual telephony and multimedia on low bit-rate networks like the GSTN, integrated services digital network (ISDN), and wireless networks. Some of the important considerations of H.263 are small overhead, low complexity resulting in low cost, interoperability with other existing video-communication standards (e.g., H.261, H.320), robustness to channel errors, quality-of-service parameters, etc. Based on these considerations, an efficient coding scheme has been developed which gives flexibility to manufacturers to make a tradeoff between picture quality and complexity.
The generalized H.263 source encoder is shown in Fig. 1 . H.263 uses a hybrid of interpicture prediction to utilize temporal redundancy and transform coding of the residual prediction error signal to reduce spatial redundancy. Although H.263 is closely related to the H.261, it provides the same subjective image quality at less than half the bit rate [5] .
The transform coding is done by discrete cosine transform (DCT). The transformed signal is then quantized with a scalar quantizer, and the resulting symbols are variable-length coded and transmitted. At the decoder, the received signal is inverse quantized, and subsequently, inverse transformed to reconstruct the prediction error signal, which is added to the prediction, thus creating the reconstructed picture. The reconstructed picture is stored in a frame buffer and can serve as the reference picture for the prediction of the next picture. The encoder consists of an embedded decoder, where the same decoding operation is performed so that both the encoder and the decoder have the same reconstructed picture.
A picture is divided into macroblocks, since such division results in more efficient coding. Each macroblock consists of four luminance blocks and two spatially aligned color difference blocks. Each of these blocks are of size 8 8 pixels. One or more macroblock rows are combined into a group of blocks (GOB) to enable quick resynchronization after transmission errors. The GOB structure is simpler than that adopted in H.261. The optional GOB headers may or may not be used, depending on the tradeoff between error resilience and coding efficiency [19] .
For improved interpicture prediction, the H.263 decoder has the block-motion compensation capability, while its use in the encoder is optional. Using block-motion compensation, interpicture prediction can be improved when the prediction blocks can be taken from different positions in the previous picture. One motion vector is transmitted per macroblock so that the simple translational motion can be compensated for. Half-pixel precision is used for motion compensation, as opposed to H.261, where full-pixel precision and a loop filter are used. Therefore, the visual quality is better compared to H.261 [5] . The motion-vector symbols are transmitted to the decoder after variable-length coding. The bit rate of the coded video may be controlled by preprocessing or by varying the following encoder parameters: quantizer scale size, mode selections, and picture rate.
Further to the core coding algorithm described above, H.263 includes four negotiable coding options: 1) unrestricted motion vectors (UMVs); 2) advanced prediction; 3) PB-frames; and 4) syntax-based arithmetic coding (SAC). The first three options are used to improve interpicture prediction. The fourth is related to lossless coding of the symbols to be transmitted, which may be used instead of Huffman coding. These coding options increase the complexity of the encoder, but improve the picture quality, thereby allowing a tradeoff between picture quality and complexity [19] .
The source coder can operate on one of the five standardized picture formats: 1) sub-QCIF (128 96); 2) QCIF (176 144); 3) CIF (352 288); 4) 4CIF (704 576); and 5) 16CIF (1408 1152), covering a large range of spatial resolutions. Support for both sub-QCIF and QCIF formats in the decoder is mandatory, while either one of these formats must be supported by the encoder. This requirement is a compromise between high resolution and low cost.
UMV Mode: In this mode, motion vectors are allowed to point outside the coded picture area. This allows for a better prediction when a small part of the predicted macroblock is located outside the picture area and, therefore, is not available. In case of the prediction of these unavailable pixels, the edge pixels are used instead. With this mode, a gain in quality is achieved, especially for the smaller picture formats if there is motion at near the picture boundaries. Note that, for sub-QCIF picture format, about 50% of all the macroblocks are located at or near the boundary.
Advanced Prediction Mode: In this optional mode, the overlapped block motion compensation (OBMC) is used for the luminance component which reduces the blocking artifacts and thereby improves the subjective video quality. For some of the macroblocks, four 8 8 vectors are used instead of one 16 16 vector, providing better prediction but at the expense of more bits.
PB-Frames Mode:
The principal purpose of the PB-frames mode is to increase the frame rate without significantly increasing the bit rate. A PB-frame consists of two pictures coded as one unit. The P-picture is predicted from the last decoded P-picture and the B-picture is predicted both from the last and current P-pictures. Although the names P-picture and B-picture are adopted from MPEG [8] , B-pictures in H.263 serve an entirely different purpose. The quality for the B-frames is intentionally kept lower, so as to minimize the overhead of bidirectional prediction, which is important in low bit-rate applications. B-pictures use only 15%-20% of the allocated bit rate, but result in better subjective impression of smooth motion.
SAC Mode: Since H.263 is optimized for very low bit rates, it uses the optional SAC mode, which replaces the variable length coding/decoding (VLC/VLD) using Huffman tables by arithmetic coding/decoding operations in order to reduce the number of bits to be transmitted. While in the normal VLC/VLD process (using Huffman coding), only a fixed integral number of bits must be used for each coded symbol, arithmetic coding removes this restriction, resulting in a reduced bit rate, while at the same time not losing the advantages offered by normal VLC/VLD.
III. EXTENDED INSTRUCTION SETS FOR MULTIMEDIA ENHANCEMENT
In this section, we discuss two low-level machine primitives, namely the VIS for SUN UltraSPARC workstations and the MMX for Intel Pentium-based PCs. These are, in effect, extensions to the core instruction sets, specifically designed to embody special instructions suitable for multimedia applications. These instruction sets support integer data processing in single instruction stream multiple data stream (SIMD) fashion by utilizing a packed fixed-point integer, where multiple integer words are grouped into a single 64-bit quantity. These 64-bit quantities can be moved into the 64-bit (integer or floating point) registers and processed with a single instruction, providing data parallelism and thereby enhancing the performance.
A. The VIS
The VIS [22] in the Sun UltraSPARC processor is a RISC-like extension to the SPARC V9 instruction set, which provides core instructions that greatly enhance the graphics and image processing capabilities of SPARC processors [11] . We exploit the data alignment and packing capabilities of VIS, along with its various representations of data (for instance, one 64-bit data may represent eight 8-bit partitioned, four 16-bit partitioned, or two 32-bit partitioned data). Thereby, we use a 64-bit register to perform a set of eight 8-bit, four 16-bit, or two 32-bit integer arithmetic in parallel, providing 8-fold, 4-fold or 2-fold speedup, respectively.
In image-and video-processing applications, many computations can be accelerated in a SIMD fashion. The addition of VIS is equivalent to adding a SIMD fixed-point processor [14] . We process four pixels (each represented by 16 bits) using only one VIS instruction, performing either multiplication, addition, subtraction, or logical evaluation. A few special VIS instructions, in addition to the regular instructions, enable the video coding application to speed up by a factor of four or higher [25] . We perform most operations using VIS in floating-point register file, so that substantial register space is available and there are no register contentions between data and control variables. In our experiments, we have used VIS for ME, MCP, DCT, and IDCT.
B. The MMX Instruction Set
Like VIS, Intel's MMX is an extended multimedia instruction set. MMX implements a new high-performance architectural technique and includes new instructions and data types to achieve increased level of performance on the host CPU. Essentially, MMX exploits the parallelism inherent in many of the algorithms in video, graphics, or multimedia applications by processing several pieces of data with each instruction [17] . MMX introduces 57 new instructions and 8 new virtual 64-bits registers in order to accomplish SIMD features. With data packed into one virtual giant register, more than one piece of data can be processed by a single instruction. For the DCT and IDCT implementation, we have performed 16 multiplications and 14 additions by using only five MMX instructions, thus reducing 35.5% of the clock cycles.
The gain in speedup is often circumvented by the overhead of data re-arrangement, data copying, data-type conversion, etc. (nonarithmetic operations) to suit the MMX instructions. Thus, the overall gain in performance may be restricted if the MMX instructions are not judiciously applied.
IV. OPTIMIZATIONS
The encoder is ehhanced with a variety of optimization techniques.
A. Algorithmic Optimization (AO)
We start with a fast search algorithm [6] that provides high speedup compared to full search block matching (FSBM). The algorithm uses reduced number of bits required for the motion vectors, yet maintaining the quality to an acceptable level. The algorithm partitions the search range into nested search zones, where the first zone is the innermost area of size 3 3 or 5 5 pixels. If the minimum MAD (mean absolute difference of a macroblock) can be found in the center, or if the matching error is less than a predefined threshold, the search procedure stops. Otherwise, the procedure continues to next consecutive zones. We use a threshold of 8 (threshold zero means full search) for zone 1, while our search range is . However, we use a variation of [6] and, instead of making the threshold zero for zone 2, we rather use an even larger threshold of 12 (in the Ultra-1 implementation) or 16 (in the PC implementation) in order to maintain the obtained speedup. This way, we do not lose the advantages gained by the use of zone-based algorithm.
In order to fully exploit the advantages offered by the extended multimedia instruction sets, we make a modification in the block-matching process to further speed up the computation. We use a pixel-vector decimation technique similar to [13] so that only one 8-pixel vector is used for each row of pixels in the macroblock. Our approach is different from [13] in that, instead of subsampling pixels, we subsample vectors of pixels in both horizontal and vertical dimensions. The advantage of using 8-pixel vectors lies in their availability for VIS or MMX, 1 as the eight pixels are stored in a byte-aligned fashion in contiguous memory locations. This approach is illustrated in Fig. 2 . If only the pixel vectors of pattern are used for block matching, then the computation is reduced by a factor of four. However, since 75% of the pixel vectors do not enter into the matching computation, the use of this subsampling pattern alone can negatively affect the accuracy of motion vectors. To reduce this drawback, we use all four subsampling patterns, but only one at each search location and in a specific alternating (cyclic) manner. Therefore, if pattern is used at the search location ( ), then it is also used at locations ( ) for , integers within the search area; pattern is used at locations ( ), pattern at (
) and pattern at ( ). For each of the subsampling patterns, we obtain a motion vector that minimizes the sum of absolute differences (SAD) over the locations where the pattern is used. The minimum SAD, obtained from all four patterns, corresponds to the selected motion vector for the macroblock. By doing pixel-vector decimation, about 5% reduction in overall program running time is obtained.
The overall accuracy of DCT and IDCT is not affected by rounding off and truncations, which are intrinsic to the quantization process in video-coding applications. By exploiting this fact, we have designed a fast 8 8 DCT and IDCT algorithm (based on [18] ) using VIS and MMX. The DCT/IDCT routines take an input block of 16-bit integers and deliver an output block of 16-bit integers. Since the input to the DCT routine is usually the difference between the current block and the reference block, the difference pixel can occupy 9 bits, and therefore, is represented as a 16-bit datum. We store four such 16-bit data into a 64-bit register. We group these data elements such that DCT computation can be viewed as an SIMD parallel process. For instance, the transformation in the first stage of Fig. 3 can be written as (1) (2) In order to perform the addition and subtraction in (1) and (2), we rearrange the input vector (residing in two registers and ) into registers , and , , as shown in Fig. 4 . This rearrangement is necessary to maintain the correspondence of data elements which are being operated on. By using 16-bit partitioned addition/subtraction of a 64-bit register on corresponding 16-bit data elements in registers and , we obtain ( , , , ) and ( , , , ), respectively, which are stored in 64-bit registers and , respectively. Similarly, the transformation of the upper part of second stage can be written as
The values of ( , ) and ( , ) can be obtained by using 16-bit partitioned addition/subtraction on corresponding data elements of the upper and lower halves of . We have developed an efficient 32-bit 32-bit multiplication strategy, where a pair of 16-bit 16-bit multiplications are done in parallel. For instance, the upper part of the third stage of Fig. 3 requires three such multiplications, namely , , , and
. The upper and lower halves of the results (64 bits) of these multiplications can be added or subtracted using 32-bit partitioned addition/subtraction of 64-bit register on corresponding data elements in order to produce . Similarly, can be obtained according to the transformation depicted in the lower part of stages 2-4 in Fig. 3 .
B. Compiler Optimization
Most compilers come with optimizers that take advantage of sophisticated scheduling algorithms in order to perform software pipelining, for most efficient processing. Although we try to be as discreet as possible, some of the program optimization (mentioned in next section) may well be implicitly performed by the compiler optimizer.
For Ultra-1 based experiments, we have used the Sun Solaris C compiler (SC4.0) with the following flags setting:
(this flag defines the set of instruction the compiler should use), (it defines the cache properties for use by the optimizer), (it enables all dependence based transformations),
(it refers to the specification of a common set of performance options), and most importantly, (it specifies that the compiler should generate optimized code at level 4). In addition, is used for preparing object code to collect data for profiling using gprof.
For PC-based experiments, we have used the Microsoft Visual C compiler, with the following settings: (it disables compiler optimization, we use it for the no optimization case), (it combines optimizing options to produce the fastest possible program), (it enables the compiler to reduce some C/C constructs to equivalent machine code), (it helps store variables in registers and perform loop optimization), (it ensures that after each function call, pointer variables must be reloaded from memory),
(it provides local and global optimizations, automatic register allocation and loop optimization),
(it replaces some function calls with intrinsic functions, to avoid overhead of function calls),
(it improves consistency of floating points by disabling optimizations that could change floating-point precision), and (it omits frame pointers on the call stack and frees up one more register for storing frequently used variables and subexpressions). While using MMX instructions, we add the flag in order to exploit further compiler optimization suitable for MMX instruction set. For a linker, optimization option has been used.
C. Code Optimization
The following code-optimization techniques provide significant performance improvement [2] , especially when the compiler optimizer fails to efficiently use the system resources.
Loop Unrolling: The H.263 encoder accesses data structures organized in matrices using loops. Some encoding functions require nested loops with several levels. Parallelism can be exploited by using pipelined access to such data structures by unrolling loops. Loop unrolling (LU) is the transformation of a loop so as to increase the loop body size and to decrease the number of iterations. This process may minimize both the number of load/store instructions by utilizing the CPU registers more efficiently, as well as data hazards arising from inefficient scheduling of instructions by the compiler optimizer. There may be two types of LU: Internal LU (ILU) and External LU (ELU). ILU consists of collapsing some iterations of the most internal loop into a larger and more complex statement requiring higher number of machine instructions, which can be more efficiently scheduled by the compiler optimizer. ELU consists of moving iterations from outer loops to inner loops, by using more registers in order to minimize the number of memory access inside the loop.
DTO: DTO is the choice of data types for the variables in the program critical path, which maximizes the performance of the different functional units, since the data types directly derived from the task definition may not yield the most efficient performance. We have used 16-bit integer values as the input and output of DCT and IDCT. In order to cope with the required floating-point operations in these functions, we have scaled the floating-point constants and allocated the precomputed constants to proper registers, instead of using the mixed-mode operations of integers and floating points.
Reduction of Redundant Operations: Divisions and multiplications are usually considered to be the most cycle-expensive operations. However, in most RISC processors, the integer (32 bit) multiply takes more cycles compared to the double (64 bit) multiply in terms of both instruction execution latency and instruction throughput [2] . In addition, floating-point divisions are less cycle-expensive compared to mixed-integer and floating-point divisions. Therefore, it is important to minimize the number of such arithmetic operations, especially inside a loop. Possible techniques include LU and DTO, while in some cases introduction of temporary variables (stored in registers) can provide noticeable performance improvement. We have used such techniques for the quantization module of our implementation.
V. EXPERIMENTAL RESULTS AND DISCUSSION
In this section, we present the experimental results and comparisons of our implementation on two platforms, namely the PCs and the workstation and compare the corresponding performances.
A. Test Video Streams
We have used nine video streams of QCIF resolution: Claire, Grandma, Miss America, Salesman, Mother and Daughter, Trevor, Car Phone, Suzie, and Foreman. These video sequences represent various type of motion, both in terms of motion in scene content and camera motion. The variety of motion makes the complexity of the ME process to be different for each video sequence, while the time to calculate the motion vectors are also of wide variation range. Being the most time-consuming encoder module, the performance of ME affects the total encoder running time. We may divide the above nine video sequences into two major categories, depending on the difficulty (and therefore time taken) to compute the motion vectors: sequences with slow motion (SM) and sequences with fast motion (FM). The sequences Claire, Grandma, Miss America, and Salesman may fall into the SM category while Mother and Daughter, Trevor, Car Phone, Suzie, and Foreman fall into the FM category.
B. Analysis of Computational Requirements of Modules
In this section we present an analysis of the execution profile of our H.263 encoder using the GNU gprof profiler. Our implementation of optimized software-based H.263 encoder is based on the Telenor's H.263 video encoder [23] . Since the coding of I-frame is performed only once (for the first frame) and does not require expensive operations like motion estimation, we restrict our analysis to the coding of one video frame as a P-frame or two video frames as PB-frames. This analysis shows that 97.3% of the program running time is spent on the principal encoding function. The computation requirements of various functional modules within this principal encoding function are analyzed below in details. Since insignificant amount of time is spent on performing input (about 2%) and video quality measurement in terms of PSNR (0.6%)-these functions are not probed any further. Table I shows the breakdown of the execution time of the principal encoding function into various constituent modules. Together, they require 92%-98% of the execution time of the principal encoding function. It may be observed that, for the no optimization case, motion estimation is the most time consuming module, followed by macroblock encoding (which involves DCT and quantization) and motion-compensated prediction. Unlike [20] , which reports a high percentage of macroblock decoding time, our approach adopts finding IDCT for only the nonzero elements, thereby reducing the macroblock decoding time considerably.
With the application of all the optimizations discussed in Section IV, the execution profile of our program changes noticeably. With optimizations, the computational requirement of motion estimation is reduced to almost one third (in terms of percentage points). Therefore, percentage execution time of other functional modules increases. Fig. 5 illustrates this effect.
C. Performance of the H.263 Encoder
The reported results were obtained using the first 100 frames of each video sequence. The encoding rates are given in frames per second. The reference frame rate was kept at 30 frames/s while the input original sequence frame rate was assumed to be 30 frames/s. As an encoding output parameter, we used both variable bit rate [with encoding frame rate 2 at 30 frames/s, QP at 10, and constant bit rate (with variable QP and encoding frame rate, but bit rate fixed at 28.8 kbits/s). To measure the actual program running time, we used available library functions ( and ), which are accurate up to microseconds. The timing results were averaged over 100 runs. In addition to performing the experiments on a 167-MHz Sun Ultra-1, we performed the same experiments with two PCs: a 233-MHz Pentium II (PC PII) and a 600-MHz Pentium III (PC PIII).
Figs. 6-11 depict the luminance PSNR under different optional modes with variable bit rates. From these figures, it is evident that PSNR does not change noticeably due to the incorporation of various optional modes. Subjective quality, as observed, also remains the same. Therefore, with the same quality, our choice of encoder is concentrated on the encoder speed. Fig. 12 shows the luminance PSNR with no optional mode at a constant bit rate of 28.8 kbits/s. The experiment was still performed with 100 frames, but the encoding frame rate was variable, in order to meet the constraint of fixed bit rate. Therefore, 25-29 frames of the sequences were coded, depending on the 2 Endcoding frame rate refers to the ratio of the allocated bit rate and the actual number of bits to encode the frames. It depends on the quantization parameter (QP) and the number of frames. Note that this frame rate is different from the frame encoding speed, which is a measure of the actual running time of the program in terms of frames per second, and depends on the computational and programming complexity. complexity of (and therefore bit spent to encode) the sequence. The mean QP is 5.68-17.88 while the mean encoded frame rate (in terms of the ratio of allocated bit rate and actual number of bits to encode the frame) is 9.44-9.88 frames/s. It is evident from this figure that, with a constant bit rate, the SM sequences yield very good quality while the quality of the FM sequences is still acceptable. Table II shows the frame encoding speed in frames/s for our H.263 video encoder. These results involve no explicit optimization. Although we disabled explicit compiler optimization for the PC by using switch for the Microsoft Visual C compiler, the compiler uses some intrinsic optimizations. As a result, the PC version of the encoder yields faster encoding speed compared to the Ultra-1 workstation. It may be observed from this table that the use of optional modes considerably slow down the encoding speed. Table III shows the encoding speed with AO performed on motion estimation, DCT and IDCT. The effect of using AO is discussed in Section V-D. The multimedia instruction sets are not used in these cases. Table IV shows the encoding speed with AO and compiler optimization. Further fine tuning in optimization is done by reducing some cycle-expensive operations, and including some code optimization, especially LU and DTO. The effect is discussed in Section V-D. Table V includes the H.263 encoding speed with all the optimizations. The encoder achieved a maximum frame-encoding speed of 18.12 frames/s using the PC PII and 12.17 frames/s using the Ultra-1 workstation. On the Ultra-1, the mean frame encoding speed with no optional mode is 11.28 frames/s. Using the PB-frames, SAC, UMV, and advanced prediction modes, the average frame encoding speed is 11.27, 10.91, 8.98, and 7.15 frames/s, respectively. Using all the optional modes, the average frame encoding speed goes down to 6.72 frames/s. On the PC PII, the mean frame encoding speed is 16.05 frames/s without optional modes. Using the SAC and the PB-frames modes, the average frame encoding speed is 15.78 and 15.04 frames/s, respectively. However, with use of the UMV mode, the advanced prediction mode and all optional modes yielded average frame encoding speeds of 14.26, 11.44, and 10.44 frames/s, respectively. Table VI shows the percentage loss in encoding speed using various optional modes compared to no optional mode. The significance of these results is discussed in the next section. Table VII shows the average luminance PSNR without optimization, which does not increase significantly with the use of optional modes.
In Table VIII , perceptible changes in PSNR are not due to optimizations. This fact is further highlighted by the subjective judgment of visual quality. Under some test conditions, however, the use of optional modes may increase the PSNR by about 1 dB [5] . Table IX shows the average obtained bit rate for the no optimization case, with the quantization QP fixed at a value of 10. Values shown in parentheses represent bit rates for interpicture only, while those without parentheses represent bit rates including intra-coded pictures. Different test sequences, having a variety of motion involved, require different bit rates (for the QP fixed at 10), ranging from 18 to 113 kbits/s for various optional modes. The bit rate could be fixed to a particular value (say, 64 or 28.8 kbits/s). However, in that case, sequences with complex motion (FM sequences) would take more encoding time and the quality would be poorer as well. This effect is shown in Fig. 12 . Table X shows the average bit rate for the optimized case with the QP fixed at 10. For the SUN Ultra-1 implementation, only a small increase in bit rate is observed compared to without optimization. This is due to the incorporation of fast search instead of FSBM. Even for the PC-based implementation, the increase in bit rate is not significant, and the implementation is still applicable with currently available modems.
Table XI depicts a comparison of encoding speed using two different PC platforms: the 233-MHz Pentium II (PC PII) and the 600-MHz Pentium III (PC PIII). In this comparison, we only consider our H.263 encoder without optional modes. With an increase in clock speed (2.58 times), the PC PIII consistently gives higher encoding speed (2.49-2.57 times). The bit rates are variable, but the average bit rate (while QP is fixed at 10) for both the PCs is almost the same.
1) Effect of Coding Options:
From Table VI , we observed that the use of optional modes obviously increases the complexity of the encoder, hence increasing the overall program running time. However, these optional modes may be useful for higher bit rates. Since we deal with QCIF sizes of video frames and more common bit rates pertinent to most of the H.263-based applications, the use of the PB-frames mode may be a better choice, which provides a balance between encoding speed and bit rate.
For the Ultra-1 implementation, the use of the PB-frames mode incurs very low expense (about 2%) in additional time requirement for the SM type of video sequences. It is interesting to note that there is an actual gain (1%-4%) in the encoding speed using the PB-frames mode for the FM type of video sequences. The use of only the PB-frames mode keeps almost the same quality with 13%-28% less bit rate. This finding affirms The use of the SAC mode is the next less expensive mode. In our experiments, the additional cost appears to be about 1%-7% compared to the no-optional mode. In this case, the bit rate is reduced by 3%-6%. However, the (UMV) mode accounts for much more encoding time, and the additional cost is about 22%-24% for the SM type of sequences and 15%-19% for the FM type of sequences. Using the UMV, there is little gain in quality with almost the same bit rate. This result suggests that, if there is little or no motion at or near the boundary region (as in the case of the SM sequences), the overhead due to UMV is much higher compared to those with faster or complex motion (the FM sequences).
The advanced prediction mode is the most expensive mode that slows down the encoder by 33%-39%. Using this mode, we obtained almost the same quality while the savings in bit rate is less than 10%. The reason is that the overhead of computing four 8 8 motion vectors instead of one 16 16 motion vector is obviously higher. With this mode, 8 8 motion vectors are chosen for 65%-75% of the macroblocks. Overall, using all the modes simultaneously, the encoder runs at about 36%-44% slower speed compared to no optional mode.
In the case of the PC PII-based implementation, SAC mode proves to be the most efficient among the optional modes, requiring only about 0.5%-3.5% more encoding time than no optional mode at almost the same bit rate. The PB-frames mode is the next, which requires about 4.5%-7.5% more encoding time, but saves about 3.5% of the bits. The use of the UMV mode is 8%-13% slower, while the advanced prediction mode needs 25%-32% more encoding time. Together, the use of all optional modes slows down the encoder by about 33%-38%. Table III , the use of AO alone gives about three-fold speedup in encoding speed, compared to the no-optimization case. From Table IV , which involves the AO, compiler optimization, and reduction of some cycle-expensive operations, the additional optimizations provide a 6-7 times more improvement in encoding speed. Further inclusion of VIS provides a speedup of about 20% on the overall program running time. All in all, with all the optimizations, about 20-26 times improvement in speed is observed with no optional mode, while about 15-22 times speedup is gained with all optional modes. In summary, we make the following observations. 1) Optimizations at algorithmic level are the most important consideration for computation-intensive functional modules, particularly for full-search block matching, DCT, IDCT, quantization, and inverse quantization. 2) Although some of the loop unrolling (LU) may be performed by the compiler optimizer, LU is an effective optimization technique. It can be applied to loop-intensive functional modules such as motion estimation and motion compensated prediction. 3) DTO provides improved performance when complicated type conversions must be handled. For instance, DCT includes such operation in order to take advantage of the 64-bit registers, and DTO is very useful in such cases. 4) Motion estimation, motion compensated prediction, DCT, and IDCT are the functional modules that deal with regular data structures, and are amenable to VIS/MMX-based optimization.
D. Effect of Optimizations
E. A Videophone Application
We have built a videophone using our optimized H.263 video encoder. We use QCIF resolution of video, captured via a video camera that may use the USB port of the PC. The videophone displays both the called party and the calling party on separate windows, while the PC runs both the H.263 encoder and decoder. In addition to video, we also use an audio codec (for which an additional bandwidth of about 10 kbits/s is necessary, but this issue is not further discussed in this paper). The videophone reports the frame rate and the bit rates in real-time. For a typical videophone, the motion involved in the scene is usually slow, allowing lower bit rate with good quality. An instance of the videophone, as shown in Fig. 13 , reveals that the bit rate of the sending and receiving bit streams are 8 and 36 kbits/s, respectively, with a display frame rate of 23 frames/s. Although the measured encoding speed in this case is 44.8 frames/s, due to a constraint in available bandwidth, capture frame rate, etc., an overall throughput of only 23 frames/s has been achieved. The bit-rate control option is set to "unlimited" (i.e., variable) with a fixed QP of 10. However, we can also keep the bit rate at a constant level (e.g., at 64 kbits/s), while allowing the QP to change. For a fixed bit rate, if fast motion is involved, the visual quality will be poorer. However, for slow motion, a bit-rate ceiling of 28.8 kbits/s is usually sufficient to yield good to excellent quality.
VI. CONCLUSION
We have presented the implementation of an optimized software-based real-time H.263 video encoder. In order to achieve enhanced performance, various software optimizations and lowlevel machine primitives such as VIS and MMX are exploited. We have achieved a video-encoding speed which is sufficient for most GSTN-based applications. In addition, we have presented a discussion about the optimal choice of the encoder. It has been found that the use of PB-frames mode is a good choice for encoding, which provides a balance between PSNR, bit rate, and encoding speed. Our present work focuses on the incorporation of new and improved algorithms for various encoder modules, which can be easily used replacing the existing algorithms, without altering the backbone of our implementation.
