Abstract-We present a fast motion estimation algorithm using only binary representation, which is desirable for both embedded system and hardware implementation with parallel architectures. The key algorithm distinction is that only the high-frequency spectrum is used. Our experimental results show that it provides excellent performance at both low and high bit rates. Because of its binary-only representation, the proposed algorithm offers low computational complexity and low memory bandwidth consumption. For multimedia-embedded system design, we further investigated specific implementation techniques for several well-known hardware platforms including Intel x86 processors, single-instruction multiple-data processors, and systolic array circuit design. The systolic array architecture requires only single memory access for both the reference and current frames from the on-chip memory. Such an implementation provides an optimized solution with great throughput, while the quality is maintained. Finally, we show that our binarization methods are closely coupled to the accuracy of binary motion estimation algorithms. The binarization and coding efficiencies can be improved using various filters and binarization methods.
A Novel All-Binary Motion Estimation (ABME)
With Optimized Hardware Architectures
Jeng-Hung Luo, Chung-Neng Wang, and Tihao Chiang, Senior Member, IEEE
Abstract-We present a fast motion estimation algorithm using only binary representation, which is desirable for both embedded system and hardware implementation with parallel architectures. The key algorithm distinction is that only the high-frequency spectrum is used. Our experimental results show that it provides excellent performance at both low and high bit rates. Because of its binary-only representation, the proposed algorithm offers low computational complexity and low memory bandwidth consumption. For multimedia-embedded system design, we further investigated specific implementation techniques for several well-known hardware platforms including Intel x86 processors, single-instruction multiple-data processors, and systolic array circuit design. The systolic array architecture requires only single memory access for both the reference and current frames from the on-chip memory. Such an implementation provides an optimized solution with great throughput, while the quality is maintained. Finally, we show that our binarization methods are closely coupled to the accuracy of binary motion estimation algorithms. The binarization and coding efficiencies can be improved using various filters and binarization methods.
Index Terms-All-binary motion estimation (ABME), binary motion estimation, fast motion estimation, fast motion search, multiresolution.
I. INTRODUCTION

I
N A multimedia-embedded system, the video-encoding module contains several major components, including discrete cosine transform (DCT), inverse DCT (IDCT), motion estimation (ME), motion compensation, quantization, inverse quantization, bit-rate control, and variable-length coding (VLC) encoding, where the most computationally expensive part is the motion estimation. Generally, motion estimation takes around 50% of the total computational powers for an optimized system. Thus, further optimizing estimation is critical in cost reduction for real-time video encoding in an embedded multimedia system.
Many fast-search algorithms have been proposed, including the three-step search [1] , the two-dimensional (2-D) logarithmic search [2] , the conjugate directional search [3] , the genetic search [4] , [5] , the diamond search [6] - [8] , the feature-based block ME using integral projection [9] , and subsampled motion field estimation with alternating pixel-decimation patterns [10] . These various search approaches reduce the complexity at the expense of motion vector accuracy, which leads to a selection of only local minimum of mean absolute difference (MAD) as compared to the conventional full-search (FS) algorithm. Multiresolution ME techniques [11] , [12] perform the search with a much smaller window from lower to higher resolution layers. The motion vectors are refined gradually at each layer, but the search area is identical to that of the FS with much lower complexity. To further reduce the complexity, binary ME algorithms [13] - [16] were also proposed to significantly decrease both the computational complexity and bus bandwidth by reducing the bit depth. Based on a binary pyramid structure, a fast binary ME algorithm was proposed in [16] , namley fast binary pyramid motion estimation (FBPME). The pyramidal structure of FBPME contains one integer layer at the lowest resolution (smallest picture size) and three binary layers that contain detail information. FBPME performs the tiled motion search with exclusive OR (XOR) Boolean block-matching criterion on binary layers and MAD on the integer layer. The block matching uses XOR operations that are much simpler and faster to implement than MAD operations.
However, the FBPME has open issues that are not addressed but are resolved in our approach. One of the issues is the use of the integer layer, which leads to two distortion computation modules to perform both MAD and XOR operations. It requires larger code size and more hardware complexity. To address this issue, we propose a new approach, all-binary motion estimation (ABME), which uses all-binary representation for all layers. Another issue is the significant complexity in performing the pre-processing, including filtering, decimation, binarization, and interpolation. To address that in ABME, we employ better filters and integrate several steps into a unified process by removing the redundant operations. Compared to the FBPME, the single-layer approach provides much reduction in complexity, and thus the power consumption for hardware implementation. A fast binary ME algorithm presented by Natarajan et al. is based on a simple 1-bit transform with conventional search schemes [13] . This provides single-layer motion estimation that derives the current and reference blocks. However, the binary representation does not use a hierarchical structure, as when a hierarchical structure is adopted, it is more challenging to get a more accurate binary representation at a lower resolution.
In this paper, we propose ABME based on hierarchical motion estimation and spatial-temporal correlations between blocks. ABME adopts a hierarchical layer structure that captures the spatial characteristics with binary representation, which is derived from simple filtering and decimation operations. The feature-extraction filters adopt linear and symmetric characteristics, which are amenable for software processing and hardware pipelining. For real-time acceleration, we design the software and hardware pipelining architectures to implement ABME for bandwidth reduction. This paper is organized as follows. In Section II, we describe the proposed fast motion estimation. We focus on pipelined architectures including software implementation and hardware realization to improve the speed of ABME in Section III. We also provide the complexity and bus bandwidth analyzes of the proposed architectures. Section IV shows the experimental results using MPEG-4 video reference software, and we conclude in Section V.
II. ABME
In this paper, we propose an ABME algorithm as shown in Fig. 1 . It features low computational complexity, reduced data bandwidth, simple software optimization and pipelined hardware architectures. The ABME is described with a flowchart, as shown in Fig. 1 , and is implemented with the following steps.
Step 1) Construct the binary pyramid structure.
Step 2) Perform the FS XOR Boolean block matching with a 3 pixel refinement window at the first level.
Step 3) Select the best of the six motion-vector candidates from the current and previous frames and perform the XOR Boolean block matching within neighboring locations for refinement at the second level.
Step 4) Perform the FS XOR Boolean block matching with a 2 pixel refinement window at the last level. In Step 1, we perform the following three substeps including filtering, binarization and decimation to build a conventional pyramid structure with binary levels. Each original pixel is compared to a threshold, which is computed from an average of the neighboring luminance pixels, to derive the binary representation. At the decimation stage, the filtered image is then subsampled by two in each dimension of the image to achieve the next layer iteratively. Thus, ABME provides all-binary edge information without an integer layer, as is used in FBPME [16] . In this paper, we adopt a three-layer pyramid structure to balance the computational complexity and the motion vectors precision.
In Step 2, the FS XOR Boolean block matching with a 3 pixel refinement window is performed to locate one initial motion vector candidate with a block size of 4 4. The initial motion vector candidate is projected to the next binary level with a block size of 8 8.
In
Step 3, based on the spatio-temporal dependencies that exist among blocks, ABME selects the best 8 8 motion vector from six candidates in the current and previous frames using the XOR matching criterion. The candidate with the minimum distortion is used as the starting point for further refinement, with a small window at the second level. Similarly, the resultant motion vector candidate will be passed onto the next binary level.
In Step 4, a 2 search window refined search is used to derive the final motion vector for a block size of 16 16 .
A. Construction of Binary Pyramids
The precise edge information is extracted based on the spatial variations within a small local area of an image. The spatial variations can be extracted with various filters. Assume that an 8-bit representation of the current frame is low-pass filtered to create a new frame . The construction of each pyramidal level in binary format is illustrated in Fig. 2 (1) (2) with , where is the total number of pyramidal levels used. The frame is a blurred version of the original frame with the same size at the th level.
Next, the filtered frame is used to create the binary representation of the current frame. The construction of the binary representation is based on a threshold . The major issue is how to define the threshold to precisely represent the edge information in binary format. The edge information can be found by differencing the original image and its low-pass version. To compute the binary representation, ABME adopts a novel differencing operation using a specified threshold as computed from frame , which provides the average spatial variations for the neighboring area. Thus, the binary representation of the th level is computed by the following 1-bit transformation:
Finally, the low-pass frame , which contains most of the spatial information from the original image at the current level, is used to create the input frame in the next pyramidal level. To compute the next level, the frame is by decimating every other pixel to yield the new frame , as described in the following equation: To achieve a tradeoff between the motion vector precision and the computational complexity, we select a three-level pyramid for motion estimation. The levels are denoted as Level 1, Level 2, and Level 3 (original picture size), respectively. Using a simple decimation, the pyramidal levels has three block sizes of 4 4, 8 8, and 16 16.
B. Block Matching
In ABME, the matching criterion is a bit-wise sum of difference (SoD) (5) where denotes the current binary block at the th level and denotes the reference binary block with offset ( ) from the left top corner of the current block. Since the frame data are in all-binary formats, a simple XOR operation is used to compute the difference.
Consequently, the advantages of the proposed ME approach can be analyzed as follows. At Level 1, we perform FS with a search range of 3 instead of a search range of , where is the target search range. Thus, with a smaller window we can get a motion vector covering 16 times the actual search area. At Level 2, the search algorithm contains two steps, including the coarse search and fine-tune search. As shown in Fig. 3 , the six motion vector candidates, which come from the upper right (UR), upper (U), left (L), Level 1 (Lv1), temporally previous (P) blocks, and the center (C), are considered for the Level 2 search. The center candidate denotes the zero motion vector. When all six candidates are equal to zero, we use the zero motion vector as the final candidate and further check its eight surrounding check points as shown in Fig. 4(b) . If the six motion vector candidates are different from each other, the coarse search stage in Fig. 1 is used to choose the best motion vector based on the minimal SoDs. The best candidate is passed onto the fine-tune search, where the four neighboring Fig. 4(a) are checked. The best motion vector, derived from the fine-tune stage, is the output of Level 2, and which is passed onto Level 3 for further refinement within a search range of 2. To further reduce the computational complexity, an optional counter is used to register the repeat occurrence of motion vector for each macroblock. If the motion vectors remain identical for the past four frames, we assume the current macroblock is static and skip the motion search and only do a refinement within a smaller search range of 1. With all-binary representation, XOR operation, and a smaller search window, we can improve the speedup of the motion estimation with only minor loss of the reconstructed image quality.
C. Binarization Efficiency
Since the average spatial variation is used to derive the edge information, the binarization process affects the motion vector accuracy. Thus, the issue is whether the low-pass filter can really register the average spatial characteristics, which are used to compute the threshold for the binarization process in (3). To resolve this issue, we analyze how various filters affect the motion accuracy and the reconstructed picture quality. Table I presents the low-pass filters adopted in this paper. The first set includes three 2-D filters:
, , and
The second set consists of three one-dimensional (1-D) separable 13-tap Hamming filters with distinct cutoff frequency at 20%, 25%, and 30% of Nyquist frequency. For a further reduction of the complexity used in (2), every filter of the second set is a linear-phase FIR filter with coefficients in the form of . For a 1-D filter with taps, i.e.,
, the lowpass frame is computed as (9) where (10) For instance, a five-tap filter with the coefficients of can be implemented with only SHIFT and ADD operations. The typical cost of such operations on various modern computer architectures is one cycle. Thus, this type of filter can achieve significant speedup over the original filters, although it is not designed with a specific filter design methodology. The filters of the second set are as follows in (11)- (13): (11) (12) (13) Fig . 5 shows the frequency responses of the filters compared in this paper. Basically, the low-pass filter with a small cutoff frequency removes less amount of the average spatial variation and retains more high frequency components in the binary representation of , as shown in (3). As shown in Fig. 6 (a)-(c), the frame that is extracted with the filters or retains more high frequency information than the frame that is extracted with the filter . The additional high-frequency information retained may increase the precision of the all-binary edge information. For practical implementation, there is a need to select a tradeoff between the filter tap number and coding efficiency. Additionally, the complexity of the binarization processes can be reducedwith the use of linear-phase FIR separable filters, which can be implemented in a pipelined fashion.
III. PIPELINED ARCHITECTURES: CASE STUDY FOR SEVERAL HARDWARE PLATFORMS
To demonstrate that ABME is desirable for pipelined architectures, we investigate specific implementations to achieve pipelined processing for various architectures, which include general-purpose architectures such as x86, single instruction multiple data (SIMD) architectures using Intel's MMX technology [19] and systolic arrays [20] . In all implementations, we use the filter in (6) for simplicity. To verify the effectiveness of our approach over the FS approach, we use the x86 system and C language for simulations. Finally, we will compare the computational complexity and bus bandwidths of ABME to those of the FS approach. Since the initial locations and the storage units of each level are distinct, the speedup of the block matching is accomplished level by level with our proposed method. The level-by-level implementation allows low memory bandwidth consumption with frequent data access for block matching.
A. Generic Functional Partitions and Software Implementation
Our pipelined architecture contains three major common modules including the integrated construction, compact storage and parallel block matching. For each module, the size and type of the storage are constrained by the data bins/registers defined in the individual architecture, which provides various distinct sizes of data storage devices/registers. To implement each generic module in an x86 system, the use of C language provides three basic data bins including INT (32 bits), SINT Fig. 7 . Parallel processing of the binarization process. Pixel F (x; y) indicates the pixel at the coordinates (x; y) within the processing frame of the lth pyramid level. The shadowed circle represents the current pixel to be binarized.
(16 bits), and UCHAR (8 bits). In the Intel SIMD architecture using MMX technology, for every module, the largest data bin is a 64-bit MMX register. Unless larger data bin is available, the optimization processes for this kind of architecture are similar to that of x86 architectures. For further reductions in memory bandwidth between the processing units and reductions in operation counts on both architectures, the use of the largest data bin is preferable to store the binary representation of several consecutive blocks in a packed way [21] .
1) Integrated Construction:
The first module, integrated construction of the binary pyramid, consists of three processes including filtering, binarization and subsampling processes. This module can be enhanced in two submodules including the boundary partitioning and parallel processing. The basic idea of the boundary partitioning is to skip the branching decision on whether the current data is located at the frame boundaries. The first submodule, boundary partitioning, is achieved by classifying the frame into the nine regions: four corners, two boundary rows, two boundary columns, and the interior. The second submodule, parallel processing, is achieved by loading multiple sets of pixels that are sequentially stored in registers with larger size. In Fig. 7 , we demonstrate an example of using the filter to construct the low-pass frame in (2). Interestingly, the derivation of the low-pass frame using the filter is equivalent to the computation of the average value of the neighboring pixels surrounding the check point with a distance of one.
Assume the frame data are stored in the data bin, named SINT, under C/C++ environment in x86 system. Fig. 7 demonstrates an example of processing the pair . We first load four neighboring pairs, represented as , , , and , from the reference frame into four 32-bit INT registers, which are denoted as , , , and . After loading, the summation and rounding operations can be performed directly, since no overflow or underflow occurs for the frame data because only 8-bit wide data are stored in a 16-bit data bin. After summation and rounding, the results are put back to a 32-bit INT register (14) where the value is used for the rounding purpose of the concurrently processed pixels.
We can derive the threshold for each pixel by extracting the pair of the values inside the register (15) where refers to the logical SHIFT RIGHT operation. The derived threshold is used for binarization in (3) . With this optimization, the construction stage has about a 30% improvement in speed.
If a larger bin is available, the speedup can be increased by a factor that equals to the number of the pixels that can be simultaneously loaded into a single bin. In the Intel SIMD architecture using MMX technology, the binary pyramid can be similarly constructed with the boundary partitioning and parallel processing as described previously for x86 architectures. However, the parallelism achievable based on MMX technology is almost doubled due to the 64-bit architecture. Based on (14) and (15) , the improvement in speed is 63%.
2) Compact Data Storage: The binary representation of frame data results in desirable compact storage for the pyramid layers. At Level 1, each row of a 4 4 block occupies only 4 bits. Hence, one data bin with 4 bits can store the same row for several sequential blocks. At Level 2, the six candidates may be located at several distinct and discontinuous spaces in the memory. The packing of multiple blocks into a single bin, like the blocks of Level 1, is ineffective to improve the efficiency. Thus, for both computer architectures, we store each block of Level 2 with separate data bins. Each block needs eight 8-bit data bins. Using the same concept of Level 1, we pack every row of the successive 16 16 blocks at Level 3 into a single data bin of size 16 . Note that the widest data bin in both architectures is only used for speedup of block matching.
3) Parallel Block Matching: The block diagrams for our fast block matching using all-binary edge information are illustrated in Fig. 1 , and include loading, bit alignment, and (parallel) XOR block matching. The loading module puts each group of sequential data into the corresponding bins of larger size for reducing memory access. Since the frame data in binary format has been compactly and sequentially stored, the memory access becomes a simple fetch instruction. Thus, the loading module loads the current data and the reference data in the search window into the on-chip memory. As shown in Fig. 1 , for each group of four sequential 4 4 blocks at Level 1, we load the same row of the current four blocks into a specified SINT register one by one, and put the corresponding row of the reference blocks into an INT register one by one in x86 architecture. Thus, the four SINT registers can be used for following parallel block matching, and the speedup is increased by about four times in comparison to the block-by-block matching scheme. The same processes are employed in the Intel SIMD architecture, except for the use of 64-bit data bins to handle each group of eight sequential 4 4 blocks at Level 1. Thus, the four MMX registers can be used for a factor of 8 in parallelism as compared to the block-by-block matching method.
The bit alignment module synchronizes the reference data in the search window with the current data. After bit alignment, (parallel) XOR block matching, Table Lookup , and SoD comparison submodules are adopted for finding the motion vector with the minimal SoD. The XOR operation is applied to compute the distortion. To derive the block SoD after the XOR module, we use a table lookup operation in the Table Lookup submodule, by counting the number of "1's" in this temporary register. Finally, comparing all SoDs, we can determine which motion vector is the best for each of the four blocks separately in the SoD Comparison submodule. While going through all blocks of the current frame, the resultant motion vector with the minimal SoD is selected.
The optimization processes of motion vector search in either architecture are similar, except that the 64-bit registers can handle more successive blocks simultaneously and special instructions are used to operate on the registers in SIMD architecture. Thus, only relevant steps in the x86 system are described in detail to show how to achieve such parallelism. The optimization of the (parallel) XOR block-matching modules will be explained in detail in Level 1. For the remaining levels, the relevant descriptions are skipped.
• Level 1: Since each block of Level 1 has the initial location assigned sequentially, we can simultaneously compute SoDs for four blocks in x86 architecture and derive SoDs for eight blocks with MMX technology in a row-by-row manner.
Step 1) Loading: Since we process four neighboring blocks of size 4 4 for the current frame simultaneously, 16-bit data is needed to compute the SoDs for each row. With the search range of 3, the data of the reference frame that needs to be loaded into a data bin should be larger than or equal to 22 bits for parallel block matching.
Step 2) Bit Alignment: For the initially loaded data, we align the reference block with a horizontal offset of 3 relative to the current blocks at Level 1, as shown in Fig. 1 . To move to the next check point, the overlapped area bits are reused by right shifting the bits in the register by one.
Step 3) XOR: The row-wise matching is illustrated in Fig. 1 , where each square indicates a single bit. Since the SINT contains 16 bits, only the lower 16 bits of the reference register will be compared with the current data using XOR. This XOR result is temporarily stored in another SINT register and three 4-bit SHIFT and three AND (as a mask) operations are required to get the SoDs for the four blocks, respectively. Step 4) The overall SoD for each block equals the sum of the SoDs for each row.
Step 5) SoD Comparison: Comparing all SoDs, we can determine which motion vector is the best for each of the four blocks, respectively. • Level 2: For both architectures, we store each block in separate data bins without packing. For each distinct motion vector candidate and each predefined check point, the Level 2 ME computes the block difference in a row-by-row manner. For the current row, we need not only load the required bits within the predefined search range into the registers, but also align the reference data by shifting the registers bits, which is equal to the horizontal offset indicated by the current motion vector. For the consecutive check points in Fig. 4 , extra SHIFT operations are necessary to align the reference data. The bit alignment process for the Level 2 search is shown in Fig. 1 .
• Level 3: The modules including loading and bit alignment for Level 3 blocks in x86 architecture are optimized as follows. In the loading module, one 32-bit register A stores the same row of the reference data at the block and the previous block ( ). The other 32-bit register B stores the same row at block ( ). Both registers contain partial bits of reference data within the specified search window, as demonstrated in Fig. 8 . To align the reference data, the horizontal predictive motion vector is initially set to be 10 as an example. Register A is shifted left by 10, while register B is shifted right by 6 to reach the initial search position. We combine the contents of the two registers with a simple XOR operation. In the last step, the reference bits are aligned to match the initial search location 2. When we move to the next location in the same row, an extra SHIFT operation is needed to synchronize the pair of data for matching. Fig. 9 . Spatio-temporal representation of parallel block matching for each column of check points using systolic arrays, where each bold dot denotes a processor element. The S (:; :) indicates the lth level binary representation of the current frame at time t. The S (:; :) present the reference data at the same pyramid level of the temporally previous frame. For block matching, the block dimension is set as N 2 N and the search range is 6R. The motion vector from the current block to the corresponding reference block is indicated by (u; v).
B. Systolic Arrays
The all-binary representation for each pyramid level reduced the storage from bytes to bits, which can be stored as groups of row or column vectors. Since the vectors are consecutively stored, we should be able to access the information efficiently through pipelining the binary information and overlapping the processing time for each matching process. Since the current block is fixed in the search process, the reference data can be accessed and propagated in the form of pipelines for an ASIC implementation. In short, it is advantageous to employ systolic arrays [20] to design the hardware implementation for ABME.
In Fig. 1 , the XOR block-matching module is optimized with systolic arrays for Level 1 and 3 searches. The realization based on systolic arrays for each column of check points is shown in Fig. 9 , where the spatio-temporal representation of the XOR block matching is defined in (5) . The binary data of both the current block and the reference block can be transported into the processors in an order such that the resultant SoDs can be computed by summing the XOR block-matching criterion in a row-by-row manner. For each block of the current or reference frame, the binary data of each row is stored as a 1-D vector. Each pair of vectors from the current and reference blocks is delivered to the processing element (PE) for computing the SoD, but the current block is further passed to the next PE through the systolic arrays. Consequently, for each column of check points, we will obtain ( ) final SoDs, as shown in Fig. 9 . To cover all check points of size , we compute and compare the SoDs with a pipelined approach, where each PE will handle a specified row of the reference blocks at the same column within the search window. Based on the pipelined approach, we can process each column within the search window sequentially in time. That is, we check every check point located at the first column of the search window and select from these check points the best candidate with the minimal SoD.
The pipeline scans through the subsequent 2 columns using all PEs in the array, and we obtain the final candidate with the minimal SoD among all search points, which leads to the resultant motion vector for the current block. The pipelined architecture requires ( ) PEs, cycles, and memory access to get the reference data and memory access to load the current data from the on-chip memory to compute SoDs of each block with bits, where additional 2 cycles are used for pipeline initialization and each memory access takes bits of the reference block and bits of the current block. The gate counts for constructing ( ) PEs are small, while memory access efficiency poses the challenge.
To further reduce the latency for the memory access, we propose a 2-D parallel block-matching architecture using systolic arrays. The 2-D architecture removes the overhead of loading the overlapping bits within the successive reference blocks by simultaneously fetching all the bits within the reference window into the on-chip memory. From each row of ( ) bits, we then de-multiplex each group of serial bits into the corresponding pipeline. The overall SoD for each pair of block is computed in a PE, which is implemented with detail circuits as shown in Fig. 10 . The proposed PE performs XOR operations of multiple bits in parallel, computes the number of 1's using a decoder and, finally, accumulates the total number of 1's as the SoD. The relationship between the PEs, the current block, and the reference block can be represented as the block diagram in Fig. 10 . The dimension of the input blocks to each PE is related to the block dimensions of the current pyramid level. For example, the block dimension is 4 for Level 1 and is increased to 16 for Level 3. With the 2-D parallel architecture, we require PEs, ( ) cycles, and ( ) memory access to get the reference data and memory access to load the current data from the on-chip memory to derive SoD for a search range of value and each block with bits. Each memory access fetches ( ) bits of the reference block and bits of the current block from the on-chip memory. As compared to the 1-D pipelining architecture, the speedup is ( ) times the computation of SoDs, and each reference data is fetched from the on-chip memory just once, which is the minimal memory access to load the bits into the system. Since ABME requires a small search range for each level, the increase of gate count in realizing the 2-D pipelining architecture is still within a reasonable range. Although we provide an implementation example, it is possible to be more efficient in mapping ABME onto a physical hardware like ASIC or FPGA chips.
C. Complexity Analysis and Bus Bandwidth
In this section, we provide the analysis of computational complexity of ABME on x86 architectures with and without the alternative implementation of parallel XOR block-matching module in Fig. 1 using the 2-D systolic arrays. The analysis is based on how a single block computes its motion vector and the memory access per second for a particular frame rate. The notations of , , , and (fps) denote the frame width, frame height, search range and frames per second, respectively. The block size is assumed to be 16 16 , which is the most commonly used in video compression standards. 
1) Computational Complexity Analysis:
For each block, the traditional FS using sum of absolute difference (SAD) needs to process all 4 search points within search window. Each location takes operations, where the three operations consist of one subtraction, one absolute value and one addition operation for each pair of data. Hence, the computational complexity of FS to obtain a single motion vector is approximated as (17) operations per macroblock. Based on the theoretical best-case scenario for FS using the 32-bit register in the x86 system, obtaining a single motion vector is approximated as (18) operations per macroblock. For the same search area, the total operations required for the proposed architecture without the use of 2-D systolic arrays, labeled as , to compute a motion vector are (19) per macroblock, where , , , and represent the operation counts for the pyramid construction and the motion search at each level of the binary pyramid, respectively. The details of derivation for these four counts in (19) are described in Appendix A.
The use of the 2-D systolic arrays for Parallel XOR block-matching modules at Levels 1 and 3, denoted as , reduces the total operations per macroblock to (20) Comparing the values of , , and , we find that our algorithm outperforms the FS, which is consistent with the computational complexities shown in Table II for various search ranges.
2) Memory Bandwidth Analysis:
In this section, we analyze total memory bandwidth for loading the data from the current and reference frames. For the FS, the total bandwidth consumption in bytes per second is (21) where the and is the memory bandwidth to access the data for the current and reference frames. Assume that the current block is loaded simultaneously into on-chip memory with 16 16 UCHAR bins. Thus, to access the data for the current frame of size requires (22) bytes. Assume that the reference block is loaded simultaneously into on-chip memory with UCHAR bins. When we move to the next block, the data for overlapped area are reused and the bandwidth required is . Thus, to completely load the data from the reference frame needs additional (23) bytes. The first term denotes the first search window, which takes more operations due to the memory stall in the initialization stage of pipelining. Since the rest of the search windows are overlapped with the previous one, fewer operations are needed.
For ABME, the total memory bandwidth consumptions in bytes used in x86 system and in SIMD architecture are, respectively (24) (25)
The indicates the bandwidth consumption for loading the current frame. The remaining terms , , , and denote the bandwidth required for accessing the reference frame. The memory bandwidth is derived with specific conditions and the detailed derivation is described in Appendix B.
Our approach can significantly reduce the bus bandwidth as compared to the FS. As the search range is increased, the memory bandwidth for the FS is increased dramatically, as shown in (23), and those for the ABME are increased slightly only due to the increased and . As opposed to the high sensitivity with various to the memory bandwidth for the FS, we find that ABME is insensitive to the search range variation, since the frame size at Level 1 is the smallest and the data from the consecutive blocks can be stored in a larger bin. Such superior performances are consistent with the observations, as shown in Table II .
IV. EXPERIMENTAL RESULTS
To show the performances of the proposed algorithm over the FS, we use MPEG-4 reference video encoder and employ a macroblock with size 16 16 for block matching. The performance is analyzed based on the following factors: the video sequences, the encoding conditions, various decimation filters, the ME approaches, and the visual quality of the reconstructed video.
The video sequences with CIF format including Coastguard, Foreman, and Akiyo, and the sequences with QCIF format covering Container and Mother-Daughter are used for testing. The six sequences characterize a variety of spatial and motion activities. We further test two CCIR601 sequences, including Table Tennis and Stefan, which consist of fast moving objects. The fast moving objects within a picture of larger size are adopted to examine the performance and the computational load of ABME approaches. Each source sequence consists of 300 frames.
As for the encoding conditions, each sequence is encoded under the conditions recommended by MPEG committee [17] . The target frame rate is set as 10 fps and the bit rates range from 10 kbps to 2 Mbps for various sequences. For finding the precise motion vector, the search range is 16 for each sequence and the range is increased to 32 for the CCIR-601 sequences.
As for the decimation filters, the 2-D filters and 1-D separable filters as summarized in Table I are used to understand the impact of the individual filter on the binarization and coding efficiency of ABME. ABME and FS are used for comparison based on the coding efficiency and the visual quality of the reconstructed video in PSNR values.
Based on the experimental results in Tables II and III and Fig. 11 , we found the following. 1) As compared to the FS, the proposed approach has significantly reduced the computational complexity and the memory bandwidth. The loss of the reconstructed video quality for slow moving sequences like Mother-Daughter and Akiyo is negligible, while for high motion sequences such as Foreman and Stefan, the loss is moderate. The experiments using the high motion sequences and large picture sizes show that the proposed approach has outstanding coding efficiency with a gradually increased cost for computation. 2) As the number of taps of the low-pass filter is increased, the accuracy of binary motion search is increased. This is because the long tap of the low-pass filter can capture Fig. 11 . Comparison of the reconstructed frame from the (a) FS and (b) ABME approaches.
more precise average spatial variation so that the 1-bit information left in the pyramid is more meaningful for block matching. Furthermore, for the 1-D separable filters, smaller cutoff frequency will increase the binarization efficiency by leaving more detailed information into the pyramid levels. Finally, the large taps will increase the computational complexity. Thus, there is a tradeoff between the motion accuracy and the computational cost.
V. CONCLUSIONS
According to the experimental results, the proposed ABME shows satisfactory visual quality. It not only takes the benefits of low computational complexity and low memory bandwidth consumption, but also insensitivity to search range increase. We also demonstrate that with better binarization methods, the visual quality can be further improved. With this feature, our algorithm can provide flexible configurations. System designers can choose the binarization methods depending on the available memory, computational power, display resolution, or data bus bandwidth provided by their system. For example, the characteristics of wireless mobile phones have less computational power, lower display resolution, and less available memory. Thus, the smallest filters should be used for the best execution speed, while its visual quality is still acceptable for the low-resolution display. On the other hand, for a faster machine such as today's personal computers or high-end DSP systems, a filter with better frequency response can be applied, since its computational power can afford more complexity. In addition, various optimization methods can be developed for specific platforms with different register sizes. Thus, ABME is more flexible than other motion estimation algorithms.
We also demonstrated platform-specific optimizations for several hardware architectures, including x86, SIMD using MMX, and systolic arrays. From the operation counts, we showed that ABME is very desirable for software implementation on a general-purpose processor system. We also show that ABME can be realized with a parallel-pipelined implementation for ASIC design and allows tradeoffs between Silicon area, power consumption and visual quality during the hardware design phase. Thus, we conclude that ABME is versatile and effective for multimedia systems in both software and hardware platforms.
APPENDIX
A. Complexity Analysis
With a search window of size 4 and XOR operations for matching a macroblock of size 16 16 , the proposed fast motion estimation approach needs operations in total as shown in (19) . The search range of Level 1 is reduced to and the block sizes from Levels 1 to 3 are 4 , 8 , and 16 , respectively. To construct the binary pyramid as defined by (14) and (15) , the binarization process requires 4.5 operations per pixel on the average. Thus, to build the binary pyramid requires (26) operations per block.
As for the block-matching process, because the data storage techniques are different from level to level, the operations required for each level are analyzed individually. In Level 1, each pack of four blocks stored in the four SINT bins forms a matching unit, which contains the data in the current row where the four blocks are sequentially stored. To complete the derivation of the four SoDs, it takes 15 operations for every row of the block within the search window. The 15 operations include one register shifting for XOR, three AND operations, and three SHIFT operations to extract the four bit-wise SoDs, four table lookup operations, and four addition operations to accumulate the SoDs of the four macroblocks processed concurrently. Hence, the total operations at Level 1 for each macroblock are (27) At Level 2, we analyze the worst case, which means the six candidates plus four tuning locations are nonoverlapped. Basically, there are ten locations that need to be checked in this case. Note that eight UCHAR bins store each row of a Level 2 block. To drive the final SoD, it takes four operations for each location, where the four operations including one SHIFT operation for one XOR, one table lookup and one addition operation to sum the block SoD. Thus, this level costs (28) operations per block.
The computational analysis of Level 3 is similar to Level 2, but with a larger block size and a fixed search range. The number of search locations is 25 and 16 SINT bins cover all rows of a Level 3 block of size 16 16 . Thus, the Level 3 search needs (29) operations per block.
Thus, the total operations required for finding a motion vector using the proposed algorithm is
B. Bandwidth Analysis
For the proposed algorithm in x86 systems without hardware acceleration using the 2-D systolic arrays, the memory bandwidth (bytes) to load every block of the current frame is (30) Based on the reusability of the data already in the registers, the memory bandwidth required for each pyramid level to load reference data within a search window per frame is (31) (32) (33) Thus, the total memory bandwidth consumption of the proposed algorithm is When the XOR matching modules at Levels 1 and 3 are implemented with the 2-D systolic array architecture, the memory bandwidths required for loading reference data within a search window per frame are (34) (35) respectively.
