Abstract-This paper describes a data-interlacing architecture with two-dimensional (2-D) data-reuse for full-search blockmatching algorithm. Based on a one-dimensional processing element (PE) array and two data-interlacing shift-register arrays, the proposed architecture can efficiently reuse data to decrease external memory accesses and save the pin counts. It also achieves 100% hardware utilization and a high throughput rate. In addition, the same chips can be cascaded for different block sizes, search ranges, and pixel rates.
I. INTRODUCTION
T HE block-matching algorithm (BMA) for motion estimation is currently used in various applications. It removes the temporal redundancy within frame sequences and thus provides these coding systems with significant bit-rate reduction. A straightforward method, the full-search blockmatching algorithm (FBMA), is widely used because it gives the optimal performance and low control overhead. A number of very large scale integration (VLSI) motion estimators based on the FBMA have previously been reported. Most of them are based on the array processors due to the inherent massive parallelism and high speed requirement of motion estimation. In [1] and [2] , they propose a one-dimensional semisystolic architecture to perform the FBMA. In [3] and [4] , two-dimensional (2-D) systolic arrays combined with onchip line buffers for implementing the FBMA are presented. The architecture used in [3] required a large amount of register elements to store the current block data and search area data. In [5] and [6] , the procedure of mapping the FBMA onto systolic arrays is described. This paper presents a data-interlacing VLSI architecture with 2-D data-reuse to implement the FBMA. The architecture allows serial data inputs to minimize the pin counts and performs parallel processing to sustain the high-throughput requirements. In addition, it is adaptable to the dimensional change of the current block and the search area by cascading the same chips. Moreover, it is simple, regular, and modular and thus is suitable for VLSI implementation.
II. THE DATA-INTERLACING VLSI ARCHITECTURE
In the FBMA, the sum of absolute differences (SAD) is calculated for each candidate location to find the best uses 16 ( ) PE's to perform parallel processing. E0-E15 registers and O0-O15 registers are parallel-in parallel-out shift registers. and are the horizontal and vertical components of the estimated motion vector, respectively. MSAD is the final minimum SAD for the current block.
In the first 16 cycles, to , the even column pixel data ( ) and the odd column pixel data ( ) within the search area are stored in E0-E15 registers and O0-O15 registers, respectively, as shown in Fig. 2(a) . The PE's and comparator (CMP) are also properly initialized at . After this initialization stage, the current block pixels ( ) are sequentially input and broadcasted to all PE's according to the block-scan mode and the column-scan order [3] . The search area data sequences and are also sequentially shifted into the E-registers and the O-registers in a column-scan order at each cycle, respectively. Every four ( ) cycles, the PE's alternately select data either in the Eregisters or in the O-registers by multiplexers to calculate the absolute pixel differences and accumulate the results. After 16 ( ) cycles, the PE's will contain the results of accumulated block differences for all the possible candidate blocks within the search area. These results are loaded in an architecture for cascading four chips. (Each chip is designated to handle one sub-search area.) parallel into the latches and then sent to the CMP one by one for comparisons. The CMP is a comparator which compares the results in the latches. Then it outputs the optimum motion vector and the corresponding SAD of the current block. During the comparisons, the PE's continue to perform the blockmatching operations of the next current block. The data-flow in Table I is continuous for the subsequent blocks within a can cascade the same motion estimation chips to operate a variety of current block sizes and search area sizes. Assume that there are PE's in a motion estimation chip. If the block size is and the search area is to , i.e., , there are candidate blocks within the search area. We partition the search area into four subsearch areas. Each of the subsearch areas contains candidate blocks. Then, each of the chips is designated to handle one subsearch area. By connecting the last outputs of the E and O registers of one chip to the corresponding inputs of another, the proposed architecture can be easily cascaded to handle a large search area. Fig. 3 shows the connection of four chips. Chip A processes the subsearch area A, chip B processes the subsearch area B, and so on. These four motion estimation chips can work in parallel to estimate the motion vectors within each subsearch area. Then, the results are shifted out of chip for final comparison. Since the speed requirement to do the final comparison is not critical, a programmable microprocessor is sufficient.
The proposed architecture can also handle the block matching with flexible block sizes. Considering the case for the block size of with the search area of , it can be performed by cascading two motion estimation chips. The connection is shown in Fig. 4 . The current block is divided into two subblocks. Chip A operates the block matching of the upper subblock, and Chip B operates the block matching of the lower subblock. Finally, the partial SAD results at the two corresponding search positions of the two subblocks are added and compared in the ADD/CMP to get the motion vector with the minimum SAD. The operations in the ADD/CMP can also be executed by a programmable microprocessor without any idle cycle.
IV. PERFORMANCE ANALYSIS
The comparison of our proposed architecture with the other architectures for the FBMA is presented in Tables II and III. These tables show the compared features for the two cases of ( ) and ( ). There are two types of data dependency: 1) the overlap among the adjacent candidate blocks within the search area and 2) the overlap between the search areas of adjacent current blocks. The proposed architecture fully exploits the two-dimensional data-reuse to perform parallel processing. This leads to the significant reduction in I/O bandwidth and saves the pin counts. In these two tables, the number of input data pins and the total number of data accesses per block includes current block data and the corresponding search area data. Fewer data accesses imply a lower demand for I/O bandwidth and pin counts. Furthermore, since the data flow is continuous, the PE's are 100% busy all the time. Compared to the previously proposed FBMA architectures, this architecture achieves the highest throughput. From the viewpoint of VLSI implementation, the proposed architecture is simple, modular, regular, and cascadable.
