Abstract-A dedicated cost-effective chip of a three-step hierarchical search (3SHS) motion estimator to support the NTSC resolution video in real time is proposed. The memory interleaving technique is developed to overcome the 3SHS's inherent problem of complicated data addressing and interconnection due to the variable distance between candidate locations and unpredictable data requirements. Based on a cyclic-pipeline utilization of memory, the memory size and bandwidth requirements can be reduced significantly. With 0.8 m CMOS technology, the proposed chip requires a die size of 6.9 2 2 2 5.9 mm 2 with 120K transistors, and is able to operate at a clock rate of more than 50 MHz.
I. INTRODUCTION
T HE motion estimation (ME) based on the block matching algorithm (BMA) is the most popular compression technique associated with motion detection and has been shown to be efficient at improving video quality and compression rates. Among BMA's, the full search (FS) method gives the optimal solution and its regular data flows and low control overhead make it very suitable for VLSI implementation. Many reported architecture designs and chips [1] - [3] are focused on efficiently reusing data to decrease external memory access, improving the performance by massively parallel processing and pipelining, and providing a flexible scheme with modular processing elements. However, the inherent massive computations of FS motivate many fast BMA's developed for reducing computation complexity. The three-step hierarchical search (3SHS) [4] , [7] has been experimentally shown to be robust and valid among such fast BMA's, and is recommended by video coding standards of CCITT RM8 [5] and MPEG SM3 [6] . The modified 3SHS [7] , [8] can provide a lowcomplexity realization and high-accuracy decoding quality for various video applications. The computation is still intensive when the real-time necessity or higher data rate coding is considered, such as MPEG2, video-on-demand (VOD), and high-definition television (HDTV).
To achieve high performance and real-time necessity, the dedicated designs of realizing a 3SHS block-matching algorithm [8] - [11] are attractive for the purpose of low-cost implementation to various video applications, compared to the high-cost FS motion-estimation processor [1] , [2] . For low bit-rate video to HDTV systems, a family of parallel architectures for a 3SHS block-matching algorithm was developed [9] , which can be realized with 3-PE (processing element), 9-PE, and 27-PE based on the utility of intelligent data arrangement and memory configuration. By taking advantage of interconnections and external memory access reduction technology, the design features of low cost, high speed, and low memory bandwidth can be achieved. For supporting real-time videoconferencing standards, Costa et al. proposed a 3SHS-based motion estimation unit of 3-PE realization with features of minimizing the necessary hardware functional units and reducing implementation costs [10] . Yeo and Hu proposed a modular linear array of 9-PE and it can simplify the interconnection between PE's and memory module by utilizing the search area data dependency between the two neighboring candidate blocks [11] . For practical VLSI implementation, the intractable problem of data addressing, due to the variable distances between candidate locations and the unpredictable data requirement, will complicate the control circuit significantly. Besides, there are still some difficulties to overcome: 1) simplifying the internetwork scheme owing to various data flows among candidate locations; and 2) reducing the pipeline computation latency among steps for better performance. To improve the above problems, the proposed addressing strategy is a way of pseudoresidual permutation, and it can be realized directly with low-cost random logic. A memory interleaving technique is introduced into PE allocating configuration to derive a regular structure, and hence is able to reduce the interconnection complexity. To bear moderate pin requirements of I/O system and improve the performance of step pipeline computation, the memory utilization is based on a cyclic-pipeline way. This brief develops a low-cost 3SHS-based motion estimator chip with 9-PE realization to process ITU-R601 (720 pixels 480 lines, 30 frame/s) video in realtime with a search range of 7 7 pixels. The next section will illustrate the 3SHS algorithm. Section III describes the architectural features and functionality of the proposed 3SHS ME chip, focusing on the interleaving technique between memory and PE and on the random-logic control circuits including practical cyclic-pipeline memory and pseudoresidual addressing way. In addition, extension to large search ranges is also discussed. Chip characteristics and a brief comparison to two other chips [3] , [8] are listed in Section IV, and Section V will be the conclusion.
II. ALGORITHM
The block-matching process is performed on the basis of the minimum distortion, measured as mean absolute difference (MAD) between the pixel values of any two blocks. As 0018-9200/98$10.00 © 1998 IEEE MAD where is the block size, is the current frame block whose left-top pixel is at the coordinate is the previous frame block with one ( ) displacement, and is the maximum displacement in pixels along both horizontal and vertical directions. This procedure is repeated on all the blocks within the search area ( ), the motion vector is then determined as the ( ) at which the MAD has the minimum value.
To search the best motion vector, the 3SHS algorithm uses a coarse-to-fine way to reduce the heavy computational cost resulting from the massive number of candidate locations. The algorithm limits the number of checking points in a search area. Fig. 1 illustrates the basic procedure of 3SHS for with an example of motion vector ( 2 7). The first step is to compute MAD's of the blocks corresponding to the nine positions, and the one producing the minimum MAD is considered the minimum distortion position. In the second step, the search is focused on the area centered at the selected point of the previous step, but the distance between any two adjacent candidate locations is shortened to one-half. This procedure continues until the distance converges to one pixel, and thus the final motion vector is derived in the third step. For a search range of 7 7 pixels, the hierarchical search procedure reduces the number of searching locations to 1/9 of the FS approach [5] , [6] .
III. THE PROPOSED 3SHS ARCHITECTURE

A. Overview
The proposed chip will be designed to perform the motion estimation with a block size of 16 16 pixels and a search range of 7 7 pixels in real time for the NTSC resolution video (720 480 pixels, 30 frames/s). The architecture of the proposed 3SHS ME chip is based on 9-PE realization as shown in Fig. 2 , which is mainly composed of nine memory banks, nine PE's, and the control logic. Those nine PE's are used to calculate the locations of nine candidates simultaneously at each step. The memory banks store the reference image frame (the required size is 48 30 bytes) for prediction.
To assure that those nine PE's can read the appropriate data from nine different memory banks concurrently, the corresponding reference pixels must be stored into memory banks previously in an interleaving way. The interleaving process of three steps for storing is illustrated in Fig. 3 , in which the index number denotes the corresponding memory bank accessed by the pixel data, and the marked number with a small circle implies the corresponding memory bank loaded at step 1. The other two marked numbers with a small rectangle and small triangle mean the corresponding memory bank loaded at steps 2 and 3, respectively. The operation of each step will need 256 clock cycles to calculate a candidate location, and this computation period is partitioned into nine stages. The nine PE's will compute the partial summation of each different corresponding candidate location, respectively, at each stage, and each PE also computes different candidate locations for every stage. To achieve the above requirement, the input order of predicted data will not adopt a raster scan way, but a way of pseudoresidual addressing. Based on the above scheme, the interconnection complexity between the memory banks and PE's can be diminished. At the end of each step, the ACC unit (in Fig. 2 ) spends nine clock cycles to accumulate each output of PE to obtain the proper partial summation and the MIN unit needs one clock cycle to calculate the minimum result among nine summations and generate the winner index of the current step. Then the address generator in the CONTROL unit utilizes this information to select the base address of interested locations in the next step. The other 11 clock cycles are spent on the CONTROL unit for various control signals and address generation. Therefore, a block will be processed per clock cycles, except the first block in the slice. Initial loading of reference data needs clocks. Thus, a frame requires clocks to process. For the frame rate 30/s, the running speed is at least MHz. For the purpose of increasing memory utilization, the overlap between successive blocks for search range is introduced, as shown in Fig. 3 . In the figure, three blocks of size 16 16-the left block with bold outline, the central block with dash outline, and the right block with dotted outline-denote Fig. 2 . Architecture of the proposed 9-PE 3SHS ME chip. Fig. 3 . The scheme for memory interleaving and cyclic-pipeline utilization, where the block size is 16 2 16 and search range is 67 2 67 pixels. the utilizing procedure of memory from left to right. At first, when the bold-outline block is processed, SEC0 and SEC1 are referred, while SEC2 is written with searching data for the next reference. Second, when the dash-outline block is proposed, SEC1 and SEC2 are referred while SEC0 is loaded with the next set of data. Finally, it is the dotted outline's turn to be processed, and hence SEC2 and SEC0 are referred while SEC1 is fed with the new data. By using such a cyclic-pipeline procedure, the required memory size for each bank is 160 bytes [i.e., (30 16 3)/9] with two-dimensional addressing dualport access, and the memory bandwidth requirement can be reduced significantly.
To accumulate the mean absolute error (MAE) of the pixels, we first calculate the partial sum by each PE for those pixels stored in the corresponding memory bank. The PE structure is described in Fig. 4 . It should be pointed out that since all PE's would write their results to one bus line, a tristate latch is introduced into each PE. The output-enable ( ) signal with 9 bit lines is given by CONTROL unit (refer to Fig. 2) , and only one bit line of them is active at any time. Because the data in one candidate must be processed with 9-PE multitasking, there are nine active states (states 0-8) which are corresponding to nine stages, and one transient state (state ) for each step. Each PE will also calculate the partial sums for the nine candidates during nine states. The output of MIN (minimum extractor) unit is a row-column notation and indicates the winner for the current step.
B. Control Circuits
CONTROL (unit) can be partitioned into four submodules: read address generator (RAG), write address generator (WAG), control signal generator (CSG), and motion vector generator (MVG). As mentioned in Section II, the addressing calculation is a computation of pseudoresidual permutation. RAG is responsible for signal decoding and addressing reference data, and it should be completed before the next state or step. Within each step, if denotes the current state index (i.e.,
) and denotes the location index (i.e., th candidate), the next memory bank index ( ) and address for each location ( ) can be expressed as mod (1) and (2) To compute the above values for the next step, the corresponding memory bank and address become mod mod for (3) and for (4) where is the winner index, and for steps 1, 2, and 3, respectively. Fig. 5 describes the structure of RAG. Two queue buffers store the old and new module indexes for the nine locations, while two address buffers store the old and new addresses for the corresponding modules. When the signal is enabled, the new address values are calculated one module by one module, and are then written into the new buffers. Such current address values cannot be updated until the update signal (up-add) is enabled. The addressing for the next step is dependent on the winner of current step, and there are nine values to be computed.
WAG generates the writing address ( ) and write-enable ( ) for each memory bank, as shown in Fig. 2 . During the period of initially loading reference data, WAG is always active, while RAG is idle. Then, RAG begins to work, but WAG is only active at first 540 clocks of 834 clocks for each block. CSG, mainly being composed of various counters, sends signals to control all the data paths, including RAG and WAG. MVG supports the winner index for each step and updates the value of motion vector. It is initialized as zero, and after one step, the motion vector will be accumulated with the displacement.
C. Extension to Large Search Ranges
Although the 3SHS was originally proposed for low bit rate video and inherently covers a search range of 7 7 pixels, larger search ranges can be obtained by simply increasing the number of steps, as suggested by MPEG SM3 [6] . However, the distance between checked points at early steps enlarges exponentially when the step number increases. This will lead to the fact that the probability of being trapped in local minimum significantly rises, and the accuracy is thus dramatically reduced. To improve the accuracy, Jong et al. [7] proposed a scalable overlapping strategy which utilizes several independent 3SHS's, each one with a search range of 7 7 pixels, to cover the required large search ranges. In a similar way, as mentioned above, 5 or 17 proposed 3SHS ME's can be integrated to a new motion estimator bearing a search ranges of 15 15 pixels or 30 30 pixels, respectively, as shown in Fig. 6 . The experimental results of [7] reveal that the central 3SHS will detect small motion vectors existing near boundaries of the central neighboring 3SHS's, and thus can significantly improve performance. To achieve the half-pel precision, some additional control circuits are required for the proposed architecture to perform an extra step of hierarchical search over 0.5 0.5 pixels search range centering around the motion vector obtained from the previous third step of the 3SHS. Based on the above modification, the proposed 3SHS ME architecture can be applied to MPEG2 video encoding. Fig. 7 shows the chip microphotograph. The SRAM occupies about one third of the core area, and RAG is the largest macrocomponent. The chip is fabricated in a 0.8 m CMOS, double-metal technology. The die size is 6.9 5.9 mm and contains about 120K transistors, and the power dissipation is 0.35 W at the clock rate of 50 MHz. The pin count is 48, composed of two pairs of internal power pads and four pairs of external power pads, 18 input pads, and 18 output pads. Table I lists features of the proposed chip and other recent ME chips [3] , [8] . In [3] , a host processor is designed to control sophisticated processes for multiple motion estimation algorithms and various video coding standards and a PE design of carry-skip adder structure is utilized to reduce the transistor count and to enhance speed, but the total number (i.e., 72) of PE's may be high. To achieve real-time motion estimation and compensation for the MPEG2 standard, Suguri et al. [8] used a three-step hierarchical telescopic search algorithm which introduces an embedded RISC processor to control the operation schedule. Its wide search range of 32.5 32.5 pixels demands a high memory requirement of 20 Mb and a high pin number of 340. The proposed 3SHS ME chip utilizes low-cost random logic to realize the controller, and only needs nine PE's and 48 pins.
IV. CHIP IMPLEMENTATION
V. CONCLUSION
This brief develops a dedicated cost-effective 3SHS motion estimator chip. Based on 9-PE realization, the chip can process the video data of NTSC resolution in real time with a search range of 7 7 pixels. The memory interleaving technique is introduced to enhance the regularity of structure. The memory size and bandwidth requirements are both reduced significantly by utilizing memory via the cyclic-pipeline. The inherent complicated addressing problem of 3SHS ME is solved by the pseudoresidual permutation and then realized with random logic. The experimental result reveals that the proposed 3SHS ME chip will be attractive for low-cost design in various video applications, such as VOD and MPEG2.
