Abstract-This paper presents an architectural enhancement to reduce the power consumption of the full-search block-matching (FSBM) motion estimation. Our approach is based on eliminating unnecessary computation using conservative approximation. Augmenting the estimation technique to a conventional systolicarchitecture-based VLSI motion estimation reduces the power consumption by a factor of 2, while still preserving the optimal solution and the throughput. A register-transfer level implementation as well as simulation results on benchmark video clips are presented.
I. INTRODUCTION

R
ECENTLY the market for portable multimedia applications, such as MPEG video camera, wireless video phone, and portable wireless multimedia terminal, has been on the rise. Consequently, low-power VLSI video compression processors are in demand. Typical video compression processors today include VLSI motion estimators which implement the full-search block-matching (FSBM) algorithm.
In the block-matching motion estimation, the motion vector is the displacement of a macroblock with the minimum distortion from the reference macroblock. The full-search block-matching algorithm determines the motion vector by identifying a macroblock with the minimum distortion from a pool of all possible candidate blocks in the search area. The FSBM algorithm thus offers the optimal solution; however, existing implementations of this algorithm are computationally expensive and power hungry because they typically compute the distortion values of all possible candidate macroblocks.
To reduce the computational complexity and the power consumption of motion estimation, several fast block-matching algorithms, such as two-dimensional (2-D) logarithmic search [2] , three-step search [3] , and conjugate direction search [4] , have been proposed. Although these approaches reduce power consumption, they result in suboptimal solutions because the search spaces are necessarily reduced. References [5] - [7] have suggested ways of lowering power consumption by implementing the binary level matching criterion. Lowering the supply voltage in the motion estimator to save power at the circuit level has been proposed in [8] . Reference [9] presents a single-chip MPEG2 design featuring low-power motion estimation, which results from search window management and demand clocking. A low-power chipset for a Manuscript received July 20, 1997; revised February 20, 1998. This paper was recommended by Associate Editor N. Ranganathan.
The authors are with the Department of Electrical and Computer Engineering, University of California at San Diego, La Jolla, CA 92093-0407 USA.
Publisher Item Identifier S 1051-8215(98)05760-7. portable multimedia was introduced for one-way full-motion video in [10] . Reference [11] gives an overview of video coding VLSI's, focusing on the power consumption reduction. However, none of the above addresses the issue of lowering power consumption of the block-matching algorithm at the architectural level without sacrificing the optimality of the solution.
In this paper, we introduce an architectural enhancement to reduce the power consumption of the full-search blockmatching motion estimation. Our approach to combat the high power consumption in FSBM motion estimation is based on eliminating unnecessary computation using conservative approximation. Finding a macroblock with the minimum distortion is typically a sequential process. Our approach computes a conservative estimate of the exact distortion value for each candidate macroblock before computing the exact distortion. If the conservative estimate of the distortion is larger than the minimum distortion found so far, this distortion value is removed from consideration in finding the minimum, i.e., the exact distortion need not be computed. As long as the power consumed in computing the estimate is negligible, the percentage of skipped distortion computation turns into net power savings. We show that augmenting this conservative approximation technique to a conventional systolicarchitecture-based VLSI motion estimation [1] reduces the power consumption by a factor of 2, while preserving the optimal solution and the throughput.
II. FULL-SEARCH BLOCK-MATCHING PROCESS
In full-search block-matching motion estimation, each reference macroblock of size pels is compared to all of the candidate macroblocks in search area to determine the best match, as shown in Fig. 1 . We use a commonly used match criterion: the candidate block with the minimum amount of distortion (the sum of absolute differences in luminance values) is considered the best match. The distortion for the candidate macroblock at position , assuming that the size of a macroblock is , is defined as (1) where and are luminance values at position of the reference macroblock and at position in the candidate macroblock in search area respectively. The motion vector is defined as the displacement of the candidate macroblock with the minimum distortion , relative to the reference macroblock:
To be precise, is a set of candidate motion vectors in the search area:
. Informally, we will use to refer to the search area which includes all of the pels of all of the candidate macroblocks.
A block-matching process with a search range has a search area of pels and candidate macroblocks in each horizontal and vertical direction, or a total of candidate macroblocks for each reference macroblock. The distortion value is computed for each candidate macroblock, and the minimum value is found from the pool of candidates. The block-matching process generates a motion vector and the corresponding distortion value .
III. CONSERVATIVE APPROXIMATION
A. Conservative Estimate of Exact Distortion
In order to reduce power consumption, our approach replaces (1) with a simpler function (conservative estimate), and uses the estimate in the process of finding while still ensuring that the exact is found. By removing one summation term from (1), we have (3) Using the triangle inequality, (3) becomes (4) Based on this result, we define the conservative estimate of as can be rewritten as (6) using the partial distortion estimate for the th row of candidate macroblock :
From (1) and (3)- (5) (8) i.e., the estimate is strictly less than or equal to the exact distortion . Moreover, computation can be made considerably simpler than computation. In other words, computing can be made to consume less power than computing .
B. Power Consumption Reduction
Our approach to reduce power consumption is to reduce the total computational requirement in the block-matching process. Specifically, we avoid computing if unnecessary, i.e., if the estimate is greater than or equal to the current minimum distortion found so far. As long as the power savings obtained by eliminating distortion function computations is greater than the extra power consumption introduced by computing the estimate function, there are net savings in power consumption. Hardware requirements for computing and will be described in detail in Section IV. Fig. 2 illustrates a situation in which the exact distortion function computation is avoided.
We define the current minimum distortion as the minimum distortion value found in the area searched so far :
where . Clearly, from (2) and (9)
If for , then there is no need to compute because .
IV. LOW-POWER VLSI ARCHITECTURE
In this section, we present a VLSI architecture to reduce power consumption in performing motion estimation computation. As described in Section III-B, we avoid computing if . The motivation for that is as follows. If the reduction in switching activities enabled by avoiding computation compensates for the increase in switching activities due to computation, then the net power consumption is lower.
In our architecture, the motion estimator computes the estimate of , using the data available for computing , while the motion estimator is computing . It is possible to do this because the set of data required for computing , according to (5) , is a subset of the data used for computing . The computation of is completed just before the computation of begins so that the disable signal can be asserted if
. If the disable signal is asserted, no new data needed for computing are issued to the processing elements so that unnecessary power consumption can be avoided. The processing array [for computing ] used in our architecture is a conventional systolic array [1] .
Our VLSI architecture for motion estimation includes two main blocks as depicted in Fig. 3 . One is a conventional motion estimator for computing the motion vector based on the minimum distortion criterion, and the other is a distortion approximation unit. The conventional motion estimator consists of a processing element array and a block-matching unit.
A. Processing Element Array
The processing element array shown in Fig. 4 is a systolic array in which each processing element (PE) computes the absolute difference (11) between luminance values of a pel in the search area and a pel in the reference macroblock, and forwards the sum of (11) and the partial sum of AD's from the row below to the row above. Each PE is assigned to compute the AD for a fixed position , e.g., the PE's in the bottom right and top left corners of the array compute the AD's for the positions and respectively. The pel data in the search area are serially shifted into the PE array in the following order: After clock cycles, the pel in the position of the first candidate macroblock arrives at the bottom right corner of the PE array, and the pel in the position of the first candidate macroblock arrives at the top left corner of the PE array. In fact, the first column of the first candidate macroblock arrives at the main diagonal, the second column arrives at the diagonal just left of the main one, etc., at the same time.
As depicted in Fig. 5 , the PE's in the bottom row of the array then compute for . In the next clock cycle, the PE's in the bottom row compute , and the PE's in the second row (from the bottom) compute . This process continues until all of the candidate macroblocks are exhausted. Note that this process requires clock cycles to compute AD's for candidate macroblocks in each row of the search area and clock cycles to flush out the last candidate macroblock of each row.
In the meantime, the computed AD value from each PE is added to the partial sum from the row below, as shown on the right side of Fig. 5 , and the result is forwarded to the row above. Thus, each column of the PE array computes . This operation is pipelined so that the top row of the array generates in every clock cycle for . To avoid computing unnecessary distortions, our implementation includes a blocking latch (BL) in each PE cell, as depicted in Fig. 4 . This latch is normally transparent, but becomes opaque when the distortion calculation is deemed unnecessary, i.e., when the disable signal is asserted by the approximation unit. When the latch is closed, the old values are kept, preventing the AD circuit to switch. Because no AD values change, neither do the partial sums. The disable signal is pipelined (see the left side of Fig. 4 ) to match the pipelining of distortion computation. The disable mechanism effectively blocks a new candidate macroblock from being introduced to the internal PE circuits, preventing the circuits from switching (or consuming power).
B. Block-Matching Unit
The block-matching unit (Fig. 6) generates the distortion value for each candidate macroblock, and compares it to the current minimum distortion value. and the motion vector are updated according to the result of comparison.
C. Distortion Approximation Unit
The distortion approximation unit is the key addition to the systolic architecture first presented in [1] to reduce the power consumption of full-search block-matching motion estimation. The savings in power consumption are directly attributed to the simplicity in computing the estimation. For the purpose of power saving, (7) can be reformulated as if (12) if (13) where and . A hardware implementation of (12) and (13) is shown in Fig. 7 .
To simplify the power estimation, we define a new measure of energy consumption: one unit of (absolute difference equivalent) is the amount of energy consumed in computing one AD. Calculating in the systolic PE array requires AD's and additions, assuming no correlation of data. Since one AD calculation is roughly equivalent to two additions, we estimate that one addition consumes . Thus, calculating consumes . On the other 
V. SIMULATION RESULTS
The program which simulates our low-power architecture has been tested on the following video sequences: Susie, French Garden, Caltrain, Football, Trevor, Salesman. In each video sequence, the reference frame is the first frame (001); three search frames are 002, 003, and 004. Table I lists the  average percentage of disabling calculation on all three search frames of these sequences.
The systolic array used in our architecture dictates that each PE computes the AD for a fixed position as candidate blocks are shifted in serially. Therefore, if the pixels in the same relative positions of two consecutive candidate blocks have the same luminance values, then the AD values do not change, i.e., no power dissipation. For more accurate power estimation, we need to take this correlation into account. However, our low-power architecture implementation requires for computing the estimate for every candidate macroblock and, for 42.6% of the candidate macroblocks (for case on average), an additional for computing the exact distortion, according to the simulation results in Table I . Thus, our implementation consumes . Therefore, the power consumption of our lowpower architecture is only % of that of the conventional systolic architecture.
In order to validate our analytical results, we simulated both the conventional systolic version and our low-power version using a custom-designed power simulator which counts the total switching activities of logic components (flip-flops, adders, and multiplexers). The power simulation results are shown in Table II . For and , the analytical and simulations results are remarkably similar.
VI. CONCLUSION
We presented a conservative approximation method that reduces power consumption in full-search block-matching motion estimation, without sacrificing the accuracy of the results or the performance. 1 Simulation results show that the proposed low-power VLSI architecture consumes half as much power as the conventional systolic-array-based architecture.
