The rest of this paper is organised as follows: section II motion estimation for video processing. A novel architecture details related prior research. Section III proposes a new binary using binary data is proposed, which attempts to reduce power motion estimation routine which exploits early termination consumption. The solution exploits redundant operations in the sum of absolute differences (SAD) calculation, by a mechanism properties in the distorton metric calculaton and exploits known as early termination. Further data redundancies are redundancies in the binary data with a run length coding exploited by using a run length coding addressing scheme, where (RLC) addressing scheme. Section IV details an associated access to pixels which do not contribute to the final SAD value is hardware architecture. Section V details hardware synthesis minimised. By using these two techniques operations and memory results and power consumption estimates, whilst section VI accesses are reduced by 93.29% and 69.17% respectively relative draws conclusions about the work presented. to a systolic array implementation.
I. INTRODUCTION
II. RELATED RESEARCH The ongoing global trend to shift multimedia applications There are numerous ways to reduce the complexity of the from desktop to mobile platforms has encountered several full search BMA. Fast heuristic search strategies such as the technical hurdles: demanding real-time applications, low band-3 step search, logarithmic search, diamond search and block width mobile networks, and mobile device hardware (HW) based gradient descent have all been used to reduce the number limitations. The latter include low computational power, low of search locations [2] . From a hardware implementation permemory capacity, short battery life and strict miniaturisation spective the generation of non regular addresses increases the requirements. Therefore the computational complexity asso-control logic considerably. Also optimal motion vectors are not ciated with modern video codecs such as H.264, is highly guaranteed. On the other hand fast exhaustive search strategies undesirable on mobile devices from a power consumption that employ such techniques as conservative SAD estimations perspective. The greatest scope for power savings [3] or early exit mechanisms [4] achieve the same results as the occur at the algorithmic level, by using such techniques as full-search ones, but reduce computation by skipping irrelevant exploiting the nature of the media processing operations to be candidate blocks [2] . Another option to reduce complexity is accelerated (e.g. regularity, redundancy) [1] .
to use binary motion estimation (BME) approaches, which Motion estimation (ME) is the most computationally de-reduce the complexity contribution of the distortion metric by manding task within all video codecs. It is used to exploit quantising 8 bit pixels to a binary representation [5] [6] [7] the temporal redundancies in video sequences, by (typically) [8] . This greatly simplifies the SAD operation (eqn. 1) since employing a block matching algorithm (BMA) to find the best the subtraction between the two processed binary valued pixel match for a block of pixels in the current frame by searching reduces to a simple XOR calculation (eqn. 2) with the absolute in a reference frame. The similarity of a block match (BM) function inherent. is evaluated using a distortion metric, of which the sum of SAD (B,.,,B,f) = gf (B,r (ii) Edge filtering is used in [6] to binarise the input pixels prior
Where B,C?t,, is the block under consideration in the current to doing a full search BME. However, in sequences with frame and B,,f is the block at the current search location in an absence of distinct edges, this approach can result in the search frame. The reference block with the lowest value poor motion vectors. Natarajan et al presents a 2D systolic SAD is chosen for further processing. array BME hardware architecture, which employs a 17x17
This paper proposes an efficient low complexity HW ar-convolution-based 1-bit transform [7] . A BME architecture chitecture for motion estimation. To reduce the complexity is proposed in [8] , which uses a hierarchical search strategy. overhead, binary block matching is employed in conjunction In previous BME research no attempts have been made to with a one-bit pixel preprocessing filter. optimise the processing element (PE) datapath. We will present 0-7803-9390-2/06/$20.00 ©C2006 IEEE two redundancies within the datapath and propose solutions to Current Macroblock
The location of the white pixels are given by exploit them. This work assumes binarisation of the texture has the following run length codes (RL), which are in the form: RLi(x,y), where x is the already been completed. relative offset from the last white pixel and y is the number of consecutive white pixels III. EXPLOITING BME REDUNDANCIES RL1(1,1) RL2(15, 3) A. Early SAD Termination RL3(13,4) RL4 (12, 5) By employing early termination techniques the processing overhead can be reduced. Early SAD termination means that Similarly, the location of the black pixels in certain block matches it is possible to cancel all further are given by: RLO(0, 1) RL1 (1, 15) operations for that block because the accumulated partial SAD RL3(3, 13) RL4(4, 12) result is larger than the minimum SAD found so far within the RL5(5,11) RL6(32,160) search window. Further processing of that particular reference Fig. 1 . Regular and Inverse RLC pixel addressing MB will only make the SAD result larger. Therefore the final SAD result will also be greater than the minimum. To exploit The first match always takes N X N (where N is the block this feature, we propose that during each SAD processing size) cycles to complete and this provides ample time for the operation, the partial SAD calculated to date is subtracted run length encoding process to operate in parallel. After the from a deaccumulation register, which initially holds the value RLC encoding, the logic would be powered down until the of the best SAD value calculated thus far. If a sign change next current block is processed. In situations where there are occurs during the deaccumulation step, there is no need to fewer black pixels than white pixels in the current MB, it is continue further processing since the current minimum SAD possible to use the black pixels instead to calculate the SAD has already been exceeded. In order to allow cancellation, with eqn. 4. Fewer pixels translates into fewer operations to a partial SAD must be available. This presents a challenge be completed, which has associated throughput and switching for typical systolic array hardware architectures, due to the benefits. granularity of the calculation. The problem is overcome in [4] SAD = TOT,,, -TOT,,f + 2 X DIFF,':t'BLACK (4) and our proposed architecture further extends the granularity of early termination through a pixel subsampling technique. The location of the black pixels can be automatically derived This will be described in Section IV.
from the RLC for the white pixels. Thus, by reusing the white pixel's RLC, additional memory is not required and B. Exploiting Data Addressing Redundancies furthermore the same SAD datapath can be reused with Another characteristic of binary data that can be exploited to minimal additional logic. The choice of which mode to use is reduce computational overhead becomes apparent by observ-decided by the MSB of TOT,,. To further minimise memory ing that there are unnecessary memory accesses and operations accesses when using the inverse run length mode, we propose when both B,,,,, and B,,f pixels have the same value. This decrementing a copy of the TOTh,f register (see fig. 2(a) ) each happens because the XOR in eqn. 2 gives a zero result when time a white pixel in the reference block is accessed. If the both B, <,,(7j7) and B,,f(,j) have the same value. To minimise copy of the TOTh,f register decrements to zero, no further this effect, we propose using a RLC addressing scheme. contributions to the SAD are possible, since all the white pixels However to use the RLC addressing the SAD calculation must have been examined and early termination is possible. be reformulated to the form given in eqn. 3 [9] .
IV. ARCHITECTURE DESIGN SAD = TOT,,f -TOT,,, + 2 x DIFF,c,
The proposed architecture can be implemented with varying
Where TOT,,,,,, is the total number of white pixels in the degrees of parallelism depending on the critical requirements current MB D-IFF,, is the number of white pixels in the (area, power, throughput, technology) of the final system.
The basic PE will now be described, followed by a parallel current MB but not in the reference MB and TOTref is the arhtetr whc use 4 rcsiglmns total number of white pixels in the reference MB. Equation 3 is beneficial from a low power hardware perspective because: A. Basic RLC SAD Processing Element * TOT,,,,,, is calculated only once per search The run length code is generated in parallel with the first match at this point the minimum SAD has already been exceeded of the search step, an example of typical RLC is illustrated and no further processing is required. If a sign change has not in fig. 1 Clock Cycles Sequence 2D SA [7] BMEA4xPE 2D SA [7] BMEA4xPE 2D SA [7] BMEA4xPE One concern with using BME is that for small BM sizes the quality of the motion vectors degrades. This along with V. EXPERIMENTAL RESULTS more accurate benchmarking and research into binarisation filters, which have not been discussed, will form the basis of The motion compensated PSNR is dependant predominately future work. Overall this paper has presented an efficient BME on the choice of the binarisation filter, consequently PSNR will architecture, which reduces computational complexity through not be considered further in these results. The 4xPE design the use of an novel binary early termination SAD architecture was captured using Verilog HDL. The design was targeted to which uses a RLC addressing scheme. Reducing the number of a Xilinx Virtex 2 FPGA and also synthesised using a 90nm computations and memory accesses is of considerable benefit TSMC library characterised for low power. The results for since it reduces dynamic power consumption in the datapath. the datapath can be seen in table II. Synplicity Pro and Synopsys Design Compiler were used for synthesis, whilst Xilinx
