A low complexity hardware architecture for motion estimation by Larkin, Daniel et al.
A Low Complexity Hardware Architecture for
Motion Estimation
Daniel Larkin, Valentin Muresan and Noel O'Connor
Centre for Digital Video Processing, Dublin City University, Dublin, Ireland
Email: {larkind, muresanv, oconnomr}@eeng.dcu.ie
Abstract- This paper tackles the problem of accelerating The rest of this paper is organised as follows: section II
motion estimation for video processing. A novel architecture details related prior research. Section III proposes a new binary
using binary data is proposed, which attempts to reduce power motion estimation routine which exploits early termination
consumption. The solution exploits redundant operations in the
sum of absolute differences (SAD) calculation, by a mechanism properties in the distorton metric calculaton and exploitsknown as early termination. Further data redundancies are redundancies in the binary data with a run length coding
exploited by using a run length coding addressing scheme, where (RLC) addressing scheme. Section IV details an associated
access to pixels which do not contribute to the final SAD value is hardware architecture. Section V details hardware synthesis
minimised. By using these two techniques operations and memory results and power consumption estimates, whilst section VI
accesses are reduced by 93.29% and 69.17% respectively relative draws conclusions about the work presented.
to a systolic array implementation.
I. INTRODUCTION II. RELATED RESEARCH
The ongoing global trend to shift multimedia applications There are numerous ways to reduce the complexity of the
from desktop to mobile platforms has encountered several full search BMA. Fast heuristic search strategies such as the
technical hurdles: demanding real-time applications, low band- 3 step search, logarithmic search, diamond search and block
width mobile networks, and mobile device hardware (HW) based gradient descent have all been used to reduce the number
limitations. The latter include low computational power, low of search locations [2]. From a hardware implementation per-
memory capacity, short battery life and strict miniaturisation spective the generation of non regular addresses increases the
requirements. Therefore the computational complexity asso- control logic considerably. Also optimal motion vectors are not
ciated with modern video codecs such as H.264, is highly guaranteed. On the other hand fast exhaustive search strategies
undesirable on mobile devices from a power consumption that employ such techniques as conservative SAD estimations
perspective. The greatest scope for power savings (IOx-20x) [3] or early exit mechanisms [4] achieve the same results as the
occur at the algorithmic level, by using such techniques as full-search ones, but reduce computation by skipping irrelevant
exploiting the nature of the media processing operations to be candidate blocks [2]. Another option to reduce complexity is
accelerated (e.g. regularity, redundancy) [1]. to use binary motion estimation (BME) approaches, which
Motion estimation (ME) is the most computationally de- reduce the complexity contribution of the distortion metric by
manding task within all video codecs. It is used to exploit quantising 8 bit pixels to a binary representation [5] [6] [7]
the temporal redundancies in video sequences, by (typically) [8]. This greatly simplifies the SAD operation (eqn. 1) since
employing a block matching algorithm (BMA) to find the best the subtraction between the two processed binary valued pixel
match for a block of pixels in the current frame by searching reduces to a simple XOR calculation (eqn. 2) with the absolute
in a reference frame. The similarity of a block match (BM) function inherent.
is evaluated using a distortion metric, of which the sum of SAD (B,.,,B,f) =gf (B,r (ii) e B) (i,j)) (2)
absolute differences (SAD) is the most popular, due to its
optimum trade off between complexity and quality [2]. The Using BME as a preprocessing stage is proposed in [5] to
SAD formula for a 16 x 16 pixel macroblock (MB) is: discount poor BM prior to a full resolution ME. A pixel
i=16j=16 is quantised to a binary value based upon the value of the
SAD (B,,, B,ef) =3 3 B, (i, j) - BEf (i7 j) (1) pixel relative to the mean of an NxN surrounding pixel block.
i=lj=1 Edge filtering is used in [6] to binarise the input pixels prior
Where B,C?t,, is the block under consideration in the current to doing a full search BME. However, in sequences with
frame and B,,f is the block at the current search location in an absence of distinct edges, this approach can result in
the search frame. The reference block with the lowest value poor motion vectors. Natarajan et al presents a 2D systolic
SAD is chosen for further processing. array BME hardware architecture, which employs a 17x17
This paper proposes an efficient low complexity HW ar- convolution-based 1-bit transform [7]. A BME architecture
chitecture for motion estimation. To reduce the complexity is proposed in [8], which uses a hierarchical search strategy.
overhead, binary block matching is employed in conjunction In previous BME research no attempts have been made to
with a one-bit pixel preprocessing filter. optimise the processing element (PE) datapath. We will present
0-7803-9390-2/06/$20.00 ©C2006 IEEE 2677 ISCAS 2006
two redundancies within the datapath and propose solutions to Current Macroblock The location of the white pixels are given by
exploit them. This work assumes binarisation of the texture has the following run length codes (RL), which
are in the form: RLi(x,y), where x is the
already been completed. relative offset from the last white pixel and y
is the number of consecutive white pixels
III. EXPLOITING BME REDUNDANCIES RL1(1,1) RL2(15,3)
A. Early SAD Termination RL3(13,4) RL4(12,5)
By employing early termination techniques the processing
overhead can be reduced. Early SAD termination means that Similarly, the location of the black pixels
in certain block matches it is possible to cancel all further are given by:RLO(0,1) RL1 (1,15)
operations for that block because the accumulated partial SAD RL3(3,13) RL4(4,12)
result is larger than the minimum SAD found so far within the RL5(5,11) RL6(32,160)
search window. Further processing of that particular reference Fig. 1. Regular and Inverse RLC pixel addressing
MB will only make the SAD result larger. Therefore the final
SAD result will also be greater than the minimum. To exploit The first match always takes N X N (where N is the block
this feature, we propose that during each SAD processing size) cycles to complete and this provides ample time for the
operation, the partial SAD calculated to date is subtracted run length encoding process to operate in parallel. After the
from a deaccumulation register, which initially holds the value RLC encoding, the logic would be powered down until the
of the best SAD value calculated thus far. If a sign change next current block is processed. In situations where there are
occurs during the deaccumulation step, there is no need to fewer black pixels than white pixels in the current MB, it is
continue further processing since the current minimum SAD possible to use the black pixels instead to calculate the SAD
has already been exceeded. In order to allow cancellation, with eqn. 4. Fewer pixels translates into fewer operations to
a partial SAD must be available. This presents a challenge be completed, which has associated throughput and switching
for typical systolic array hardware architectures, due to the benefits.
granularity of the calculation. The problem is overcome in [4] SAD = TOT,,, - TOT,,f + 2 X DIFF,':t'BLACK (4)
and our proposed architecture further extends the granularity
of early termination through a pixel subsampling technique. The location of the black pixels can be automatically derived
This will be described in Section IV. from the RLC for the white pixels. Thus, by reusing the
white pixel's RLC, additional memory is not required and
B. Exploiting Data Addressing Redundancies furthermore the same SAD datapath can be reused with
Another characteristic of binary data that can be exploited to minimal additional logic. The choice of which mode to use is
reduce computational overhead becomes apparent by observ- decided by the MSB of TOT,,. To further minimise memory
ing that there are unnecessary memory accesses and operations accesses when using the inverse run length mode, we propose
when both B,,,,, and B,,f pixels have the same value. This decrementing a copy of the TOTh,f register (see fig. 2(a)) each
happens because the XOR in eqn. 2 gives a zero result when time a white pixel in the reference block is accessed. If the
both B, <,,(7j7) and B,,f(,j) have the same value. To minimise copy of the TOTh,f register decrements to zero, no further
this effect, we propose using a RLC addressing scheme. contributions to the SAD are possible, since all the white pixels
However to use the RLC addressing the SAD calculation must have been examined and early termination is possible.
be reformulated to the form given in eqn. 3 [9]. IV. ARCHITECTURE DESIGN
SAD = TOT,,f - TOT,,, + 2 x DIFF,c, (3) The proposed architecture can be implemented with varying
Where TOT,,,,,, is the total number of white pixels in the degrees of parallelism depending on the critical requirements
current MB D-IFF,, is the number of white pixels in the (area, power, throughput, technology) of the final system.The basic PE will now be described, followed by a parallel
current MB but not in the reference MB and TOTref is the arhtetr whc use 4 rcsiglmns
total number of white pixels in the reference MB. Equation 3
is beneficial from a low power hardware perspective because: A. Basic RLC SAD Processing Element
* TOT,,,,,, is calculated only once per search Fig. 2(a) shows a simplified view of the SAD PE. At the first
. TOTh,f can be updated in 1 clock cycle clock cycle the minimum SAD encountered so far is loaded
. Incremental addition of DIFF,crur allows early termina- into DACC_REG. During the next cycle TOT,,,,, or TOTh,f
tion if the current minimum SAD is exceeded is added to DACC_REG (depending if TOT,C. [MSB] is 0 or
. By using run length coding to address the DIFF,crur 1 respectively). On the next clock cycle DACC_REG is de-
pixels, irrelevant data access is minimised. accumulated by TOTref or TOT,?r. If a sign change occurs
The run length code is generated in parallel with the first match at this point the minimum SAD has already been exceeded
of the search step, an example of typical RLC is illustrated and no further processing is required. If a sign change has not
in fig. 1. It is possible to do this during the first match occurred the address generation unit (AGU) retrieves the next
because SAD early termination is not possible at this point, run length code from memory. If TOT,?rr[MSB] =0 the run
2678
ORIGINAL MEMORY
cur ref TOTref TOTcur Macroblock
prevdaccval ~ ~ ~ ~ ~ T TrTef r-ms
PE CONTROL X2 decTOTref ~ ~ ~ ~ ~ ~ ~ ~ ~ 2 2 2 21 12
TOTcur-msb Sub Block 0 Sub Block 2 2 12 Su Bloc 2 Su Blc3
drvdcalc34eg4Sign34Change3Sign4Change
(a)RLCSD Procesing Elemen (b) Pixe Subsampling343
Fig. 2. RLCSAD processing element2 1and Pixel Subsamp2ing
length air cod isproessed nmodifid. On te othe hand erly terinatio decision The earl terinaioncanoccra
if PEf[S]1 heivrernlntCOTOdcTodref is2prcesse. anypointdurin th B pocssng 2f eal temntindeIn eiter cae the rocesing reults n an mcroblck pixl notoccur n all4PB,4 3itis neesar to exainethe4_PSA
currentrun legth cod arepoTcssed IfthuSD-alultincate tha reucinBthouk h Sutb-stallin thebPBl while
is not cancelled, subsequent run length codes for the cur2ent2g g 2 3 y3 g34 4444theupdate state 22opraes tepdt stg can run in parallelM arfeched_ from memory22and the33processing4repeats.
dacc-reg Sign CupateloichIfthanegmtc iscaceledthnainwulacirculaor fullsearchi usedmntoalTOTrmicatin beudtdin asOacl o he updte minimTHUmBOCsA vau.Iafe
onelockcycl.Thsai donebyD subtractingtheen previousrowaplin
or colmn(dpendin on sarch FindowRLmoemnt fromesn accumultatngthxelSbsmlocieen ADvle,thgeutisngtv
TOngh pand aodding thoesnew rowmordcolum.Onteohrhn arlnewfminaimumSA hacsinbheenrlnotefound.theupatcan thena
if TT,, [MS]=i he ivere ru legthcodeis roces stopn processduingadthe lMpogcesisg.powerryedrdown.aoevrifn thes
B. 4xeite Block MthprchsinggArchitetur ana arboc ie occumulate block lEveSAi vluispeesroseaitie, this men thAt
addes. hi adres s se t reriveth rlevntpiel vlue adeeifanew minimum SAD has beenfound.sarsl,theuboc
InXReorertoexloth rearlyitermntsion,danineredilatepartial alevelminimum SADandotaumnimumoarSADevauesatnowgnee
SaDl musintiobeigenerated.TIsotrhequresmaDnncaxlculaion toe todabe updcatred. Intadditintianladjustment mutreead.t
puroeed iun alequentiale mannepr,cesbutIthireducSADcluain Rtehnrdcnth oughput,hure bokmaceynrges Thstalisg beauetEhate
asndoi cnotlldesirbleqenforreal tienapplicatones. Torh icurease bloc mpatchstarteddperaccmuaiosfotheldttaeamuninimumllethroughpt,heparaleim mustrbedepothedOuprosigrpoased par-neSADo(MINmADOhD Fratherta th cnewlamionimu SADocu
allenarchtecur can beseenncafig.te 3. thearchintectuecarrites (MNAwW.SneMISDE ISD
oGUt motinestimatioernon bonekM ato a time.iin.PrvieInthnasotAC_eGnx mancadjutment offDAcCRngGtisprequrmned at the
a cirxula bockulmeatch,h MBise split ianto efurpdxteblock upB.tThgiscvlu Is storednwmacinspcancregisersi theni wupdatby sigsipl pxe susaplng ecniueshon n ig blsock.nThi deonsraethe patted updatemcAn occur.In paftrale
2(b) Echc ofleTheiou PBoprtednonebys xtrctn block Theiou four with lainthenetblockmatch ADvaue,whc tsofcnierabule benefaite
paralle PBdgedneratheparia SAD vaclues.Thdeionmkg sino imroeasns ntheP doei snotwered towbe stalledrunil the
unit then usesthese ~~~~~accumulatedSAvaustmkeaAD pdehs finishled. A au spstie hsmasta
B.4xE lok athig rcitctrea ewmiimm ADha benfond A areu2679eblc
TABLE I
RLC BMEA4xPE VERSUS CONVENTIONAL SYSTOLIC ARRAY BME
Memory Accesses (1 bit pixels) Operations (1 bit XOR & addition) Clock Cycles
Sequence 2D SA [7] BMEA4xPE 2D SA [7] BMEA4xPE 2D SA [7] BMEA4xPE
Akiyo 1.5206 x 109 4.5298 x 108 4.0170 x 109 2.6105 x 108 3.2252 x 107 6.8858 x 107
Hall Monitor 1.5206 x 109 4.8202 x 10" 4.0170 x 109 2.7847 x 10" 3.2252 x 107 7.3362 x 107
Foreman 1.5206 x 109 4.7137 x 108 4.0170 x 109 2.6942 x 108 3.2252 x 107 7.0999 x 107
Average 69.17% reduction 93.29% reduction 220.37% increase
TABLE II
-BME 4xPE SYNTHESIS RESULTS
Area Max Freq. Power
=g l_Emux Virtex 2 FPGA 14,615 gates 120 MHz 12.951 mWUod
~~~~~~~~~~~TSMC9Onm 10,117 gates 250 MHz 1.220 mW
BME_P -_ X
take advantage of RLC addressing and this leads to the concern
BMEE Ma-gNed
~~~~~~~~that our architecture could suffer from input/output bandwidth
inefficiencies. However this is not the case, as can be seen
in table I, due to the early termnination, on average 69.17%
|MAX - fewer 1 bit pixel memory accesses are required. The reduction
totalldawreg ,in operations and memory accesses comes at the expense of
reduced throughput compared to [7], that requires a constant
MAX L-iI A 271 cycles per macroblock. This compares to our design,M-e-00--M&
~~~~which after allowing for early termination, requires on average
598 clock cycles per MB. If throughput is essential our design
lalmrninBe scales to 16 PE, however the effectiveness of early termination
MOMt_$lbik:l_b 1is reduced.
VI. CONCLUSIONS AND FUTURE WORK
One concern with using BME is that for small BM sizes
the quality of the motion vectors degrades. This along with
V. EXPERIMENTAL RESULTS more accurate benchmarking and research into binarisation
filters, which have not been discussed, will form the basis of
The motion compensated PSNR is dependant predominately future work. Overall this paper has presented an efficient BME
on the choice of the binarisation filter, consequently PSNR will architecture, which reduces computational complexity through
not be considered further in these results. The 4xPE design the use of an novel binary early termination SAD architecture
was captured using Verilog HDL. The design was targeted to which uses a RLC addressing scheme. Reducing the number of
a Xilinx Virtex 2 FPGA and also synthesised using a 90nm computations and memory accesses is of considerable benefit
TSMC library characterised for low power. The results for since it reduces dynamic power consumption in the datapath.
the datapath can be seen in table II. Synplicity Pro and Syn-
opsys Design Compiler were used for synthesis, whilst Xilinx REFERENCES
XPower and Synopsys Prime Power were used to generate the [1] M. Pedram and J. M. Rabaey, Power Aware Design Methologies. Kluwer
power onsumpionfgures.The AIC impementtion, ue to Academic Publishers, 2002.power consumption igu s. ASIC lem atio due to [2] P. M. Kuhn, Algorithms, Complexity Analysis and VLSI Architectures for
the dedicated logic rather than configurable logic has smaller MPEG-4 Motion Estimation. Springer, June 1999.
area and lower power than the FPGA implementation. From an [3] V. Do et al., "A Low-Power VLSI Architecture for Full-Search Block-
Matching Motion Estimation," IEEE Trans. Circuits Syst. Video Technol.,
area and power perspective direct comparisons with previous vol. 8, no. 14, pp. 393 - 398, Aug. 1998.
hardware implementations of BME is difficult since they do [4] M. Takahashi et al, "A 60-MHz 240-mW MPEG-4 Videophone LSI with
not quote these figures [7] [8]. For this reason and issues with 16-Mb Embedded RAM," IEEE J. Solid-State Circuits, vol. 35, no. 11,
pp. 1713-1721, Nov. 2000.
normalisation across different semiconductor processes we [5] Feng, Lo, Mehrpour, and Karkowiak, "Adaptive block matching motion
have chosen to benchmark our implementation in terms of 1 bit estimation algorithm using bit plane matching," in IEEE Int Conf Image
pixel memory accesses, 1 bit operations and number of clock Processing, Washington D.C., USA,, vol. NA, 1995, pp. 496-499.
cycles (see-table1Using tandard M EG-4testequences [6] M. Mizuki, U. Desai, I. Masaki, and A. Chandrakasan, "A binary blockcycles (see table I). Using standard MPEG-4 test sequences matching architecture with reduced power consumption and silicon area,"
with a BM size of 16x16 and a search window of -8/+7, in Proc. IEEE ICASSP-96, vol. 6, Atlanta, USA, 1996, pp. 3248-3251.
experiments showed we achieved a 93% average reduction in [7] B.Natarajan, V. Bhaskaran, and K. Konstantinides, "Low complexityblock based motion estimation via one bit transform," IEEE Trans.the number of operations compared to a SA implementation Circuits Syst. Video Technol., vol. 7, no. 4, pp. 702-706, Aug. 1997.
[7]. This figure also compares favourable to the early SAD [8] J. H. Luo et al, "A Novel All-Binary Motion Estimation (ABME) with
termination HW design presented in [4], which achieves 45% Optimized Hardware Architectures," IEEE Trans. Circuits Syst. Video
' ~~~~~~~~Technol., vol. 12, no. 8, pp. 700 - 712, Aug. 2002.
to 51% reduction in operations. Our improvement can be [9] D. Larkin, V. Muresan, and N. O'Connor, "An Effi cient Motion Estima-
attributed to the subsampling and the use of RLC addressing tion Hardware Architecture for MPEG-4 Binary Shape Coding," in Irish
for the binary data. A systolic array implementation cannot Signals and Systems Conference, Dublin, Ireland, Sept. 1-2, 2005.
2680
