Abstract|In this paper, we i n v estigate hardware implementation of block matching algorithms BMAs for motion estimation of moving sequences. Using systolic arrays, we propose VLSI architectures for the two-stage BMA and full search FS BMA. The two-stage BMA using integral projections reduces greatly computational complexity with its performance comparable to that of the FS BMA. The proposed hardware architectures for the two-stage BMA and FS BMA are faster than the conventional hardware architectures with lower hardware complexity. Also the proposed architecture of the rst stage of the two-stage BMA is modeled in VHDL and simulated. Simulation results show the functional validity of the proposed architecture.
I. Introduction
Digital communication technology has made transmission of video images possible and, in modern information age, its advance is closely related to the growth of telecommunication industries. For various quality digital video services, it is required to compress video data with high compression ratio and e ective memory utilization.
Digital signal processing DSP of video signals, the major part of a video transmission system, includes digital image acquisition, motion estimation and compensation, prediction error coding, entropy coding decoding, and so on. Motion estimation is one of the most important processes, because its capability determines the nal quality of a reconstructed video signal. For practical realization of a video communication system, both development of e ective video compression algorithms and their VLSI implementation techniques for real-time processing are required.
Motion compensation coding 1 -4 is composed of several steps. First, we remove the temporal redundancy of the image sequence by means of motion detection, then encode the motion vector and prediction error, where the prediction error is de ned by the di erence between the current frame and the motioncompensated previous frame. There are two approaches to nd the motion information: the block matching algorithm BMA 1 and the pel recursive algorithm PRA 2 , 5 . A BMA has been used for motion estimation in a video conference system, moving picture experts group MPEG, and high de nition television HDTV because of its simplicity and easy hardware realization.
The full search FS or brute force search algorithm, the most basic BMA, nds the best matching block, in terms of a prede ned error measure, among all candidate blocks in search area of the previous frame. It gives the optimal performance, but it requires too massive computational overhead for real-time processing. Many algorithms have been proposed to reduce the computational complexity with a little performance degradation compared to the FS BMA. The two-stage BMA using integral projections 6 is one of the fast BMAs.
However, most fast BMAs have di culties in hardware implementation for realtime applications, so a number of hardware architectures using parallel processing have been presented. Recently, with consideration of the cost, complexity, and system load of a general-purpose computer, the development of a special-purpose visual processor using systolic wavefront array processors dedicated to a speci c job has been accelerated. Systolic arrays 7 , 8 have useful characteristics such a s modularity, regularity, local communication capability, and so on. Also they control and schedule pipelining, and they have been used in various DSP applications requiring high throughput.
In this paper, we propose hardware architectures for the two-stage BMA and FS BMA. In Section II, we describe brie y BMAs, especially the two-stage BMA. In Section III we propose hardware architectures for the two-stage BMA and FS BMA, and we show experimental results of the rst stage of the two-stage BMA using very high speed integration circuit hardware description language VHDL 9 -11 . In Section IV, we analyze the performance of the proposed VLSI architectures of the two-stage BMA and the FS BMA along with various conventional FS BMA architectures, and conclusions are given in Section V.
II. BMAs
A BMA estimates the amount of motion on a block basis between two successive frames. In a typical BMA, a current frame is divided into a number of N N blocks. The block of pixels called a reference block in the current frame is compared with the corresponding blocks called candidate blocks within an N +2pN+2p search area in the previous frame, where p represents the maximum displacement assumed. The displacement giving the best match is referred as the motion vector. Common matching criteria include the mean absolute difference MAD and mean square error MSE. The MAD measure gives good performance and its calculation requires only a few simple instructions. Thus the MAD matching criterion has been preferred in VLSI realization, and the proposed architectures also employ it.
The FS BMA nds the best matching block among all candidate blocks in search area 1 . Conventional fast BMAs 1 include the three step search TSS, direction of the minimum distortion DMD search, menu v ector search, and one at a time search OTS. The TSS is composed of nine search points including the center point, and the search distance decreases as the stage proceeds. The DMD is based on the assumption that the MSE increases monotonically as the search moves away from the direction of minimum distortion. The menu v ector method nds the best matching point among prede ned candidate points. The OTS nds the minimum MSE along the horizontal direction at rst, then searches along the vertical direction.
The numbers of search points of these algorithms are considerably reduced compared to that of the FS BMA whereas the two-stage BMA using the integral projection concept reduces the computational complexity b y employing a 1D matching function in the rst stage. Image processing using the projection concept has been widely used in medical image processing 12 , 13 , for example, a 2D image in computerized tomography is reconstructed from 1D projection data. Also it has been applied to video 6 , 14 , 15 and image compression 16 , progressive image transmission 17 , and so on.
The two-stage BMA consists of two stages. As a distortion measure between two blocks, it adopts a 1D matching measure to nd a coarse motion vector in the rst stage whereas it employs a conventional 2D measure to nd the nal motion vector in the second stage. It greatly reduces computational complexity of conventional BMAs with its performance almost comparable to that of conventional ones 6 .
For example, combined with the FS BMA with 16 16 blocks, the computational complexity of the two-stage BMA is reduced by a factor of about four. The larger the subblock size, the greater the performance improvement b y the twostage BMA. Also, for noisy sequences the two-stage BMA gives the performance comparable to that of the FS. The picture quality obtained by the two-stage BMA combined with the TSS is similar to that of the TSS with much l o w er computational complexity. Its procedure is summarized as follows 6 . to calculate horizontal projections in the H block and they are passed to the V AD block where vertical projections are computed. The V AD block also computes the 1D distortion measure between the horizontal vertical integral projections computed from the reference block in the current frame and the candidate search block in the previous frame. The output k of the V AD block is the MAD. The nal step is to nd the coarse motion vector, with SR and M blocks, generating the minimum distortion among all candidate motion vectors. The SR block transfers the calculated 1D distortion measure to the bottom and right whereas the M block nds the coarse motion vector giving the minimum distortion. Note that the possible vertical displacements with p = 2 are -2, -1, 0, +1, and +2. The MAD with vertical displacement of 0 is obtained at o2. The MAD with vertical displacement of -1 +1 is generated at o1 o3. Similarly the MAD with vertical displacement of -2 +2 is generated at o0 o4. The nal vertical displacement is determined by nding the smallest oi, 0 i 4. Similarly, the possible horizontal displacements are -2, -1, 0, +1, and +2. Note that the last M block stores the smallest MAD. For example, if the output of the last M block changes at the third clock, the horizontal displacement is 0. The output of the last M block c hanges at the second fourth clock with the horizontal displacement equal to -1 +1 whereas at the rst fth clock with the horizontal displacement equal to -2 +2. Note that, with nite size of image frames, we need to extend data across the image boundaries. We can detect motion vectors near image boundaries by using any of conventional schemes such as zero padding, repetition, symmetrical extension, and so on. Function de nition of the H, V AD, SR, and M blocks are shown in Figs. 3a, 3b, 3c, and 3d, respectively. The H block calculates horizontal projections and stores the projection value in and transfers it to the output o0. Also the H block transfers the input pixel value i0 to the output o1. The V AD block calculates vertical projections and computes the 1D distortion measure with horizontal vertical projections calculated from the current and previous frames. The horizontal vertical projections of the reference block is stored in the H0, H1, and H2 V0, V1, and V2 blocks, where N = 3 is assumed. The H0, H1, and H2 blocks shown in Fig. 3b calculate the absolute di erence between the stored horizontal projections of the reference block and the horizontal projections h00, h10, and h20 of the candidate block that are computed by the H blocks. The output of the H1 block is the sum of the absolute di erence which is computed by the H1 block itself and the output of the H0 block. Similarly, the V0, V1, and V2 blocks compute the absolute di erence between the stored vertical projections of the reference block and the vertical projections of the candidate block calculated by the S block using the pixel values through the H block. The output k of the V AD block is obtained by the sum of the horizontal and vertical MADs. Also, the V AD block transfers the input data h10, h20, 1,0, and 2,0 to the outputs h0, h1, a, and b, respectively. The SR block shown in Fig. 3c transfers input data i0, i1, and i2 to output data o0, o1, and o2, respectively, with a unit delay. The output m of the M block shown in Fig. 3d is de ned by min m0, m1 , where min x; y represents the smaller value between x and y. Note that the output of the M block in the bottom row in Fig. 2 is used to nd the coarse motion vector.
The second matching procedure can be realized by several ways: e.g., by using VLSI architectures for the FS BMA such as Komarek and Pirsch's motion estimator 18 , the full tree and 16-cut tree of Jehng et al.'s architectures 19 , and proposed systolic array for the FS BMA see next section with p = 1. The proposed systolic array for the second matching stage is shown in Fig. 4 , where nine AD blocks are employed for calculation of the 2D absolute di erence. The 2D absolute di erence is computed by the reference block in the current frame and the candidate block in the search area of the previous frame. In Fig. 4 , pixel values of the current and previous frames are fed into the AD block from left to right and top to bottom. Outputs of each AD block are stored in it. The required number of PE's for the second stage is 2p + 1 2 + 1, i.e., 10 PE's with p = 1 . Fig. 5a shows the function de nition of the AD block in which the absolute di erence between i0 and i1 is computed. The accumulated absolute di erence is stored in and inputs i0 and i1 are transferred to outputs o0 and o1, respectively. Fig. 5b shows the function de nition of the N block in which the output n is de ned by min n0, n1, n2, where min x; y; z represents the smallest value of x, y, and z. The calculated MAD is stored in the AD block and it is passed to the right when the MAD computation has been completed. Thus, the nal motion vector is detected by nding the motion vector giving the smallest MAD among the rightmost AD blocks.
Note that the possible horizontal vertical displacements for the second matching stage are -1, 0, and +1. The MAD with vertical displacement of 0 is obtained at n1. The MAD with vertical displacement of -1 +1 is generated at n0 n2.
The nal vertical displacement is determined by nding the smallest ni, 0 i 2.
If the output of N block c hanges at the second clock, the horizontal displacement is 0. If the output of the N block c hanges at the rst third clock, horizontal displacement is +1 -1.
The required number of clocks of the two-stage BMA is NN + 3 + 6 p + 1 see Table I : 2N + 1 + 6 p clocks for the rst matching stage and N 2 + N , 1 clocks for the second matching stage. The required number of PE's for the twostage BMA is N + 2 p 2 + 7 p + 13 see Table II : N + 2 p 2 + 7 p + 3 PE's for the rst stage N + 2 p H's, 2p + 1 V AD's, p2p + 1 SR's, and 2p + 2 M's and 10 PE's 9 AD's and one N for the second stage. Fig. 6 shows VHDL 9 -11 synthesis results of the rst matching stage with H, V AD, SR, and M blocks. VHDL descriptions of the designs are structured hierarchically from the behavioral or data-ow descriptions of the basic functional modules. Synthesis results are obtained using technology independent generic gates. We obtain optimization results using the VGC 450 library. In the rst stage, the numbers of gates required for the H, V AD, SR, and M blocks are 448, 1969, 198 , and 187, respectively. T h us, the estimated total number of gates required for the rst matching stage is 66757. Fig. 7 shows its VHDL simulation results, where rst, clk, row, and col represent the reset signal, clock signal, row and column positions of the reference block, respectively. The outputs of the absolute di erence at each r o w see Fig.   2 are denoted by o i , 0 i 6. The coarse motion vector is denoted by m vx and mvy. In Fig. 7 , the coarse motion vector detected is given by 3, 1. VHDL simulation results in Figs. 6 and 7 show the functional validity of the proposed architecture.
B. Array Architecture for the FS BMA
The FS BMA is preferred in implementing a motion estimator since it gives optimal results. Also its implementation shows useful features such as regularity, xed-step operation, and so on 18 -23 . Most conventional BMA architectures such as Komarek and Pirsch's AB1, AB2, AS1, and AS2 types 18 used systolic arrays, especially systolic mesh connected array SMCA. In general, the SMCA has a large number of memory access per unit time for high throughput and its concurrent memory access is limited, thus the memory bandwidth problem may arise. To solve this problem, Hsieh and Lin 20 added some shift registers to the AB2 type of Komarek and Pirsch, then their architecture required a single memory access per processing time. With a few exceptions, the SMCA has input skew, in which input data to arrays are delayed sequentially for its valid computation, causing data latency. Jehng et al. 19 proposed a tree architecture having no data skew. This tree architecture has high throughput and is suitable for some algorithms such as the TSS 4 in which previous results of an adjacent candidate block are utilized and simultaneous massive amount of memory access is required. Fig. 8 shows the proposed systolic array, with N = 3 and p = 2, of the FS BMA which is the modi ed structure of the AB2 type of Komarek and Pirsch. The current frame is stored in the Q block and the previous frame is fed into the Q block in prede ned order as shown in Fig. 8 . The T block transfers data with a unit delay. The absolute di erence between pixel values of the reference and candidate blocks computed in each Q block is transferred from top to bottom. The P block in the last row computes the MAD between the reference block and candidate blocks, where a dot represents a unit delay. Finally, the M block nds the motion vector. Note that the proposed architecture computes the absolute di erence using the Q block i n t w o directions: i.e., left to right and right t o left. The proposed systolic array for the FS BMA is faster than the conventional architectures such as Komarek and Pirsch's architectures and tree architectures of Jehng et al., with the similar number of PE's. Fig. 9 shows the function de nition of the basic PE's. The Q block shown in Fig. 9a stores the pixel values of the reference block i n to and , then calculates the absolute di erence. The inputs i2 and i3 to the Q block are transferred, in the reverse direction, to outputs o2 and o3, respectively, and the output o0 o1 is the sum of i0 i1 and the calculated absolute di erence. The output m of the M block shown in Fig. 9b is given by min m0, m1 , whose function is the same as the M block shown in Fig. 3d . The P block shown in Fig. 9c transfers the input i2 to o0, and the output o1 o2 is the sum of i1 and i3 i0 and i4. Also the T block in Fig. 9d transfers the inputs i0 and i1 to o0 and o1, respectively, with a unit delay.
The required number of clocks of the proposed architecture of the FS BMA is N + 2 p p + 3 see Table I . The required number of PE's for the proposed architecture is N 2 + 2 N + 1 see Table II : N 2 Q's, N P's, N T's, and one M.
IV. Performance Analysis
In this section, we analyze the performance of the proposed VLSI architectures for the two-stage BMA and FS BMA. Table I shows the required processing time in number of clocks, as a function of N and p, of the AB1 type, AB2 type, AS2 type, full tree, 16-cut tree, proposed FS BMA architecture, and proposed two-stage BMA, where N represents the block size and p denotes the maximum displacement assumed. One processing time is de ned by the time required for a single PE operation. The AB1 type using the 1D systolic arrays is slower than the AB2 type employing the 2D systolic arrays, whereas the required number of PE's for the AB1 type is smaller than that for the AB2 type. The AS2 type requires the processing time similar to the AB2 type. The full tree architecture requires too high hardware cost, thus we use a 16-cut tree architecture. The rst matching stage of the two-stage BMA is realized by the proposed architecture in Fig. 2 and the second matching stage can be implemented b y v arious architectures such as AB1 type, AB2 type, AS2 type, full tree, 16-cut tree, and the proposed architecture in Fig. 4 . AB2 type, AS2 type, and full tree are fast, however, their hardware cost is expensive.
The processing time required for each method with four parameter sets N;p = 8, 7, 8, 15, 16, 7, and 16, 15 is also listed in Table I . From Table I , we can notice that the proposed motion estimator is faster than the conventional motion estimators except for the full tree architecture.
In each case, the processing time largely depends on the search area size, N + 2p. Therefore, it is true that the proposed motion estimator is e cient when the search area is large. In recent video communication systems, a coding scheme for moving sequences tends to consider large search area in order to nd more accurate motion, which is desirable for nding large and abrupt motions. Note that the number of PE's is related to the complexity of the PE's and the processing time.
The number of adders for conventional FS BMA architectures, with N;p = 16, 7, i.e., AB1 type, AB2 type, AS2 type, full tree, 16-cut tree, and the proposed architecture require 34, 529, 511, 512, 32, and 545 adders, respectively. With N;p = 16, 7, the number of adders for the rst stage of the two-stage BMA is 286 and the numbers of adders for the second stage using AB1 type, AB2 type, AS2 type, full tree, 16-cut tree, and the proposed architecture are 34, 529, 103, 512, 32, and 20, respectively.
The TSS is not a sequential process, thus it is not suitable for hardware realization using systolic arrays. The conventional TSS architectures were based on full tree and 16-cut tree structures. Note that tree architectures require a large number of pins. The required numbers of PE's with N;p = 16, 7 of full tree and 16-cut tree structures are 512 and 32, respectively. The processing times with N;p = 16, 7 of full tree and 16-cut tree structures are 54 and 864, respectively. The proposed architectures for the two-stage BMA are comparable to that for the conventional TSS.
V. Conclusions
The FS BMA gives the optimal performance for blockwise motion estimation, but it requires too massive computational overhead for real-time processing. Recently, development of specialized visual processors using systolic wavefront array processors dedicated to a speci c job has been accelerated.
In this paper, we propose VLSI architectures using systolic arrays for the twostage BMA and FS BMA. For the parameter sets commonly adopted in conventional motion estimation, the proposed architectures are faster than the conventional architectures. We show VHDL synthesis and simulation results of the rst stage of the two-stage BMA. Simulation results show the functional validity o f the proposed architecture. Further research will focus on development of e cient algorithms and their possible hardware implementation techniques for real time processing.
Figure Captions 
