Hierarchical block matching is an e cient motion estimation technique which provides an adaptation of the block size and the search area to the properties of the image. In this work, we propose two novel special-purpose architectures to implement hierarchical block matching for realtime applications. The rst architecture is memory-e cient, but requires a large external memory bandwidth and a large number of processors. The second architecture requires signi cantly fewer processors, but additional on-chip memory. We describe in details the processor architecture, the memory organization and the scheduling for both these architectures. We also show how the second architecture can be modi ed to handle full-search and 3-step hierarchical search block matching algorithms, with signi cant reduction in the hardware complexity as compared to existing architectures.
Introduction
In many applications of visual communication, moving image is transmitted over low transmission rate channels. Since the amount of image data to be transmitted is large, video compression techniques need to be employed. Video compression techniques reduce two types of redundancies in the image: (i) spatial and (ii) temporal. Spatial redundancy is the redundancy in a single frame, while temporal redundancy is the redundancy between successive frames. Spatial redundancy is exploited by a combination of di erential pulse coded modulation (DPCM) and transform coding 1]. Temporal redundancy, on the other hand, is exploited by motion compensated prediction 2]. An important component in motion compensated prediction is estimating motion between successive frames 3]. In fact, in video compression coders, motion estimation takes up more than 50% of the computational needs of the entire compression process 4].
There are two main techniques for motion estimation 2]: pel recursive and block matching. Pel recursive algorithms estimate motion between successive frames on a pixel by pixel basis, whereas block matching algorithms (BMA) estimate motion on a block by block basis. In BMA, the image frame is divided into blocks (generally of size 8 8 or 16 16) , and a motion vector is estimated for each block. BMAs are becoming more popular due to their relative computational simplicity, and have been adopted for motion estimation in the MPEG standards 5].
There are two classes of BMAs: xed block size BMA 2], 3], and hierarchical BMA 6] . In a xed block size BMA, a xed size reference block is compared with candidate blocks in a xed size search area in the previous frame. The matching criteria that is mostly used for nding the best match is the mean-of-the-absolute-di erences (MAD) 3]. A full search technique or selective search technique is used to nd the block with the minimum value of MAD in the search area. Fixed size BMAs give unreliable motion estimate for small block sizes and large displacements. They also fail to give accurate estimates if di erent parts within a block have motion in di erent directions. On the other hand, Hierarchical BMAs provide an adaptation of block size and search area to the properties of the image. In hierarchical BMAs, the search is carried out in a hierarchical fashion. The size of the block and the search area vary at di erent levels of hierarchy. The computation starts out with large blocks and large search areas, and in subsequent levels, the sizes are reduced. The search in each level is continued around the best tting candidate block in the previous level. At each level of hierarchy, either a full search or selective search technique is employed. Bierling has proposed a hierarchical BMA that uses selective search scheme and has shown it to give more 2 Hierarchical Block Matching Algorithm
In hierarchical block matching (HBMA), the size of the block and the search area vary at di erent levels of hierarchy. At lower levels of hierarchy, larger block sizes are used in order to estimate motion of larger parts of the image, while at higher levels of hierarchy smaller block sizes are used in order to estimate motion of the smaller blocks. Fig. 1 shows the principle of hierarchical block matching for two levels of hierarchy. At the lowest level, a displacement estimatex is made for the block marked I. At the next level of hierarchy, the block size as well as the maximum possible displacement is reduced, and a displacement estimatez is made for the block marked II around the estimated displacement of block I. The displacement for the reference block is then the sum of the two displacements,ỹ =x +z. Table 1 gives the parameters of a simpli ed hierarchical block matching algorithm for two levels of hierarchy 6]. The parameters apply in both dimensions. The level 1 displacement is calculated Parameter at level 1 2 Max. update displacement (d l ) 7 3 Measurement window size (W l ) 64 16 Step size (S l ) 32 16 Filter window size (F l ) 5 5 Subsampling (U l ) 8 2 (1) where X t?T is the previous frame at time t ? T, and d is the maximum possible displacement at that step. The (i; j) vector corresponding to the minimum MAD value for block B m;n is the estimated displacement vector or motion vector in that step. Since motion vectors are computed for blocks corresponding to every S 2 pixel, we say that the image is e ectively divided into blocks of size S 2 S 2 .
Parameters
In Table 1 , the number of operations 2 in a MAD computation is W l 2 U l 2 =64, l = 1; 2. At each level of hierarchy, the log D step search technique (modi ed three step search) is used, where D = max: displacement + 1 16] . In level 1, the maximum possible displacement d 1 = 7 pixels, and thus a log(7 + 1) = 3 step search is carried out. In level 2, the maximum possible displacement d 2 = 3 pixels, and a 2-step search is performed. In each step, the reference block is matched with 9 3 candidate blocks in the search area. Thus 3 9 = 27 MAD computations are carried out in level 1, and 2 9 = 18 MAD computations are carried out in level 2. A low pass mean value ltering on F l F l = 5 5 size block is carried out on the original image frames to improve the accuracy of the motion estimates 6]. Figure 2 shows part of the image frame for the case when S 1 = 2S 2 . The level 1 displacement is estimated using MAD computations for blocks B i;j , B i;j+2S 2 , B i+2S 2 ;j , B i+2S 2 ;j+2S 2 , etc. and level 2 displacement is estimated for all the blocks. Since the level 1 displacement is required for the estimation of the level 2 displacement, the blocks which do not compute the level 1 displacement using MAD, estimate it using bilinear interpolation of the level 1 displacements of the four closest neighbors 6].
In the rest of the paper, we use the following set of parameters for blocks of size S 2 S 2 =8 8. The step size S 1 = 16, and the rest of the parameters, namely, d l , W l , F l and U l , are the same as that of Table 1 . We use these parameters to calculate the hardware resources, memory bandwith, etc. for the proposed architectures.
Number of Processors
The minimum number of processors that would be required for real-time implementation of the two level hierarchical BMA is calculated as follows. Let the image size be M N. For 3 Architectures for HBMA
In this section we present two architectures for two-level hierarchical BMA. We rst describe the common architectural features such as processor con guration, memory organization, scheduling, etc, and then give the details of each architecture. We describe the architectures in terms of the general parameters (W l , U l , S l , etc.) and then give details using the speci c values from Table 1 .
We assume that the parameters in the two levels are related as follows:
e. total number of MAD computations for blocks in levels 1 and 2 are the same.
2. U 1 and U 2 are either both odd or both even.
While the above assumptions are mildly constraining, they enable a very e cient architecture design. For instance, assumption 1 simpli es the scheduling, while assumption 2 enables e cient memory organization.
Processor Architecture
Let p be the minimum number of processors that are required for real-time computation. Then from equation 2, we have p = d9MNfc(3(Q=S 1 ) 2 + 2(Q=S 2 ) 2 )e. The time required to process a block is n c = dQ 2 =pe cycles (assuming one operation is performed every cycle by each processor). If n c is not an integer, then in the last cycle of processing, some processors will be idle. This implies that additional processors will be required to meet the computation rate. The number of processors is then chosen to be p 0 , where p 0 is the smallest integral factor or multiple of Q that is greater than or equal to p, such that the computation rate is met.
Example: For the parameters in Table 1 , if the image is of size 576 720, f = 25Hz, c = 200ns, then the minimum number of processors p = 13. Since every block match involves 64 computations, in the last cycle, one processor will be idle. The number of processors is chosen to be p 0 = 16, since p 0 is a multiple of Q = 8 and the computation rate is met. 2 The proposed architecture consists of p 0 processors which are interconnected by a tree structure as shown in Fig. 3 . The tree architecture is adopted to achieve small latency and fast computation times 10]. The p 0 processors of type P D form the leaves of the tree. These processors compute the absolute di erences between pixels of the reference block (X) and the pixels of the candidate block (Y ). The di erences are added using a tree of adders. Processing element P M keeps track of the minimum value of MAD. Each level of the tree can be considered as a pipeline stage and the latency is dlog p 0 e + 3. Fig. 3 shows the processor architecture for the case when p 0 = 8. Note that the tree of processors can be replaced by a systolic array at the expense of an increase in the latency and consequently, an increase in the pipeline re ll time.
Memory Organization
The memory organization has to be such that multiple processors can access data simultaneously from the image memories. Figure 4 shows the memory organization for the case when p 0 = 16 (r = 3; e = 9). The module width is the same as the pixel width.
2
The image frames are distributed among the memory banks and modules in the following way.
Pixel I i;j is allocated to memory bank b, and within bank b to memory module k. is the address of pixel I i;j in module k. 
The above distribution is such that each memory bank contains an entire row of pixels (i.e. N pixels) in the image frame, and each line is distributed in the cyclic mode among the memory modules. In Appendix 1 we prove that the distribution function of equation (3) supports simultaneous access of p 0 elements which are U 1 apart for level 1 search and U 2 apart for level 2 search.
Interpolator
Bilinear interpolation is required for all blocks for which level 1 displacements are not estimated by MAD computations. In particular, it is required for all blocks fB i;j g such that (i mod S 2 = 0 & j mod S 2 = 0) but (i mod S 1 6 = 0 & j mod S 1 6 = 0), for 0 i M ? 1; 0 j N ? 1. In Figure 2 where S 1 = 2S 2 , bilinear interpolation is required for blocks B i;j+S 2 The bilinear interpolator consists of N 2S 2 + 2 shift registers to hold the level 1 displacement vectors and four add-shift units as shown in Figure 5 . Each add-shift unit adds two displacement vectors and right shifts the sum vector by 1 bit (equivalent to division by 2). Since the maximum estimated displacement for each vector component is 10 pixels, each displacement vector is housed in two 5-bit registers. Although the above scheme is for a two level HBMA, a similar scheme can be designed for 3 or more level HBMA. Recently, an interpolator has been proposed for 3 level HBMA in 19].
Scheduling
Hierarchical BMA has two types of computational dependencies: inter-level and intra-level. Interlevel dependency is the dependency between computations in di erent levels of search, while intralevel dependency is the dependency between computations in the same level.
Inter-level Dependency : For all the blocks, computations of the level 2 displacement cannot begin unless the level 1 displacements are available. For some blocks, the level 1 displacement is not computed by MAD, and the level 2 displacement computations have to wait until bilinear interpolation is done on the level 1 displacements of the neighboring blocks.
Intra-level Dependency : At each level of search, search for the next step is made around the estimated displacement of the previous step. For instance, for level 1 search, step 3 search cannot proceed before result from step 2 search is available, and similarly, search of step 2 cannot proceed before result from step 1 is available.
The scheduling schemes of both Architectures 1 and 2 employ interleaving to avoid the performance degradation due to these dependencies.
Architecture 1
Architecture 1 has an on-chip module which consists of a set of processors, an address generating circuit and an interpolator, and an o -chip module which consists of two image memories to store the current frame and the previous frame. The processors are organized in a tree structure as described in Section 3.1. The o -chip memories are organized into memory banks and memory modules as described in Section 3.2. The o -chip memory is the slowest unit, and hence the system operates at the rate of o -chip memory access times. For bit parallel operation, if b bits are used to represent every pixel, and p 0 pixels are accessed simultaneously, the processor chip pin count requirement (> bp 0 ) will be very high. Since the memory access time is fairly high, accesses to modules within a bank can be pipelined and interleaved to reduce the pin count as in 10]. Also the large memory access time and the interleaved accesses can be used to route a pixel to thè right' processor. Routing is required since the same pixel is required by di erent leaf processors at di erent instances of time while processing di erent blocks. Figure 6 shows the block diagram of this architecture for the case when p 0 Q. The image memories contain data that has been ltered by a 5 5 low pass lter. The computation schedule for the case when S 1 = 2S 2 (see gure 2) is given in Table 2 .
The computation of multiple blocks are interleaved to avoid pipeline stalls that would otherwise be caused by the inter-level and intra-level dependencies. For instance (see Table 2 ), after the level 1-step 1 search of block B i+2S 2 ;j+2S 2 , instead of waiting for the search results to become available (this is equivalent to emptying the pipeline in the tree of processors) and then proceeding with the level 1-step 2 search of B i+2S 2 ;j+2S 2 , the level 1-step 1 search of the next block, B i+2S 2 ;j+4S 2 is computed. After the level 1 vectors for blocks B i+2S 2 ;j+2S 2 and B i+2S 2 ;j+4S 2 are available, the level 2 search for all adjacent blocks are scheduled. The computations for the two steps in level 2 search can once again be interleaved among these blocks to avoid any pipeline stalls. Table 4 : Memory bandwidth requirement in Mbytes/s for the three algorithms without and with on-chip memory.
One major disadvantage of Architecture 1 is that it has very high memory bandwidth requirements (see Table 4 ). This is because the same pixels are used for multiple computations, and are consequently accessed as many times from the external memory. In this section we propose an architecture that on the one hand, reduces the external memory bandwidth, and on the other hand, requires fewer processors. This is achieved by storing the search area on-chip in a local memory, and by operating the processors at the faster local memory access rate. Since the processors are essentially word-level subtractors, they can be easily clocked at local memory access times of (say) 70ns.
As in Architecture 1, the computation unit of Architecture 2 consists of a tree of processors working in parallel. To support simultaneous data accesses, the on-chip memory is distributed among multiple memory modules. Moreover, since the same pixel is accessed by di erent leaf processors at di erent times (for di erent computations), a data shu er unit is required to reorder the p 0 data elements accessed from the local memory, before they are forwarded to the leaf processors. The local memory, the data shu er, the processor tree, and the address generator units may be housed on the same chip. Each of these units are pipelined so that they can be clocked at the local memory access rate. Fig. 7 shows the proposed system architecture.
Processor Architecture
The computation unit consists of a tree with p 0 leaf processor as described in Section 3.1. The processors as well as the adders are clocked at the rate of on-chip memory access times. As a result, the number of processors in Architecture 2 is signi cantly smaller than that of Architecture 1 (where the processors are clocked at the rate of o -chip memory access time).
Memory Organization
The memory system consists of two external image memories to store the current and previous image frames. In addition, it consists of two on-chip memories to store the data that would be otherwise accessed multiple times from the external memories.
On-chip Memory Organization On-Chip Memory 1: On-chip Memory 1 is used to store the search area from the previous image frame. For a block whose top leftmost pixel is I i;j , the level l search area includes pixels fI i+m;j+n g where 0 m; n W l ? U l + 2 P l k=1 d k and d l , U l , W l are the parameters for level l. Thus the search area at level l is of size (W l + 1 ? U l + 2 P l k=1 d k ) 2 In HBMA, not only does the level 1 and level 2 search areas of a block overlap, the search areas of neighboring blocks also overlap. This fact can be exploited to reduce the memory bandwidth. The nonoverlapping part of the search area for the next block can be loaded while the current block is being matched. The amount of overlap between search areas of neighboring blocks at level l is (W l + 1 ? U l + 2 P l k=1 d k ? S l ) (W l + 1 ? U l + 2 P l k=1 d k ).
The exact amount of overlap between the level 1 and level 2 search areas as well as the overlap between the search areas of neighboring blocks, depends on the actual values of the various parameters.
Example : For the parameter values in Section 2, the on-chip memories store the data that is required to process a group of 4 neighboring blocks. In each group, level 1 and level 2 displacements are required for one block, and level 2 displacements are required for the remaining three blocks.
For instance, in Figure 2, The organization of on-chip Memory 1 is the same as that described in section 3.2. The search area is distributed among memory banks and modules as per the distribution function of equation (3) . The only di erence is that for a search area of size (say) K L, the parameters M and N in equation 3 have to be replaced by K and L. Since the system now operates at a much faster clock rate, it may not be possible to pipeline on-chip memory accesses within memory banks. Hence an address generating circuit is required for every on-chip memory module.
On-Chip Memory 2 : On-chip Memory 2 is used to store the reference blocks which are being matched. For the parameter values in Section 2, Memory 2 stores the ve reference blocks (of the current frame) corresponding to a group. These 5 blocks are the blocks used for calculating level 1 and level 2 displacements of one block, and the level 2 displacements for the remaining 3 blocks in a group. Thus the size of on-chip Memory 2 is 5Q 2 bytes, since Q 2 pixels need to be stored per block. While the displacements of one group are being computed, the ve blocks of the next group are loaded. For instance (see Figure 2) , the level 1 block for B i+4S 2 ;j+4S 2 is loaded only after level 1 displacement for B i+2S 2 ;j+2S 2 is computed, the level 2 block for B i+4S 2 ;j+3S 2 is loaded only after level 2 displacement for B i+2S 2 ;j+S 2 is computed, and so on. If we de ne the time that is required to carry out one step search of a hierarchy level (9 block matches) as one time unit, then it takes a total of 11 time units to estimate the vectors of a group of 4 blocks. The 5 blocks for the next group are loaded in 10 (out of 11) time units.
Since the number of processors is p 0 , on-chip Memory 2 consists of p 0 memory modules. The data can be distributed in the cyclic or in the consecutive mode. Since there is an overlap in the successive reference blocks, the size of on-chip Memory 2 can also be reduced if the common pixels are stored only once. Table 1 , 32 71 pixels need to be accessed during the computation of 4 blocks, and thus A = 8 71. Therefore the number of external memory modules is d mMNf 71 32 e. Table 4 shows how the external memory bandwidth reduces after on-chip memory is added to the system. The image data needs to be distributed in the modules so as to facilitate multiple accesses. In fact, the distribution function (eqn. 3) of section 3.2 can be used here too. To reduce the number of memory ports, image memory modules can be grouped into banks and accesses can be pipelined as in 10].
External Memory Organization
Current Frame: Earlier we described how for the parameters in Section 2, 5 blocks are fetched from the current frame during 10 time units (each time unit = 9 block matches). Since each block consists of Q 2 pixels, 5 Q 2 pixels need to be accessed in 10 
11

Data Shu er
The mapping between the memory modules of on-chip Memory 1 and the leaf processors is di erent at di erent time instants. This is because the same pixel occupies di erent (relative) positions in di erent candidate blocks. Thus while a pixel is accessed from the same memory module every time, it has to be routed to di erent leaf processors at di erent times. In Architecture 1, the routing was achieved by reordering the pipelined accesses. Such a method cannot be applied to Architecture 2 since it operates at a higher speed. Note that the reference block pixels always occupy the same position, and hence once brought into on-chip Memory 2, do not require any subsequent reordering. In this section we describe a dynamic routing network called the data shu er which parallely routes p 0 pixels from the memory modules to the right leaf procesors. While the routing required in Architecture 2 could be implemented by a general purpose interconnection network such as the Omega network, we choose to implement it using the specialized data shu er unit. This is because the characteristics of the memory organization can be exploited to design a router that is simpler and more area e cient than the general purpose networks.
In the memory organization described in section 3.2, if p 0 Q, only 1 bank of e memory modules is required, and each row of the block being matched is distributed among modules. The data shu er then reorders p 0 (out of e) elements that are accessed in parallel from the memory modules. If p 0 Q, there is more than 1 bank (r > 1) and each bank contains an entire row of the block being matched. In this case, p 0 Q banks are connected to the processor tree at any given time, and the data shu er reorders p 0 Q (out of r) bank level accesses as well. In any cycle, the memory modules and the banks are accessed in an order that is a function of e, r and U. Moreover, the order of accesses of di erent blocks is closely related. Speci cally, the order of accesses during the computation of B i;j is a circular shifted version of the order of accesses during the computation of B i1;j1 , where i 6 = i1 and j 6 = j1. This property enables us to design the data shu er unit by a dynamic routing network.
Without loss of generality, consider the case when p 0 Q. If I i;j is the top-left pixel of block B i;j , and if U is the subsampling, then the Q pixels of a row that get accessed are from modules (j + vU) mod e, where 0 v Q ? 1. Let m(q1) be the module from which I i;j1 is accessed, where q1 = j1 mod e. Let SEQ(m(q1)) be the sequence of modules that are accessed when the leftmost pixel of the row is module m(q1). SEQ(m(q1)) is obtained by varying v from 0 to Q ? 1 in the expression (j1 + vU) mod e. The elements of SEQ(m(q1)) need not be distinct. However, SEQ(m(q1)) consists of a sequence of size e whose elements are distinct. We refer to this sequence S(m(q1)) as the kernel sequence (S(m(q1)) is obtained by varying v from 0 to e ? 1 in the expression (j1 + vU) mod e.). For instance, in an example with p 0 = 4, Q = 8 and e = 5, SEQ(m(q1)) = f0; 3; 1; 4; 2; 0; 3; 1g and S(m(q1)) = f0; 3; 1; 4; 2g. It can be shown that for j1 6 = j2, the kernel sequence S(m(q2)) for B i;j2 , where q2 = j2 mod e, is a circular shifted version of S(m(q1)). If we denote one of the kernel sequences as the master sequence, then all the other kernel sequences can be obtained by circular shifting the master sequence.
The same idea can be extended for the case when p 0 > Q. In this case, both the bank level accesses and the row level accesses have to be reordered. If pixel I i1;j is the top-left pixel of block B i1;j , then the Q rows that get accessed are from banks (i1+vU) mod r, where 0 v Q?1. Let b(q1) be the bank that contains the topmost row (row i1) of the block, where q1 = i1 mod r. Let SEQ(b(q1)) be the sequence of banks that are accessed when the topmost row is in b(q1). Then S(b(q1)) is a kernel sequence of size r whose elements are distinct. It can be shown that for i1 6 = i2, the sequence S(b(q2)) for block B i2;j , where q2 = i2 mod r, is a circular shifted version of S(b(q1)).
The proof for both the cases have been included in Appendix 2.
For p 0 Q, the sequences corresponding to di erent values of m(q) can be obtained by rotating the master sequence by a maximum of b e 2 c times to the left or right. This procedure is implemented by the data shu er unit as follows. There are d e 2 e stages of multiplexers which are used to route the data through this network. Each stage consist of e 3:1 multiplexers which shift the data up or down or not at all. Since the amount of shift along with the direction of shift with respect to the master sequence is known apriori, a nite state machine can be designed to control each stage of switches. Fig. 8 shows the 3 types of multiplexers that are used for the implementation of the data shu er. Switch type A is a 2 to 1 multiplexer, whereas switch type B is a 3 to 1 multiplexer. Both switches have 3 outputs which always hold the same values. Switch type C is the same as switch B except for the fact that it has only one output. A similar procedure for reordering the data is applied for the case when p 0 > Q. Here Table 5 it is possible to derive the mapping between the processors and the memory modules. Since p 0 = 4, in the rst cycle, pixels 0 through 3 are processed, and in the second cycle pixels 4 through 7 are processed. If the leftmost pixel is in m(1), then in the 1st cycle, processor 0 gets its pixel from m(1), processor 1 gets its pixel from m(3), and so on. In the second cycle, processor 0 gets its pixel from m(4), processor 1 gets its pixel from m(1) and so on. Table 6 describes the mapping between the processors and the memory modules.
P(3) P(1) P(2) P(0) P(2) P(0) P(3) P(1) Table 6 : Example where p 0 = 4, e = 5, r = 1, Q = 8 and U 2 = 2. The entries indicate which processors get pixels from which memory modules, given the module number of the leftmost pixel in a row. m(#) is the memory module number and P(#) is the processor number. Fig. 9 shows a 3 stage data shu er for the above example. The type A switches latch the data corresponding to the master sequences: 0-3-1-4-2 for U 2 = 2 and 0-2-4-1-3 for U 1 = 8. The control signal L is reset to 0 for blocks subsampled by U 2 and is set to 1 for blocks subsampled by U 1 . The multiplexers of type B and C then shift the data up or down by a maximum of be=2c = 2 positions. The control signals S 1 ; S 2 ; S 3 and S 4 are set by a nite state machine. VLSI implementation of the data shu er unit shows that the delay introduced by the multiplexer switches is very small and hence the entire unit can be considered as one pipeline stage.
Scheduling
Since on-chip Memory 1 contains the search area for only 4 neighbouring blocks, the data hazards caused due to inter-level and intra-level dependencies cannot be avoided altogether. Extra clock cycles have to be added to re ll the pipeline during the intra-level computations of a group of blocks. Table 7 describes this schedule. The number of pipeline stages in Architecture 2 (see g. 7) is dlog 2 pe + 6 4 . Thus the time required to re ll the pipeline is = dlog 2 pe + 6 cycles.
The minimum number of processors (p) now required for the two level HBMA of Table 1 Table 7 : Time schedule at which displacements for the blocks get computed in Architecture 2 when S 1 = 2S 2 . T is the time required for 9 block matches and is the time required to re ll the pipeline.
If the on-chip memory size is increased to accomodate the level 1 search area of the next group of 4 blocks, the above data hazard can be totally avoided. In that case the schedule of Architecture 1 (Table 2) can be used here too. 4 The processor tree has dlog 2 pe + 3 pipeline stages. The address generator, the data shu er and memory 1 each contribute to 1 pipeline stage. Table 8 shows the number of processors, the number of main memory modules and the extra amount of on-chip memory needed for Architecture 2. We assume a conservative on-chip memory access time of 70 ns., and main memory access time of 200 ns. Note that there is a drastic reduction in number of processors and memory modules at the cost of a few kbytes of on-chip memory. 
An Example
Implementation
A processing unit that can handle block sizes of 16 16 for the teleconferencing video format for 8-bit parallel data has been implemented using Berkeley CAD Tools. The technology used was 1 m CMOS. The unit consists of a processor tree, a data shu er, an address generating circuit, an address shu er and the necessary glue logic. The address generating circuit employs a table lookup ROM to carry out the division. It generates multiple addresses which are routed to the memory modules using an address shu er that is similar to the data shu er. All the subunits are clocked at 70ns. The processor tree with four leaf processors is of size 6.83 sq.mm. The data shu er which routes data from 5 memory modules to 4 leaf processors is of size 2.45 sq.mm. The layout of these subunits have not been compacted. In this implementation, the on-chip memories are required to support simultaneous read and write operations. This can be achieved by using either dual port memories or single port memories with a modi ed organization. In our design, the two on-chip memories were implemented by single port memories divided into multiple pages to allow simultaneous read and write operations in di erent pages. On-chip memory 1 was split into 5 modules, where each module consisted of 22 
The tree architecture with p leaf processors (see section 3.1) can be used here too.
Memory Organization:
The memory organization for FBMA is similar to the one used by Architecture 2 for HBMA. It consists of two local on-chip memories to house the search area and the reference block, and two external image frame memories to store the current and the previous image frames. The on-chip memories are organized into p modules since p pixels have to be accessed in parallel.
For FBMA, the size of the search area is (2d + W) 2 . Since consecutive search areas for FBMA overlap, the fresh amount of data to be loaded for the next search area is (2d+W)W. Therefore the total size of on-chip memory 1 is (2d+W) 2 +(2d+W)W = 2(d+W)(2d+W The data is distributed in the on-chip memory and in the external memory as described in section 3.2 (see equation 3) . A data shu er unit similar to the one described for HBMA can be used for FBMA as well. Since FBMA does not have any computational dependencies, the scheduling is very straight-forward. Table 9 shows the hardware resources that are required for the FBMA algorithm for the three standard video formats (assuming on-chip memory access time of c = 70 ns and d = 7) . Note that the number of processors is signi cantly lower than that of 10] (compare with 
Scheduling:
The intra-level computational dependencies in HBMA are also present in 3HSA. For instance, step 3 of 3HSA can proceed only after results from step 2 are available, and step 2 can not proceed until step is 1 over. Therefore the pipeline needs to be stalled between step 1 and step 2, and between step 2 and step 3 of every block match. Table 10 shows the time schedule for the blocks centered around pixels shown in Fig. 2 . Time interval T is equivalent to time required for 9 block matches, and is the time required to re ll the pipeline. 
As in the HBMA, if the on-chip memory size is increased to accomodate search areas for two consecutive blocks, the pipeline stalls can be eliminated by employing pipeline interleaving. architecture requires fewer processors and drastically fewer external memory modules for 3HSA compared to 10] (see Table 3 ).
Conclusion
We have presented two area e cient, high throughput special purpose architectures for hierarchical BMA. Architecture 1 uses no on-chip memory, but has a very high memory bandwidth. Architecture 2 on the other hand, has a small on-chip memory which helps reduce the memory bandwidth. It also requires fewer processors compared to Architecture 1. We have suggested a new memory organization scheme for both the architectures that permits parallel access of multiple pixels that are equidistant. We have also proposed a new dynamic data routing network for Architecture 2 which parallely routes data from the memory modules to the right processors. Finally, we have shown how an architecture similar to Architecture 2 can be used to implement full search and 3-step hierarchical search BMA, with signi cant reduction in the hardware complexity.
Assume for some m = m 1 , I i;j and I i+Uqm 1 ;j lie in the same bank. Therefore, (i + U q m 1 ) mod r = i mod r ) (U q m 1 ) mod r = 0 This implies that m 1 = r Uq l, where l is an integer. But r is not a multiple or factor of U q and there exists no l other than 0 that satis es the above equation.
We next prove that within a bank, the Q pixels along a row (of a block) can be accessed in parallel. This proof is same as that of Case I when p 0 = Q. 2
Appendix 2: Data Shu er
In this Appendix, we show that for any U, r and e, the sequence of modules and banks that are accessed during di erent computations are related. Let I i;j be the top-left pixel of block B i;j . Then the sequence of modules and banks accessed by block B i1;j1 is a rotated version of the sequence of modules and banks accessed by block B i2;j2 , where i1 6 = i2 and j1 6 = j2. We consider two cases: (i) p 0 Q and (ii) p 0 > Q. Case I (p 0 Q)
In this case, the Q pixels of a row are accessed from modules (j + vU) mod e, 0 v Q ? 1. Let q = j mod e. Then SEQ(m(q)) is a sequence of size Q corresponding to the modules that are accessed when the leftmost pixel of the row is in module m(q). Let S(m(q)) be the kernel sequence of SEQ(m(q)). To prove that for any j = j1 and j = j2, the sequence S(m(q1)) can be obtained by rotating the sequence S(m(q2)), where q1 = j1 mod e and q2 = j2 mod e. Since both the sequences consist of e distinct elements, without loss of generality we assume that the element corresponding to v = v1 in S(m(q1)) matches the element corresponding to v = v2 in S(m(q2)). Thus it is su cient to prove that the element corresponding to v = v 1 + a of S(m(q1)) is the same as the element corresponding to v = v 2 + a of S(m(q2)) for 0 a e ? 1. Figure 2 : A part of the image frame when S 1 = 2S 2 . The level 1 displacement is estimated using MAD computations for only the blocks that are marked with . The level 2 displacement is estimated for all the blocks. 
