Abstract
1: Introduction
Block matching algorithms (BMA) are often found in image and video applications for pattern analysis, motion detection, data compression, etc. Inherent computational complexity in these algorithms often demand special hardware to meet real-time performance. Due to algorithm regularity and modularity, such algorithms are very suitable for VLSI implementation. However for practical system design considerations, not only computational requirements have to be coped with, but also memory bandwidth has to be minimized to reduce U 0 pin-count and hence realization cost. In other words, the desired hardware has to provide sufficient computational power to meet algorithm complexity. Moreover the large volumes of imagehide0 data have to be managed carefully to enhance on-chip data reusability. From these viewpoints, a good VLSI architecture for BMA is often judged by the following issues: ( 1 ) 
internal storage space which determines on-chip memory cost and bandwidth, (2) 110 pin-count which determines I/O bandwidth and packaging cost, and (3) processor element (PE) efficiency which determines PE structure and its utilization ratio for a given period.
Motion estimation (ME) based on full-search BMA (FBMA), one of the key techniques in video coding to remove redundancy, is a good example for illustration of such design complexity. Many research results on VLSI architectures for FSBMA were proposed in the past [l, 2, 3, 4, 5 , 6 , 7, 8, IO] , where research efforts have moved from meeting computational requirements to reducing memory bandwidth as well as providing multi-functions specified in standard [14] . Yeo and Hu[9] proved when the search range(P) is set to half of the macroblock size(N), significant reduction in U 0 bandwidth can be achieved without sacrificing performance. But in the case of N=P, four cascaded chips are needed, increasing the implementation cost of large search area. In the situation of large search area, data reusable problem becomes an important issue. In order to reduce U 0 bandwidth, some internal buffer is needed to greatly reduce U 0 bandwidth. Our goal targets at finding optimized buffer size for sufficient large search area ME architecture under minimum U 0 bandwidth constraint in a single chip. 
Dtotal ( 
+
Analyzing the DGs, we can extract the dependency arc b of all variables. Table 1 shows the dependency arcs which are necessary to evaluate buffer size for a given mapping strategy. R is the reference block data, Dtotal is the distortion value of a candidate block, MV is the motion vector of one macroblock, S is the search area data. Data dependency in the search area are more complicate than other variables. In Figure 4 , observing the DG of type 1, there are three types of relation in S-variables which will be discussed in detail in next section. From this table we know for different mapping directions, we can choose freely various dependency arcs unless they violate the mapping constraints. With this table, we can evaluate optimized memory size for different mapping directions under U 0 bandwidth constraint. Here we withdraw type I11 and type IV from our discussion because they cannot be used to generate cost-effective architecture mappings. 
whereD(e) is the delays on edge e in the PE array, S is the schedule vector, and P is the processor basis. According to the delays of each dependency edge e , we can derive total memory size for a giving mapping scheme.
-
3: Buffer size estimation
Memory elements can be classified as registers in each PE and memory banks outside the PE array as well. 
* **
(2P)' term indicates the effective latency of one search area particular negative delay value will be described lately.
-
In Table 2 we know for above four conditions, ~( e , ) , and D(=) are all equal to 1.
These cause one or zero delay in each PE depending on the projection direction 2 . For example, in type I(a), ~( z ) doesn't result in any delay in each PE, while o(<) causes one delay in each PE. Above delays are inevitable, so the buffer minimization problem falls on exploiting both D(G) and D(;). In the following sub-sections we describe this problem of the four conditions respectively, where the condition N=P is emphasized.
3.1: Type I(a)
First, from Table 2 Then we apply a procedure called "merging" in order to greatly reduce the register count.
Before presenting this procedure, we first define the "boundary" and "non-boundary" regions in one search area as shown in Fig. 5(a) . The sizes of these two regions are (N-l)*(N+2P-1) and (2P)*(N+2P-l) respectively. From the DG we know sometimes "non-boundary'' and "boundary" data must be passed to the same row, like (S21,S15) or (S31,S25). If we assume N=P, two inequations are acquired:
(2P-N)>(N-l) means only (N-1) registers are sufficient to buffer "boundary" data and (2P-N)<(2P) means at least 2P-N (this value is right equal to o(G) which is the lower bound size to buffer "non-boundary'' data) registers are needed to buffer "non-boundary" data. After applying "merging" procedure, we can reduce the buffer size from N*(2P-N) to (ZP-l), as shown in Figure 5 (b). Next, another key point to reduce the VO bandwidth is to consider dependency arc S3 which represents the common data of two adjacent search areas as depicted in Figure 3 . According to above discussion, a formula for the buffer size estimation of DG type I(a) is given as follows:
3.2: Type I(b)
This case causesD(G)= N-2P<O. It implies that ~( z ) violates the constraint s' b 2 0 , however it is a special case for N=P as shown in Fig. 7 , where two adjacent search areas are appended together.
R"
\=?.? The (2P)' PES can be arranged as snake-like form as shown in Fig. 8 . At time t, three pixels (S35, S43 and S51) are sent to row n concurrently. After the "merging" procedure, we can reduce the buffer size from 2P*(2P-N) to (4P-N-1) per row. The total buffer is giving as: 
3.3: Type II(a)
In this case, the buffer size for e, is the same as DG type I(a). From Table 2 we know
= (2P)'-N, e g = -(N-1). Applying the concept described in section 3.1, the whole common data between two adjacent search areas should be buffered, so the buffer size for is (2P-l)*(N+2P-l). The total buffer for DG type E(a) is:
3.4: Type II(b)
In this case, the buffer size for is the same as DG type I(b) described in section 3.2. Considering the illustrated example shown in Figure 9 . At time t+l to t+N+l, the common data (i.e. S31, S41 and S51, the size is 2P-1) between two adjacent search areas should be stored in the buffer every N cycles. For ~( z ) =NZ-N, the buffer size for e, is ( , v ' -N % * ( 2~-~) = (~-1 ) * ( 2~-1 ) . So, the total buffer size is given as: 
4: Comparison and mapped ME architectures
The comparison of the proposed buffer size estimation with other existing papers is presented in Table 3 . In the following we present two ME architectures for type I(a) and type I(b) respectively.
According to the discussion in section 3.1, beside registers in each PE, two extra buffers are used to buffer the data for e, and ea . These two buffers are illustrated in Figure 10 as buffer I and buffer I1 respectively. The buffers can be implemented as pointer address memory described in our previous work [l2] . There are two busses passing each row of the PE array, supplying "boundary" and "non-boundary'' data concurrently. In order to reduce U 0 bandwidth, the penalty paid is complex routing and the data broadcasting problem during changing to next search area. In Figure 11 , the size of PE array is (2P)'-There are three busses passing each row of the PE array. The reason lies in that three pixels should be consumed at the same time in worst case.
5: Conclusion
In this paper we have presented how to find optimized buffer size derived from DG under minimal I/O bandwidth constraint. Optimized buffer size helps greatly reducing YO bandwidth with little penalty of implementation cost in ASIC design, especially in the condition of large search range. We have derived four basic types of DG and discussed the buffer size estimation corresponding to various mapping strategies. The buffer can be easily implemented by pointer address memory (PAM). Finally, two ME architectures withloo% PE efficiency have also been demonstrated to show the proposed mapping schemes.
