This paper presents a new VLSI architecture for full-search block matching algorithm. The proposed architecture has two specific features: (1) it lhas a processor element (PE) array which provides sufficient computational power, where PE's work in a semi-systolic style and (2) it contains stream memory banks which provide scheduled data flow to reduce idle operations within PE array. By exploiting broadcasting and local data communications, hardware efficiency of the proposed architecture can be up to 1O(P%, which outperforms those systolic-array solutions found in the literature.
INTRODUCTION
Many architectural solutions for implementing block maching algorithm (BMA) can be found in the literature [1,2, 3, 4, 5, 6] . Most of the solutions are focused on the data flow within processor element (P'E) array. Therefore systolic array approach has been highly exploited in VLSI implementation. However, this approach causes some problem in data flow outside P E array. In ather words, too much overhead on memory bandwidth is requesteid to provide a scheduled data sequence in order to meet the need of P E array. Therefore large number of 1/0 pins is needed, resulting in higher packaging cost. In addition, due to pipeline filling at the boundary of search area, hardware efficiency, which can be expressed by Eh = ( e)2, is degraded a lot. For example, if P=N=8, only 50% of process are working on the candidate blocks. Although, in [5, 6] , the authors proposed a snakelike data stream format which can reduce the 1/0 blandwidth problem, the hardware efficiency still remains very low.
In this paper, we propose a semi-systolic array to improve the low efficiency problem as found in systolic array solutions. Instead of local connections of search data flow, we use a global distribution of search data connected to each P E row (or column). The partial sum is locally connected. With this style, it has been proved that hardware efficiency up to 100% can be achieved if a dedicated memory management unit is supported. Section 2 describes how the general BMA --*Work supported by the National Science Council of Taiwan, HOC, under Grant NSCR4-%213-E009-115 algorithms can be mapped onto the proposed semi-systolic array or SSA architecture. Section 3 presents the memory management strategy in order to offer the scheduled data sequence so that 100% efficiency can be achieved in P E array. Finally a demonstrator design of motion estimation processor for N=P=16 is described.
MAPPING BMA ONTO SEMI-SYSTOLIC ARRAY ARCHITECTURE
The basic structure'of the SSA is shown in Figure 1 . In this structure, the connections are divided into two types-one is broadcasting or global distribution type and the other is local type. For broadcasting type, input data is fed in from the stream memory and connected to all PES of the same column (row). For local type, results obtained from the higher (left) PES are pumped into next lower (right) PES for further processing.
To illustrate how full search motion estimation can be mapped onto SSA architecture, we use an example of a 3x3 reference block with search area of 7x7.
First we assume that reference data have been stored in each PE, then search data are pumped out from the stream memory and broadcast to PES which perform absolute mean calculation and partial sum accumulation. With the latency of 6 cycles, the first distortion comes out from the bottom right cell (ACC). Then the distortion values of the rest candidates are obtained sequentially. However, when boundary is detected, all PES become idle since data of the next row (column) have to be Wed to the pipeline.
This low efficiency can be overcome by preloading data on next row before boundary is detected. As shown in Fig- ure Z(a), when the distortion Calculation is done on the boundary, data of the next row shadd be pumped into PE array at the next cycle. The mask region indicates that these data should be simultaneously pumped into the PE-array. 
MEMORY MANAGEMENT
When candidate blocks are not within the boundary area, only one single data stream is needed for all PES on the same row. However when boundary criteria is detected, two data streams are needed. This implies that a two-read-port memory is needed. In addition, the data items fetched from the current stream memory have to be loaded into next stream memory. Therefore a two-write-port memory is a h needed. As a result from these read/write considerations, it is necessary to provide a >port memory with size of (N-l)X(BP+l) for non-boundary data and a 4-port memory with size of (Nl)X(N-I) for boundary data. However, since storage space is only activated once at a certain time interval, the 4-port memory devices can be reduced to 2-port memory with the constraint that these (N-l)X(N-l) boundary data should use different read/write ports as those for non-boundary data.
We still have to consider the problem of data initialization since this problem may cacse idle op~rations within the PE ar- (a) Improve hardware efficiency using multiple inputs so that search data can be filled in the PE array and (b) shows a n example how hardware efficiency can be improved by parallel ports when boundary is detected. Fig.3 . Organization of the stream memory banks. Note t h a t these two identical banks are working interleavely to reach 100% efticiency .
ray and hence 100% efficiency cannot be achieved. In the previous discussion, we first assumed that reference data (NXN) and part of search data (N-l)X(SP+N) are preloaded iqto PE Moreover to ensure that reference data are already available when motion estimation is activated for another reference block, we need one NXN shift register array (SU). However, this SRA is also needed for matching partial sum sequence to get final distortion value. Since one SRA cannot be shared for delay management and storing reference data, we use two SRA's which are interleaved as shown in Figure 4 .
Based on this organization, the total memory space needed is 2(N-l)X(N+2P) + 2(NXN), where the former part is for search data and the latter part is for both delay management and reference data.
THE, DEMONSTRATOR DESIGN
Floorplan of the ME processor is shown in Figure 5 . Based on the proposed architecture, the area for a motion estimation processor with N:=P=16 is about 9.5X7.2 mm2 for a 0.8 pm CMOS double metal technology [7] . The critical path has been limited to lOns to meet the requirements of MPEGS Main profile at main level [8] . 'This chip has been assembled and under fabrication currently. Figure 6 shows the final layout of the ME processor. Table 1 shows the performance comparison with other available architectures. The reference size of 8x8 is selected as a platform for comparison, where search area :s 23x23 and PT- count is 8x8. It can be found that our proposed architecture can produce a motion vector every 256 cycles, which is the minimum among the four architectures. Although, the memory size used in our proposal is about 4 times of [SI, it is still less than the other two [9,10]. Besides, we have found that the PE array occupies more than 70% area in physical design. This implies that the memory size is not a key issue in VLSI implementation for our proposed architecture.
In addition, the SSA architecture also has the followipg features: 0 simple control flow: the required control signals for each module are rather simple and can easily be derived. For example, the load used in PE array is only needed when the start of a new motion vector is requested.
0 selection of different displacements (P): this can be done by adjusting the read/write pointers at the stream memory banks, where the cycle count for calculating each motion vector is (2P+1)2. In our demonstrator design, 3 different displacement codes (4, 8, 16) are allowed.
0 selection of different sizes of reference block: this can be done by adjusting the position of the input port of the stream memory banks.
CONCLUSION
Iri this paper, we have presented a novel VLSI architecture for optimally implementing full-search motion estimation dgorithm. The proposed architecture mainly consists of (1) PEarray which is a semi-systolic array structure to offer computation power and (2) memory management unit which offers a scheduled data flow so that 100% hardware efficiency within the PE-array can be achieved. In addition, this proposed architecture is also flexible in selecting the sizes of reference and search blocks. The architecture has also been demonstrated by a full-search motion estimation processor for N=P=lG.
We are currently investigating the possibility of applying this novel architecture to other motion estimation algorithms, such as 3-step search and telescopic search methods.
