block matching motion estimation (ME) algorithm based on overlapped search data flow. The proposed VLSI architectures have three specific features: (1) they contain a processor element (PE) array which provides sufficient computational power and achieves 100% hardware efficiency; (2) they contain stream memory banks which provide scheduled data flow requested by P E for computing mean absolute distortion (MAD); and (3) they both have minimum memory bandwidth to save 1/0 pin-count. This paper presents two VLSI architectures for full search
INTRODUCTION
As digital video services become more widely accepted, it is necessary to develop low-cost hardware solutions. Motion estimation (ME) is one of the major techniques defined in standards for encoding moving pictures. Therefore many research works on ME'S hardwares can be found in the literature [1, 2, 3, 41. From these research results, it can be found that the trend in ME hardware moves from meeting computational requirements to reducing 1/0 bandwidth as well as to providing scalability to enhance flexibility in system-level designs. In this paper, we would like to present two scalable VLSI architectures for ME processor based on full-search block matching algorithms. Both architectures are derived from the same search data flow, however they differ in passing search data within P E array. The mapping process and the specific features of these two architectures are discussed in the following sections.
SCALABLE VLSI ARCHITECTURES
To provide scalable solution, we propose a system block diagram as shown in Figure l(a) . Here each module can directly be connected to handle different sizes of reference block (NXN) and search range [-P,P-11. That is both N and P are allowed to change according to applications. Each module basically contains 3 units, namely stream memory bank, processor element array, and post-processing unit as shown in Figure l(b) . Before discussing the functionality and specific features of each unit, we first briely describe the data flow to implement full-search block matching algorithm. 
Overlapped Search Data Flow
Very often P E efficiency becomes degraded as boundary candidate is detected. This is due to the factor that search data for the boundary candidate are not always needed in PES. If we merge search data on neighbouring rows and send them simultaneous1.y to P E array, we can ensure that all PE's work on correct search data as shown in Figure 2 .
Under this data flow strategy, each P E only has to select the correct search data from two input buses. The rest tasks are how to develop ME architecture to meet the requirements as identified above.
Stream Memory Bank
To provide the required data flow for the P E array, we have recently developed a dedicated stream memory [5] for such a purpose. We first divide the search area into two regions, namely boundary (B-region) and non-boundary (NBregion). Both are buffered in separate stream memory banks and directly connected to individual input bus. For NB input bus, it is always occupied by the NB stream memory. for next motion vector to preload new reference block and new search area which should be buffered in advance before P E array is activated. To reduce idle time during data initialization, we use double buffering method As a result, we need 3 input buses connected to the stream memory banks.
The required storage space is 2 x N x (2P + N -1).
Processor Element Array
To meet computational requirement for full-search block matching algorithm, we propose 2-D array structure for the ME processor. The basic function of P E is to perform absolute distortion calculation and accumulation. Since NXN 2-D distortion summation can be decomposed into N 1-D's distortion, each 1-D's PES can be designed independently.
However search data flow determines the connection style among PES as discussed below:
Orthogonal Style:
In this case, data flow between search data and partial sum are orthogonal to each other within PE. The corresponding P E array structure is shown in Figure 3 (a) which is also known as systolic-array structure. Each PE receives two inputs, i.e. B and NB, performs distortion calculation and then passes them to its neighbouring P E horizontally. Partial distortion is accumulated and passed to its neighbouring P E vertically Selection between B and NB is controlled by se1 which runs vertically too. Data initialization is also done through both B and NB buses, where B-bus is also for reference data since it has sufficient time slots.
Parallel Style:
In this case, search data and partial sum run in parallel within PE. The corresponding P E array structure is shown in Figure 3 (b) which is also known as semi-systolic array structure [6] The functionality of each P E remains the same, however, search data are connected to 1-D PE array through 2 global buses which are for both B and NB data. Selection between B and NB is also controlled by se1 which, however, is vertically connected to PES of different dimension Only reference data has to be initialized for this case because search data are buffered in the stream memory bank.
Post-Processing Unit
The N's 1-D distortion from the PE array has to be summed and compared to find the minimum among all candidates and hence to achieve the motion vector. In addition, to allow scalability, it has to deal with distortion obtained from other ME module (as shown in Figure 1 ) to identify temporary motion vector with its distortion and pass them to neighbouring ME module for further estimation process. Figure 4 shows the structure for this unit. It mainly contains one parallel adder tree, one motion vector generator, and one comparator. Note that a delay matching unit is also included t o allow alignment in data transfer between ME modules.
Scalable Solutions
The above description focuses on the internal structure of each ME module. Here we describe how several ME modules can be configured for large search area and reference block.
case I: Increasing Reference Block Size
We assume that each ME processor can handle NXN reference block, search range of [-P,P-l] and produce one MV every (2P)2 cycles. If, for example, the reference block becomes 2NXN and search range remains the same, we need two ME processors to produce one MV every (2P)' cycles. Figure 5(a) shows the scalable design, where each ME processor computes partial MAD for each NXN block. The partial MAD from ME1 is then sent to ME2 to find total MAD of each candicate. Motion vector and corresponding MAD are then obtained from the compare-select unit in ME2 every (2P)' cycles without counting latency. Note that to reach the performance, search data has to be arranged as shown in Figure 5 
Case 11: Increoszng Search Range
We assume that the search arrange is extended to [-2P,2P-1 1 and motion vector still has to be located in (2P)' cycles. Without using multiple ME processors, it takes (4P)' cycles to find one MV. We thus partition the search area into 4 regions and each of which will be handled by one ME processor. For example the search range from [-2P,-1] is handled by ME1 as shown in Figure 6 (a). The search area is partitioned into two regions with overlapped (N-1) rows and columns as shown in Figure 6 (b). These 4 ME processors are executing in parallel to compute local minimal MADS which are then further pirocessed to find the final MV.
EVALUATION AND DISCUSSION
It can be found that total storage space for both styles are the same. However each solution has its own advantages and disadvantages. For example, the orthogonal style has locality in PE design, making it well-suited for VLSI implementation. However, the orthogonal flow also makes it bad for physical routing. In addition, data initization between two motion vectors becomes more difficult since part of the storage space are distributed among the PES. On the other hand, the parallel style has the drawback of broadcasting signal of search data and control signal, making it difficult in handling timing constraints. However the advantages are (1) routing area in physical design can be reduced and (2) data initialization becomes more easy since only stream memory bank, instead of both stream memory and P E array, has to be handled. Both solutions can reach ( 2 P + N -1) memory space is needed; e control design is very straightforwrd: control signals can be easily generated by ring-counter-based logic [8] ;
e clock speed can be enhanced to meet design specs: the ME processors are based on semi-systolic/systolic array structure and can easily be pipelined to meet speed requirement. 100% hardware efficiency of P E array during MV search. A demonstrator chip based on the parallel-style was fabricated and tested. Results show that, for N=P=16, more than 48,000 MVs per second can be obtained with core area less than 70mm2 based on 0.8pm CMOS double metal process. Compared to other ME architectural solutions found in the literature [l, 2,4, 71, our proposals not only achieves optimal hardware efficiency within P E array but also require minimum memory bandwidth. Also the proposed two architectures, which combine 2-D array and stream memory, are very modular and hence very suitable for high sample rate video applications.
In summary, our scalable ME proposals are very regular, modular and well-suited for VLSI implementation Below we highlight their features: e minimal 1/0 to reach 100% P E efficiency: 2 input ports are needed for both B and NB search area data, where the input port for B is also used for porting reference data during initialization phase;
Q optimal generation of motion vector (MV) each MV can be obtained within (2P)' cycles;
