Abstract-Efficient VLSI architectures for motion estimation using the full-search block-matching algorithm are proposed in this paper. These structures are based on an improved and more efficient two-dimensional single-array architecture with minimum latency, maximum throughput, and full utilization of the hardware resources. This optimized architecture is extended to a class of fully parameterizable multiple array architectures that combine both pipelining and parallel processing techniques and provide the ability to configure the processors according to the setup parameters, the processing time and the circuit area specified limits. The development of a single-array processor in a single-chip based on a 0.25-m CMOS technology process proves the practical interest of the proposed architecture for implementing real-time motion estimators.
I. INTRODUCTION

M
OTION COMPENSATION is a fundamental technique to obtain data compression in video coding by exploiting interframe prediction of temporal redundancies in video sequences [1] . Several block matching algorithms have been proposed. Among them, the full-search block-matching (FSBM) algorithm exhaustively compares each block of the current frame with all candidate blocks of the search window defined within the previously processed frame. Although this algorithm provides the optimal solution, it is the most computationally expensive.
Motion estimation requires a huge amount of computations [1] , which justifies the great research effort that has been made to develop efficient dedicated architectures and specialized processors [2] - [4] . Besides the FSBM algorithm, which is extremely regular and suitable for implementations based on array structures [5] , other faster block-matching algorithms have been also proposed [6] . Most of them consider only a reduced set of candidate motion vectors, simpler matching or distortion computations, or even a subset of the block motion field. However, not only do these algorithms provide suboptimal solutions, since the considered search spaces are necessarily reduced, but most of them apply nonregular processing schemes. Consequently, only a few architectures have been proposed to implement fast motion estimation algorithms, due to their less regular structures and higher control overheads [6] .
The main goal of the research presented in this paper is the analysis and development of efficient array architectures and dedicated processors for motion estimation based on the FSBM algorithm. From a comparative analysis of the main array architectures that have been proposed over the last few years [7] - [10] , [4] , one concludes that most of them do not provide a maximum and constant throughput or a full utilization of the hardware resources. Due to its peculiar processing scheme, the AB2 architecture [7] proposed by Vos and Stegherr [8] was selected as the basis for the presented research. It is shown that this architecture can be significantly improved in what concerns the hardware requirements, by reducing the amount of memory required to store the search area, thus achieving a full utilization of the hardware resources. It is also shown that the improved single-array architecture can be extended to multiple-array architectures by exploiting the parallelism and lack of data dependencies provided by the motion-estimation algorithm. A new class of parameterizable array architectures is derived, integrating the proposed single-and multiple-array architectures.
The proposed class of architectures was described using fully parameterizable VHSIC Hardware Description Language (VHDL) code and its functionality was thoroughly tested. An integrated circuit for motion estimation was developed, by making use of this class of architectures and using a standard cell library of a CMOS 0.25-m technology. Experimental results show that the implemented configuration is able to estimate motion vectors in 4CIF video sequences in real-time.
II. FSBM ARCHITECTURES
The FSBM algorithm using the sum of absolute differences (SAD) distortion measure can be described using the four nested loops presented in Fig. 1 . The specific characteristics of a given FSBM architecture are defined by the set of these loops that are executed in parallel. Assuming, for example, that the variable in the algorithm of Fig. 1 is set to a fixed value and that each processor element (PE) performs the primitive operation SAD, a 3-D dependence graph (DG) can be derived [11] , [12] . Systolic array structures are obtained by applying the usual index projection, time scheduling and graph folding [11] , [12] spaces. FSBM architectures are usually classified according to the set of performed projections, giving rise to 1-D structures if multiprojection techniques are applied. Their execution time is dependent on the specific arrangement of the data supply and on the number of projections performed in the retiming procedure.
One of the first discussions about FSBM architectures was presented by Komarek and Pirsch [7] . They discussed the characteristics of a set of 2-D and 1-D processor arrays that were obtained by reducing the dimensions of the initial DG using the referred operations. The main difference between these arrays is the exploited processing concurrency, which implies different structures and different number of PEs (#PE) and is dependent on the desired performance versus circuit area trade-off. The so-called type I-AB2 bidimensional structure requires # , while the AS2 type array uses # . By projecting the initial DG twice, 1-D arrays are obtained, such as the AB1 structure, with # , or the AS1, with # . Vos and Stegherr [8] proposed an improved version of the type I-AB2 2-D structure which presents some significant advantages in what concerns the processing time. Hsieh [9] presented a type I architecture with some improvements in what concerns the transfer of data into the processing circuit. In his proposal, the candidate macroblock is supplied as a series of one-dimensional data through a set of delay elements. Chang [10] proposed an alternative notation for the DG representation in order to improve the four loop based model: instead of nodes and links, he repeatedly allocated a 2-D projection (slice) in the 2-D space (tiling). If some conditions are met, Chang's model provides a hardware utilization rate very close to the optimal (100%). However, this is only possible by using multiple data-input lines. Very recently, Kittitornkun and Hu [4] also proposed a different mapping of the original DG in order to improve the throughput of the algorithm, by eliminating all idle clock cycles in the transitions between consecutive frames.
One of the most important figures of merit that is often used to compare the performance of the architectures is the required number of clock cycles to estimate the motion vectors. The values of #PE and of some of the referred architectures are presented in Table I . It is worth noting that, in practice, the real values of can be significantly greater than those presented in Table I . In fact, extra clock cycles are frequently necessary to fill the pipeline and dummy results are often computed to preserve a regular data flow. For comparison purposes, the limit situation corresponding to a processor array with a single PE, designated by SinglePE architecture, was also considered.
The circuit area and the processing time were estimated by parameterizing the set of expressions presented in Table I in terms of , which represents the relation between the reference macroblock width and the maximum considered displacement of the candidate macroblocks in each direction in the search area . The obtained results are presented in Figs. 2 and 3 . In AB1 and type I architectures, the circuit area is independent of the search window size ( and processing elements, respectively), while in AS1, AS2 and type II structures, it increases significantly with the dimension of this window. Therefore, these last structures are usually advantageous for small sized search windows , while the former offer advantages for greater search areas. In what concerns the processing time, while for most architectures it increases with the search window size, its value is constant for the type II structure, since one PE is used to compute the SAD measure of each candidate macroblock.
Among all these array structures, the type I architecture proposed by Vos has been recognized as being one of the most efficient structures [4] , [13] . Its main advantages are the short processing time and the limited amount of hardware requirements, when compared with other bidimensional structures. However, this architecture still has some nonexploited features, which can be used to significantly improve its efficiency in terms of hardware resources and exploited parallelism level.
III. AN IMPROVED SINGLE-ARRAY ARCHITECTURE
The proposed single-array architecture is based on Vos architecture but presents significant improvements in what concerns its structure. Due to the similarities between the processing schemes of these two architectures, the description of the proposed structure will be done by contrasting its optimized characteristics with those presented by Vos and Stegherr [8] , [14] . Therefore, references to Vos architecture will be done whenever it shows to be convenient.
The diagram shown in Fig. 4 illustrates the main differences between the architecture proposed by Vos, represented using solid and dotted style lines , and the proposed architecture, represented with solid and dotted-dashed style lines --. Like any other type I bidimensional structure, each pixel of the reference macroblock is assigned to one of the PEs that compute the SAD similarity function (designated by active PEs). Besides this active block, the processor proposed by Vos is also composed by two passive blocks with passive PEs, which are appended to each side of the active block (see Fig. 4 ). Each passive PE is mainly composed by data registers for the displacement and storage of search-area pixels. Both the reference macroblock and the search-area pixels are transferred into the processor through two vertical input register chains, with length and , respectively. Within the PE array, search-area pixels can be displaced in three directions: upwards, downwards, and to the left. If, at a given clock cycle, one column with pixels of the search area is fed into the structure through the set of upper inputs, all search-area pixels within the PE array are simultaneously shifted one position to the left. During the following clock cycles, search-area data is shifted downwards one position per cycle. Meanwhile, the pixels corresponding to different candidate macroblocks are processed by the several active PEs, obtaining one similarity value in each clock cycle. After shift-down operations, another left shift of the search area is performed and a new column of pixels is fed in the right side of the array. However, this column is now loaded through the lower inputs. This alternation of input positions in the input register chain is repeated along the search process. During the next clock cycles, search-area data is shifted upwards in a similar manner as described above, being shifted to the left after the th clock cycles. Contrasting with other architectures [7] , [9] , this zig-zag processing scheme provides fast processing capabilities, preventing the need for dummy clock cycles between any two adjacent rows of the search area. To ease the discussion, a 90 rotation of the processor diagram presented in Fig. 4 will be considered in the description.
The processing scheme of Vos' architecture can be represented, in a simplified way, by the sequence of states shown in Fig. 5 . The fraction of the search area being processed at a given clock cycle was represented using a solid-line rectangle, whereas those leaving or entering the processor were represented using a dashed-line rectangle. The bottom dashed-line rectangles represent search-area fractions entering the processor in the next clock cycle, while the top dashed-line ones represent already processed search fractions leaving the array. Fig. 5 evidences that in an array composed by active PEs and by passive PEs, used to process search fractions with pixels, half of the total amount of passive PEs, , are not being used at any clock cyle. However, these passive PEs are required whenever search-area pixels are displaced into their registers. The proposed solution to overcome this drawback consists in disposing the Vos' planar structure over a cylindrical surface, as it is shown in Fig. 6 . By doing so, since the pair of passive blocks is superimposed, one can naturally discard one of them, using the other to displace the search-area pixels. Nevertheless, the zig-zag processing scheme can still be applied to this modified structure, preserving the properties of Vos' architecture but keeping all PEs busy at any instant.
A simplified block diagram of the proposed structure is shown in Fig. 6(c) . The cylindrical structure of Fig. 6(b) is obtained by connecting the passive PEs located in the right margin of the passive block with the active PEs of the left margin of the active block, as it was shown in Fig. 4 . The new processing scheme of the proposed architecture is shown in Fig. 7 , for the same setup of Fig. 5 .
In contrast with Vos' architecture, this structure does not require passive PEs not carrying useful data at some clock cycles. Moreover, the zig-zag processing scheme of Vos' architecture is preserved, thus maintaining its recognized efficiency. However, while in Vos' architecture registers are required, in the proposed architecture only registers are necessary. The chart presented in Fig. 8 illustrates the relation between the number of registers required by both structures to perform the displacement of search-area pixels. This relation is about 60% for and 55% for .
IV. NEW CLASS OF MULTIPLE-ARRAY ARCHITECTURES
The proposed motion-estimation architecture can be further sped up by computing in parallel several distortion measures using multiple active blocks (hereafter designated by "cores"): since both the passive and the active PEs perform the same displacement task of the search-area pixels, an passive PE array can be replaced by an array of active PEs with the same size. However, as shown in Fig. 9 , such a processor, composed of processing cores, will require adder-tree blocks to compute the distortion values of all candidate macroblocks being processed at each clock cycle. Likewise, the former passive block, composed of processing elements [see Fig. 6(c) ], is now fragmented into several smaller passive blocks (attached to the corresponding active blocks). The fraction of the last passive block composed by the set of columns of PEs located next to the right margin of the structure is designated by connection block. The last column of this block is connected with the leftmost column of the first active block, thus obtaining the previously described cylindrical structure [see Fig. 6(c) ].
The usage of several active blocks makes it also possible to split the computation of each distortion value into more than one active block or time unit. This feature provides the ability to use active blocks composed by only columns of PEs and rows of PEs , thus adapting the processor structure to the characteristics of the implementation technology (see Fig. 9 ). In such schemes, each distortion measure is only obtained after multiple complete excursions of the processing Fig. 10 . dN=he2dN=`e reference pixels must be stored in the standing-data registers of the active PE.
cores in the corresponding search areas, in order to process each of the fractions of the reference macroblock. Consequently, reference pixels, corresponding to each of these excursions, will have to be stored in the standing-data registers of the PEs, as shown in Fig. 10 . Meanwhile, search-area data must be displaced in both directions of the processing array.
To maximize the effectiveness of the processor, it is assumed that all PEs process the same number of search-area pixels. Consequently, the number of candidate macroblocks processed by each of the active blocks in a given row of the search area should be fixed to . This restriction will imply the usage of a slightly smaller search area, composed by only candidate macroblocks in each row. In fact, for an array processor composed by processing cores (each one with active PEs) and a search range initially defined in the interval , corresponding to a total of candidate macroblocks defined in a pixel search window, the effective number of candidate macroblocks in each row of the search area is given by (1) with a maximum deviation from the initial value given by (2) The number of pixels that compose the effective search area is (3) and the number of passive PEs between each of the active blocks is given by , with the rightmost passive block accommodating the columns of passive PEs of the connection block (4) The number of clock cycles required to estimate the motion vector for a given reference macroblock in a processor composed by processing cores is given by (5) In fact, the search area can be split in three different regions (A, B, and C), as illustrated in Fig. 11 . While in region B, all candidate macroblocks are processed using all the fractions of the reference macroblock, in regions A and C, not every fraction is used in each excursion of the active blocks [see (5) is obtained by multiplying the sum of (6c), (7), and (8c) by the number of clock cycles required to process each horizontal excursion . Hence, by adjusting a restricted set of implementation parameters ( , , and ), processing structures with distinct performances, as well as different hardware requirements, are obtained. This feature can be particularly interesting in implementations with limited hardware resources (such as FPGAs), providing the means to adjust the used resources to the target technology.
A. Typical Structures
In the remaining part of Section IV, some examples of different configurations obtained from the proposed class of processors will be presented. Each processor was denominated as , where and . The structure, shown in Fig. 12(a) , makes use of a single active block and coincides with the previously presented single-array processor. The number of clock cycles required to process one reference macroblock is given by (9) In contrast, the structure, shown in Fig. 12(b) , makes use of two active blocks , thus increasing the amount of required hardware by a factor of two. Therefore, the number of clock cycles required to process one reference macroblock is halved as follows:
(10) Fig. 12(c) illustrates the structure where the total amount of required hardware was reduced by using only rows of PEs. Two active blocks are used in this structure (11) The number of clock cycles required by this structure is exactly the same used by a type I structure. This fact can be easily understood if one realizes that although the number of registers used by this processor array is only half of the required by the structure, it uses exactly the same number of active PEs. Consequently, not only does this structure provide advantages in what concerns the amount of required hardware resources, but it is also characterized by a shorter pre-fetch time of the search-area data. This feature can be significantly advantageous when the target implementation provides limited hardware resources (e.g., FPGA devices).
The structure , illustrated in Fig. 12(d) , is characterized by the same hardware resources than structure. However, active blocks were implemented using . As before, the required number of clock cycles is also . Fig. 13 . Relation between the number of clock cycles required by the considered structures of the proposed class and by the traditional nonreversive structure, using both the transparent and nontransparent transfer modes
For the sake of comparison of the performances of the presented structures, we also considered an architecture using a single active block and not making use of the reversive zig-zag processing scheme. In this nonreversive (nR) structure, search-area pixels are always displaced in the same direction, which presents some analogies with the architecture proposed by Hsieh [9] . Consequently, additional clock cycles are required to process each row of candidate macroblocks in order to move the processed pixels to the opposite side of the array (12) The number of clock cycles required by each of these structures is summarized in Table II . In the previous analysis, the extra clock cycles required between the processing of consecutive reference macroblocks to transfer or remove unused or already processed search data from the array were not taken into account. These extra clock cycles can be avoided if a transparent transfer mechanism based on the pre-fetch layer described in [15] is used. With such a structure, it is possible to pre-load both the reference and part of the search data corresponding to the next reference macroblock while the current macroblock is being processed. Therefore, any idle clock cycles in the transitions between consecutive reference macroblocks can be avoided, giving rise to the values presented in the first column of Table II. Fig. 13 shows the relation between these values and the number of cycles required by the traditional nonreversive architecture. This chart evidences the significant increase of the number of clock cycles required by those structures using the nontransparent transfer mode for small search areas (small values of ). In such situations, the parcel of time corresponding to the read operation of the search data cannot be disregarded. For greater values of , these relations approach those obtained for the structures with transparent pre-fetching.
V. DATA-FLOW CONTROL
The desired data-flow between the several blocks of the processor can only be guaranteed through careful design of the various control units that provide the set of signals required by the several blocks of the processor. To simplify the design of these units, they were decomposed into local and simpler control circuits and synchronized through a restricted set of signals. The number of states of each controller was optimized in order to guaranty the tradeoff between the efficiency and the flexibility levels of each circuit and the required amount of hardware resources. The main control circuits are the central control unit (decomposed into the main state machine, the data flow controller and the line and column counters), the clock generator, and the reference and search-area input controllers.
As an example, the search-area input controller is responsible for loading search-area pixels into the processor array by means of a serial-in-parallel-out (SIPO) input buffer. It reads fractions of the previous image, composed by the set of pixels corresponding to each row of the search area, and transfers them in parallel to the array as soon as all the fractions of the reference macroblock have been processed. Its implementation is carried out by means of cascaded registers with their outputs connected to the processor array, as shown in Fig. 14(a) .
However, with the introduction of the new processing scheme based on the cylindrical structure, some measures have to be taken into account in order to provide the correct data to each PE. Fig. 14(a) shows an example of a processor array with and . To simplify the explanation, the processor array was schematically represented in Fig. 14(b) , where each column was conveniently numbered according to the corresponding active block A or B. Since some of these columns are processed by more than one active block, they were marked with both representations. The situation illustrated in this figure corresponds to one of the two possible extreme positions of the data being processed in the array and coincides with the transfer of a new search line. In this case, the position of the pixels in the input buffer is the same as in the array. Consequently, it is only necessary to transfer them in parallel to the array.
However, in the situation corresponding to the other extreme position, the spatial distribution of the pixels loaded into the input registers is not aligned with those in the processing array. In fact, their alignment can only be guaranteed if the registers are connected as it is illustrated in Fig. 14(c) . In this case, the set of input registers is split into two smaller shift registers: one with registers and other with registers.
The complete search-area input buffer and its alignment circuit is presented in Fig. 14(d) . This circuit is characterized by a variable topology of the connections between the two smaller input registers, which depends on the direction of the flow of data being processed. This variable topology is accomplished by means of two multiplexers, making it possible to transfer in parallel all the input pixels to the correct position in the processing array.
VI. IMPLEMENTATIONS AND EXPERIMENTAL RESULTS
A configuration software tool was developed to determine the several possible configurations of the proposed class of architectures that fulfill the requisites of a given video coder. This configuration tool, whose graphical interface is illustrated in Fig. 15 , receives the following set of parameters: the image resolution and the frame rate; the dimension of the reference macroblock and the maximum allowed displacement in each direction ; the processor maximum operating frequency and a flag indicating if the pre-fetching transparent transfer mode is to be used. These parameters are used to compute the number of PEs to be implemented in each line and column of each active block and the number of processing cores that will compute the several distortion measures in parallel . For each possible configuration, the application outputs the effective maximum frame rate and may also supply the required semiconductor area to implement the processor, calculated with a priori information about the used technology. As soon as the selection of the desired configuration has been carried out, the configuration tool outputs the complete description of the FSBM processor using IEEE-VHDL description language [15] .
To achieve the required characteristics in what concerns the processor performance and configurability, the description of the several blocks of the proposed class of processors was carried out using a fully parameterizable VHDL description, by making extensive use of "generic" type configuration inputs. Furthermore, a fully structural description of these circuits was also carried out, using only the most elementary logic operations provided by the implementation library to achieve the required optimization levels of the several processing blocks.
It was designed a FSBM integrated circuit based on the proposed class of architectures with Synopsys synthesis tools and Cadence design tools. This chip was implemented using the Diplomat-25 standard cell library, based on the UMC 0.25-m CMOS technology process from Virtual Silicon Technology Inc. [16] . The implemented processor is composed by a single active block with 16 16 PEs and it is characterized by a search range from to pixels , corresponding to the set of parameters tipically adopted by ITU-T H.26x and ISO MPEG standards [1] . The total area of the implemented chip is about 16.07 mm , with a total pin-count of 56. The main characteristics of the chip are summarized in Table III . The chip is able to deliver 28.6 GOPs at 36.5 MHz over typical voltage and temperature ranges, giving rise to a total of 1.78 GOPs/mm . With such a configuration, it is possible to estimate motion vectors in 4CIF video sequences at a rate of 16 frame/s.
VII. CONCLUSION
A new class of fully parameterizable multiple-array architectures for motion estimation in video sequences was proposed in this paper. This class is the result of the introduction of significant improvements in the AB2 single-array architecture for the FSBM algorithm, which led to the minimization of its latency, the maximization of its throughput and a full utilization of the hardware resources.
The proposed class of processors provides the capability of further speeding up the estimation of motion vectors, by computing in parallel several distortion measures using multiple and independent processing cores. Optimized implementations of these processing structures can be obtained by means of a software configuration tool that was developed to provide the ability to automatically configure the target processors according to the setup parameters, the processing time, and the circuit area specified limits.
The obtained fully parameterizable VHDL description of this class of architectures provides the possibility to implement efficient processors that are able to perform motion estimation according to the existing ITU-T H.26x and ISO MPEG video coding standards with configurable search ranges and video quality tradeoffs. Experimental results obtained from the implementation of a single-array configuration have shown that it is possible to estimate motion vectors, for example in 4CIF video sequences, at a rate of 16 frames/s with a single chip.
