Abstract-In this paper, a high-throughput modular architecture for a logarithmic search block-matching algorithm is presented. The design efforts are focused on exploiting the search area data dependencies using special data input ordering constraints. The input bandwidth problem has been solved by a random access on-chip memory, and a simple address generation procedure has been described. Furthermore, this architecture can handle a large search range with unequal horizontal and vertical spans using a technique called pipeline interleaving. Compared to the existing architectures for the three-step search BMA, this architecture delivers a high throughput rate with fewer input lines, and is linearly scalable.
I. INTRODUCTION

V
IDEO compression standards such as MPEG [1] have fundamentally impacted the future direction of modern information technology. These standards made it possible to develop low-cost, high-performance, real-time visual communication systems to support emerging applications such as multimedia, digital library, video-on-demand (VoD), and high-definition television (HDTV).
A common approach of video compression is to exploit both spatial (intraframe) as well as temporal (interframe) redundancies. Transform coding is often used to serve the purpose of intraframe coding, and predictive coding with motion estimation is often used for interframe coding. In the context of video coding, motion estimation is concerned less about the movement of a specific object in each frame. Rather, within a predefined search area of the reference frame, the goal of motion estimation is to search for a block or a region which best matches, under certain matching criteria, a given block (or a region) in the current frame. The displacement between the coordinate of the block in the current frame and the matched block in the reference frame is called a motion vector. Given the block in the reference frame and the motion vector, as well as the difference between these two corresponding blocks, the block in the current frame can be recovered perfectly. Assuming the reference frame is available, then only the difference image between the two matched blocks and the corresponding motion vector needs to be transmitted to fully recover the current frame. This often leads to a significant reduction of the video data to be transmitted. Motion estimation can be performed with different granuarities such as a pixel, a block, or a (irregular) region.
Among them, the block-based approach is considered the most matured and practically useful. A block refers to a small square of pixels (e.g., 8 8, 16 16) in a frame. During motion estimation, a distance measure, defined by the matching criterion, between the target block in the current frame and a candidate block within the search area of the reference frame will be computed. The candidate block with the smallest distance to the target block will be selected as the best matched block, and the displacement between these two blocks will be computed and transmitted as the motion vector.
Motion estimation is a computationally intensive task. The amount of computation is proportional to the number of candidate blocks in the search area and the size of the block. A full-search block-matching algorithm (FBMA) evaluates the distance between the target block and every candidate block within the search region. Hence, it is able to find the best matched block to give the highest peak-signal-to-noise ratio (PSNR) and reconstructed image quality. On the other hand, it also demands an enormous amount of computation. For example, for HDTV applications, it is estimated that 8 billion additions will need to be performed each second in order to sustain real-time compression of the video signals. Numerous special-purpose hardware, many feature systolicarray structures, have been proposed [8] - [13] to alleviate this problem. On the software side, there are also many fast blockmatching algorithms being proposed which seek to reduce the computation time by searching only a subset of the eligible candidate blocks. These fast block motion estimation algorithms include the two-dimensional logarithm search algorithm [4] , the three-step search algorithm [5] , the conjugate direction search algorithm [6] , and many variations. These algorithms search only a small fraction of available candidate blocks at the cost of moderate performance degradation in terms of PSNR and subjective picture quality. Even for this family of fast motion estimation algorithms, a hardware implementation will be beneficial in that a special-purpose motion estimation unit can spare the host processor to handle more complicated, but fewer computation-intensive tasks. However, unlike the FBMA algorithm which is extremely regular and suitable for array structure implementation, these fast motion estimation algorithms have a much less regular structure, and hence are more difficult to implement. So far, there are two different architectures proposed for the implementation of the three-step search block-matching motion estimation algorithm.
In Jehng et al. [14] , they proposed a low-latency and high-throughput tree architecture for the FBMA and threestep search BMA. This tree architecture can avoid the long data path length, data skewing, and hazards caused by data dependency. Nevertheless, it requires massive amounts of input ports, i.e., it requires input ports for the block of size image pixels. To alleviate the problem of large input pin count, they proposed a tree-cut technique which folds a whole tree into a subtree with a reduced number of processing elements and input pin count by sacrificing the throughput rate. In [15] , Jong et al. suggested a fully pipelined parallel architecture for the three-step search BMA. They map the three-step search BMA onto one-dimensional nine processing elements, and let each PE evaluate each candidate location. For the data reusability, half search area buffers have been proposed to utilize the overlapped search area region between the two neighboring target blocks. Both architectures, [14] and [15] , introduced a technique called memory interleaving which distributes search area data to and nine memory modules, respectively, and allows parallel data accesses. Nonetheless, there is still room for further improvement. In particular, the overlap of search regions within the search area corresponding to a particular target block is not fully exploited in these architectures, resulting in excessive memory access.
In this paper, we discuss our effort to develop an array structure to implement a logarithmic search block-matching algorithm of which the three-step search algorithm is a special case. A key step in this work is to exploit the patterns of overlapped search area in the reference frame so as to derive an efficient architecture which reduces as much as possible the needs to reload or redistribute the same reference data over and over. As a consequence, our architectures offer more efficient processor utilization while achieving a very high throughput rate. The irregularity of the logarithmic search method is hidden behind the memory access pattern, which is how different parts of the memory (frame buffer) are accessed at different time. Previous implementations of the three-step search method did not address the issue explicitly, nor are their results conclusive. In this paper, we derive a formula explicitly describing the memory access pattern, and proposed an application-specific architecture to implement the memory access process. With these efforts, we estimate that the architecture we propose is capable of handling the progressive-scan HDTV format with a clock rate as low as 50 MHz.
The rest of this paper is organized as follows. The logarithmic search BMA is reviewed in Section II. The reference area overlapping pattern analysis is discussed in Section III. Furthermore, high-throughput architectures with reduced input pin count and memory bandwidth using special data input schemes are discussed in Section IV. Finally, comparisons with other existing architectures and conclusions are given in Section V. II. REVIEW OF THE LOGARITHMIC SEARCH BMA By default, the current frame and the reference frame have exactly the same coordinate for each pair of corresponding pixels. Depicted in Fig. 1 is the target block (shaded region) of the current frame overlaid on the search area (the clear region in the background) of the reference frame. Starting from the current frame position, the search region spans pixels in each direction. Thus, each entry of the two dimensional motion vector (MV) has a range of In a full search BMA, all positions are to be searched. Each search requires the calculation of the total absolute difference (TAD) between the pixels in the target block, denoted by and the candidate block, denoted by (1) where is the displacement between these two blocks. Thus, each search would require integer subtractions and additions.
In a logarithmic search BMA, the search is accomplished hierarchically in steps. 1 During step a partial motion vector is determined by comparing the TAD's evaluated at exactly nine displacement vectors: (2) where with the local step size 2 and the shifted origin for and (0, 0) for The local displacement which yields the smallest TAD will be chosen as
The net motion vector then can be found as (3) Let us now consider an example where and Then and This corresponds to the case of three-step search BMA method [5] . This is illustrated in Fig. 2 where the 225 candidate motion vectors correspond to the 225 grids. In the first step, the TAD's of those nine clear circles labeled with 1 are evaluated. After comparison, it is decided
In the second step, TAD's on the eight shaded circles as well as are compared, and the minimum is selected. This yields in this example. Consequently, Finally, in the third step, and Note that the total number of distortion measure evaluations in a logarithmic search BMA method is -orders of magnitude less than that of the full search BMA method. With the above example, only 25 out of the possible 225 (a ratio of nine to one reduction) distortion measures need to be evaluated for each target block in the current frame. On the other hand, since not all candidate displacement vectors are evaluated, the logarithmic BMA method often yields a suboptimal motion vector whose corresponding TAD may not be the global minimum among all candidate displacement vectors. Some proposals to modify the threestep search method to alleviate this problem can be found in [16] - [17] .
The logarithmic search BMA corresponding to a single target block in the reference frame can be written as the fivelevel nested Do loop as depicted in Fig. 3 . The outer most loop with index performs the search steps. Loop indexes and refer to the nine displacement vectors to be searched during each step, and and loops compute the TAD for a given displacement vector. 1 dxe is the ceiling function. Given a nested Do loop, it is possible to map its corresponding dependence graph onto an array structure with processing nodes occupying a lower dimension index space [18] . However, due to the dependency of on the corresponding five-dimensional dependence graph is not regular. Thus, an existing systolic design procedure cannot be applied directly. On the other hand, the inner four-level nested Do loop with being fixed indeed forms a regular dependence graph, and systolic mapping procedures can be applied.
III. REFERENCE AREA OVERLAPPING PATTERN ANALYSIS
Given a fixed an area in the reference frame will be searched to compute the nine This is illustrated in Fig. 4(a) . For example, the search area for corresponds to the shaded area in this region. Clearly, in the nine possible search directions, their search areas have significant overlap. Under the assumption that is an integral multiple of we may partition this overall search area into segments, each of size pixels pixels. Some of these segments will be used only once by one particular search direction and others will be used many more times. In Fig. 4(b) , we list the number of search directions for which the segment will be accessed for the case of This figure suggested an interesting memory loading strategy: if each of these segments is loaded simultaneously pixel by pixel, then the entire search area will need I/O channels (8 bits wide each), and the loading of data into the array can be accomplished in clock cycles. Once the pixel is loaded into the array, they will be routed to the proper processing element for computation. As such, each pixel will be loaded only once regardless of whether it is to be used once or nine times during the computation of the nine TAD's. This is the motivation behind the architecture proposed below. Before we present the formal derivation, let us consider a heuristic architecture. Assume that the target area is also decomposed into subblocks (in this example, 16/4 16/4 16 subblocks), and assign one processing element (PE) to compute the 16 ADA (absolute difference and accumulation) operations within a subblock as depicted in Fig. 5(a) and (b). Then each PE will need to access a 3 3 subblock array in the reference frame. For example, the PE (0, 3) will process pixels in reference subblocks shaded in Fig. 4(b) . By analyzing this pattern, an interconnection pattern within the array structure can be devised for a given search step Obviously, this is only a basic idea. The actual implementation will be more complicated. First, recall that changes its value for different search steps But reconfiguring the array during the program execution may not be the best idea. Also, when becomes small (i.e., assigning pixels to a single PE will no longer be economical. In this case, a larger subblock size will be used, and of course, the internal control for each PE will be somewhat complicated as a result.
IV. ALGORITHM TRANSFORMATION AND ARCHITECTURE
A. Algorithm Transformation
Based on Fig. 4(a) above, we can summarize the overlap between search areas in the following lemma.
Lemma 1: Let and be integers satisfying If then the data dependency of can be expressed as follows.
Case I: For and Case II: For and
Proof: The proof is given in the Appendix. To illustrate, consider the case of and Depicted in Fig. 6 are the three search areas corresponding to and The overlapped area is the shadowed region. It can be verified that the conditions stated in (4)-(5) in Lemma 1 are correct.
As mentioned in Section III, the inner four-level nested Do loop with the being fixed indeed forms a regular dependence graph, and systolic mapping procedures can be applied. Assuming that the 3 3 array of search points is to be searched column by column, the pair of indexes and in the original formulation in Fig. 3 can be combined into a composite index using the following transformation formula: Moreover, the pair of indexes and can be transformed into and as summarized in the following lemma, and the search area data dependencies in Lemma 1 can be exploited. Proof: The proof is given in the Appendix. This index transformation implies that when pixels of an image block are read into the array structure, they are loaded sequentially according to the following data input ordering scheme. The target block is decomposed into by subregions 3 according to the search step as depicted in Fig. 7 , and the individual pixels within each subregion are loaded column by column. Depicted in Fig. 8 is an example when and The number within each subregion depicts which is the input order of image pixels within each subregion.
With the above data input ordering scheme, we proceed to replace in Fig. 3 yields the three-level nested Do loop formulation as depicted in Fig. 9 , where
We use to emphasize that this transformed variable depends on all three indexes: and
B. Systolic Mapping
In order to apply systolic mapping, a computing algorithm must satisfy two conditions, namely, single assignment form and locally recursive dependency. By single assignment form, each variable can be assigned to a single value during the execution of the entire algorithm. By locally recursive dependency, the data dependency cannot be a function of the loop iteration bounds. To satisfy these two constraints, we change the variables and respectively, into a three-dimensional variables and and impose the following data transmission rules:
Compared to and TAD, the data propagation pattern of is much more complicated. We note that the same pixel in the reference frame may be used several times during the logarithmic search as summarized in Lemma 1. If we can carefully design the input ordering of the pixel stream of the reference frame, much input/output bandwidth may be saved, resulting in more efficient and less costly implementation. This can be accomplished by propagating the reference frame pixels along the proper direction so that the data can be made available without requesting external input/output operations. After careful analysis of the overlaps of the search region for different values of we summarize the result in Corollary 1.
Corollary 1: Let and be integers satisfying Then the data dependency of can be expressed as follows:
Case I: For and ( 
15)
Case II: For and ( 
16)
Proof: The proof is given in the Appendix. Incorporating (13), (14) , and Corollary 1, the logarithmic search BMA can be transformed into a localized, regular iterative formulation as depicted in Fig. 10 .
In the above formulation, there are five dependence vectors, and these dependence vectors can be represented in a matrix form with each column representing each dependence vector:
(17) The first column denotes the dependence vector for variables MV, and while the second and third columns denote dependence vectors for a variable TAD. The next two columns denote the dependence vectors of indicated by (15) and (16) and depicted in Fig. 11 . By projecting the three-dimensional dependence graph (Fig. 11 ) along a projection vector with the so-called default schedule direction we obtain a two-dimensional array structure, as depicted in Fig. 12 , consisting of 9 16 processing elements, each responsible for a subregion depicted in Fig. 7 for a particular search direction (the index) during the step search.
The two-dimensional array structure requires 16 input ports for loading the 16 subregions within the target area depicted in Fig. 7 , and each of these subregions is loaded simultaneously, pixel by pixel. Each pixel loaded will be broadcasted to the nine processing elements. The search area data are loaded through the 36 input ports simultaneously, and they will be routed to the proper processing element for computation. The search area of size in the reference frame is partitioned into segments, each of size pixels pixels as depicted in Fig. 13 . For each step the number of segments composes a subregion of image pixels according to the target area subregion pattern depicted in Fig. 7 , and each of these subregions is loaded simultaneously pixel by pixel through each input port. At search step 0, since the search area is composed of 6 6 subregions, each of size pixels pixels. As such, each pixel within the search area for search step 0 will be loaded only once regardless of whether it is to be used once or nine times. However, at search steps the subregions overlap since the search area size is smaller than Fig. 13 depicts the search area for each search step when and The black dots indicate the base of each subregion, and the area shaded with black depicts the subregion which will be loaded through input port 23 of the two-dimensional array depicted in Fig. 12 . As such, the data within the shaded area will be loaded more than once. This brings the total number of pixels to be loaded to the array, for each target block in the current frame, to
Compared to the number this novel approach can alleviate the I/O bandwidth problem.
C. Architectures with Reduced Input Pin Count
In this section, we propose two different data input ordering schemes (Types II and III), and the corresponding architectures Proof: The proof is given in the Appendix. The target block is decomposed into subregions according to the search step as depicted in Figs. 14 and 15 . Again, the individual pixels within each subregion are loaded column by column. Each index transformation procedure exploits the search area data dependencies (4) and (5) in Lemma 1, respectively. The two types of data input ordering schemes are depicted in Fig. 16 where and have been considered. The number within each block depicts which is the input order of image pixels within each subregion.
After careful analysis of the overlaps of the search region for different values of the data propagation pattern of is summarized in the following corollary.
Corollary 2: Let and be integers satisfying Then the data dependency of can be expressed as follows.
For and
Proof: The proof is given in the Appendix. Corollary 3: Let and be integers satisfying Then the data dependency of can be expressed as follows.
For and (27)
Proof: The proof is given in the Appendix. For each formulation, there are four dependency vectors which can be represented in matrix forms and where the first column denotes the dependency vector for variables MV, and The next two columns denote the dependency vectors for TAD, and the last column denotes the dependency . Furthermore, a one-dimensional systolic array with nine processing elements as in Fig. 19 can be obtained by using a technique called multiprojection which projects the two-dimensional DG along a projection vector with a default scheduling vector Compared to the Type I architecture, the throughput rate decreases by a factor of 4 while decreasing the input pin count for the search area data by a factor of 2. Since the Types II and III architectures exploit only one of the two search area data dependencies in Lemma 1, multiple access of the data is inevitable. The numbers of search area data accesses for search step 0 when are compared in Fig. 20 . The number within each area indicates the number of data access, and the shaded area indicates the subregion which will be loaded through input port 5 for each architecture. An example of the Type III architecture with an input data path for search step 0 when is depicted in Fig. 21 .
D. Address Generation
Compared to the FBMA, the data flow of the logarithmic search BMA is less regular. Random-access on-chip local memory [19] can be a feasible solution to overcome the irregular data flow and high input/output memory bandwidth. Fig. 22 depicts a block diagram of the proposed architecture with the random access on-chip local memories when High throughput has been achieved by pipelining the three stages using three modules, each of which handles different target blocks in the current frame in parallel. As such, a throughput rate as high as block per clock cycle can be achieved with the Type I architecture. There are three 8 256 RAM's (2 kbits) for the target area data (RRAM) and three 8 1024 RAMs (8 kbits) for the search area data (SRAM) in the reference frame when As such, each module imports target area data and search area data from the different RRAM's and SRAM's in a cyclic manner.
Assuming that the image pixels are stored column by column in a raster scan order in both local memories, the target area data addresses (REF) and search area data addresses (SA) of the proposed architectures are generated automatically by Corollary 4. are the input order of image pixels within each subregion as discussed in Lemmas 2-4. Proof: The proof is given in the Appendix. Fortunately, the target area data addresses (28), (30), and (32), and the base addresses in (29), (31), and (33) are predictable. Therefore, when the target area data address table can be stored in 8 256 ROM (2 kbits), and the search area data address table can be stored in advance in 10 1152 ROM (12 kbits) and 10 576 ROM (6 kbits) for Types I-III architectures. As such, the address generator calculates the search area data address locally with the base address and motion vector of the previous step. Because is a power of two, at each step, can be calculated recursively using shift operations.
E. Pipeline Interleaving
In practice, the search range is larger than for a reference block of size For example, according to the Grand Alliance HDTV specification, the architecture should support the larger search range with unequal vertical and horizontal spans in the search area. An example of the search range of to pixels horizontally and to pixels vertically is depicted in Fig. 23(a) . The problem of a large search range can be solved by partitioning the search points into six subregions of search points, and each subregion of search points can be handled by combining the logarithmic search and the telescopic search [7] as depicted in Fig. 23(a) . Fig. 23(b) depicts the data flow diagram of the pipelined computation using three modules of our proposed architecture. Because both the logarithmic search and the telescopic search are based on the results of the previous search, delays must be inserted to prevent data hazards, which may cause performance degradation. This performance degradation caused by the idle clock cycles can be solved by pipeline interleaving because each subregion can be processed independently. Fig. 23(c) depicts the pipeline interleaved data flow diagram of the Type I architecture. Since the latency for the logarithmic search using the Type I architecture is clock cycles when the task of the six subregions can be interleaved to utilize the PE's 100%. By doing so, each search step can be performed every clock cycles, and the throughput rate of the Type I architecture becomes blocks/clock cycle. Hence, with a clock rate as low as 50 MHz and a search range of to pixels horizontally and to pixels vertically, the In this paper, a high-throughput modular architecture for the logarithmic search BMA has been proposed. Depending on the data input ordering schemes, we proposed a one-dimensional linear array, two-dimensional Type I, and Type II (Type III) architectures which feature throughput rates of and blocks/clock cycle, respectively. A comparison of our proposed architecture with the other existing architectures for the two different search methods is presented in Tables I  and II. Table II compares the total number of data accesses within the reference area for a target block when and The comparison shows that with the same number of PE's, 4 Jong's and our proposed architecture have an identical throughput rate. However, input pin count and memory bandwidth have been reduced dramatically by exploiting the reference area overlapping pattern. Special data input ordering constraints corresponding to the reference area overlapping pattern have been proposed.
The proposed architecture can handle the large search range with unequal vertical and horizontal spans in the search area with 100% processor utilization by using the technique called pipeline interleaving. As discussed in Section IV-E, the proposed Type I architecture can handle real-time high-volume video processing such as HDTV. However, our proposed architecture requires a rather large number of input ports, which may render this architecture impractical. Architectures with a reduced number of input port have been proposed in Section IV-C to lessen the burden of large input pin count. Moreover, different matching criteria other than the TAD are being applied to reduce the number of input ports and hardware complexity for low-power application. 2) (8): Because each subblock is composed of pixels, this can be proved by Fig. 24(b), (c) and Fig. 25(b) , (c). is larger than the target block of size As such, the search area data addresses can be calculated by adding the displacement to the fixed offsets reculsively.
