Abstract
Introuction
Rapid growth in High-Definition (HD) digital video applications has lead to an increased interest in portable HDquality encoder design. HD-compatible MPEG2 MP@HL encoder uses Full Search Block Matching Algorithm (FS-BMA) based Motion Estimation (ME). The ME module accounts for more than 80% of the computational complexity of a typical video encoder. Moreover, the power consumption of an FSBM-based encoder is prohibitively high, particularly for portable implementations. Hence, efficient ME processor cores need to be designed to realize portable HDTV video encoders.
Parameterizable FSBM ASIC design to solve the input bandwidth problem by using on-chip line buffers was proposed in [15] . [18] proposed a family of modular VLSI architectures which allow sequential inputs but perform parallel processing with 100 percent efficiency. A systolic mapping procedure to derive FSBM architectures was proposed in [4] . The designs of ( [2] , [20] ) and [5] focused on the reduction of pin counts by sharing memory units and 2-dimensional data reuse, respectively. [19] improved the memory bandwidth by using an overlapped data flow of search area which increased the processing element (PE) utilization. A low-latency high-throughput tree architecture for FSBM was proposed in [3] . Both [13] and [1] proposed low-power architectures based on removal of unnecessary computations. Finally, a novel low-power parallel tree FSBM architecture was proposed in [6] , which exploited the spatial data correlations within parallel candidate block searches for data sharing and thus effectively reduces data access bandwidth and power consumption. [7] proposed an FPGA architecture to implement parallel computation of FSBM. Systolic array and novel OnLine Arithmetic (OLA) based designs for FSBM were proposed in [8] and [9] , respectively. Customizable low-power FPGA cores were proposed by [10] . [11] evaluated the performance of FSBM hardware architectures [4] implemented on Xilinx FPGA. The results show that, real-time motion estimation for CIF (352 × 288) sequences can be achieved with 2-D systolic arrays and moderate capacity (250 k gates) FPGA chip. An adder-tree based 16 × 1 SAD FPGA hardware was implemented by [17] .
The aforementioned FSBM architectures can be divided into two categories, namely, FPGA [7, 8, 9, 10, 11, 17] and ASIC [4, 15, 18, 2, 3, 20, 5, 19, 13, 1, 6] . This work uses FPGA technology to implement a high-performance ME hardware with due consideration to (a) processing speed and (b) silicon area. Almost all aforementioned VLSI architectures optimize any one of these parameters. The novelty of the proposed architecture lies in its combined optimization of the aforementioned conflicting design requirements. The proposed hardware uses an initially-split pipeline to reduce processing cycles for each MB and thus increases the throughput. In addition, this design requires less number of adders and only one Absolute Difference (AD) PE, which drastically reduces the silicon area when compared to other existing designs. The pixels of the search regions have been organized in memory banks such that two sets of 128-bit (16 8-bit pixels) data can be accessed in each clock cycle. Section 2 gives an overview of FSBM-based motion estimation. Section 3 presents a brief discussion on SAD modifications and describes the proposed FSBM hardware. The implementation and comparative results have been presented in Section 4. Section 5 presents a reconfigurable address generator. Finally, Section 6 concludes this paper.
FSBM-based Motion Estimation
Motion-compensated video compression models the pixel motion within the current picture as a translation of those within a previous picture. The motion vector is obtained by minimizing a cost function measuring the mismatch between the current MB in current frame and the candidate block in reference frame. SAD, the most popular cost function, between the pixels of the current MB x(i, j) and the search region y(i, j) can be expressed as,
where, (u, v) is the displacement between these two blocks. Thus, each search requires N 2 absolute differences and (N 2 − 1) additions. The FSBMA exhaustively evaluates all possible search locations and hence is optimal in terms of reconstructed video quality and compression ratio. High computational requirements, regular processing scheme and simple control structures make the hardware implementation of FSBM a preferred choice. The execution profile of a standard video encoder obtained using the GNU gprof tool has been shown in Table 2 . The table shows that motion estimation is the most computationally expensive module in a typical video encoder. In addition, SAD computations take the maximum time due to complex nature of absolute operation and subsequent multitude of additions.
Proposed FSBM Architecture
In this section we delineate our proposed speed-area optimized FSBM architecture. The first subsection briefly explains the SAD modification and the MB searching technique. The subsequent subsections describe the proposed hardware and the memory organization.
SAD modification
This section presents a modification to SAD computation. The SAD expression in Eq. 1 can be re-written as,
The detailed proof of the above derivation can be found in [12] . Again, it can be posited that, if,
where SAD min denotes the current minimum SAD value. Thus, if Eq. 3 is satisfied, then the SAD computation at the (u, v) th location may be skipped. In addition, if X(u, v) be the sum of pixel intensities at the (u, v) th MB location, then this sum can be derived from X(u − 1, v) by subtracting and adding the intensity sum of columns at specific positions. Based on this fact, [12] proposes a search strategy to efficiently derive and compute the MB sums at successive locations. The MB search technique used in our proposed design adopts this particular approach.
Pipelined SAD Operator
The SAD hardware for FSBMA has been divided into eight independent sequential steps. It computes the initial full SAD for the first Search Location (SL) and derives the SAD sums for subsequent SLs. 
Memory Organization
Our design adopts the MB scanning technique proposed in [12] . The pixels in p = 16 search region are represented by P i,j where 0 ≤ i ≤ 48 and 0 ≤ j ≤ 48 (shown in Fig. 3) ). This search region has (2p + 1) (Fig.3) , e.g., [P 1,1 , P 2,1 , P 3,1 , ..., P 16, 1 ] is one such 128-bit data, which belongs to the column 1 of the search region. It is observed that the one of the columns from column number 17 to 32 are accessed concurrently with another column from rest of the columns, i.e., 1 to 16 and 33 to 48, in the pre-defined search region. Therefore, the pixels have been organized in two different memory banks, as shown in Fig. 2 . The data in these memory banks are organized in column major format so that the whole column can be accessed by a single memory access. The memory controller generates the right address at every clocks for both the memory banks. The selected 384 bits (48 pixels of a single column of Fig.3) of each bank are then multiplexed and the correct 16 pixels are passed onto the SAD processing unit.
When the search location is moved down from the previous position, then we need to access two set of row pixels. This is not possible by the previously organized memory banks in one clock. It is easily observed Fig. 3 that either the first 16 pixels or the last 16 pixels of a single row have to be accessed for this purpose. It is also to be observed that, for the even row number, the first 16 Fig. 2 . In order to reduce the total number of memory accesses in FSBM-based architecture, data reuse can be performed [14] at four different levels. Our on-chip memory bank organization technique adopts the data reuse defined as Level A and Level B. Level A describes the locality of data within the candidate block strip where the search locations are moving within the block strip. Level B describes the locality among the candidate block strips, as vertically adjacent candidate block strips are overlapped. In our design this memory organization primarily based on the usage of Look Up Tables (LUT) in the FPGA implementation.
Performance Analysis
This section presents the implementations results of the proposed hardware. Subsequently, it compares the obtained results with other exiting FPGA based designs.
Implementation Results
The proposed design has been implemented in Verilog HDL and verified with RTL simulations using Mentor Graphics ModelSim SE. The Verilog RTL has been synthesized on a Xilinx Virtex IV 4vlx100ff1513 FPGA. The synthesis results show that design requires 333 CLB Slices, 416 DFFs/Latches and a total of 278 input/output pins. The area of the implementation is 380 look-up tables (LUTs) and the highest achievable frequency is 221.322 MHz.
The pipelined design takes 23 clock cycles to produce the first SAD value. Thereafter, one SAD value is generated in every cycle. A search range of p = 16 has (2p + 1) 2 = 1089 search locations. So for a search range of p = 16, the number of cycles required by our hardware to find the best matching block is, 23 (for the first search location) + (1089-1) (for the remaining search locations) = 1111 cycles.
Our FPGA implementation works at a maximum frequency of 221.322MHz (4.52 ns clock cycle). Hence, the FPGA implementation can process a MB (16x16) in 5.022 usec (1111 clock cycles per MB * 4.52 ns per clock cycle = 5.022 usec) and a 720p HDTV (1280x720) frame in 18.078 msec (3600 MBs per frame * 5.022 usecs per MB = 18.078 msec). At this speed, the proposed hardware can process 55.33 720p HDTV frames per second. This is a big improvement over other approaches, where the frames processed per second is much lower. This is evident from Table 2 . The high speed and throughput of our design is mainly because of the modified SAD operation and the split pipeline design of the proposed architecture.
Performance Comparison
This subsection compares the hardware features and performance of the proposed design with existing FPGA architectures. No comparison has been made with available ASIC solutions. Table 4 .1 compares the hardware features of the proposed and existing FPGA solutions for a macroblock (MB) of size 16 × 16 and a search range of p = 16. As can be seen, our design consumes less cycles per MB, has the highest maximum operating frequency. The splitting of the initial stage of the pipeline facilitates this high speed. The area required in terms of CLB slices and the hardware complexity in terms of AD PEs (Absolute Difference Processing Elements), adders and comparators are much lesser for the proposed architecture. Modification of the SAD operation contributes to the high speed and less area and hardware complexity. The use of memory banks has led to higher on-chip bandwidth. However, this has also led to the only drawback of our design, which is the high number of input/output pins.
A performance comparison of the various architectures has been also shown in Table 4 .1. In order to compare the speed-area optimized performance of different architectures, the new performance criteria of throughput/area has been used. Higher the throughput/area parameter of a design, more is the speed-area optimization of the architecture. The architectures have been compared in terms of (a) number of HDTV 720p (1280x720) frames that can be processed per second, (b) throughput or MBs processed per second, (c) throughput/area, and (d) the I/O bandwidth. As can be seen, the proposed design has a very high throughput and can process the maximum number of HDTV 720p frames per second (fps). Moreover, the superior speed-area optimization in the proposed design is exhibited by its substantially high throughput/area value of 598.2.
Reconfigurable Block Matching Hardware
Apart from using the full pattern, block matching can also be performed by using N-queen decimation patterns. It has been shown [16] that the N-queen patterns have similar PSNR drop but yield much faster encoding performance as compared to the full pattern, particularly for N = 4 and N = 8. This section presents a reconfigurable hardware design to find the minimum SAD value by selecting any one of the full-search, 8-queen or 4-queen decimation techniques. To the best of our knowledge no similar hardware design exists in literature.
For both 4-queen and 8-queen decimation techniques, the pixels being processed for two consecutive SAD-based block matching are mutually independent. This fact can be utilized to further enhance the performance of the SAD operator discussed in section 3. Only the memory organization and the address generation at each clock will differ for the three decimation patterns. It has been observed that the reconfigurable address generator and SAD operator require only 40% and 2% extra hardware cost, respectively, as compared to the already proposed full pixel architecture.
The reconfigurable address generator uses a common datapath. Two consecutive addresses are represented by their respective bit value differences. For each decimation technique, the bit value is toggled following some predefined patterns. Bit toggling of the 8-bit address lines are controlled by their respective enable signals which are being generated by one special controller logic. This state machine based controller generates the respective enable signals depending on 2-bit decimation mode select input signals. The pipelined datapath shown in Fig. 1 can also be reconfigured according to the user specified decimation mode. In case of 8-queen on 16 × 16 block size, 32 pixel values are added at every clock by both halves of the pipe stages from one to five. The resultant value is directly used to perform absolute difference with the MB to calculate current SAD value. The same datapath of the pipelined SAD operator also performs the SAD calculation for 4-queen decimation. This technique requires 64 pixels for each SAD value for 16 × 16 block size. So, the pipeline is reconfigured in a way such that its both halves from stage one to five and stage six are used to perform the addition of these 64 pixel values. Subsequently, it performs sum of absolute differences to get the new SAD.
Conclusions
This paper has presented a FPGA based design for Full Search Block Matching Algorithm. The novelty of this design lies in its modified SAD calculation and in splitpipelined design for parallel processing in the initial stages of the hardware. The macroblock search scan has also been suitably altered to facilitate the derivation of SAD sums from previously computed results. Compared to existing FPGA architectures, the proposed design exhibits superior performance in terms of high throughput and low hardware complexity. The high frame processing rate of 55.33 fps makes this design particularly useful in both frame and field processing of HDTV based applications. The paper finally hints out the reconfigurable block matching hardware that could be useful to general purpose real time video processing unit.
