Abstract-This paper presents a flexible and scalable motion estimation processor capable of supporting the processing requirements for high-definition (HD) video using the H.264 Advanced Video Codec, which is suited for FPGA implementation. Unlike most previous work, our core is optimized to execute all existing fast block matching algorithms, which we show to match or exceed the inter-frame prediction performance of traditional full-search approaches at the HD resolutions commonly in use today. Using our development tools, such algorithms can be described using a C-style syntax which is compiled into our custom instruction set. We show that different HD sequences exhibit different characteristics which necessitate a flexible and configurable solution when targeting embedded applications. This is supported in our core and toolset by allowing designers to modify the number of functional units to be instantiated. All processor instances remain binary compatible so recompilation of the motion estimation algorithm is not required. Due to this optimization process, it is possible to match the processing requirements of the selected motion estimation algorithm to the hardware microarchitecture leading to a very efficient implementation.
A flexible, reconfigurable, and programmable motion estimation processor, such as the one proposed in this work, is well poised to address these challenges by allowing the core microarchitecture to be optimized alongside the estimation algorithm. This concept was briefly introduced in [2] and has been further developed and improved upon in this work to consider the advanced features of H.264: fractional-pixel search, variable partition sizes and rate-distortion optimization. The contributions of this paper are given here:
1) An analysis of full search and fast motion estimation algorithm performance to motivate the need for configurable and programmable hardware. 2) A flexible and configurable fast motion estimation processor and instructionset architecture to scale the hardware to the processing needs of the search algorithm.
3) An open-source toolset composed of a cycle-accurate model, compiler and analysis software to guide the designer to an optimal implementation. 4) An evaluation of processor performance over a combination of algorithm and hardware configurations. 5) A physical measurement of the processor power consumption when implemented on modern state-of-the-art reconfigurable FPGAs. This paper is organized as follows. Section II reviews relevant work in the field of hardware architectures for motion estimation, with a focus on reconfigurable and programmable solutions. Section III motivates the work presented by showing the importance of different motion estimation options and algorithms in modern high-definition video sequences. Section IV presents the programming model and tools developed to explore the software/hardware design space for advanced motion estimation. Section V describes the processor microarchitecture details and Section VI analyzes the complexity/performance/ power of the proposed solution. Finally, Section VII concludes this paper.
II. MOTION ESTIMATION HARDWARE REVIEW
Full-search exhaustively searches for the motion vector that minimizes a criteria, such as the Sum of Absolute Differences (SAD), within a predefined search range. The SAD is the most common quality metric used in inter-frame prediction due to it being simple to calculate, and efficient to implement in hardware. This approach has been traditionally preferred for hardware implementations due to its regular dataflow, making them well suited to architectures using 1-D or 2-D systolic array principles, which only require simple control and can achieve high hardware utilization as seen in [3] .
Full-search architectures also have the added benefit of being able to implement SAD reuse strategies that makes them especially suited to support the variable block sizes used in H.264 [4] . By combining the SAD for smaller block sizes into the larger sizes, only small increases in gate count are required over their conventional fixed-block counterparts with little bearing on their throughput, critical path, and memory bandwidth. On the other hand, the hardware requirements needed to obtain enough parallelism to check all the possible search points in real time are very large [5] . Serial architectures can be used to reduced these requirements at the expense of throughput as shown in [6] . This becomes even more challenging if large search ranges, rate-distortion optimization (RDO), and fractional-pel search are considered. Previous work [1] has shown that these options are required to obtain high-quality motion estimation and optimal rate-distortion performance.
A recent example of a high-performance, integer-only, fullsearch architecture is presented in [7] . This work considers a relatively large search range of 63 48 pixels and can be scaled by varying the number of pixel processing units. A configuration using 16 pixel processing units can support 62 frames per second (fps) at a video resolution of 1920 1080, when clocked at 200 MHz. Each pixel processing unit is assigned to a different macroblock and obtains 41 motion vectors (all block sizes) in parallel. By working on 16 adjacent macroblocks of 16 16 pixels in parallel, data reuse can be exploited. This architecture reported a usage of around 154 K LUTs when implemented in a Virtex-5 FPGA. However, this work misses important motion estimation options such as Lagrangian-based RDO optimization [8] , and does not support fractional-pixel search. In Lagrangian-based RDO, the cost of each possible motion vector is added to the SAD using Lagrangian multipliers, to form the deciding cost. This technique has the potential of improving image quality considerably, as will be illustrated in Section III. The hardware in [7] calculates 41 16 motion vectors in parallel so the hardware complexity required to obtain the additional motion vector costs needed for RDO will be considerable in this architecture.
In an effort to reduce the complexity of inter-frame prediction, "fast" ME algorithms have been proposed as seen in [9] . The challenges a designer faces for these algorithms include unpredictable data flow, irregular memory access, low hardware utilization, and sequential processing. Fast ME approaches use a number of techniques to reduce the number of search positions and this inevitably affects the regularity of the data flow, eliminating one of the key advantages that systolic arrays have: their inherent ability to exploit data locality for reuse. This is evident in the work done by [10] that compares a number of fast-motion algorithms when mapped onto systolic arrays, discovering that the required memory bandwidth scales down much more slowly than the number of operations executed.
A number of architectures have also been proposed that follow the programmable approach by deferring the algorithm definition from design-time to run-time. The application-specific instruction-set processor (ASIP) presented in [11] uses a specialized data path and a minimum instruction set similar to our own work. The instruction set consists of only eight instructions operating on a RISC-like, register-register architecture designed for low-power devices. There is the flexibility to execute any arbitrary block matching algorithms and the basic SAD16 instruction computes the difference between two sets of sixteen pixels, which in the proposed microarchitecture takes sixteen clock cycles to complete using a single 8-bit SAD unit. An implementation using a standard-cell 0.13-m ASIC technology shows that this processor enables real-time motion estimation for QCIF (176 144) at 12.5 MHz to achieve low power consumption. An FPGA implementation using a Virtex-II Pro device is also presented with a complexity of 2052 slices and a clock of 67 MHz. In this work, scaling can only be achieved by varying the width of the SADU (ALU-equivalent for calculating SADs) but due to its design, the maximum parallelism that can be achieved in one clock cycle is limited to the width of one macroblock at 16 pixels, by using 256-b single-instruction multiple-data (SIMD).
This programmable concept is taken one step further in [12] . This core is also oriented to fast motion estimation, supporting subpixel interpolation and variable block sizes. Interpolation is done on demand using a simplified nonstandard filter, which will cause a mismatch between its output and any standard-compliant decoder. The core implements an early-termination optimization technique (described later) to save on computation, but again does not support Lagrangian-based RDO. Scalability is limited since only a single functional unit is available, although the processor can implement algorithm-specific instructions to improve performance. Their SAD instruction is also comparable to our own pattern instruction and operates on 16-pixel pairs simultaneously, requiring 16 instructions to compute each macroblock search point, taking up to 20 clock cycles to complete all 16 points. The processor uses 2127 slices in an implementation targeting a Virtex-II device with a maximum clock rate of 50 MHz. This implementation can sustain processing of 1024 768 frames at 30 fps. Xilinx has also developed a processor capable of supporting high-definition 720-p video at 50 fps [13] , operating at 225 MHz in a Virtex-4 device for a throughput of 200 000 macroblocks per second. Their implementation uses a total of around 3000 LUTs, 30 DSP48 embedded blocks, and 19 BlockRAM resources. The search algorithm is fixed and based on a full search of a 4 3 region around ten user-supplied initial predictors for a total of 120 candidate positions, chosen from a search area of 112 128 pixels. The core contains a total of 32 SAD engines which, for each given motion vector candidate, continuously computes the 12 search positions that surround it.
Limitations identified in these previous FPGA-based fast ME solutions include the lack of support for full high-definition 1080p due to limited architecture scalability, no true fractional-pel search-which is either not supported or is based on nonstandard interpolation features, and no support for Lagrangian-based RDO.
III. CASE FOR FAST MOTION ESTIMATION HARDWARE
Most of the available literature indicates that full-search algorithms deliver the best performance in terms of PSNR and bit rate, when compared with fast motion estimation algorithms. However, the research done in papers such as [14] , [15] suggest that a well-designed fast block matching algorithm cannot only speed up the motion estimation process, but also improve the rate-distortion performance in state-of-the-art video codecs such as H.264. This is because the introduction of motion vector candidates obtained from neighboring macroblocks as initial search positions, and the use of early termination techniques, tend to produce smoother motion vectors with smaller deltas between the predicted and the selected vectors, in turn translating into fewer bits needed to code them. This can produce better results than full-search algorithms that simply check all possible motion vectors available in the search range. Fig. 1(a) -(c) explore the rate-distortion performance of three 1080-p high-definition sequences obtained from [16] with varying degrees of motion complexity (high in Crowdrun, medium in Pedestrian, and low in Sunflower) collected using a modified version of x264 (a high-quality open-source coding implementation for H.264) [17] . The fast ME algorithm selected is the popular hexagon-based fast search strategy for integer-search followed by a diamond-based search for fractional search, as available in x264. The search area has been increased to 112 128 pixels as used in our own core, and full-search at the integer-pel level is shown for comparison. The full-search algorithm considered, as in the established case, works by checking all points in the search space and selects the vector with the lowest SAD without applying Lagrangian RDO. It can be seen that fast integer search without RDO (int) performs as well as, if not slightly better, than full-search (full) over all three HD sequences. The Pedestrian and Sunflower sequences also show that enabling the Lagrangian optimization for fast motion integer-pel int(rdo) is beneficial, with approximately 1 dB gain for the same bit rate. This, however, is not the case for the Crowdrun sequence that contains more local motion components and generates "noisier" motion vectors. Algorithm frac(rdo) adds a fractional-pel refinement based on a diamond search to the int(rdo) algorithm, and it can be seen that this also outperforms the integer-pel mode in all sequences, improving PSNR by up to 2 dB in the Crowdrun sequence. Finally, algorithm frac(rdo,4 4) adds subblocks to algorithm frac(rdo). For the Pedestrian sequence, using subblocks offers negligible benefit, and actually degrades performance for Sunflower. The reason for this is that, since the motion complexity is lower in these sequences, it can be captured sufficiently using a single large block, whilst any gains made by subblocks are cancelled out by the extra overhead of coding more motion vectors. However, in high activity sequences such as Crowdrun, these finer subblocks can provide a benefit of 0.5 dB at the same bit rate.
From this analysis, it can be concluded that different video sequences benefit differently from the various options available during motion estimation. This presents a strong case for using reconfigurable and programmable hardware to better adapt to the requirements of motion estimation in advanced video coding.
IV. INSTRUCTION SET AND PROGRAMMING MODEL
Fast motion estimation algorithms have not been standardized, and multiple tradeoffs between algorithm complexity and video quality exist, making a programmable architecture beneficial. The following sections present the microarchitecture and programming model of the hardware/software solution developed according to the principles of configurability and programmability, which we have named LiquidMotion. 
A. Implementation Overview
Our implementation flow is separated into two stages: compile-time and run-time, which are outlined as follows.
At compile-time, the configuration is driven by a pre-existing set of constraints in terms of power, area, and performance. For example, given a power (mW available) or area budget (in terms of available logic resources) for motion estimation only, the designer can first create a processor configuration that meets this constraint. Once this architecture has been generated, a software algorithm can then be adapted to meet the performance constraint (required frame rate) with respect to the available hardware. Several iterations of this optimization process may be required to converge on the most efficient hardware tradeoffs to support the required software algorithm.
During run-time, the hardware architecture can be adapted on-the-fly to the amount of motion in the scene, for example, by reducing the clock frequency to reduce dynamic power consumption, or by exploiting the advanced partial reconfiguration features available in modern FPGA devices, such as the Xilinx Virtex-5 and Virtex-6 series, to activate and deactivate processor features for power savings. As the amount of motion in the sequence changes (detectable by analysing the length of the motion vectors) the hardware can be dynamically reconfigured, for example, during the intra-coding of a keyframe in which the motion processor is idle.
B. LiquidMotion Instruction Set Architecture (ISA)
The instruction set should be able to express the inherent parallelism available in the motion estimation algorithm in a simple way to minimize the overheads for instruction fetch and decode and to keep the execution units of the core as busy as possible. The number of execution units available in the proposed processor vary depending on the implementation, so it is important that binary compatibility between different hardware implementations is achieved, meaning a program only needs to be compiled once to be executable on any configuration. The instruction set architecture consists of a total of nine different instructions, as illustrated in Fig. 2 .
There are two arithmetic instructions for integer and fractional pattern searches, a total of six control instructions that change the program flow and one mode instruction that sets the partition mode (subblocks) and reference frame to be used for the arithmetic instructions. The arithmetic instructions exploit the most obvious form of parallelism available, data-parallelism at the search point level. For example, in a simple diamond pattern, there are four points that can be calculated in parallel if enough execution resources have been implemented. The arithmetic instructions express this parallelism with two fields that identify the number of points used by the pattern and the position in the point memory where the offsets for that pattern are defined. The control unit can then execute the instruction with a parallelism level that ranges from issuing each of these points to a different execution unit in a fully parallel hardware configuration, to issuing each point to the same execution unit in the base hardware configuration. The same approach applies to fractional instructions. The set mode instruction is used to change the active partition mode and reference frame of the core and configures the internal control logic to operate with different address boundaries and data sources.
There are a total of thirty-two 32-b registers available. These registers include the command register, motion vector candidate registers, results registers, and profiling registers. The motion vector candidate registers are used to store motion vectors supplied by the user from surrounding macroblocks or from macroblocks in different frames. If the core is configured with motion vector candidate support, these vectors are loaded into the current motion vector automatically and their costs calculated before the main algorithm runs. The set mode instruction can be used to set the total number of candidate vectors available. This is necessary because when the core processes the corners and sides of a frame, some motion vector candidates may be unavailable.
The core has no instructions to access external memory, relying instead on an external DMA engine to move the reference frame data and current macro block data into its internal memory before processing starts. At the beginning of each row of macroblocks, this external DMA engine moves the full 7 8 macroblock search area into the internal memories, for a search range of 112 128 pixels. For each following macroblock in the same row, only the rightmost 1 8 column needs to be loaded and this can take place in parallel with data processing as explained in Section V-C. Once the input data is ready, processing can be started by writing to the command register. Advanced motion estimation techniques, such as those using adaptive thresholds, can be implemented by modifying the program memory contents directly by inserting modified immediate field contents into the compare instructions.
C. LiquidMotion Programming Model and Design Flow
The processor offers a simple programming model so that a motion estimation algorithm programmer can access the functionality of the hardware without detailed knowledge of the microarchitecture. Our toolset is composed of a compiler, a cycle accurate simulator, and analysis software to enable the programmer to test different motion search techniques before deciding on the one that obtains the required quality of results in terms of rate-distortion performance and macroblock or frames per second throughput. At this point the programmer can instruct the tools to generate an RTL configuration file for the processor. Commercial synthesis tools such as Xilinx ISE or Synplify can then be used to process this configuration file together with the LiquidMotion RTL hardware library and generate a hardware netlist and FPGA bitstream with the correct number and type of execution units to match the software requirements.
This design flow is illustrated in Fig. 3 . This scalable architecture can be easily programmed using an intuitive C-style language we have termed EstimoC. EstimoC is a high-level language that is powerful enough to express a broad range of motion estimation algorithms in a natural way. EstimoC code can be written in the embedded editor, or any other compatible editor, and is interpreted by the EstimoC compiler. The language has a natural syntax with elements from C and special structures for the development of motion estimation algorithms. Typical constructs such as for and while loops and if-else statements are supported. The algorithm designer can use these constructs to create arbitrary block-matching motion estimation algorithms ranging from the classical full search to advanced algorithms such as unsymmetric-cross multihexagon-grid (UMH) [18] . Part of the language is dedicated to the preprocessor and other parts are for the core decoding unit.
The preprocessor is a crucial part of the compiler because it provides syntax facilities for the development of sophisticated algorithms. For example, EstimoC provides two ways to specify the search patterns: using a static pattern specification as in pattern(hexagon) {pattern instructions} or through using dynamic pattern generation. In this second case, the programmer writes a sequence of simple check instructions in the form check(x,y); followed by the update; syntax The algorithm corresponds to a four-point diamond pattern followed by a full-search fractional-pel refinement which also illustrates that it is possible to implement exhaustive search approaches if they are required.
The example starts by setting an initial step size of 8 that defines the size of the diamond. An initial check is done at the center point (defined by the motion vector loaded in the motion vector candidate register or zero if none available) and the 4-point diamond surrounding it. This will result in a single integer-pattern instruction with 5 points (instruction zero in the sample code), after which a number of diamond steps are conducted reducing the step size until it is smaller than 1, corresponding to fractional search. Each four-point diamond will generate a single check instruction. Finally, a small full search is conducted with the two for loops that will result in a single fractional instruction with a total of 25 points ( 0.5, 0.25, 0, 0.25, 0.5) for the and indexes (instruction 31 in the sample code). The example program also shows a specific if-break syntax that is used to terminate the search early as described in the following subsection, and corresponds to instruction opcode 2 in the sample code.
The compiler processes this source code and generates two binary files. The first file program_memory contains the program instructions themselves, and a second file point_memory contains the and offsets from the basic search pattern (e.g., ( 1,0), (1,0), (0,1), (0, 1) for the diamond search) that will be applied to the current motion vector candidate to identify the location of each new search point to be checked. All of the check statements between every update statement compiles into a single integer or fractional pattern instruction. Additional if constructs that compare the current value of the motion vector and SAD registers with predefined threshold values are also supported. These can be used to change the program flow depending on the amount of motion in the current macroblock, and the desired search accuracy.
D. Early Termination and Search-Point Duplication Avoidance
Early termination is a very important feature used to speed up execution in fast motion estimation algorithms. Typically if a pattern fails to improve upon the SAD of the previous iteration, the algorithm terminates the current search loop. To implement this technique, each completing check pattern instruction sets a best_eu register indicating which search point has improved upon the current cost. This register is set to zero before each instruction starts, so the value of the best_eu register at the end of execution indicates if the instruction has improved the cost value (best_eu no longer zero) and if so which search point has achieved this improvement. The conditional jump instruction checks this register and changes the execution flow as required. The same hardware can be used to support a technique to avoid searching duplicate points by coding optimized subpatterns in memory.
For example, in a hexagon search pattern the first pattern contains six different points but subsequent patterns will only add three new points to the search sequence. To avoid computing the same point more than once, the best_eu register can be checked to identify the winning search point and this information used by the hardware to decide which instruction to execute next. For this optimization to work, the program needs to be extended to contain a total of one full pattern sequence and six short patterns sequences. The complexity of identifying duplicate search points, and avoiding them, is taken care of automatically by the compiler.
V. PROCESSOR MICROARCHITECTURE
The microarchitecture of two sample configurations are illustrated in Figs. 5 and 6. Fig. 5 corresponds to the base configuration with a single integer-pel execution unit whilst Fig. 6 corresponds to a more complex configuration with four integer-pel execution units (IPEUs), two fractional-pel execution units (FPEUs), and one interpolation execution unit (IEU). One integer-pel pipeline must always be present as shown in Fig. 5 to generate a valid processor configuration but all others units are optional, and are configured at compile time. In addition to the number of fractional and integer execution units, the hardware includes support for other motion estimation options, as shown in Table I . Partitions are calculated sequentially since SAD reuse is not possible in fast motion search as explained in Section VI-A.
A. IPEUs
Each functional unit uses a 64-b word-length and is heavily pipelined to achieve a high throughput. All accesses to reference and macroblock memory are done through 64-b-wide data buses and the SAD engine also operates on 64-b data in parallel. The memory is organized into 64-b words and typically all accesses are unaligned, since they refer to macroblocks that can start at any position inside each word. By performing 64-b read accesses to two memory blocks in parallel, the desired 64 b inside both words can be selected inside the vector alignment unit. The number of IPEUs is configurable from a minimum of one to For example, a typical diamond search pattern with a radius of 1 will use four positions in the point memory with values ( 1,0), (0, 1), (1,0), (0, 1). Any pattern can be specified in this way and multiple instructions specifying the same pattern can point to the same position in the point memory to save resources. Each execution unit has a local copy of the point memory and processes 64 b of data in parallel with the rest of the execution units. The size of the point memories are 256 16 b and contain the and offsets of the basic search patterns. Each integer-pel execution unit receives an incremented address for the point memory so each of them can compute the SAD for a different search point in the same pattern. This means that the optimal number of IPEUs for a diamond search pattern is four, and for the hexagon pattern six. However, by avoiding duplicate points, the number of search points for many regular patterns can be halved. In algorithms which combine different search patterns, such as UMH, a compromise can be found to optimize the hardware and software components. This illustrates the idea that the hardware configuration and the software motion estimation algorithm can be co-optimized to generate different processors depending on the software algorithm to be deployed.
Motion vector candidates and additional subblocks are supported with specific state machine extensions in the fetch, decode and issue control unit. These state machines change Fig. 6 . This module takes as inputs the predicted motion vector, the quantization parameter, and the current motion vector and computes the motion vector cost using a hardware multiplier and Lagrangian multiplication factors stored in small LUTs. The motion vector costs are added to the SAD costs in the COST Accumulator and control unit. This hardware approach replicates the software approach used in the x264 coder.
B. FPEU and IEU
The engine supports both half and quarter-pel motion estimation, thanks to a half-pel IEU and specialized FPEUs. The number of half-pel IEUs is limited to one, but the number of FPEUs can also be configured at compile time. The IEU interpolates the 20 20 pixel area that surrounds the 16 16 macroblock corresponding to the winning integer motion vector. The area interpolated is reduced when a subblock mode is active: 10 10 area for the 8 8 mode, 6 6 area for the 4 4 mode, etc. The interpolation hardware is cycled three times to first calculate the horizontal pixels, then the vertical pixels, and finally the diagonal pixels.
The IEU calculates these half-pels using a six-tap filter as defined in the H.264 standard. The IEU has a total of eight systolic 1-D interpolation processors, each with six processing elements. This design choice was made to balance internal memory bandwidth with processing power so that in each cycle, a total of eight valid pixels are presented to a different interpolator, where each can produce one new half-pel sample every clock cycle. In this way, by cycle 9, the first interpolator will have processed all eight pixels and can start on the next eight. This approach obtains high hardware utilization since no idle cycles are introduced but can still be used with memory blocks limited to 64-b interfaces. A total of 24 rows with 24 bytes each are read so that enough pixels are available for the six-tap filter to operate correctly. Each interpolator is enabled nine times so that a total of 72 8-byte vectors are processed. Due to the effects of filling and emptying the systolic pipeline before the half-pel samples are available, a total of 141 clock cycles are needed to complete half-pel horizontal interpolation. During this time, the integer pipeline is stalled, since the memory ports for the reference memory are in use.
Once horizontal interpolation is completed, and in parallel with computing the vertical interpolation, the diagonal interpolation and any fractional-pel motion estimation, the processing of the next macroblock or partition can start in the IPEUs. Motion estimation is only stalled during the cycles where the interpolation area is read from reference memory, but at all other times the integer and fractional execution units can be run in parallel. Completion of the vertical and diagonal pixel interpolation takes an additional 170 clock cycles after which the motion estimation using the fractional pixels can start. Quarter-pel interpolation is done on demand simply by reading the data from two of the four memories containing the half-and full-pel positions and averaging according to the H.264 standard. The fractional pipeline is as fast as the integer pipeline, requiring the same number of cycles to compute each search position as explained in Section VI.
C. Reference Memory Organization
The implemented reference memory can accommodate a search area width of 128 pixels, but the horizontal search range is intentionally limited to 112 pixels. This leaves a 16-pixel-wide memory area which can be reloaded with a new column for the next macroblock using a shifting window technique, in parallel with the processing the current macroblock continues. Using this technique, the reference addresses are offset gradually as macroblocks are processed so that reads are not performed on the memory area being loaded. The implementation of the reference area in the Xilinx Virtex-5 is very compact and uses a total of four BlockRAM resources. Each BlockRAM is organized with 1024 words and 4 bytes per word in a dual-port configuration. Fig. 7 shows a simplified view of how the reference memory is organized. The key feature is that the eight-pixel words that form the reference area are stored in an interleaved configuration. For example, the first row of the first 16 16 macroblock is formed by words 0 and 1. Word 0 is stored in BlockRAMs 1 and 2, whilst word 1 is stored in BlockRAMs 3 and 4 as shown on the left of Fig. 7 . The least significant bit of the address is used to determine BlockRAM selection. Since a motion vector can point to any location in this reference window, memory accesses are generally misaligned; for example, the last 3 bytes from the word read in BlockRAMs 1/2 must be concatenated with the first 5 bytes from the word read in BlockRAMs 3/4 to form 64 b of valid data. Notice that, if the motion vector points to the middle of memory word 1, then a few bytes from memory word 2 will also be needed to formed 64 b of valid data. In this case, the address must be incremented by one to access the right location for memory word 2 (second position in BlockRAMs 1/2).
The effect of this memory-interleaving technique is that the BlockRAMs always have at least one memory port free. This free port can be used to load new reference data for the next macroblock in parallel with the processing of the current macroblock. This is very important since restricting the processing and loading of new data to be sequential will typically halve performance. This simultaneous reading and writing means that the overheads effects of limited bus bandwidth can be masked. In our prototype, the bus width is 64 b so the DMA engine can load a new 64-b word in each clock cycle. A new column of 
VI. HARDWARE PERFORMANCE EVALUATION AND IMPLEMENTATION
For the implementation, we have selected the Virtex-5 LX110T device included in the XUPV5 development platform. This device offers a high level of logic density and is fabricated using 65-nm CMOS technology.
A. Performance Analysis
The fractional and integer execution units have been carefully pipelined, and all configurations can achieve a clock rate of 200 MHz. Obtaining a performance value in terms of macroblocks per second is not as straightforward as in full-search hardware, in which the same number of SADs are computed for each macroblock. In our case, the amount of motion in the video sequence, the type of algorithm, and the hardware implementation all affect the number of macroblocks per second that the engine can process. The cycle accurate simulator of the toolset has been used to measure the performance of the core processing the same high-definition files introduced in Section III. The performance values obtained from this simulator has been verified against a prototype implementation of the system using the XUPV5 board. Overall, the microarchitecture always uses 33 cycles per search point, although there is an overhead of 11 clock cycles needed to empty the integer pipeline between finding the best motion vector in each pattern iteration, and starting the next pattern from the current winning position. The microarchitecture stops an execution unit if the current SAD calculation becomes larger than the cost obtained during a previous calculation (early termination) to save power but will not try to start the next search point early. The main reason for this is that since the core uses multiple execution units, it is very important that all execution units are synchronized so that a single controller can issue the same control signals to all units. Execution units starting at different clock cycles will invalidate this requirement.
1) Integer-pel Performance Analysis:
Integer-pel performance is evaluated using three different fast motion estimation algorithms: diamond, hexagon and UMH, all of them followed by a eight-point square refinement as implemented in x264. Fig. 8(a)-(c) shows the performance, in terms of frames per second, as the number of integer execution units, and the minimum subblock size, changes. It is clear that the performance of the processor is highly dependent on the motion estimation algorithm and the prediction features selected. The 8 8 mode considers the 16 16, 8 16, 16 8 , and 8 8 partitions while the 4 4 mode considers all partition sizes down to 4 4. As the number of partitions increases, performance decreases since the core can only compute one partition at a time. Unlike full-search, it is not possible to reuse partition results for fast motion estimation algorithms since each partition may follow a potentially different search direction. It is also important to note that not all the partitions are checked: this decision is made by the inter-mode selection algorithm within x264. For example, if the 8 8 partition has not improved over the 16 16 partition, then 4 4 will not be considered.
These figures show that more complex algorithms scale more gracefully with the number of available execution units. For example, the UMH algorithm includes a multihexagon search pattern that generates a single instruction with 64 search points. This instruction can fully utilize a hardware configuration with 16 execution units; consequently, performance increases linearly when moving from a configuration with 8 IPEUs to 16 IPEUs. On the other hand, a diamond pattern with four searchpoints will only be able to use four of these execution units. In this latter case, any performance increase for configurations with more than four IPEUs is due to the final square refinement step that contains eight search-points. Furthermore, a diamond-search configuration with three IPEUs will need the same number of cycles as for two IPEUs, because while the first iteration will enable all three IPEUs, a second iteration will still be required to complete the pattern instruction during which only one IPEU will be utilized. Fig. 8(a) -(c) also shows that the simpler motion found in the Sunflower and Pedestrian video sequences result in higher frame rates. This could be exploited by lowering the clock frequency to the minimum level of performance necessary to meet the frame rate requirement, for dynamic power savings. In addition, the complex motion present in Crowdrun makes the probability of selecting the smaller subblocks much higher and doing so increases the impact on performance. For example, to maintain a frame rate of 30 frames per second over the Crowdrun sequence when all the block sizes are used, 16 IPEUs are needed, as shown in Fig. 8(c) . Another form of parallelism not described in this paper, but certainly possible, will be a multicore implementation. In this case, some ME cores can be dedicated to processing certain subblocks and only activated if needed. This will enable the scaling of the presented architecture to even higher frame rates for complex algorithms. Alternatively, multicore implementations could also be used to support multiple reference frame prediction.
2) Fractional-pel Performance Analysis: The current microarchitecture can run both integer-pel and fractional-pel searches in parallel. To be able to obtain the same level of integer and fractional performance, each fractional-pel execution unit needs two alignment units because in quarter-pel interpolation, two half-pel data words need to be read and aligned. The complex part of executing the fractional-pel refinement involves the half-pel interpolation using the standard six-tap filter. In the current microarchitecture, this interpolation needs to complete before the fractional-pel search can start. The interpolator needs around 300 clock cycles to compute all horizontal, vertical, and diagonal pixels. Fig. 9 (a)-(c) evaluates just the fractional-pel performance in parallel with integer processing, and shows the performance of both the fractional-pel execution units and the interpolation unit. The three fractional motion estimation algorithms explored are: diamond, hexagon and square search. Fractional search does not require overly complex algorithms since the search area is limited to 20 20 pixels; this being the original 16 16 pixel area corresponding to the winning integer macroblock, extended by two pixels on each side. In all cases, we consider a search loop formed by two half-pel checks followed by two quarter-pel checks. This follows the same approach as used in x264. Additionally, subpartitions are processed in a similar way to the low-complexity mode of x264: the fractional search refinement is only done on the best partition after the integer-search completes. This option is taken to keep interpolation complexity low. The alternative of performing a fractional refinement over each possible partition will need a multicore implementation since the single interpolator available in the microarchitecture will not be sufficient. As with integer search, these results show that simpler motion sequences translate to higher fractional performance, as expected. In this case, we can also observe that performance scales more slowly with the number of FPEUs. The reason for this is that half-pel interpolation, which requires a constant number of clock cycles independent of how many FPEUs are available, must be completed before fractional search can start.
B. Power Analysis
Power is a major consideration in hardware design, so it is important to investigate how effective the core is from a power efficiency point of view.
FPGAs are much less power-efficient than dedicated ASICs by orders of magnitude, so introducing power as a new design constraint when targeting reconfigurable chips is important. Thus far, most of the available research only uses complexity, performance, and area as the optimization criteria for FPGA designs. In any case, most of the literature that do report power consumption rely on vendor-supplied tools. Typically, the standard approach has been to use a tool such as Xilinx XPower, together with a VCD activity file obtained from simulating the netlist which has been back-annotated with timing information. This method accurately captures the logic glitches largely responsible for dynamic power consumption, together with the switching behaviour of flip-flops and LUTs. This flow, however, implies that the whole FPGA device is dedicated to the ME core and, for example, considers that all the power consumed by the clock distribution network in the device is added to the power of the ME core. Since the basic ME core consumes between 2% (6% memory) and 3% (8% memory) of the logic resources available in the Xilinx XUP V5, these types of estimations are overly pessimistic. In a real system, the ME core will be just one IP block in the FPGA, consuming extra dynamic power only when it is activated.
Instead of estimating power using the vendor tools, we chose to deploy our core as part of a system-on-chip (SoC) using a modified Xilinx XUP V5 board with an isolated Vcore power supply connected to a custom-designed power supply module. A current sensing resistor is incorporated into this custom power supply, and its terminals exposed to allow the voltage drop to be measured using the Virtex-5 System Monitor primitive. This primitive is a hard block available in the Virtex-5 and Virtex-6 range of FPGAs that allows the monitoring of various physical operational parameters. It can be configured to monitor the internal voltages used to provide power to the core of the FPGA, and the die temperature. This block can also be linked to a number of auxiliary channels that allow the monitoring of signals external to the FPGA. The current sensing resistor can therefore be connected in series to the FPGA and the power supply, and these auxiliary channels used to determine the current load of the FPGA by monitoring the voltage drop across the resistor. Using this current measurement, coupled with the supply voltage of the device, the power dissipation of the system can be accurately determined.
For our measurements, we have clock-gated the ME processor so that we were able to isolate the power it consumed from the power of the rest of SoC. The SoC uses a soft core processor to move data from external DDR memories to the internal ME memories over an AMBA bus. Power measurements were taken for two runs of the SoC, first with the ME processor clock-gated and then with the ME clock enabled and the core executing the loaded motion estimation algorithm (hexagonal search). The difference between the two measurements corresponds to the dynamic power of the ME core. Fig. 10 shows the dynamic power of the ME cores with one IPEU, two IPEUs, and two IPEUs with an additional FPEU. As expected, power consumption increases linearly with core frequency, and proportionally with the number of execution units since they are of similar complexity and perform a similar function. The power consumption in the FPEU case also includes the power of the two IPEUs as they are run in parallel.
The static power measured for the whole FPGA device is around 360 mW in the configured state and 345 mW in the unconfigured state. However, only a fraction of this corresponds to the ME core, which uses a small portion of the entire chip's resources.
Unfortunately, no power results have been presented in [11] or [12] , so a direct comparison with these two programmable fast motion estimation processors, implemented on similar FPGA technology, was not possible. Power results based on the XPower tool are presented for the full-search architecture described in [19] . This work targets the Xilinx Virtex-II family of FPGAs and is limited to integer-pel search only. It uses a number of techniques to reduce power including pipelining, clock gating and using only a small search range of 32 30 pixels for CIF formats. Additionally a pixel truncation technique that reduces the precision of the algorithm, and hence the PSNR video quality, is also proposed to further reduce power. In order to be able to do a fair comparison, we used XPower on our own processor configured with one or two IPEUs and retargeted for the Virtex-II families. The work presented in [19] , with 256 processing elements, reports a minimum dynamic power of 1826 mW at 50 MHz. In contrast, for our work, XPower reported a power consumption of 480 mW with one IPEU, and 587 mW with two IPEUs. Clocking our solution at 50 MHz is more than sufficient to maintain real-time operation at CIF resolutions, even with a single IPEU.
C. Hardware Comparison
The results of implementing the processor with different numbers and types of execution units are illustrated in Table II . The basic configuration is small, using only 2% of the available logic resources and 6% of the memory blocks. Each new execution unit adds around 1000 LUTs and four embedded memory blocks. Table III compares the performance and area figures of our LiquidMotion processor against the ASIP cores proposed in [11] and [12] for integer-pel search only. To obtain a fair comparison, our core was retargeted to a Virtex-II device for these experiments since this is the technology used in [11] and [12] . An estimate for a general-purpose P4 processor, with all assembly optimizations enabled, are also presented as a reference although the power consumption and cost of this general purpose processor are not suited to the embedded applications that our work targets. These types of comparisons are difficult since the features supported by each design varies. Although our core can support fractional-pel searches, this feature is shared only by the work presented in [12] , which use a nonstandard interpolator in which both integer and fractional searches must run sequentially. For these reasons, this comparison has been limited to integer search only.
Overall, Table III shows that our core offers a similar level of diamond-search performance to the ASIP developed in [12] using one execution unit; this can be almost doubled if the two execution units are instantiated, as shown in the final row. The pipeline of our proposed solution can clock at double the frequency as shown in the table, and justifies why our solution, using only a single execution unit, can support 1080-pixel HD formats while the solution presented in [12] is limited to 720-pixel HD formats. The measurements of cycles per macroblock were obtained processing the same CIF sequences (352 288) as used in [12] . The diamond search corresponds to the implementation available in x264 that includes up to eight diamond interactions followed by a square refinement using a single reference frame and a single macroblock size (16 16) . A direct comparison, including with fractional-search, has not been possible since standard-compliant H.264 interpolation is not supported in these cores. Although our core consumes more memory blocks (but still less logic slices than in previous work) with a single IPEU, we are able to use these resources more effectively and can offer a search reference area over 6 times larger than the ASIP in [12] .
VII. CONCLUSION
The main features of the presented processor are the support of arbitrary fast motion estimation algorithms for real-time HD support, the seamless integration of fractional and integer-pel support, the availability of a software toolset to ease the development of new motion estimation algorithms and processors and the description of a scalable, configurable architecture with a variable number of execution units determined by algorithm and throughput requirements. The combination of these features constitutes a significant advancement compared with the work reviewed in Section II. When compared to traditional full hardware, the presented core scales well to large search ranges without linear increases in hardware resources and consequently power consumption. When compared to other ASIP processors, our work is faster, more scalable, and fully supports the advanced features introduced with the H.264 standard. The measured power values have been added to the cycle accurate simulator part of the toolset 1 which can then be used to configure the processor according to power, performance, and complexity constraints.
