Abstract: A Full Search Motion Estimation architecture design is proposed and fully elaborated and tested in this paper. The proposed Motion Estimation architecture smartly reuses the data fetched from the main memory to be used in the search area. This allows using less memory I/O bandwidth. The proposed architecture guarantees a full utilization of all resources and not to have any stall at all during the Motion Estimation process. The proposed architecture guarantees high speed by performing the Motion Estimation process in adequate number of clock cycles. Additionally, high video quality is obtained using the proposed architecture. Both of the high speed and the high video quality are achieved by using an efficient algorithm to load the search area into a local memory. The local memory efficiently loads the processing array with the required search area and achieving two data reuse levels. We concentrate on elaborating and functionally testing the whole Motion Estimation architecture using VHDL verification language and provide a proof for the high accuracy of the designed architecture. The design of the local memory is implemented using only registers and a simple counter. This simplifies the design by avoiding the use of complicated addressing to write or read into/from the local memory. The proposed architecture has a regular data flow which leads to a simple VLSI implementation. The proposed architecture is flexible and can be used for low and high definition video sequences. Due to the high speed of the proposed architecture, it can be used for many real time video applications such as video phones, video conference, and HDTV broad casting.
INTRODUCTION
HD-DVD, video conferencing, HDTV broadcasting, video-on-demand, multimedia messaging, and ultra frequency video transmission are real time video applications that have been spread nowadays. H.264/AVC (Advanced Video Coding) and H.265/HEVC (High Efficiency Video Coding) are recent standards used for such applications [1] [2] [3] [4] . Such standards keep very low bit-rate as well as high video quality. This is achieved by adding some complexities to the encoder design of such standards. Multiple reference frames, halfpel and quarter-pel accurate Motion Estimation, parallel processing, and variable block sizes techniques are examples for such added complexities.
Full Search Motion Estimation (FSME) is the well known algorithm used in both H.264/AVC and H.265/HEVC standards for removing the temporal redundancy of the transmitted video signal.
Consequently, the encoder of such standards can achieve a high compression in the transmitted bit-rate. FSME guarantees high video quality and high compression in the transmitted bit-rate, however, it consumes most of the video encoding time [5] . Consequently, many fast Motion Estimation algorithms were developed to tackle the problem of high complexity of the FSME process. Three Step Search (TSS) [6, 7] , New Three Step Search (NTSS) [8] , Four Step Search (FSS) [9] , Diamond Search (DS) [10] , Cross Diamond Search (CDS) [11] , Successive Elimination Algorithm (SEA) [12, 13] , and Adaptive Search Window Size (ASWS) [14, 15] are examples for such fast Motion Estimation algorithms.
Most of the previous fast Motion Estimation algorithms are not implemented in VLSI due to the unregularity of data flow. Although some of them are well implemented in VLSI, the transmitted video accuracy is low [16, 17] . As a result, Full Search Motion Estimation http://dx.doi.org/10.12785/ijcds/040401
http://journals.uob.edu.bh is still used for video transmission. Due to its regular data flow, FSME algorithm is well implemented in VLSI. In this paper, a Full Search Motion Estimation architecture design is presented and fully elaborated and tested. Regularity of data flow, reducing the I/O bandwidth required for video transmission, reusing data that is fetched from the main memory, and fully utilizing the resources of the proposed design are the issue in this paper. We use the VHDL verification language to verify the functionality and accuracy of all components of the proposed Motion Estimation architecture.
The paper is organized as follows. Section 2 presents the problem formulation. The proposed Motion Estimation architecture is discussed in details in section 3. The whole data flow of the Motion Estimation architecture is discussed in section 4. Section 5 discusses the simulation results. Finally conclusion and future work are drawn in section 6.
PROBLEM FORMULATION
H.264/AVC (Advanced Video Coding) and H.265/HEVC (High Efficiency Video Coding) are the most recent video coding standards jointly by ITU-T VCEG and ISO/IEC MPEG [1] [2] [3] [4] . Figure 1 
A. Motion Estimation Process
Motion Estimation (ME) is the process of finding the Motion Vector (MV) that defines the transformation of the current block image from the reference block one. Full Search Block Matching Motion Estimation (FSBM-ME) is the most popular ME algorithm [18] . In FSBM-ME algorithm, the current frame is divided into blocks, each of size N×N pixels; where N=16. Each block searches for its best match candidate block in the search area located at the reference frame. As seen in Figure 2 , the best match candidate block using the FSBM-ME algorithm is calculated by searching each point in the search area represented by 2P max ×2P max .; where 2P max is range of the selected search area. The point located at the smallest cost is selected as the best match candidate block. The cost can be measured using the Sum of Absolute Difference (SAD) metric. The displacement between the center of the search area and the best match reference block is represented by the Actual Motion Vector (AMV). Comparing video applications to other multimedia sources such as speech and text, it consumes much data. Table 1 illustrates some different video data formats. For SIF video sequences, an 32×32 search area is needed. While for SDTV and HDTV video sequences, an 64×64 search area is required [5, 19] . We concluded from the data in Table 1 two main important notations: 1-Higher number of search areas are needed from the memory as the frame size is increased or due to the increasing consumer demand for higher resolution [5, 20] . For example, UHDTV broadcasting requires much data to be fetched from memory than the Video Conferencing which uses SIF video format. Since the memory I/O bandwidth is limited, the proposed work is proposing and elaborating the use of architecture for better use of the available memory I/O bandwidth. In this paper an architecture is proposed for performing FSBM-ME process.
Current Frame n

Current block
The elaborated architecture allows the data reuse of an existing data inside the ME co-processor. Consequently, no need for fetching large amount of data from the main memory. 2-The more the resolution of a video sequence is, the more the required computations to perform the FSBM-ME algorithm. These computations consume much encoding time. The proposed architecture allows parallel processing; consequently, higher speed of video transmission is obtained. Additionally, 100% utilization of the resources of the proposed architecture is achieved. Following sub-section is a brief description of the used data reuse principle. 
B. Data Reuse Principle
Compared to H.264/AVC standard [3] , H.265/HEVC has accomplished up to 50% savings in the transmitted bit-rate. Consequently, 4K and Ultra High Definition TV (UHD-TV) resolutions can be achieved [19] . There are two main problems in both standards [1, 3, 5, 21, 22] . The huge number of pixels data required from the external memory is the first problem [5] . For a current block of size N×N pixels, a search area of size 2P max ×(2P max +N-1) pixels is required from the external memory. The second problem is the huge number of computations required for performing the full search Motion Estimation process. 2P max ×(2P max +N-1) absolute difference operations for a full Motion Estimation process per one current block is required. The huge number of data can be solved by using data reuse techniques [23] [24] [25] . In this work we use two different data reuse levels; i.e., Level A and Level B as follows:
Data reuse level A: In a single strip of the search area of size 2P max ×2P max , consecutive candidate blocks are overlapping in (N×N-1 pixels) within the same strip as seen in Figure 3 . As a result, the overlapped area can be reused for the future candidate block #2 and only one column is needed from the external memory for such future candidate block#2.
Data reuse level B: There are overlapped pixels between two consecutive strips (i.e., strip#1 and strip#2) as seen in Figure 3 . Consequently, while processing strip#2, most of the pixels used in strip#1 can be reused. It means, only one row of pixels is needed from the external memory to complete strip#2. 
PROPOSED MOTION ESTIMATION ARCHITECTURE
The whole proposed ME architecture is shown in Figure 4 . This architecture is mainly used for the H.264/AVC standard. The search area fetched from memory is 2P max ×(2P max +N-1) and the current block size is N×N. N and P max are chosen to be 16. The ME operation starts when the De-multiplexer (Demux) receive the pixels of both the Current Block (CB) and the search area from the external memory. The Demux distributes the data to either the Local Memory or the PE Array. The Local Memory consists of three sub-memories. Local Memory send candidate blocks to the Processing Array which contains the data of both the current and the candidate blocks. After the absolute differences are calculated inside the PE array, they will be sent to the Adder Tree to get the Sum of Absolute Address (SAD). The SAD value is then sent to the Compare Unit to find the minimum SAD between the CB and all candidates in the search area. After the comparison, the position of the final minimum SAD is stored in the motion vector memory. The motion vector memory sends all the stored actual motion vectors to the main processor. The Control Unit controls all those activities of the components.
It is worth mentioning that this architecture is scalable one, so it can be easily used for the H.265/HEVC standard. Local memory will have same size but the PE http://journals.uob.edu.bh array will be extended to be 32×32 in order to be suitable for the ME of the H.265/HEVC standard. 
A. PE Array
The Processing Element (PE) array is the factory of getting the Absolute Difference (AD) values between the current block and the candidate block in the search area. It consists of 16 PE Rows as seen in Figure 5 to form the PE array in Figure 6 . The current block data pixels and the candidate block data pixels enter the 16 rows in parallel via the terminals CBR in and RBR in , respectively. Every clock cycle, one data pixel enters the least significant PE of each row of Figure 5 . Since the pixel value ranged from 0 to 255 gray levels, the number of bits per pixel is chosen to me 8. As a result, each PE row has 128 bits for the whole ADs in one row. It is worth mentioning that the data enters the first PE and each PE sends its stored data to the next PE. There is an exception for the last PE which does not need to send any data to any next PE. All of the PEs calculates the absolute difference in parallel. 
B. Adder Tree
The output of the PE Array is 256 AD values that need to be summed in a very fast fashion. Using normal adders result in a huge delay that may prevent the proposed architecture to be used in the real time video applications. Adder tree architecture is a good choice that uses parallel processing to add many values in one clock cycle [5, 26] . The main unit in adder tree is the 4-2 compressor shown in Figure 7 . It is used to add 4 bits at a time. These four Bytes will enter to 4-2 compressors as seen in Figure 8 . The value of carry out for the current stage i will be C in for the next stage ii. The final result will be obtained by using 9-bits adder which adds the output of the adder tree in Figure 8 as follows: 
D. Local Memory
The main idea of data reuse principle is performed by using the Local Memory unit. It is used to save the data of the search area as well as data that may be reused in the future. Consequently, no need for fetching such reused data again from the main memory. The Local Memory unit consists of two main units: The Demultiplexer (Demux) and the sub-memory units as seen in Figure 11 . Figure 10 shows the required search area for a 16×16 current block. The last pixels in part 2 and part 4 required additional 16×15 pixels for completing the search process. This is the reason for using the last sub-memory 3 in Figure   11 . The additional pixels (dashed area in Error! Reference source not found.), which are required for searching the pixels in part 3 and part 4, can be fetched using the three sub-memories 1, 2, and 3, consequently. Each sub memory contains a 16×16 register array as seen in Figure 12 . Each register is eight bits in length and saves a value of one pixel in the search area. The data enter as 16 pixels row by row from down to top direction. Each clock cycle one row enters from bottom and shift one row to the upper register row. Data outputs from sub-memory column by column starting from the left column and move forward to the right direction. Selecting a specific column is done by using a counter. http://journals.uob.edu.bh
The Demux is acting as the interface between the external memory and both the PE array and the local memory. Data is transferred from the external memory using 128 bits data bus (16 pixels wide). The PE array starts filling its registers with the Current Block data fetched from the external memory once per ME search operation when the select of the DEMUX is set to 0. The search area is filled starting by sub memories 1, 2, and 3, respectively, when the select terminal is in positions1, 2, and 3.
Back to the whole architecture of the local memory in Figure 11 and the search area in Figure 10 , the whole operation will be as follows. During the first 16 clock cycles, the select terminal of the DEMUX will be 0. The PE array starts getting the values of the current block row by row as 16 pixels (128 bits) in the upward direction. In the next 16 clock cycles, the select terminal will be 1 and sub-memory 1 start to be filled in the upward direction with part 1 of the search area. In clock cycle number 33, the counter will refer to the most significant column of sub-memory 1 and select terminal will be 2. Additionally, sub-memory 2 starts to be filled with part 2 of the search area. The counter keeps increasing until clock cycle # 48. At clock cycle # 48, all part 1 of search area is moved to the PE array and group 2 is filled in sub-memory 2. PE array will give 256 Absolute Difference (AD) values at clock cycle # 49. The AD values will be added by the adder tree to get the final SAD value at clock cycle # 50. Level A data reuse is achieved by moving the counter to the first left column of part 2 of the search. Once the counter is selecting this column, it will be entered to the left column of the PE array to give another 256 AD values. The process will continue until the first strip of level A data reuse is done. It is worth mentioning that on clock cycle # 49, the select terminal will be 3 to start filling the sub-memory 3 with the dashed area of first strip level A of the search area in Figure 10 . Level B data reuse [21] will be achieve by filling only one row from part 3 ad part 4 into sub-memories 1, 2, and 3, respectively. The counter will be updated to cover all points in the search area in Figure 10 . It is worth mentioning that the SAD value is 16 bits length.
E. Motion Vector Memory
The output of the adder tree is a SAD value between the current block and the candidate block (SAD_current). The compare unit stores the value of the minimum SAD so far and its corresponding position. The compare unit compares the SAD_current with the minimum SAD. Id SAD_current is less than the minimum SAD, the compare unit will update its minimum SAD value with SAD_current and its new position. After all candidate blocks in the search area are processed, the final position will be sent to a motion vector memory in Figure 13 .
The proposed ME architecture is flexible one. It means it can be used for doing ME process for many formats of video sequences. For example, QSIF, SIF, and SDTV video sequences [5] . For Motion Estimation, the current block should be divided into 16×16 and each current block should have an actual motion vector (position of the minimum SAD). These actual motion vectors (AMV) are stored in a motion vector memory shown in Figure 13 .
The size of the SDTV video sequence is 720×486 pixels per frame. If divided into 16×16 current blocks, 1395 AMVs are needed. Motion vector memory is simply a FIFO system that contains 1395 registers. We simply used 32×32 search area in our simulation. Consequently, the input to the motion vector memory is 11-bits in length. The first position is stored in the bottom register and shifts in the upper direction every new AMV. The reset terminal (Rst) is enabled once per current frame. The enable terminal (En) is enabled at the end of each Motion Estimation process to store an AMV for a current block.
11-bits register #1394
11-bits register #2 11-bits register #1
11-bits register #0
Rst
En
To the main Processor AMV Figure 13 : Motion Vector Memory.
F. Control Unit
The Control Unit is the most important and complex part of the design. It produces all the required control signals for the whole components of the ME architecture. The control unit consists of two important parts: the up counter and the control signals controller. The Control Unit has three inputs; i.e., enable, reset, and the system clock. The outputs of such unit are all the needed control signals.
The up counter is used to count the clock cycles needed for each ME process start from the top left pixel to the bottom right one in the search area. For example, for a search area of 32×32, the up counter starts from 000H to 400H. To start counting, an enable, reset and system clock http://journals.uob.edu.bh are needed as inputs, and the number of clock is the output of the up counter. The up counter value is reset with every new ME process. The output of such counter represents the position of the candidate block inside the search area. This value should be matched to the whole frame axis before storing the value of the best match candidate position in the motion vector memory.
The control signals controller takes the output of the up counter as its input. The output of such controller is the control signals that initiate all component of the whole ME architecture.
THE DATA FLOW OF THE ME ARCHITECTURE
The ME process starts by getting start control signal and the system clock from the main processor. PE array is filled with the current block pixel values in the first 16 clock cycles. This filling operation occurs by set the select terminal of the DEMUX to 0. The second 16 clock cycles, the sub-memory 1 will be filled by 16×16 pixels of search area (group 1) as seen in Figure 14 . This will be done by set the select terminal of the DEMUX to 1. In clock cycle number 33, PE array starts read data of group 1 and submemory 2 also starts reading 16×16 search area group 2 by setting the select terminal to 2. At clock cycle number 48 the PE array gives 256 absolute differences to the adder tree and sub-memory 3 starts getting its 16×16 search area pixels by setting the select terminal to 3. In clock cycle number 49 the adder tree will give the SAD value to the compare unit and the PE array gets the first column of group 2 in Figure 14 which achieve data reuse level A. In clock cycle number 50, the compare unit is done by its update operation. It is worth mentioning that sub-memory 3 finishes filling its pixel values at clock cycle number 64. It means each sub-memory requires 16 clock cycle to be filled. After filling sub-memory 3, in clock cycle number 65, only 16 pixels (group 4) will fill the bottom row of sub-memory 1. All values in submemory 1 will be shifted upward to achieve level B data reuse. After filling the contents of group 3 into the PE array, new candidate value of group 1 starts to enter the PE array. Groups 5 and 6 will be filled in clock cycles 66 and 67, respectively. Operations will be repeated by entering the remaining search area values in the submemories and read them accordingly into the PE array. It is worth mentioning that sub-memories 1 and 2 require 16 clock cycles to read data column by column from each one. Sub-memory 3 only fills 15 columns into the PE array.
It is clear from previous discussion that data enters the sub-memories row by row to achieve level B data reuse. Level A data reuse is achieved by switching the read operations between sub-memories. Thus, 16 clock cycles are needed to read from sub-memory 1 while writing submemory 2 row by row. And 16 clock cycles to read from sub-memory 2 while writing sub-memory 3 row by row. Finally, 15 clock cycles are needed read from submemory 3 while writing the bottom row of sub-memory 1 and sub-memory 2 (level B data reuse). That is a total of 47 clock cycles. Those 47 clock cycles are repeated, with resetting the counter of the local memory for each row, for 32 rows of the search area. Adding 16 clock cycles for loading the current block into the PE array, another 16 clock cycles for loading the first candidate block inside the PE array, two clock cycles for getting the first SAD, one clock cycle for the compare unit, and one clock cycle for resetting all registers at the beginning of the ME process, the total setup clock cycles are 36. The number of clock cycles required for the whole ME operation are (32 row of search area) × (47 clock cycles for reading one slice using sub memories 1, 2, and 3) + 36 (setup clock cycles) which are 1540 clock cycles. http://journals.uob.edu.bh
SIMULATION RESULTS
The frames are divided into blocks of size 16×16. The used search area has a size of 32×32 pixels. The proposed Motion Estimation architecture is tested using VHDL verification language and full functional verification was performed using Modelsim tool.
In our simulations, we simulate each part of the whole ME architecture in Figure 4 . Figure 15 and Figure 16show the test benches for the current block (CB) and the candidate or reference block (RB) used to test the PE array. If we fill both of the CB and the RB at same time to the PE array, it takes 16 clock cycles to be loaded inside the registers of the PE array. Since for each PE, there two registers, two more clock cycles are needed. The expected results of the absolute difference shown in Figure 17can be obtained after clock cycle number 18. The simulation result in Figure 20 confirms the absolute difference values in Figure 17 of the PE array at clock cycle number 18.
[ ] The second part that we checked is to join the PE array with the adder tree to calculate the SAD value. The used CB and RB in this case are more complex and are shown in Figure 18 and Figure 19 , respectively. The expected SAD value is 200H after 18 clock cycles. Figure 21 confirms such results.
[ ] Figure 18 : The CB used in SAD calculation.
[ ] Figure 19 : The RB used in SAD calculation.
The whole architecture is checked using a search area start at (R0,C0) and has a search area size of 32×32. The whole Motion Estimation process consumes 1540 clock cycles (16 clks for loading CB into PE array, 16 clks for loading sub-memory 1, 47 clocks for loading each strip into the PE array, 2 clks for getting the SAD, 1 clk for reset the whole registers, and 1 clk for the compare unit to store into the MV memory. It is worth mentioning that we have 32 strips of search area that should be loaded into the PE array. As seen in Figure 22 , the best match is the block in red that is located in the position (R30, C27). Using the calculation above, we expect that the Actual Motion Vector (AMV) of the best match candidate block is to be located at clock cycle # 1487. Figure 23 shows the simulation results of performing the ME process and finding the best match at clock cycle # 1478. This is considered as the AMV which is stored in the MV memory after the whole ME process is done at clock cycle # 1540.
http://journals.uob.edu.bh http://journals.uob.edu.bh 
CONCLUSION AND FUTURE WORK
Full Search Motion Estimation architecture is proposed and fully functionally tested using VHDL verification language. Simulation results show that the whole Motion Estimation process can be performed using 1540 clock cycles (it means the transmission throughput is ) with 100% utilization of all resources. Data reuse is achieved using smart data flow as well as small internal local memory. Simulation results show that the proposed architecture can find the exact AMV with 100% success rate. The future work will include the calculation of hardware cost and comparing the proposed design with the state of the art ME architectures. Since the proposed architecture uses less number of components and has a regular data flow, this is expected to positively affect on speed, area, and power consumption. A comprehensive comparison between the architecture in this paper and the future work is considered.
