Abstrarl-it is well known that image processing requires a huge amount of computation, mainly at low level processing where the algorithms are dealing with a great number of datnpixel. One of the solutions to estimate motions involves detection of the correspondences between two images. For normalised correlation criteria, previous experiments shown that the result is not altered in presence of nonuniform illumination. Usually, hardware for motion estimation has been limited to simple correlation criteria. The main goal of this paper is to propose B VLSI architecture for motion estimation using a matching criteria more complex than Sum of Absolute Differences (SAD) criteria. Today hardware devices provide many facilities for the integration of more and more complex designs as well as the possibility to easily communicate with general purpose processors.
I. INTRODUCTION
Presently, different methods for motion estimation and localisation of an underwater vehicle exist, mainly based on acoustic sensor networks. This strategy is relatively expensive since transponders have to be deployed from a ship, calibrated and recovered afler the mission. Therefore this procedure is not adequate for low-cost, small-size underwater vehicles such as URTS (Underwater Robotic Intelligent System) developed at the University of Girona (see figure 1). One cost-effective alternative can be to equip the vehicle with a down-looking camera, which acquires seafloor images while the robot is performing its mission. This down-looking camera provides rich visual information which can be used for vehicle motion estimation. Sequence of images acquired by the camera mounted on the robot can be used in constructing a map of the images. An example of a part of a underwater mosaic image is presented in figure 2 .
In most of the cases the process involves recovering the motion of the vehicle by means of gray level correlation [l] or using optical flow [2] . UnfortunateIy underwater images are difficult to process due to the medium of transmission characteristics. Blurriness of elements of the image, cluttering and nou-uniform illumination are some of the problems present in underwater imaging.
Our aim in the present work is, starting fiom analysis of VLSI architectures for motion estimation, to develop efficient versions of Field Programable Gate Array (FPGA) hardware implementation of motion estimation algorithm for the particular case of underwater images. In order to enable hardware implementation, a number of transformation in the algorithms may be performed. The algorithm and the required transformations are introduced in section II. Section m describes our proposed architectures €or real time motion estimation for of underwater images. The implementation wouId allow motion to be estimated at video rate as described in section TV. The paper ends up w i t h the conclusions and further work. 
REAL TIME MOTION

A. Matching Criteria
The motion estimation algorithm consider two images: the current image I , acquired by the camera and a reference image I, coming from the control system. Point correspondences between the current image and a previous reference image have to be found in order to compute a motion estimation matrix. Often this requires detecting features in one image, and matching them in the other one. The selection o f features may depend on the application, although points are commonly used because they can be easily extracted and are quite robust to noise [3] . A correlation algorithm provides, for each interest point (zc, yc) of the c m n t image, its corresponding matching (E,., y r ) in the reference image (see figure 3) . The correlation score is defined as the covariance between the grey levels of a region defined by the correlation window in the current image and the same region defined in the reference image. The algorithm searches for all similar patches inside the correspondent search window. A normalised correlation criteria C, which assures that the result is not altered in presence of nonuniform ilIumination, is showed in equation (1). In ow previous work we successfully applied this criteria to underwater images, in the presence of nonuniform illumination [4] . The correlation score between a point (zc, yc) in the first image I,, and point (zr, y,.) in the second image is defined as: ( 1) where (2a + I) x ( 2 a + 1) is the size of the correlation window. Ic(xc,yc) and Ir(zr:y,.) are the average intensity and a2(-) defines the variance of both correlation windows. The correlation algorithm compares the correlation score of each pixel within the search window and selects the highest one.
Correlation algorithms have important properties like regularity and modularity. Thus, they can be break down into computational blocks which can be processed in parallel. We propose a decomposition of criteria C for its parallelization.
We can observe that there are five sums to be computed in equation (1): sum1, sum2, sum3, sum4 and sums. 
B. Linear A w q s for Motion Estimation
Our approach is based on some ideas applied in VLSI architectures for motion compensation algorithms used in video coding standards. Full Search Block Matching Algorithms (FSBMA) are used for motion estimation in such applications [5]- [7] . In these algorithms the current h m e is divided into blocks of n x n pixels, and for every block the aIgorithm searches for similar blocks in the previous fiame within a search area of size (2p + n) x (Zp + n). In full search the algorithm has a very regular data-flow for the search area can be used for the implementation of the algorithm. Moreover, two possible ways of computation come out from the property of associativity of the operations in the algorithm. The four resulting array structures are denominated: AB1, AS1 for linear arrays and AB2, AS2 for quadratic arrays. These architectures exploit concurrency using different structures and numbers of PES.
Complexity of the motion estimation algorithm kom equation 3 introduces some reshction in chousing the adequate array architecture. For instance, a quadratic array is suitable only in case of simple PE architecture. When more computation must be done in paraIlel it can "eat" a lot of resources, which sometimes are not available. This is the reason why we restrict our analysis to linear arrays; AEil-type and AS1-type are shown in figure 4.
One solution to reduce the latency introduced by both AB1 and AS1 structures is to increase the memory access. When every PE is supplied with data coming from external memory, idle cycles can be avoided. When accessing external memory is constrained, reducing the latency can also be obtain by controlling the data-flow through the array using registers and multiplexors. This strategy was applied by Vos 
VLSI ARCHITECTURE FOR NORMALISED
CORRELATION
WhiIe in FSBMA the image is divided in blocks and the algorithm is looking for matches of every block in a search area, our approach is looking for correspondences of areas surrounding interest points. These are scene features which can be reliably found when the camera moves from one location to another, even when lighting conditions of the scene change, In our previous work we proposed a real-time implementation of interest points detection 1121. Due to its simplicity Sum of Absolute Differences (SAD) has been most extensively used matching criteria in VLSI implementation. In case of underwater imaging, our previous works [4] showed that by applying normalised correlation criteria to find matchmg in pairs of images, the result is invariant to nonuniform illumination. The complex error measurement computation is shared out over two computational elements: an array of Processing Elements (PE), each PE performing two accumulations and three muItiply-accumulations and a Post Processing Element @PE) containing multipliers, subtractors, square root and division to compute the error measurement.
VLSI architectures for motion estimation presented above can be easily adapted to our algorithm. The complexity of the processing element determined us to set apart the quadratic arrays, so that only linear arrays are analysed in this proposal. Figure 5 represents the AB1 and AS1 structures adapted to our design. The architectures correspond to experimental correlation window size of 3 x 3 and search window size 7 x 7.
As we can see in figure 5 b), a reduced AS1 structure is analysed, where each PE has a search window datum input from the memory and the correlation window data is broadcast through the array. Comparing with AB], this strategy reduces the number of time cycles but increases the number of processing elements.
AB1-type architecture is suitable for applications where Iarge motion vectors must be estimated, which imply big search windows, When the application requires faster processing speed, AS1-type architecture can achieve higher performance than ABl-type for small search window size. The approach introduced by Yang et al. [ I l l is an important contribution to reduce the memory throughput comparing with both AB 1 and AS 1. Figure 6 shows this strategy applied to Figure 7 presents the intend structure of one PE in both cases: AB1-type and ASl-type (Yang's architecture). In AB1 structure the PE perfoms three multiplications and five additions as the accumulation is done in a separate block at the end of the PES array, Besides, in AS1 structure the 
algorithm.
Yang VLSI architecture to Mashfig implement normalised correlation accumulations are done in the PE. It increases the size of the PE but simplify the control, while each PE has the same structure. In case of -1, one PE applies the arithmetic operators to the outputs of the previous PE, therefore the bitsize of the inputs and the outputs of each PE vary through the array of processing element.
It is obvious that the ABI-type PE may occupy less siIicon area than AS 1 -type PE, where multiply-accumulations must be performed in parallel. This analysis helps the designer to choose between a solution to saves hardware and a second choice to reduce the memory access and increase the execution speed. As we mentioned above, depending on the application, the designer may decide for ABI-lype, which can deal with large search areas but introduces great latencies or ASI-ope when performance is a critical issue. In case of underwater images, the motion of the vehicle is slow, so that the displacement between consecutive fiames is quite small. Therefore the architecture proposed by Yang can accelerate the aIgorithm to reach real time performances w i t h fair resources requirements.
2) Post Processing Element: The Post Processing Element (PPE) is one of the critical part of our design. The results from the array of PES are pipelined into the PPE. The PPE computes the correlation criteria defined in the equation ( reduced space occupied on the FPGA device and generates an exact result value. Figure 8 shows the computations performed by PPE. The last step of the algorithm compares all the measurements corresponding to every candidate match. The result of the algorithm is the coordinates of the pixel with the biggest value for the correlation score.
Iv. IMPLEMENTATION AND ANALYSIS
The purpose of h s work is the implementation of the proposed motion estimation algorith on a target FPGA hardware. This was accomplished by describing the algorithm in VHDL language and then synthesising it for the FPGA device. Prior to any hardware design we chose to implement a MATLAB software version Corresponding to every step of the algorithm. MATLAB is a tool which facilitates procedural routines to operate on images represented as a matrix. The implementation must be flexible, which means that by changing some of the parameters of the aIgorithm, the new generated hardware must be valid. The architecture was design in such a way to permit changing of these parameters. Computation complexity affects the level of parallelism, while many multiplications and accumulations must be performed at the same time. When talking about flexibility we can refer either to the architecture or to the implementation. The architecture must be able to support variation of the algorithm's parameters such as number of comers, correlation and search window size. Indeed, it imply parametriaation of the implementation. As FPGA implementation allows optimisation at bit level this parameters determine the bit size of the accumulator results, which furthermore affects the computational requirements in PPE. Table II shows the hardware requirements and the performance in case of Yang's architecture applied to different correlation and search window sizes. The required resources are quantified using Logic Elements(LE) and DSP blocks. LE are basic logic blocks of the selected FPGA device architecture and DSP blocks are embedded multiplies fiom the FPGA device. The performance is defined in ms and represents the latency introduced by the normalised correlation aIgorithm for to the detection of 100 matchings using a reasonable clockfrequency of 1OMH.
Chousing an adequate description language allows migration of the design to different hardware platforms. Moreover, compIai@ reduction means avoiding floating-point units which are very area expensive. Both, comer detection and matching algorithm make use of division operation. Tmsformation of these algorithms must be performed such that the computation to be based only on fixed-point arithmetic. Silicon must be optimally used to implement the computation so that the data storage and the control parts must be minimised [14$ 
