Abstract
Introduction
A down-looking camera mounted on an underwater vehicle provides rich information for its navigation system [5] . Correct robot navigation requires real-time performance of tasks such as motion detection. Image processing tasks associated to motion detection algorithms use mathematical techniques dominated by convolution, correlation, filtering and least squares among others. Considering the size of the image (768×576, 416×288 pixels), these tasks have a high computational cost. Therefore, real-time execution of them (25 frames per second) requires fast-processing systems. In case of general purpose computers, achieving this performance is a great challenge. The highest performance concerning time of execution can be achieved by programming the application at gate level. However algorithms in computer vision are quite complicated and require high flexibility in the implementation. Reconfigurable computing combines the advantages of both approaches: implementation at a very low level and high flexibility and rapid prototyping [4] .
Our work is focused on low level image processing algorithms for motion estimation were a large amount of data has to be processed. This paper explores the possibility of hardware implementation of tasks such as interest points detection and matching procedure. Correlation algorithms have important properties like regularity and modularity. Thus, they can be divided into computational blocks which can be processed in parallel. There is an extensive literature concerning array architectures applied to image processing, especially in Block Matching Algorithms (BMA) for motion estimation [2, 8, 10] . Komarek et al. [8] described specific solutions for array architectures of full search BMA. They propose four different alternatives for one and two dimensional array architectures. In the early nineties, various VLSI designs were proposed for decreasing BMA computation time [1, 2] . While in full search BMA the image is divided into blocks and the algorithm looks for matches of every block in a frame, our approach is looking for correspondences of interest points. These are scene features which can be reliably found when the camera moves from one location to another and lighting conditions change. On the other hand, a more complex error measurement criteria like normalised correlation [13] is applied.
The remainder of this paper is structured as follows. Section 2 presents the motion estimation algorithm and defines real-time constraints. The detailed architecture will be described in section 3. Finally, section 4 outlines conclusions and future work.
Analysis of the motion estimation algorithm for its parallelization
The goal of this algorithm is to estimate the motion of an underwater robot. Correspondences between the current image acquired by the camera and a reference image have to be found in order to estimate the motion. This often means detecting features in one image and matching them in another. The selection of features may depend on the application, although points are commonly used because they can be easily extracted and are quite robust to noise [7] . However, matching those features in the second image is normally a complex task. Underwater images are difficult to process due to the medium transmission properties and non-uniform illumination [6] . These aspects can provoke undesired bad correspondences (outliers) which can introduce errors in the motion estimation process. Some authors have proposed a normalised correlation to reduce the influence of non-uniform illumination [13] .
Corner detector
In our algorithm motion is estimated by computing the planar homography between the current image I c and a previous reference image I r . The first step in solving the correspondence problem is the detection of a set of wellcontrasted points in the current image. Corner detector algorithms consist of computing the image gradient components I x and I y by convolving the current image with the Prewitt masks. Benedetti et al. [2, 3] proposed a modified version of the Tomasi-Kanade [11] algorithm which reduces the computation and avoids floating-point. In this algorithm a G matrix is considered.
is found. Every pixel having:
is retained, where λ t is the imposed lower bound for the solutions of the equation (2) . The last step of the algorithm discards any pixel which is not a local maximum of P λt (i, j) . N interest points are selected considering the highest values for P λt (i, j). In this approach the complexity is considerably reduced and does not require any floating point operation.
Correspondence problem
Once interest points are detected in the current image, we search for correspondences in the reference image. Quite often local gray-level correlation is applied to detect matchings in the pair of images. A correlation algorithm provides, for each interest point p c = (x c , y c ) of the current image, its corresponding match p r = (x r , y r ) in the reference image. The correlation score is defined as the covariance between the grey levels of a region defined by the correlation window in the current image and the same region defined in the reference image. The algorithm searches for all candidate windows inside the correspondent search window. A normalised correlation criteria C, which assures the result is not altered in presence of nonuniform illumination is showed in equation (4). This criteria was applied to underwater images [5] where nonuniform illumination is always present.
where α = (n − 1)/2; n × n is the size of the correlation window. I c (x c , y c ) and I r (x r , y r ) are the average intensity and σ 2 (·) defines the variance of both correlation windows. The algorithm compares the correlation score of each pixel within the search window and selects the highest one.
As the amount of interest points increases, the correlation approach becomes very time consuming. For this reason we propose a breaking down of criteria C for its parallelization. We can observe that there are five sums to be computed in equation (4): sum 1 , sum 2 , sum 3 , sum 4 and sum 5 . 
This breaking down simplifies the parallel implementation while each Processing Element (PE) of the architecture executes in parallel the computation of these five sums. Furthermore, the Post Processing Plement (PPE) performs the remaining computation.
Finding correspondences is the most time consuming part of our algorithm. Let us consider N interest points detected in the current image. For every interest point, we are searching for [(2p + 1) − 2α] 2 possible correspondences, where p = (q − 1)/2 and q × q is the search window of size. Considering the braking down of the correlation cri- teria in five sums as described in equation (5), two accumulations and three multiplication-accumulations have to be computed for every pixel from the correlation window. Gathering these sums by means of correlation criteria, see equation (6), twelve additional computation steps for every candidate block are necessary. The complexity of the correspondence problem becomes at frame-rate f r :
For N p = 200, α = 7, p = 14, at a frame-rate of 25 frames per second we have O p 2036 GOPS (Giga Operations Per Second). Our approach tries to reduce this complexity by means of a parallelization of the correspondence problem. Real-time feature detection is also achieved.
Proposal of a Parallel Architecture

Corner detector hardware implementation
The current image is read from memory and the goal of the corner detector is to provide the memory address of N interest points of the image. The first step in corner detection is the computation of the image gradient components I x and I y by convolving the current image with a set of 3 × 3 Prewitt masks. Benedetti et al. [3] proposed an implementation based on two FIFOs and two buffers used to delay the incoming pixel. The left column of Figure 1 shows the block diagram corresponding to each step in corner detection. The Data Flow Graph (DFG) for the image convolution with the Prewitt masks and summing elements inside a 3 × 3 window are shown on the right side of Figure 1 to be 3 × 3. The next step consists of computing P λt from equation (2) and rejects the values which do not satisfy the conditions of equation (3) . Non-maximum suppression is carried out using a 3×3 window. In order to retain N pixels with the highest value of P , a pipeline of N Sort-Processing Elements (SPE) is proposed. One SPE compares the input pixel value with the one stored in its buffer and retains the bigger one. An external signal can empty the SPEs buffers at the end of each frame.
The delay introduced by the corner detector is important, since we are interested in the memory address of the N corners instead of their value of cornerness. Every 3 × 3 window generator introduces a latency of two lines and two pixels. The delay introduced by the computation of P λt is shown in equation 8 and depends on the image size (M i ×N i ), pixel sampling time (t s ) and the number of timecycles for the computational blocks from Figure 1 : Prewitt (t P ), Sum (t S ) and Compute P λt (t C ).
Parallel implementation of the correspondence problem
For every interest point we are looking for correspondences in the reference image. When mapping an algorithm into an array of processors, the problem is to access multiple data to feed all the processing elements (PE) at the same time. Yang et al. [10, 12] proposed a solution which consists of a local data exchange between PEs. This approach uses a two memory access for reference image (r 1 , r 2 ) and one for current image (c), see Figure 2 . Once read from memory, the data are broadcasted to every PE. Buffers are used to delay data and multiplexers to switch between data. For high utilization efficiency of the architecture, the size of the search window must depend on the size of the correlation window, according to equation p = 2α. The number of PEs is also determined by the size of the correlation window and is equal to (2α + 1). A schematic representation of the specific hardware architecture is shown in Figure 2 . One PE is in charge of the parallel computation of the five sums defined in equation (5) . Two accumulations and three multiplication-accumulations are executed in parallel (Figure 3(a) ). For a given interest point, the necessary time to search for the computation of the five sums of equation (5) is defined by:
where ∆t is the time required for one computational level. After T p seconds, the post processing element can compute the correlation criteria from equation (6) . The DFG for this computation is shown in Figure 3(b) . Hardware implementation of square roots and division operations is crucial for
