I n this paper we propose a specialized hardware architecture for the real time visual navigation of a mobile robot. The adopted navigation method is based on a two-steps approach. Features are extracted and matched over an image sequence which is captured by a videocamera (mounted on a mobile robot) during its motion. As a result, a 2D motion field is recovered and used to extract ego-motion parameters. Our hardware implements the first step of the method, which consists of feature extraction and raw match computation by means of radiometric similarity computation. Real time performances are allowed since a 40 MHz processing rate is achieved.
Introduction
Navigation is the capability of a mobile robot of safety moving in the environment, avoiding obstacles. In such a contest, an important role is played by artificial vision. In fact, a significant amount of useful information on three-dimensional motion can be obtained by image sequences which result from the movement both of a video camera and of 3D objects. We refer to as passive navigation the ability of an autonomous agent to determine its motion with respect to the environment. Ego-motion parameters for passive navigation are efficiently recovered by means of a displacement vector field analysis. Such vector field represents correspondences of two-dimensional features (extracted in successive images of a sequence) with 3D features in the space. A small number of such displacement vectors on the image plane is sufficient to obtain useful informations on ego-motion parameters. In literature, two frameworks seem to approach the 'finding correspondences' (matching) problem: direct and optimization methods. Both frameworks consider a low level step consisting of feature extraction from images. Then, direct methods use local constraints in order to find correspondences [1, 2, 3] . Conversely, the optimization methods use global constraints to formulate an energy or cost function and detect correspondences by minimizing this function. Such minimization is generally achieved using iterative techniques. While the direct methods are fast, but more sensitive to the noise, the optimization-based techniques are more reliable, but require burdensome processing.
In [4] we described a two-step algorithm for passive navigation. The effectiveness of this algorithm was successfully and largely verified by means of softwarebased autonomous motion tests. This method computes the heading direction from a displacement vector field determined through a feature based approach. It extracts features in an image and performs a two-step matching in successive images of the acquired sequence: the first step detects raw correspondences using a standard correlation based technique; the second step refines the raw matches by minimizing an energy function. A performance analysis indicated the first step as the phase requiring the most burdensome processing. In this paper, we propose a specialized hardware which speeds-up the first step of matching, allowing the ability to perform passive navigation in real-time. This hardware has been designed keeping in mind three specific constraints: modularity of hardware structures; easiness of the algorithm implementation; and a data flow approach enabling to avoid the use of complex software programs in exploiting the hardware computational power. In fact, it often happens that devices designed for high nominal performances have many degrees of internal parallelism. On the other hand, a full and efficient exploitation of such devices requires expert software developers and presents increasing difficulties such as the stages and the number of arithmetic units such as the number of control lines.
As an example of this, before approaching the design of the hardware solution proposed in this paper, we verified the possibility of using commercially available hardware. We have tested the performance of the MAXPCI architecture provided by Datacube Inc. This architecture consists of a set of hardware modules (convolver, histogrammer, warper, etc.), organized as a pipeline processor on PCI bus, oriented to reach a speed of 50Mhz pixel processing. Besides the high cost and the lack of a friendly programming, the MAXPCI efficiently performs image based operations. Nevertheless, it is not designed for pixel based operations (e.g., search a local maximum). This implied that our approach, implemented on MAXPCI architecture, allowed real-time performance only when the vehicle moved at low speeds (10 cm/s,) while, for being attractive, practical applications often require higher speed (e.g., 1 m/s). Therefore, neglecting sophisticated solutions, we tried to map the algorithm described in [4] by means of a simple hardware architecture, based on the use of several Look Up Tables (LUTs).
Note that in our application, independently from the arithmetic unit (LUT-based or Full-Adder based), memory access represents the throughput limiting factor of the whole architecture, supposing it is pipelined. In fact, the data to be processed at any clock cycle are those grabbed in a frame memory by the video camera. Therefore, even if FA-based arithmetic units can be in general faster than the employed LUT-based units, they cannot speed up the pipelined computation of the examined case, which is bounded by the latency of these memories. In other words, replacing the proposed LUT-based units with faster units does not produce any improvement in terms of computing power, since the throughput of the pipe remains bounded by the latency of the grabbing frame memory (which is assumed not to be shorter than the latency of the LUTs).
As a consequence, we chose the LUT-based approach for the arithmetic units, since it allowed both a faster development of the whole architecture, and an easier pipelining for high-speed processing [8] . In our architecture, operands are used to address the specific arithmetic LUT storing the related results, which can be directly achieved in a single access. The use of LUT's is already been exploited in DSP applications (e.g., [7] ). We adopted Residue Number System (RNS): LUTs implementing RNS-based arithmetic have dimensions which are reduced with respect to those of LUTs based on the conventional arithmetic [6, 7, 8] . A more compact FA-based device could be object of further research.
Moreover, the goal of low complexity for the resulting hardware was reached also by means of a short dynamic range for the input data. By a software simulation carried out on a general-purpose computer, we have verified that the algorithm described in [4] can also successfully operate on images with a quantization of 5 bits/pixel. Therefore, few bits down to the usual image quantization of one byte/pixel induced a great benefit in terms of memory amount needed by the LUTs. After the minimum quantization level needed by the algorithm was derived by means of software simulation, the hardware design and its CAD implementation have been performed with the aim of answering the following questions. Is it really possible to develop a complex algorithm in its entirety by using the above mentioned approaches? What is the processing rate achievable by LUT based structures, using current on-shelf technology? How large and expensive could this specialized hardware be when it is developed under the described constraints?
The paper is organized as follows. In the following section the whole heading estimation technique is described. Next, we present a high level description of the hardware architecture. Implementation details and performance analysis are provided in the next section. Some experimental results are then presented.
98
F. MARINO ETAL.
Image Motion Estimation
The heading algorithm is based on estimation of correspondences among features which are extracted from successive images. These correspondences are detected imposing the cross ratio invariance and are estimated only for features of ''high'' interest such as corners or edges. Such ''high interest'' features are selected using the ''interest operator'' introduced by Moravec, which isolates points having minimal autocorrelation values [2] . The method works in two stages:
1. It computes the SAD (Sum of Absolute Differences) between neighbouring pixels in four directions (vertical, horizontal and two diagonals). Such SADs are computed over a window ðqÂqÞ using the following equations: horizontal (08)
where, Iðx; yÞ is the brightness function computed on the pixel ðx; yÞ. The smallest value of V (.), H (.), D1 (.), and D2(.) is called interest operator value and is considered the variance for pixel ðx; yÞ.
It chooses as high interest features the pixels where
the interest operator values are local maxima.
Once N high variance features p i ¼ ðx i ; y i Þ are extracted in the first image, the best possible candidate ''matching features'' (matches) q i ¼ ðx i ; y i Þ are selected in the second image using a correlation based measure (radiometric similarity): given p i ¼ ðx i ; y i Þ in the first image, we select q i ¼ ðx i ; y i Þ in the second image which minimizes the SAD:
In Eqn (5), w is the size of the square window on which the SAD is evaluated, I and I represent the image brightness functions associated with the first and second image, respectively. The points having the smallest SAD are considered ''raw matches''.
In actual fact, since we compute correspondences among regions, false matches are unavoidable: the correct match point can mismatch the center of the highest correlated window. Also, several candidate matches can give the same value of SAD. Therefore, we use the raw matches computed by means of correlation only as an initial guess to be refined by an optimization approach which is based on the cross-ratio invariance.
The cross-ratio is the most popular geometric invariant of four collinear points or five coplanar points, since it is the simplest numerical property of an object that is unchanged under perspective projection.
Five coplanar points Q ¼ ðp 1 ; p 2 ; p 3 ; p 4 ; p 5 Þ have the familiar cross ratio CRðQÞ as their projective invariant:
where sinða ij Þ is the sin of angle p i p 5 p j :
We use the geometric invariance of cross-ratio of coplanar points both to verify the goodness of matches estimated by correlation similarity and to correct all mismatches. The idea is based on the assumption that planar surfaces in indoor environments are frequent (e.g., walls, tables). Moreover, this assumption is not a limitation. Since this constraint can be satisfied only if the considered features are coplanar, we use a global optimization process that both tries to satisfy the cross ratio similarity and takes into account the radiometric similarity of the features. For each subset of five coplanar points P ijklm ¼ fp i ; p j ; p k ; p l ; p m g in the first image, the corresponding points in the second image Q ijklm ¼ fq i ; q j ; q k ; q l ; q m g should have the same cross ratio. The used method takes advantage by considering many intersecting subsets of five points obtained as combinations of available sparse features. If the features are on different planes, the minimization approach constraints only coplanar features to influence themselves, producing subsets of non-coplanar features. Information is cooperatively propagated among subsets, and, due to the imposed radiometric similarity, matches among non-coplanar features are avoided.
In addition, we propose to solve the correspondence problem by minimizing the sum of all differences between the cross ratio computed for each subset of five features in the first image and the cross ratio computed for the corresponding points in the second image. The derived energy function which has to be minimized can be formalized as:
where . CRðP ijklm Þ and CRðQ ijlkm Þ denote the cross ratio functions estimated respectively in the subset of five coplanar points P ijklm in the first image, and in the subset of corresponding points Q ijklm in the second image. . DðP ijklm Þ denotes the Euclidean distance among the points fp i ; p j ; p k ; p l ; p m g in the first image. The introduction of this factor is motivated since near features have higher probability of being coplanar. . The term R i imposes that corresponding features in the first and in the second image must have a radiometric similarity.
The proposed approach converges to the desired correspondence points of the subset Q ijklm by implementing gradient descent along the EðQ ijklm Þ surface, which expresses the quadratic cost function's dependency on all of the points of Q ijklm . The input data are correspondences estimated for sparse features (characterized by an high directional variance) extracted by the Moravec's interest operator [2] in the image acquired at time t. Optimal correspondences are estimated by iteratively updating the raw corresponding features in the second image. To reduce the processing time, the expensive steps consisting of high variance features extraction and raw match computation, are directly performed by a LUTs-based hardware. 
100
Hardware Architecture High-Level Description
A logical scheme of the architecture [5] is shown in Figure 1 .
At the i th step of a sequence, an image of ðDIMÂ DIMÞ pixels is acquired, and:
. N features are extracted in order to be matched over the image which has to be acquired at the (iþ1) th step; . N corresponding points (one for each feature which was extracted over the image acquired at the (i71) th step) are detected.
The Computing Blocks performing the above operations are the ''Interest Block'' and the ''Correspondence Block'' respectively. These blocks can operate in parallel, since features of extraction and match computation are independent tasks.
The pixels of the image, provided by an external frame grabber, flow into a pair of interleaved frame memories through a pipe of S shift registers ðS ¼ maxðq; wÞ, where q and w are the sizes of the square windows used for computing, respectively, the directional variances and the SAD). S71 of these shift registers have size ¼ DIM, whilst the last one has size ¼ S.
When the frame memory 0 is full (filled by the i th image), the frame memory 1 is enabled in writing in order to store the ði þ 1Þ th image, and so on. Both the images are needed to be stored, since while the extraction of features in the current image can be performed at the same time as the data are flowing through the shift registers, the match computation also needs the previously acquired image.
The Interest Block
The interest Block computes the directional variances among neighboring pixels in four directions in a q Â q window (Eqns (1)- (4)), selects the smallest value as the ''interest value'' and associates it to the central pixel. Figure 4 ) which compares the Interest value with the current maximum in order to detect the biggest one, and to store it in the register MAX (contextually it updates the register CKMAX with the new address).
Because the main purpose of the interest block is to estimate a set of N features to be matched over the next image, the acquired image is partitioned in N regions (each one having DIM6DIM/N pixels). For each region, a local maximum among all the interest values is evaluated, and the address of the related pixel is stored in the register V i (where i=0, 1 depends on the current frame memory) in order to recover the feature and to match it over the next image (Figure 2 ). When the k th ðk ¼ 0; . . . ; N À 1Þ region has to be processed, each estimated interest value is compared with the current maximum (which is stored in the MAX register). The interest value will update the current maximum if it results higher. Contextually, the value of a ShiftCounter (i.e., v) will be written in the register CKMAX. The value v is directly related to the coordinates ðx; yÞ of the new maximum since: In practical cases, DIM ¼ 2 i , and equations (8) and (9) can be trivially solved considering the binary word codifying v as the join of y and x (each one having i bits). Moreover, in these cases, v represents the address of the current maximum in the frame memory.
Finally, when the k th region has been completely processed, the value stored in the register CKMAX is fed into the k th cell of the Register V i ði ¼ 0; 1 depending on the current frame memory), and the register MAX is reset in order to correctly begin the analysis of the ðk þ 1Þ th region.
The above described interaction between the computing blocks and the frame memories could seem complex but it is really simple: when the frame memory 1 is storing the ði þ 1Þ th image, the Interest Block is evaluating the N features on the ði þ 1Þ th image and storing their addresses in the register V1. Simultaneously, the Correspondence Block is evaluating, over the ði þ 1Þ th image, the candidate matches of features which were extracted in the i th image and whose addresses are stored in the register V 0 (for this reason, memory of the i th image must be held in the frame memory 0).
The Correspondence Block
The Correspondence Block ( Figure 5 ) computes the SAD (according to Eqn (5)) among the features ðx; yÞ extracted from the previous image and the data ðx; yÞ flowing in the pipe of Shift Registers. At each clock cycle, the pixels of the presently acquired image are shifted through the pipe, and a new SAD computed. The current SAD is compared with the current minimum (stored in a MIN register), and if it results smaller, it is written in the MIN register. Contextually, the value v of the the ShiftCounter is saved in the CKMIN register. In this way, when the image has been completely processed, the CKMIN register will indicate the value v which directly gives the coordinates (see Eqns (8) and (9)) of (1)- (4), other levels (m 1 ) 564 256-words, 4-bits/word sum in eqn (1)- (4), other levels (m 2 ) 3þ1 1K-word, 5-bits/word subtraction (m 1 ), tree of comparison 3þ1 256-words, 4-bits/word subtraction (m 2 ), tree of comparison 3þ1 512-words, 1-bit/word RNS to sign conversion (comparison) Figure 3 . The ''5Block'' used to compare the variances in order to detect the minimum one (the Interest value).
F. MARINO ETAL.
the point having the best matching (the minimum error). This value represents the output of each Correspondence Block.
Though the architecture could have N Correspondence Blocks, all of them working in parallel to match at the same time the features, only one Correspondence Block is really necessary. In fact, we can reasonably assume that two consecutive images are grabbed after a small motion of the camera. Therefore, if we consider images subdivided in regions having size (NK6DIM), with ðNK DIMÞ, the probability that the ''high interest'' feature of the k th region of the first image has its match in the k th region of the second image is high. As a result, a single Correspondence Block can be recursively used to find the matches for each one of N features. Note that, if the feature is located close to the boundary of the region, the above assumption might be not verified, nevertheless, the loss of some feature is not a problem for the heading estimation.
Implementation Details and Performance Analysis
In order to reduce the dimensions of the LUTs mapping the arithmetic tables, we found the lowest dynamic for the pixels of the acquired images which can be adopted without a decrease in accuracy. The algorithm in [4] was tested using different quantizations for the input images. These experiments showed that a 5-bit quantization for the image data does not limit the performance in terms of accuracy, while also producing great benefits in hardware design as well as computing time. In addition, the Residue Number System (RNS, Appendix A) has been chosen to reduce the memory requirement and to achieve a suitable speed-up. Table 1 shows the amount of required memory to implement the Interest Block.
Such a Block firstly computes (q71) absolute differences (ADs) among adjacent pixels according to Eqns (1)- (4), along four directions. In order to perform these operations by means of LUTs, we employ memories as specified in row one of Table 1 . Because of the absolute value, the input as well as the output dynamic range is [0-431]. Therefore, the two 5-bit inputs play the role of a 10-bit address in a LUT storing 5-bit words.
The resulting ADs should be added according to the external sum in Eqns (1)-(4) in order to achieve V(.), H(.), D1(.), and D2(.). Therefore, the dynamic range [0-431] is expanded to . Finally, these values have to be subtracted from each other to detect the minimum value in the tree of comparison. In such a step, the dynamic range becomes [(-186)-4(186)]. This range should require 8.53 bits. Therefore, we introduce a conversion from binary to a (5þ4)-bits RNS (see The conversion of a binary 5-bit data x to S 0 is implemented by a LUT (row two of Table 1 , which maps x-4 x j j m 2 only, since the mapping x-4 x j j m 1 is not necessary, because x j j 32 ¼ x.
The external sums required by Eqns (1)-(4) are performed in S 0 by the LUTs described in the rows 3 and 4 of Table 1 , respectively, concerning m 1 and m 2 .
The interest value is detected by three comparators selecting the minimum value among V(.), H(.), D1(.), and D2(.). Such an interest value is therefore compared with the current maximum as described previously. Therefore, 3þ1 comparators are needed. They compute the RNS difference between a couple of data to be compared (row five and six of Table 1 ). The sign of the result is therefore derived by a LUT having (5 þ 4) bits addresses (row seven of Table 1 ) and used to select the output of the comparator, as shown in Figure 3 . Note that the full RNS/binary conversion is not needed.
Similarly, Table 2 shows the amount of required memory to implement the Correspondence Block. Such a Block, computes 32 ADS among corresponding pixels belonging to the windows shown in Figure 5 , according to eqn (5) (first level). The required memories are specified in row one of Table 2 . Because of the absolute value, the input as well as the output dynamic range is [0-431]. Therefore, the two 5-bits inputs play the role of a 10-bit address in a LUT storing 5-bits words.
The computed ADs should be added according to the external sum in eqn (5) in order to achieve the SAD. Therefore, the dynamic range [0-7gt;31] is expanded to . The SAD has to be subtracted to the actual minimum SAD in order to detect the ''raw match''. In this step, the dynamic range becomes [(-992)-4992], and it should require 10.95 bits. Therefore, we introduce a conversion from binary to a ð5 þ 5 þ 2Þ-bits RNS S 00 =m 1 =32 (5-bits modulus), m 2 =31 (5-bits modulus), m 3 =3 (2-bits modulus) of the ADs computed in the first level.
The conversion of a binary 5-bits data x to S 00 is implemented by a LUT (row 2 of Table 1 , which maps x-4 x j j m 2 ; x j j m 3 , being x j j 32 ¼ x.
The external sums required by Eqn (5) are performed in S 00 by the LUTs described in rows three (for what Figure 5 . The Correspondence Block. It is mainly composed by: (a) two windows (w ¼ 6) taken respectively over the sift registers and over the frame memory storing the previously acquired image; (b) a tree computing the SAD between these windows; (c) a terminal block comparing the current SAD with the current minimum in order to detect the smallest one, and to store it in the register MIN (contextually it updates the register CKMIN with the new address). 40.4 Figure 6 . The images were acquired while the vehicle is translating on a rectilinear path. The distance between the two frames is 500 mm. Estimated FOE: x ¼ 710, y ¼ 15.
concerns m 1 and m 2 ) and four (for what concerns m 3 ) of Table 1 . The smallest SAD is detected comparing the SAD with current minimum as described in a previous section. The required comparator computes the RNS difference (row 5 and 6 of Table 1 ). The sign of the result is therefore derived by a LUT having ð5 þ 5 þ 2Þ-bit addresses (row seven of Table 1 ) and used to select the output of the comparator. Note that also in this Block, the full RNS/binary conversion is not needed.
The LUTs resumed in Table 1 and Table 2 have been implemented by means of Read Only Memories (ROMs) in 0.7 mm CMOS Standard Cells-based technology (ES2 Library). For convenience, their data sheets are provided in Table 3 and in Table 4 .
Since the whole architecture is pipelined, it can work using the working frequency of the slowest LUT. Therefore, it is able to process sequences of images at a rate of 40Mpixels/s. Such performances are fully satisfactory, since a 5126512 TV frame at 50 Hz has a rate of 13M pixels/s. The described hardware is not yet realized, but the whole architecture has been simulated by software, and all the assumptions have been verified. The hardware design has been performed using CADENCE environment and all performances have been tested Figure 7 . The images were acquired while the vehicle is moving on a curvilinear path. The distance between the two frames is 500 mm; the rotation angle is 1 degree. Estimated FOE: x ¼ 126, y ¼ 71.
106
using CADENCE internal tools. The chip realization is at masterization phase.
Experimental Results
Tests have been performed on image sequences acquired in our laboratory by a TV camera mounted on a pan-tilt head which is installed on our vehicle SAURO. The focal length of the TV camera was 6 mm.
Performances of the heading estimation algorithm are shown in Figures 6 and 7 . Features extracted in the first image were successfully matched in the second one. Match results are shown in terms of the displacement vectorial field. The small black square in the figures represents the Focus of Expansion (FOE) position. Images in Figure 6 were acquired while the vehicle was performing a forward translational motion. Images in Figure 7 were originated by a curvilinear motion (forward translation combined with a rotation). 
Conclusions
A VLSI architecture enabling both to select a set of features from an image and to match them over an image sequence has been described. Both extraction and matching steps are independently performed on each acquired frame.
The proposed architecture is able to process sequences of images at a rate of about 40M pixel/s. This computing power has been essentially reached because of the use of Look Up Tables, whose sizes (performing the extraction of the features) and the Correspondence Block (detecting the correct matches for the extracted features) can be integrated in a medium size chip implemented in 0.7 mm CMOS Standard Cells-based technology (ES2 Library). Study and design of the described hardware is motivated by the need of realtime image processing for passive navigation tasks of our mobile robot SAURO. As soon as hardware becomes available, it will be tested on SAURO architecture.
