The main goal of the proposed project is to enhance a mobile robot with evolutionary optimization capabilities for tasks like egomotion estimation and/or obstacle avoidance. The robot will learn to navigate different environments and will adapt to changing conditions. This implies the implementation of vision-based navigation of robots using artificial vision, computed with on-board FPGAs. The current paper aim to contribute on the implementation of a real-time motion extraction from video a feed using embedded FPGA circuits.
General information
One of the most important choices at the design level of a vision system is the selection of the image acquisition hardware. For instance there are alternatives to the vertebrate single-lens eyes-like cameras, such as insect-like compound eyes. The processing hardware support selection for implementing these artificial vision systems can be critical for a high value outcome. A newly developed method for displacement estimation, presented in this paper, extended with the fusion of inertial and visual sensor data, will be implemented onboard the processing hardware unit of the selected camera. This requires special work on optimization of data processing in terms of computational load. One needs to make use of simulation tools as well as camera and robot prototypes for the implementation and optimization of control methods. The last phase of the research will aim at the benchmarking of the developed systems, testing and verification of the implemented applications.
Scientific background of the research
Although recognition of various obstacles is easy for humans it proves to be a very challenging task for machine vision systems. As the results and experience of known research groups [1, 2] shows, in order to cope with the problem's difficulty one needs to design a complex, multi-class, multi-feature, multi-stage classification system. Building a proper environment model is an important requirement. It has to be accurate and has to yield a low computational cost. For instance, applications focusing on driving assistance use object detection algorithms that are based on processes that are grouping 2D or 3D points. The representation of the detected objects are built on geometric primitives such as 3D cuboids [4] or 2D bounding boxes [3] . Alternatively, the polyline object representation is also very common way to create further algorithms that can achieve higher speeds du to since they use more compact point subsets.
One of the most important choices at the design level of a vision system is the selection of the image acquisition hardware. For instance, there are alternatives to the vertebrate single-lens eyes-like cameras, such as insect-like compound eyes [5] that are developed by prestigious research groups [6] . These offer a dynamically adaptable structure with panoramic field of view, low distortion and aberration, and good temporal resolution while yielding high spatial resolution alongside a reduced size [7] . These properties are highly useful for visually-controlled navigation, specifically for tasks like, take-off, landing, collision avoidance and other optically driven responses, which do not require a high resolution image acquisition. The local sensory adaptation capabilities of insect compound eyes can compensate for significant changes in light intensity at the photoreceptor level and distribute information in a neuronal circuitry, resulting in fast and low-power integrated signal processing.
The processing hardware support selection for implementing these artificial vision systems can be critical for a high value outcome. Being at the center of group's research activity, the new generations of SRAM-based FPGA devices are a proper choice for the implementation of reconfigurable computing platforms that need accelerated processing in real-time systems. On the other hand, the hardware-software co-design problem is more complex in system development because the components need to be more advanced. The possibility of manipulating FPGA configurations at runtime using on-chip resources is investigated in [8, 9] . The requirements for runtime partial reconfiguration capability in embedded applications can be sustained by storing multiple bitstream generation choices, including direct bitstream manipulation for logic blocks and hybrid one-dimensional and two-dimensional physical area relocation control modules.
Embedded Implementation of a Resource-Efficient Optical Flow Extraction Method 165
The main goal of the proposed project is to enhance a mobile robot with evolutionary optimization capabilities for tasks like egomotion estimation and/or obstacle avoidance. The robot will learn to navigate different environments and will adapt to changing conditions. Using the run-time reconfiguration properties of modern digital reconfigurable hardware-based (FPGA) platforms, an otherwise days-long evolutionary cycle of a physical robot can be slashed to a matter of milliseconds. By implementing this technique, the most common issues that emerge when using evolutionary simulation -modeling the real world environment as accurately as possible and modeling only those characteristics of the robot that are relevant for achieving the desired behavior -are avoided.
Requiring the author's experience [10, 11] in the hardware/software codesign technique, this project assumed the development of a low-power, low resource-cost motion detection sensor data acquisition system. The authors dealt with the design and implementation of a new method for egomotion estimation based on visual sensor fusion. This paper put special attention on the development of a low-computation cost optical flow extraction system from test video sequences. We present the use of simulation tools and prototype platforms for the implementation and optimization of the proposed system.
The developed method for resource-cost-efficient optical flow extraction
If we denote the image in a matrix A, its values will represent the gray level of a point in an image. When representing a grayscale pixel on 8 bits, these values will vary between 0 and 255. In Figure 1 we can see a frame of test video sequence named GRID. The images of this and other sequences were used as inputs to the algorithm for calculating the Optical Flow (OF), developed in the Visual C++.
The gradient in an image is a vector indicating the direction of variation of image intensity (grayscale variation direction). This can be determined by calculating the value difference of adjacent image points. Consider a new matrix B which contains the gradient values of the matrix A. Using the values adjacent to the pixel p in the image in the calculation of the gradient, will result in a properly aligned gradient. Detection of outliers in this gradient will then lead to the detection of edges in images. This method, however, is sensitive to noise and luminance variations. The effect of noise can be reduced by calculating the average values of the gradient in the orthogonal direction, too. A horizontal gradient used so far is made by calculating the difference between values of two columns.
This can be represented as a filter matrix of the form: -1 0 1 Vertical average can be obtained by adding rows in this matrix:
In order to reduce noise sensitivity of the method we have studied the possibilities of determining the average value of the gradient calculated from the video images. Similarly, horizontal average value may be obtained using a vertical mask of the form:
The result of these operations will be placed at the location indexed by the central element of these matrices. In fact, these 3x3 mask matrices are a basic form in these types of applications, but can have many variations by changing the weighting of the cells. We have experimented two of the well known mask matrices in the literature, with which we run experiments. The first option is the one developed by Roberts and Sobel's the second. These methods are effective, as demonstrated by the abundance of their applications in the literature. It is also important to mention, that these methods require fewer resources for implementation in digital hardware than other methods such as Canny, LoG (Laplacian of Gaussian), Prewitt, Frei-Chen.
A. Optical flow computation experiments with different video sequences as
input data In order to test and validate the algorithm developed and implemented at first in software, we chose three different video sequences. Two of them are real and the third video is an animation.
In the top left corner of Figure 1 we can see a frame from one of the video sequences used as test data, called Grid (31 frames with a resolution of 320x240, with 8 bits/pixel). These images were used as inputs to the algorithm for calculating the optical flow (Optical Flow -OF).
Examples of calculating the horizontal (top right) and vertical (bottom left) gradient of the Grid sequence of video frames can be seen in Figure 1 . The detection of horizontal and vertical edges is the next step performed, as the bottom right section of Figure 1 shows.
In Figure 2 we can see one frame of the test video sequence called Anim (51 frames with a resolution of 200x200, with 8 bits/pixel). We also have implemented a feature of the program to calculate the gradient direction obtained with the following trigonometric relationship:
After reading frames of the video sequence files, the first operation performed by the method testing program is a Gaussian filtering with a filter matrix of 5x5 pixels. This first step is followed by the calculation of vertical, horizontal and combined gradients, with results stored separately. The algorithm continues with the positive and negative edge detection based on the frame intercorrelations, than it comes to determining the optical flow.
The effort invested in writing this software without the use of existing function libraries for image processing, has paid off in the next phase of the project -presented in this paper -the FPGA hardware implementation of the method using hardware description language (VHDL) and Xilinx ISE development environment (Design Suite 14.7).
Φ=arctan(
B v ( j, k ) B h ( j, k ) )
B. Description of the system designed and built for parallelized implementation on FPGA
In Figure 3 polygons with green background symbolize BRAM modules (Block RAM) of the FPGA circuit. These were configured using IP Core Generator tool from Xilinx ISE development system Design Suite to store a selected video frame sequence (image grayscale, 8-bit resolution of 200x200 pixels), using 10 of 39kbits of BRAM memory. We developed a double pipe-line structure to parallelize execution of operations. The calculation steps determined in the C++ program were implemented here in separate modules that are synchronized by a finite state machine (FSM). Observe the two parallel pipe-lines, processing data from two consecutive frames of video.
Each of these performs the following steps: • Scanning the image to determine the minimum, maximum and average values, data needed for subsequent calculations, scaling, etc.
• It runs matrix Gaussian filtering algorithm, • Reading consecutive pixel values, that are inserted into the pipe-line which runs several phases: vertical and horizontal gradient computation, positive and negative edge detection, determining gradient direction.
• After completing these calculations, the results are saved in separate BRAM modules, • Based on these partial results, which can be computed from two consecutive images in a synchronized manner, the method calculates their intercorrelation,
Figure 3 -Block diagram and sequence of operations implemented on the FPGA
• This yields the corresponding OF values, • The OF values will be scaled and accumulated from several pairs of images in the sequence, • The end result is saved in the dedicated OF BRAM memory, from where it can passed on to an application that will use it.
The state-diagram of the finite-state-machine (FSM) controlling one thread of the pipe-line structure is shown in Figure 4 . Note the loop formed by the states 1, 2, 3 and 5 corresponding to the data input phase from the BRAM memory (Frame Buffer in Figure 4 ) and the image parameters computation. It then passes to the second loop (states 4, 5, 6, 7 and 8) where it performs the calculations of the gradient, edge detection, OF, etc. The last state saves the results. The novelty consists in a method able to detect the image parameters while running the filter algorithm, thus saving an entire image scanning cycle. The first studies on optical flow computation date back to 1980 and there are many alternative methods offered. They can be based on gradient, correlation, energy and phase methods, creating well-defined groups [12] . Gradient methods are based on the evaluation of spatial-temporal derivatives. The first such methods are presented by Horn and Schunck [13] , respectively Lucas and Kanade [14] . All these methods are difficult to implement in digital hardware, due to their high resource-cost. The new, resource-efficient motion estimation method developed by our team, uses an OF extraction algorithm consisting of the following steps: a) Based on the detections results of the previous stages, from each frame of the video sequence we have generated a flag matrix signaling the edge positions in the image. A flag value (logical 1 bit value) is placed on the x, y coordinates of the generated matrix in the vicinity of the locations where an edge is detected with a width of 2 to 3 pixels. b) The next step consists in scanning the flag matrix pairs generated from consecutive frames with a 5x5 pixel window to determine the local direction of travel (motion) of the existing edges.
Embedded Implementation of a Resource-Efficient Optical Flow Extraction Method 171 Figure 5 .
-The introduced method for determining the local displacement of edges
As can be seen in the examples in Figure 5 , the evaluation windows are divided into four quadrants, and the resulting value of the direction of movement will be saved to coordinates that are at the intersection of these quadrants of 3x3 pixels. 
Test results of the developed embedded OF extraction system
Translation (synthesizing) programs of functional hardware description languages like VHDL to Verilog do not result in a series of instructions executed sequentially but in a draft of a digital logic circuit required to perform the algorithm described.
In this respect, the test -debug -of these programs is achievable through circuit simulations techniques. To simulate, however, a digital circuit, we need time-varying input signals that fed to the inputs that will set the system to react as expected or otherwise, reflected by the output signal's variation. These stimulation input signals can be generated by implementing corresponding VHDL functional simulation modules.
These VHDL simulation codes can check the outputs of the module under test (UUT -Unit Under Test) for the generated inputs, and returns status messages or error signals if detected. The Xilinx environment provides the ISIM simulation compiler that generates the graphical representation of the input signals, internal signals of the UUT and outputs in the form of timing diagrams.
In this section of the paper we present a few of these diagrams, for the implemented OF extraction project.
In Figure 6 one can follow the partial simulation of one thread of the pipeline structure in Figure 3 . Note the double addressing of the dual-port BRAM memory to get two values simultaneously in the same clock cycle. The finite state automaton executes the first loop (states s1, s2, s3 and s5) to control the sequential reading of BRAM and calculation of the image parameters. After reaching the highest memory address (0 ... 39 999, for an image of 200x200 pixels) the FSM transitions to state s4, where the final values of the calculated parameters are available. It important to note the time required for these operations, which is 800µs in accordance with the same timing diagram.
One of the steps difficult to implement in hardware was the calculation of the image mean values, because it requires at least one division operation, which is only possible in a digital circuit to values which are equal to 2 n . To solve this problem and minimize the error introduced with divisions by 2 n values closest to the current divider values, we used the following method: division was achieved by using shift registers, and the error was reduced by averaging the results of displacement with two consecutive values. The image in Figure 7 shows a complete execution cycle of the developed algorithm by the FSM pipe-line control structure.
One can observe in Figure 7 the evolution of calculating the image minimum and maximum values, followed by the second loop, with the gradient computations and scaling. It should be noted in this case, the total execution time is approximately 1.6 ms. As it results from the analysis of the time diagram in Figure 7 , each partial result obtained in state s8 is saved and sent to the next component, namely at the end is placed in a BRAM memory called Combined gardient memory image in Figure 3 . Since the hardware resources required to implement this computation flow occupies only about 1-2% of the capacity of the FPGA circuit used (a Xilinx Virtex 5 FX30T) (Table 1) , this structure should be instantiated more than twice, thus leading to a more efficient parallelized structure with the possibility of processing multiple frames of the video sequence simultaneously. This extension, however, is restricted by the number of available FPGA BRAM memories (68). In its current form, with only two parallel pipe-line structures (two frames processed simultaneously) the project uses at least 40 BRAM modules. 
Conclusions
Therefore, there is room for expansion in this type of project, but with certain limitations. The alternative is, however, the use of the dedicated processor module (PowerPC440 core) of the FPGA used for the execution of those tasks that require sequential steps. On the other hand, by introducing this component into the system, other problems can be solved, such as accessing the external DDR-2 RAM modules of the used OPUS FPGA development platform, as well as real-time image acquisition as input, using peripheral interfaces attached to it.
All these avenues of development will be studied and, if favorable feasibility is found, will be exploited in later stages of the research project.
The viability of the implementation results will need to be validated by demonstrating the method with a mobile robot. The final demonstration will show the collision-free, (semi-)autonomous drive of a mobile robot or even of a group of collaborating robots in a highly-cluttered environment. The implemented systems will yield a new class of artificially intelligent robots that can adapt their hardware structure in order to behave better in a changing environment. Individual or collaborating groups of robots with these abilities could be used in a variety of reconnaissance or monitoring tasks. For instance the capability to assimilate and share acquired knowledge about its environment can be useful in scenarios where hazardous spaces need to be explored and mapped fast (ex. search in earthquake-damaged buildings).
