In computer vision and more particularly in vision processing, the impressive evolution of algorithms and the emergence of new techniques dramatically increase algorithm complexity. In this paper, a novel FPGA-based architecture dedicated to active vision (and more precisely early vision) is proposed. Active vision appears as an alternative approach to deal with artificial vision problems. The central idea is to take into account the perceptual aspects of visual tasks, inspired by biological vision systems. For this reason, we propose an original approach based on a system on programmable chip implemented in an FPGA connected to a CMOS imager and an inertial set. With such a structure based on reprogrammable devices, this system admits a high degree of versatility and allows the implementation of parallel image processing algorithms.
INTRODUCTION
FPDs and in particular FPGAs have achieved rapid acceptance and growth over the past decade because they can be applied to a very wide range of applications [1] . One of the most interesting applications of FPGAs is the prototyping of designs to be implemented as gate arrays. Another is the emulation of entire large hardware systems. Apart from prototyping, an emerging topic for FPGA applications is their use in custom computing machines. This involves using the programmable parts to "execute" software, rather than compiling the software for execution on a regular CPU. For the latter, the notion of such soft-core CPU or hardware overload of the instruction set becomes crucial. Such approaches offer a good tradeoff between the performance of fixed-functionality hardware and the flexibility of software-programmable substrates. These different aspects are a great advantage in the design of an embedded sensing system, in particular when there are several data flows. Like ASICs, the main benefit of these systems is their ability to implement specialized circuitry directly in hardware. However, fast prototyping is easier for FPGAs. Consequently, in the design of a versatile embedded system dedicated to image processing, the FPGA solution proves to be the better way.
In computer vision and especially in vision processing, the impressive evolution of algorithms and the emergence of new techniques drastically increase the complexity of algorithms. This computational aspect is crucial for the majority of real-time applications and in most cases programmable devices are the best option. For example, FPGAs have already been used to accelerate real-time point tracking [2] , stereovision computing [3] , color-based object detection [4] , and video and image compression [5] .
In this paper, an architecture dedicated to computer vision is proposed. Our approach towards a smart camera consists in performing most of the early vision processing at the sensor level, before transmitting the information to the main processing unit. This behavior is inspired by the human vision system, where eyes are responsible for attention and fixation tasks, sending to the brain only pertinent information about the observed scene. As a matter of fact, the amount of visual data to be transmitted and analyzed is strongly reduced and communication bottlenecks can be avoided. The adaptation of perceptual aspects from biological vision to artificial systems, which is known as active vision and active perception, is briefly explained in Section 2 as the principal motivation of this work. Consequently, the main originality of this work is to use the concepts developed in active vision and more generally in bio-inspired computer vision in order 2 EURASIP Journal on Embedded Systems to design suitable hardware. In Section 3, the hardware of the smart camera is described. The technological choices are argued according to the objectives given in the previous section. The different modules are fully described and the different data flows are explained. Section 4 presents the core of the FPGA design, in particular the specific modules like the address generation unit or the fixed pattern noise (FPN) correction unit. Finally, we present the results of two image processing algorithms (motion detection and high-speed template tracking implementation).
ACTIVE VISION SYSTEMS
One of the numerous objectives in artificial vision research is to build computer systems that analyze images automatically, determining what the computer "sees" or "recognizes" and "understands" from the environment.
In what follows, the problem is to perform the process of interpretation of sensorial data within an environmental model. The first ways of treating the "vision problem" used passive vision and dynamic vision approaches. Passive vision comprises the classical analysis of images. The approach that David Marr explicitly advocated [6] , to which many others subscribe, has led to a thriving research field that has been dominant in visual science in recent years. From David Marr, "Vision is a process that produces from images of the external world a description that is useful to the viewer and not cluttered with irrelevant information." David Marr proposes a model of visual processing that begins by identifying the "zero-crossings" (edges) in the image, uses this edge information to provide a crude segmentation of surfaces called the 2D sketch, and finally extracts from this sketch the threedimensional spatial information. That spatial interpretation is expressed in terms of geometrical primitives such as generalized cylinders or cones, so that the only data which must be explicitly stored are the x, y, z locations, alpha, beta, gamma orientations, and aspect ratios of each of the cylinders and a symbolic code of the relations between them. In this way, the complex scene is reduced to a highly compressed set of meaningful numbers. The problem with this model is that nobody has ever been able to define how such spatial information can be reliably extracted from the scene. Moreover, the visual world contains far too many ambiguities to be handled successfully. Dynamic vision is a complementary approach which corresponds to the study of visual information but in an unbounded sequence of views. This approach introduces time into the image processing, while movement (measured by optical flow) is used in the perception process. Some classical approaches using these strategies revolve around recovering structure from motion.
In contrast to these two approaches, [8] [9] [10] have proposed the active vision approach. Active vision techniques are derived from attempts to simulate the human visual system. In human vision, head motion, saccadic eye movement, and the eye's adaptation to lighting variations are important in the perception process. Active vision therefore aims to simulate the power of this adaptation. In other words, active vision is an alternative approach to dealing with artificial vision problems. The central idea, also known as the taskdriven paradigm, is to take into account the perceptual aspect of visual tasks. Therefore, instead of a full 3D representation of the observed scene, the system is supposed to extract only the information useful for solving a given problem through a task-driven observation strategy (Figure 1 ).
An artificial active vision system uses observer-controlled input sensors. Its main goal must be actively to extract the requested information in order to solve a given task. A wide literature proposes many systems built around the active vision paradigm. The majority of these systems have been driven by the "robotic" approach and are based on a robotic head. A large survey up to 1996 can be found in [11, 12] .
Another trend considers algorithmic aspects and focuses on gaze control using foveated sensors with a log-polar mapping. This method can be applied at the sensor level (imager), at the image processing level or both. At the sensor level, some dedicated imagers based on logarithmic-structured space-variant pixel geometry have been implemented. The main advantage of these methods is the ability to cover wide work spaces with high acuity and a small number of pixels. Several descriptions of the advantages of using space-variant, or logmap, architectures for video sensors have been proposed [13] [14] [15] . Another logmap device consists of an emulated sensor based on a conventional CCD and an image warp algorithm embedded on a microcontroller [16] . More recently, a new trend towards smaller active vision systems comparable in size to the human head is pushing the limit of motor, gearbox, and camera design [17] [18] [19] .
However, as mentioned above, most work dedicated to active vision systems is concentrated in the robotic field. In contrast, the main motivation for the work presented here is to propose a system truly resulting from the human visual system. Consequently, our approach needs a dedicated architecture for which the FPGA proves to be essential. This architecture is presented and discussed in the next section.
ARCHITECTURAL FEATURES
The main purpose of our architecture is to allow the implementation of early vision processes as in the human or primate visual system. In these systems, it is well known that the first neural layers (in the retina) prefilter the visual data flow in order to select only the conspicuous information. From this prefiltered information, an attentional processing allows focusing on the selected target. In the literature, several computational models of visual attention can be found. The first representative model was proposed by Koch and Ullman in [20] and has been recently revised by Itti et al. [21] . In these models, the purpose of the saliency map is to combine the "salient" or "conspicuous" location information from each of the lower feature maps into a global measure to determine how different a given location is from its surroundings. This technique is used to guide selective attention. The design of our active vision system is based on this kind of approach where we assume that the strategy of visual processes can be divided into the following three successive tasks.
P. Chalimbaud and F. Berry 
Attention
This is the initializing step of the process. Whole images are grabbed while waiting for the building of the saliency maps. These maps are built in parallel and represent/code conspicuity within the visual field along particular dimensions (e.g., color, orientation, or motion). The result of this step is a set of ROIs (Regions Of Interest).
Focusing
This step allows the generation of the geometry of an ROI (rectangular, tilted, foveal, circular, . . . ) and the optimization of the signal/noise ratio: contrast optimization in an ROI [22] , tracking of an ROI in motion, and so forth.
High-level processing
This last step comprises different kinds of tasks such as identification and classification. The attention stage needs strong parallelization, on the one hand to respect real-time constraints, and on the other hand because of the intrinsic characteristics of the algorithms. As examples, some classical algorithms in an attention task used to build an efficient saliency map are motion detection, Gabor filters, and color segmentation. However, the characteristics of particular visual tasks may require dedicated image processing and only an FPGA approach allows such flexibility. For architectures such as these, a Stratix EP1S60 from Altera has been chosen. This choice is detailed below. The need for strong parallelization was what led us to connect 5 × 2 MB SRAM synchronous memory blocks. Each 2MB memory has private data and address buses. Consequently, in the FPGA, 5 attention processes (using 2 MB each) can address all the memory at the same time and an SDRAM module socket provides an extension of the memory to 64 MB ( Figure 2 ).
The focusing stage must control the imaging devices in order to address only the ROI and to optimize the analog signal conversion. That is the reason why the sensing board has been designed around a CMOS imager and a set of 4 digital/analog converters. A set of inertial sensors has been added in order to estimate the movements of the camera and improve the perception (stabilization and depth estimation [23] ).
In our approach, the high-level processing has to be performed on a host computer rather than on the embedded system. In order to send the data, the smart camera is connected via a high speed communication (USB 2.0 or FireWire).
The embedded system is integrated into a modular architecture consisting of three boards: the sensing board, the processing board, and the communication board. An overview of the smart camera is shown in Figure 3 and a structural description presents the stacked structure with 3 boards.
System on programmable chip features
As described in the previous section, the sensor was designed around a Stratix EP1S60 manufactured by Altera. This component enables a high density of integration (57120 logic elements). It also has three further main advantages which guided our choice. Firstly, the Stratix is optimized to maximize the performance benefits of SOPC integration based on an NIOS embedded processor. An NIOS processor is a user-configurable soft core processor, allowing many implementation and optimization options. The NIOS CPU is a pipelined generalpurpose RISC microprocessor which supports both 32-bit and 16-bit architectural variants. Both 16-and 32-bit variants use 16-bit instructions. For our sensor, the main advantage of this soft core processor is its extensibility and adaptability. Indeed, users can incorporate custom logic directly into the NIOS arithmetic logic unit (ALU). Furthermore, thanks to a dedicated bus (Avalon bus), users can also connect into the SOPC on-chip processor and custom peripherals. They can thus define their own instructions and processor peripherals to optimize the system for a specific application.
Secondly, the Stratix integrates DSP Blocks. These embedded DSP Blocks have been optimized to implement several DSP functions with maximum performance and minimum logic resource utilization. Each DSP block offers multipliers, adders, subtractors accumulators, and a summation unit functions that are frequently required in typical DSP algorithms. Each DSP block can also support a variety of multiplier bit sizes (9 × 9, 18 × 18, 36 × 36) and operation modes (multiplication, complex multiplication, multiplyaccumulation and multiply-addition) and can offer a DSP throughput of 2.8 GMACS per DSP block. The EP1S160 device has 18 DSP Blocks that can support up to 144 9 × 9 multipliers. These embedded DSP Blocks can also be used to create DSP algorithms and complex math routines in highperformance hardware. These can then be accessed as regular software routines or implemented as custom instructions on the NIOS CPU. For example, a cumbersome algorithm can be implemented in hardware and directly executed in software using a custom instruction. This gives designers the P. Chalimbaud and F. Berry flexibility and portability of high-level software design, while maintaining the performance benefits of parallel hardware operations in FPGAs. Lastly, the Stratix device incorporates a configurable internal memory called TriMatrix memory. The TriMatrix memory is composed of three sizes of embedded RAM blocks. The Stratix EP1S60 TriMatrix memory includes 574 M512 blocks (32 × 18 bits), 292 M4K blocks (128×36 bits), and 6 M-RAM blocks (4 K × 144 bits). Each of these blocks can be configured to support a wide range of features and to synthesize a wide variety of RAM (FIFO, double ports). With up to 5 Mbits of fast RAM, the TriMatrix memory structure is therefore appropriate for handling the bottlenecks arising in sensor vision algorithms.
Sensing device
This module consists of a CMOS imager manufactured by Neuricam, two 2D accelerometers from analog devices and three 1D gyrometers from Murata. The imager allows full 2D addressing with a column bus and a row bus. It has a resolution of 640 × 480 (VGA) and provides a broad dynamic range (120 db) due to the logarithmic response of its pixel structure. Four digital-to-analog converters allow modification of the four analog voltages of the imager: analog signal offset, digital conversion range, voltage reference, and a pixel precharge voltage. These four converters are used to optimize the conversion range. In effect, the CMOS imager has a logarithmic curve that enables the broad dynamic range (120 dB).
The inertial set is composed of two 2D linear accelerometers ADXL311 designed by analog devices and three gyrometers ENC03 -M designed by Murata. These sensors are soldered onto the imager PCB and aligned with the imager axis. A single 8-input analog-to-digital converter allows conversion of the different axis measurements. It is important to notice that a temperature sensor is included in this board to regulate the inertial sensors' deviations (see Figure 4 ).
ARCHITECTURAL DESIGN
The major difference between biological and artificial vision systems most probably lies in their flexibility. In order to develop adaptive capacities, the hardware architecture 6 EURASIP Journal on Embedded Systems previously described is designed to implement some specific low-level processing dedicated to early vision. This lowlevel processing attempts to establish an efficient interface between sensitive elements and high-level perception systems. As explained in Section 3, our strategy for efficient visual perception is based on three layers: attention, focusing, and high-level processing. This approach adopts a pyramidal method which reduces the amount of data flow. Typically, we can consider a simple system built around an attentional module based only on color segmentation and a focusing module based on template tracking ( Figure 5 ). In the following, a detailed description of the FPGA organization is presented. All the "standard modules" (FPN correction, addressing module) are described and designs for focusing and attentional modules are proposed.
Implementation approach
The implementation of such an approach requires the management, sequentially and concurrently, of the execution of the routines previously described. Indeed, all task-oriented execution (attention, focusing, and identification) is controlled by supplied results and these three layers possibly have to share areas of interest. Moreover, the information bottleneck located in the imager level should be continuously optimized to ensure high performance. In our hardware architecture, these functions are carried out by what we term a "Sequencer" (Figure 6-M0 ) and are performed on the NIOS soft-core processor. This solution has two main advantages. Firstly, we benefit from software flexibility to define the routines' interactions; and secondly, the soft-core processor allows an efficient architectural matching with the other parts of the supervision unit.
An internal RAM ( Figure 6-R0 ) is used to store the instruction sequences which define the sequencer behavior according to the task under consideration. The host computer which uses our embedded system communicates with it through a standard communication bus (USB 2.0 protocol) and sends requests in order to indicate to the sequencer the relevant behavior to adopt. More precisely, according to the controls (and potentially a set of parameters) passed in a dedicated stack, the sequencer chooses preestablished interactions between the modules (Figure 6 -P4) which constitute a dedicated processing chain. This architectural module implements the previously described routines of environmental adaptation, attention, focusing, and low-level identification (Figure 6-P6) . A number of these modules, due to environmental adaptation (processing No. 1 to i), modify the pixel flow which is going to be used by the attention, focusing, and identification modules (processing No. j to j + 1). The different data flows (corrected windows of interest and inertial measurements) can be used by these modules to perform computing. The set of results that are provided by these processing modules are collected in a buffer. This is how the sequencer selects results to send to the host computer. The sequencer is going to use a part of these results to perform visual feedback on sensing devices (Figure 6 -S0) using dedicated control modules (Figure 6-P0, P1 and P3) . We note in Figure 6 the module P2 which works the external RAM R1. This module performs the fixed pattern noise correction which is absolutely essential with the image sensor technology we use (described in Section 4.3). Lastly, the dedicated communication module P5 is a multiplexer that synchronizes the corrected pixels flows and the sequencer results flows for sending to the host computer.
The sequencer constitutes an active interface between the sensing device, processing chain, and the host computer. The modular processing chain is synchronized with raw pixel flow control provided by the CMOS imager. Finally, this control allows dynamic control of the global sensor state according to global visual data coherence.
Addressing module (Figure 6-P0)
The goal of the address generator device is to compute line and column addresses of the current window of interest. The shape of the window is actually rectangular and is set by 5 parameters: position (X, Y ), size (H, W), and orientation (α).
The address generation is based on the well-known "Bresenham" graphical algorithm [24] . For the computation of the tilted rectangular window addresses, we have implemented an architecture based on a recent method for drawing straight lines suitable for raster-scan displays and plotters developed by Bond [25] .
The approach proposed by Bond is based on signal processing concepts related to resampling, multirate processing, and sample rate conversion. The x-coordinates of each pixel can be viewed as a uniform sample set, and the y-coordinates represent another sample set. Assuming the slope of the line is within the first octant, the y-coordinate is generated by resampling the set of x-coordinates. To control this resampling, the fractional part of the y-coordinate is stored in a control variable as an integer. The algorithm can be summarized by these few lines of code (see Algorithm 1) .
(X1, Y 1) and (X2, Y 2) are the coordinates of the segment represented in Figure 7 . n is the number of bits which are used to store the fractional part of the y-coordinates. Cvar is the fractional part of the y-coordinate. Incr is an integer variable used to store slope value. Carry is an overflow indicator for the operation Cvar+ = Incr;, and (X, Y ) are the iterative coordinates of the line represented in Figure 7 .
The extension of the algorithm to other octants is performed by interchanging the roles of x and y, and changing the signs of the coordinates x and y. The internal architecture of the address generator device is illustrated in Figure 8 . The range of window tilt is encoded in a natural binary-coded variable named Angle. The first three MSB bits of this variable define the octant (Figure 8) . The other bits of Angle are Table 1 Total logic elements 150 < 1% Total memory bits 0 0% DSP block 9-bit elements 0 0% System clock frequency 190 MHz - Table 2 Total logic elements 65 < 1% Total memory bits 85 < 1% DSP block 9-bit elements 0 0% System clock frequency 241 MHz -this module in the FPGA is characterized by the parameters shown in Table 1 .
Fixed pattern noise correction module (P4)
Due to the technological limits, a classical CMOS imager (without an embedded CDS correction) provides a raw pixel flow with a high FPN. Indeed, the nonuniformity of the electrical characteristics of each pixel involves an additional stage of correction. In order to even out the electrical response of each pixel, the module called FPN correction module ( Figure 6 ) subtracts the reference values (of the FPN) from the pixel flow. These referent values represent offset differences between each pixel for the same illumination ( Figure 9 ). The set of these values constitutes a reference image. In order to carry out this correction, the sensor integrates a module in order to load it from an external RAM (Figure 6-R1 ). The implementation of this module in the FPGA is characterized by the parameters shown in Table 2 .
EXAMPLE OF ATTENTION MODULE IMPLEMENTATION: MOTION DETECTION
Based on an image difference method (Figure 10 ), this algorithm looks for moving objects in a scene. In the image plan, motion is transduced to temporal and spatial gray-level changes. This module detects temporal changes and defines a rectangular window around the moving object. In the first step, a difference image is obtained through subtraction of images i and i − 1. Then the difference image is thresholded, and its vertical projection is calculated (line-by-line pixel sum in each column). A peak detector is applied to the vertical projection, giving the horizontal position of the moving objects found in the scene. The horizontal projection inside each vertical zone detected previously is then calculated, and a second peak detection is applied to define the vertical position of the moving objects. In this way, it is possible to define the position of several moving objects in the image simultaneously. This information can be used as a parameter for another algorithm, such as a tracker.
EXAMPLE OF FOCUSING MODULE: TEMPLATE TRACKING
As explained in Section 3, we assume that an active vision process can be split into three layers: attention, focusing and interpretation. The custom module proposed as an example in this section is a solution to the focusing layer. A design dedicated to an efficient template tracking implementation is presented. The main idea of template tracking is to estimate the displacement of a focusing window called W between time t and t + ∂t. This module (Figure 11 ) comprises two parts: a memory to store the reference template denoted I * and a dedicated architecture for the displacement estimation. The architecture adopted is based on a derivation of the Kanade-LucasTomasi algorithm [26] . This algorithm is an iterative method to estimate displacement between two frames (I * and I). The proposed method is based on the calculation of the dissimilarity between two images as follows:
where 
where
Finally, the displacement d can be estimated by solving the equation:
where Z is the following 2 × 2 matrix:
and e is the following 2 × 1 vector:
Due to the Taylor expansion, this algorithm is not exact and needs iterations to find the correct displacement. The estimated displacement at each iteration i is denoted s i = (s x s y ) T and is calculated by
T is the result of the iterative process such that
where N is the maximum number of iterations allowed by the processing. In our case, because of NIOS control, the iterative process is limited to 80 MHz and the pixel flow runs at 8 MHz (corresponding to N = 80/8 = 10 iterations max). The architecture developed to implement this algorithm is presented in Figure 12 . The first module called "storage
device" allows the storage of the reference frame I * and swapping of the current pixel flow between two double-port memories. This module performs the two functions simultaneously in order to ensure the pixel flow rate. The displacement between the reference template I * and current window I of interest is simultaneously computed during the storage of the next window of interest.
In the first step, the difference (sub) between the two images and the spatial derivatives of their sum (add) are computed and synchronized. The computation of spatial derivatives (Gx and Gy) is based on a set of FIFOs and multiplieraccumulators which apply a (3 × 3) convolution mask to the data flow. The convolution kernel mask is the Gaussian derivative function.
In the second step, Gx, Gy, and Sub d are applied to a set of multipliers in order to compute the coefficients G x G x , G x G y , G y G y , G x D, and G x D. The accumulation of each allows computation of the elements of Z and e.
The solution is obtained by the evaluation of the following three determinants:
According to the signs and the comparison of Λ, Λ x , and Λ y , the displacement counters CT x , CT y are updated as shown in Algorithm 2.
The updating of the reference template position is carried out according to the counter values. This process is iteratively repeated and allows detection of the correct translation vector between the two frames. Lastly, the estimated translation vector is used to update the position of the window of interest in the CMOS imager in order to track the reference template.
The implementation of this architecture on a Stratix EP1S60 leads to the parameters shown in Table 3 .
Centering of the windows of focusing with a displacement estimation (1, 1) (1, 3) (0, 6) ( 3, 4) ( 7, 2) ( 7, 8) ( 4, 0) (6, 1) Initial windows of focusing Reference template Figure 13 : Results of the template tracking architecture. To evaluate the robustness of the approach, the image is artificially moved with a given displacement indicated under each image. To illustrate the algorithm, several images and simulation results are presented. This simulation was performed using ModelSim software with the VHDL description of our system.
CONCLUSION
The computation of low-level vision tasks in real time is the first and fundamental step in building an interactive vision system. This paper proposes an alternative to classical architectures with a highly versatile architecture dedicated to early image processing. The proposed embedded system attempts to define a global coordination between sensitive elements, low-level processing, and visual tasks.
The approach, based on FPGA technology and a CMOS imager, reduces the classical bottleneck between sensor and processing. The FPGA component ensures a high interaction rate between the CMOS imager and low-level processing. This interaction is used to select useful information earlier in the acquisition chain than for more traditional systems. It then focuses processing resources. This capacity is used to control the sensor state according to the visual task and the environment evolution. Our implementation of the FPGA and CMOS imager technologies results in high-speed vision, real-environment vision, and the efficient design of embedded systems. Among prospective algorithm candidates, we can cite the works performed on dynamically reconfigurable components such as the ARDOISE 1 project [27] . This evolution of FPGA technology seems to be attractive for performing dynamic control of the acquisition chain. Rather than having a control system state, the system itself can be physically changed and giving a higher level of suitability for many algorithms. We have also developed a DSP board in order to improve the computation capabilities of our system. With this device, our embedded system will evolve into a heterogeneous architecture, and new research into codesign between the FPGA and the DSP will be necessary.
Moreover, to test the validity of our approach, several visual tasks will be implemented. Our objective is to identify elementary functions in order to define a library of architectural modules. Of course, this library will provide efficient solutions to attention resolution, focusing, and identification of subtasks according to specific applications. Finally, we plan to work on the development of software tools to facilitate the implementation of complex vision tasks
