Advances in computer science and VLSI technology almost double the off-the-shelf processor computation power every year. Development tools are more and more sophisticated, and thus forward the trend towards dedicated architecture. Such advances allow new applications which could not have been imagined just two decades ago. The sequential nature of the conventional system is the major obstacle for real-time vision and other perception tasks. However, the association of parallel architecture, neural network, electronic retina, and real-time hardware for low-level image processing seems to be adequate [1, 2] . Currently, vision and perception tasks are a real challenge for many researchers in the mobile robot area and many problems are not completely solved (technically and formally): for example, object-recognition, robustness and the reliability (of the result) to meet environmental change, response time, cost, and cumbersome and low-energy consumption. The vision applications are diverse, and many parallel vision systems are proposed [3] [4] [5] [6] [7] . The analysis of computation needed by a mobile robot reveals four main tasks: perception sensor (CCD camera), multi-sensor data fusion and decision-making, dynamic control, and data acquisition from different sensors such as ultrasound, laser meters etc. Among these tasks, the perception sensor is the most complex one and consumes most of the computation power. Furthermore, to obtain the best performance, algorithms and hardware need to be adapted to each other; thus it is very important, even vital, for a vision system to be reconfigurable and flexible.
Introduction
Advances in computer science and VLSI technology almost double the off-the-shelf processor computation power every year. Development tools are more and more sophisticated, and thus forward the trend towards dedicated architecture. Such advances allow new applications which could not have been imagined just two decades ago. The sequential nature of the conventional system is the major obstacle for real-time vision and other perception tasks. However, the association of parallel architecture, neural network, electronic retina, and real-time hardware for low-level image processing seems to be adequate [1, 2] . Currently, vision and perception tasks are a real challenge for many researchers in the mobile robot area and many problems are not completely solved (technically and formally): for example, object-recognition, robustness and the reliability (of the result) to meet environmental change, response time, cost, and cumbersome and low-energy consumption. The vision applications are diverse, and many parallel vision systems are proposed [3] [4] [5] [6] [7] . The analysis of computation needed by a mobile robot reveals four main tasks: perception sensor (CCD camera), multi-sensor data fusion and decision-making, dynamic control, and data acquisition from different sensors such as ultrasound, laser meters etc. Among these tasks, the perception sensor is the most complex one and consumes most of the computation power. Furthermore, to obtain the best performance, algorithms and hardware need to be adapted to each other; thus it is very important, even vital, for a vision system to be reconfigurable and flexible.
In this article we propose a parallel reconfigurable perception sensor which makes it possible to fit image-processing algorithms to hardware. We explain our choice and motivation for low-level and high-level imageprocessing architecture, and describe the embedded real-time multi-processor kernel based on LINDA concepts and its performance evaluation tool. Finally, a perception sensor configured for an application of 3D scene reconstruction based on a geometric method is presented.
System Architecture
Generally, image processing may be classified into low-, medium-and high-level processing. Single instruction, multiple data (SIMD) architecture is well adapted for the low-level, while MIMD architecture meets medium-and high-level image processing. However, the design issue of a perception or vision system lead to more or less complex and flexible solutions, such as IUA, WARWICK, DMP, SPHINX [8] [9] [10] . Among these systems, it is interesting to outline the main difference between IUA and DMP architecture. IUA fits formally with the three-image-processing level: thus it is flexible and application-independent [8] , while IUA is expensive and cumbersome. DMP is functionally adapted to a 3D scene-reconstruction algorithm [9] , is less expensive and less cumbersome, but is not adapted for other vision applications. Our objective is to implement a low-cost, reconfigurable, and flexible perception sensor.
Low-level image processing
The main features of the low-level image-processing algorithms are the following:
1. Independent processing without recovery (histogram, thresholding): each pixel or bloc of pixels is evaluated independently of other pixels or blocs of pixels. 2. Processing with recovery (spatial convolution): each pixel is evaluated in relation to the neighboring pixels. 3. Global processing (FFT): each line or whole image is evaluated.
Currently most of the vision systems use SIMD architecture for low-level image processing: IUA, MGAP, VIP, IMAP [8, [11] [12] [13] . However the drawback of the SIMD for low-level image processing is that the execution time depends on the image size and the latency time due to the image transfer before and after processing.
For low-level image processing the pixel flow is regular, each pixel is evaluated by the same operations for most of the treatment, and the image after processing has the same size. The simulation and implementation results show that spatial convolution and Deriche filter ( Figures 11, 12 and 13) may be implemented in real-time by using pipeline architecture based on FPGAs [14, 15] . Thus, low-level image processing is well adapted to pipeline architecture. PIPE and Cytocomputer are two vision systems using pipeline architecture [16] . One of the inconveniences of PIPE or Cytocomputer is their lack of flexibility. Thanks to the advances in the integrated circuit, especially FPGA, and hardware synthesis tools based on VHDL language, pipeline architecture becomes flexible and efficient for lowlevel image processing.
Key features of the low-level image-processing board 'LLIB' include:
• two local video buses, which permit the connections with the image acquisition board and processing element board; • a VME bus interface for system control;
• an interface with the TMS320C40 communication port;
• a matrix of 6 ϫ 6 ALTERA FLEX 8000 FPGAs [17] ; • 16K ϫ 16-bit words of SRAM (12 ns access time) to store control words; • 10 FIFOs of 1024 bytes used as image line input.
It should be observed that the output of low-level image processing can be stored back in the image acquisition board or in a shared memory of the medium-and high-level processing element through a TMS320C40 communication port.
The LLIB is reconfigurable and reprogrammable, and is therefore versatile and very suitable for development, evaluation and experimentation with new algorithms and architectures (neural networks, systolic arrays etc). In addition, the LLIB may be used to implement the medium-or high-level image-processing accelerator.
Image acquisition board
To avoid bottleneck and pixel transfer latency time, we have to develop an image acquisition board adapted to LLIB, and medium-and high-level image-processing boards. The image acquisition board is implemented with a 40 MHz TMS340C20, graphic system processor (GSP), which is used to generate the video control signals for image acquisition and display. It permits us to capture images in real-time (256 gray levels) with different resolutions: 1024 ϫ 768, 512 ϫ 512 and 256 ϫ 256 interlaced or noninterlaced video input. The image may be captured, dis-played and simultaneously passed through the LLIB for low-level image processing and stored back in the frame buffer. This mechanism allows the LLIB, in some cases, to treat the image several times. Therefore, for the simplicity of the low-level image-processing hardware implementation, the video, noninterlaced mode is used. Key features of the acquisition board include:
• a VME bus interface for system control;
• a video input unit for image capture; • a video output unit which permits the visualization of the image coming from the camera or the frame buffer; • 1 MB of frame buffer (VRAM), which may be organized as a set of four 512 ϫ 512 ϫ 8-bit or two 512 ϫ 512 ϫ 8-bit and one 512 ϫ 512 ϫ 16-bit image plane; • two local video buses for the connection with the LLIB.
Medium-and high-level image processing Processing element
In spite of the performance of the recent RISC processors such as DEC21164, UltraSparc-I and PowerPC [18] [19] [20] , a single processor system cannot perform the medium and high-level image processing in real-time. On the other hand, the medium-and high-level image processing is potentially parallel. However, according to the nature of the application, the computation-power needs for a perception sensor are completely different. Generally, there are two parallel paradigms: data paradigm (SIMD) and task paradigm (MIMD) [21] . Medium-and high-level image-processing algorithms are not regular, and are very different from one application to another. Consequently, the MIMD 'task paradigm' seems to be the most adapted. There are three classes of MIMD architectures: shared memory, 'SMM', distributed memory, 'DMM', and distributed shared memory, 'DSM'. SMM is simple and efficient for a small number of processors (less than eight processors without global cache memory and less than 40 processors when using global cache) [21, 22] . It should be observed that the processor number of SMM is physically limited by the number of bus connections (the fan-in/fan-out, line reflection and capacitance load). As for DMM, the number of processors is not physically limited; the degree of each node is about six, thus permitting the building of massively parallel systems (number of processors exceeds 1000). However, when the processor number increases, the interprocessor communications are increased and depend logically on the topology. Compared with SMM and DMM, DSM gives the best performance cost ratio, good portability and flexibility [23] . Thus we choose DSM architecture for medium-and high-level image-processing. As for the processing element, after the analysis of the medium-and high-level image-processing algorithms, its key features may be established: The general purpose RISC processors such as UltraSparc-I, PowerPC and DEC21164 are powerful, but are difficult to implement and more expensive than DSP (digital signal processors). Furthermore, they do not contain integrated communication ports and DMAC. Table 1 shows the keys feature of TMS320C40 [24] , SHARC 2106X [25] and T9000 [26] .
The SHARC 2106X is the most adapted to our need, but it was not available when we developed the system. Thus we chose TMS320C40 as the processing element. To simplify the design, the parallel processing development system 'PPDS' [24] is used as a processing element.
Key features of the TMS320C40 PPDS include:
• each TMS320C40 is supported by a local bus comprising 64K ϫ 32-bit words of zero wait-state static RAM, and 8 KB of EPROM; • 128K ϫ 32-bit words of one wait-state SRAM on a shared global bus; • an expanded bus connector that provides an external interface to the shared global memory bus; Our main preoccupations for the design of the system are the configurability and the flexibility. Each processing element is connected locally with its neighbors (three other TMS320C40 on the same board) in full cross-bar by using four communication ports. Then each TMS320C40 has two other unused ports.
To permit the reconfiguration of the system to adapt to different algorithms to get the optimal performance, a recursive and multi-stage interconnection network (RMN) is developed by using FPGA technology (ALTERA FLEX 8000) (Figure 1 ). The RMN properties such as diameter, degree, recursivity, partitionability, synchronization and fault-tolerance are demonstrated in Belloum [27, 28] . The intercommunication network permits implementation of a module of 16 workers (four high-level processing boards), a super-module (SM) containing 64 workers (four modules), a S2M containing 256 workers (four SMs) (Figure 2 ) etc. Figure 3 shows the interconnection network features compared with different, well known networks such as hypercube, HHC [29] , mesh and connected cube cycle. The RMN permits us, on the one hand, to adapt mediumand high-level image-processing algorithms to minimize communication cost, and on the other hand it is scalable to meet the computation power requirements. It also permits the comparison of the performance of a parallel application on different topologies and the choice of the best adapted topology.
Perception sensor hardware
The key concept of our perception sensor is the dedicated programmable, reconfigurable, flexible and modular hardware for image processing. The system is built around two main buses: a VME bus is used as the system control bus, and a TMS320C40 communication port bus is used as the interprocessor communication inner-cluster or inter-cluster bus. The basic configuration of a perception sensor hardware has four main elements:
• image acquisition board;
• low-level image-processing board;
• medium-and high-level image-processing board;
• interconnection network board.
Each task of the medium-or high-level image-processing may be performed by a processor (one TMS320C40 or one worker), a cluster (a processing element containing four TMS320C40), several clusters (less than three clusters), a module (a set of four clusters) or several modules, according to the computation power needed. In addition, a module may be configured to fit to algorithms (partitionability). This organization allows the application to adapt to the architecture and to simplify the software implementation.
In the next section we show how to configure the perception system to adapt to 3D scene reconstruction using a geometric method.
Software

Real-time multi-processor kernel
One of the problems of parallel architecture is the system's software facilities for developing and debugging applications. LINDA concepts permit parallel programming, but real-time constraints are not included in the implementation [30, 31] . Nevertheless, a real-time multi-processor kernel is absolutely necessary for a mobile robot. For this purpose a real-time multi-processor kernel based on LINDA concepts -hierarchical LINDA -is implemented [32] . The syntax of operations In(), Out() and Rd() are:
1. In(type_of_access, key, time_constraint, arg_nb, arg1, ..,argn); 2. Rd(type_of_access, key, time_constraint, arg_nb, arg1, ..,argn); 3. Out(type_of_access, key, time_constraint, arg_nb, arg1, ..,argn)
Type_of_access: specifies the tuple space location in the system for a dedicated application. The first letter of the type_of_access permits the definition of the hierarchical tuple space:
• processor local tuple space;
• cluster shared tuple space;
• D_cluster_id: distributed shared tuple space out of cluster, and the cluster_id allows us to get the path to attain the destination cluster by using the system configuration map. This type of access is only used with Out() operations.
Key: identifies a tuple class in a tuple space. Time_constraint: a time constraint associated with the tuple (duration of validity, date of validity);
Arg_nb: argument number of a tuple; Arg1..argn: information arguments.
Key features of real-time hierarchical LINDA include:
• Local tuple space. Permits several local processes to be executed concurrently. Therefore, if a process running on a processor needs to exchange messages or synchronize with another process on the same processor, it creates a local tuple (In(L,..)) and sleeps if the tuple has not yet been created in the tuple space. In the same way, if a process has to send messages to another process, it creates a local tuple (Out(L, … )) and proceeds.
• Shared tuple space. Permits several processes to be executed in parallel in the same cluster by sharing a common tuple space. When a process i of worker x performs Out(P,'segment0', … ) or In(P,'segment0' … ), the key 'segment0' is automatically created and bound with wid (worker identification number: x) and pid (process identification number: i) in the common tuple space. Thus, when another process j of worker y performs In(P,'segment0', … ) or Out(P,'segment0', … ) a key collision appears during the tuple creation and the worker y sends a message (pid of worker i, and argument pointer) to the worker i through the communication port. In the case that process i is sleeping on In(P,'segment0') operation, it will be moved to the process-ready queue and rescheduling is performed ( Figure 5 ).
• Distributed tuple space. Allows the processes to be executed in parallel in the different clusters. Message exchange or interprocess synchronization is done through the communication port and clusters' commonshared tuple space. The process i of worker x of cluster m performs In(P,'result_segment0', … ); the key 'result_segment0' is created in the common-shared tuple space of the cluster m and bound with pid, wid. Thus, when the process j of worker y from cluster n performs Out(D_cluster-m,'result_segment0', … ), the kernel uses the system configuration map to identify the path to attain cluster m, and the messages are sent through TMS320C40 communication port. As happened previously, a key collision appears during the tuple creation by the worker z (assume that worker z is connected to worker y) and a message is sent (pid of worker x and argument pointer) to the worker x through the communication port. The process i will be moved to the process-ready queue and rescheduling is performed.
The hierarchical LINDA has some restrictions and is less user-friendly than classical LINDA [33] : nevertheless it is more efficient and permits real-time implementation.
Performance evaluation tool
To make HLINDA user-friendly, we implemented a development tool which detects the dead lock and evaluates the execution time of a parallel application [34] .
In parallel processing the performance tuning of an application is important, because with small changes of the communication patterns and load balances the performance of the application can be improved significantly. So in order to achieve nice tuning of the applications, the performance analysis tool is necessary. We can analyse kernel activities, and the kernel can be evolved.
To implement a performance analysis tool, the following tasks should be considered: acquisition of the tracing informations in the target system, tracing informations analysis and analysis, results display.
The basic means of instrumenting a system for performance measurement purposes are H/W monitors and S/W monitors. A H/W monitor is typically external to the measured system and does not interfere with the measured system or alter its performance [35] . But the drawbacks of H/W monitors are their cost and their lack of ease in use. A S/W monitor is instructions-added to the measured system to gather tracing information [36, 37] . Because the monitor's instructions run on the measured system, and hence use system resources, they alter the performance of the target system [38] . Nevertheless, a S/W monitor is more useful and easier to use than a H/W monitor, and the interferences can probably be compensated for.
For a good analysis, the detailed execution status information of the application have to be collected. Meanwhile, in the case of massively parallel processing the collection of detailed information will increase the communication cost between the host and each node (or cluster) and will interfere with the execution of the application. In general, the performance of a parallel program may be affected by the scheduling strategy, synchronization, placement, grain size of tasks, and shared variable access time. Therefore, in real-time parallel processing we have to define well the range of tracing to reduce the communication cost between the host and parallel machine, and to reduce the interference with the execution of the application.
The embedded performance analysis tool is composed of seven modules. The first three modules -the tracing manager, the reporting manager, and the low-level analysis (k-analysis) manager -reside in the kernel. The four other modules -the analysis manager, communication analyser, graphic visualization manager, and interface-to-kernelreside in the host.
For the monitoring tool, we utilize three levels of tracing to give the user the choice of adequate tracing range, to reduce the communication cost, and to minimize the interference with the application. The reporting manager resides in the kernel as a low priority process, and thus it can be scheduled. The pretreatment over the tracing information permits the reduction of the amount of information that should be gathered by the analysis manager residing in the host. This minimizes communication cost.
For the analysis, the sum of user level CPU time and kernel level CPU time is utilized as the metrics, three levels of communication which characterize H-LINDA are analysed to give the user the informations about the communication overhead of each level, and the loads of each process, each processor, and each cluster are analysed in the form of order-list and in the form of percentages to give the user the load balance information.
We presented a performance analysis tool with four concepts for the real-time multi-processor kernel H-LINDA [34] . With this tool we can detect the communication deadlock before and during the execution of the application, we can reduce the communication time and the interference with the application, and we can also estimate the amount of interference. Furthermore, with the hierarchical graphical user interface, the user can concentrate on one processor or on one cluster to find the problem in his application easily ( Figure 6 ). The user interface also permits the user to change the priority of a process during execution to avoid the idle state or deadlock state of one processor. With this interface the user can also change the communication route of one processor and of one cluster to avoid disfunctioning in one processor or in one cluster with the help of the interconnection network in the massively parallel machine. 
PERCEPTION SENSOR FOR
Example of 2D Scene Reconstruction
Visual applications are diverse. To show that our perception system can adapt to any application, we undertake the parallelization of 3D scene-reconstruction algorithm using a geometric method, to meet the application response time (10 Hz). The algorithm is sequential and had been implemented on a conventional system: a SUN IV workstation as the host processor and an Imaging Technology 151 as the vision system [39] . This platform is not suitable for realtime 3D scene reconstruction because of the lack of processing power and many real-time functions such as simultaneous three camera acquisition. The total execution time is about 12 s on a SUN IV for a scene containing approximately 100 segments.
Brief description of the 3D scene reconstruction by the geometric method
Three-dimensional scene reconstruction using the geometric method is based on segment processing: nevertheless, it does not use the segment length to perform the segment matching.
The geometric method is summarized by the following steps [39]:
1. Image acquisition with three cameras. 2. Contour point extraction using Deriche filter. 3. Segment extraction by applying polygonal approximation method on the contour points, followed by a false segment elimination. 4. 3D scene reconstruction using maximal segment:
• (a) Segment matching. In this step each segment S1(A1B1) of the image plane (1) is compared successively to every segment S2(A2B2) of the image plane (2) . For each comparison, the following points are processed and checked: (i) construction of plane II1 and II2 by using the end points of segment S1, S2 and camera optic center C1 and C2; (ii) segment S1 and S2 are retained as candidates for segment matching if the intersection of plane II1 and II2 is a line (R), situated in front of the image planes with a distance D less than a fixed threshold, as shown in Figure 7 ; (iii) a segment of maximal length is constructed by using S1Ј(A1ЈB1Ј) and S2Ј(A2ЈB2Ј) which are the projections of segments S1 and S2 to the line (R) ( Figure 8 ); (iv) finally, to confirm the matching, segment S3 in the image plane(3) must have a common collinear region with the projection of Smax (Amax,Bmax) to the image plane(3) (Figure 9 ). • (b) 3D coordinate processing. The three segments found (S1, S2 and S3) and the 3D coordinates of the initial segment may be computed by using either the least square method or the Kalman filter. In this step, the obtained segments contain many false segments. 5. 3D scene-reconstruction improvement. This step contains two main functions: (i) 3D scene correction including projection of 3D scene to different image planes and correction of the projected segment and (ii) 3D scene-reconstruction. 6. False segments elimination. The 3D scene-reconstruction can be improved by projecting 3D segments onto the different image planes to eliminate the false segments by comparing them with segments of the initial image. 
Parallelization of the 3D scene-reconstruction by LINDA and simulation result on PPDS
The analysis of the sequential algorithm of 3D scene reconstruction shows that there are six successive steps; and the execution time of each step is summarized in Figure 10 . Naturally, 3D scene reconstruction by geometric method algorithms may be implemented in parallel by dedicating each image plane input to be treated by a module.
Three images are captured simultaneously by each vision module. Each image is processed in parallel by LLIB for edge detection and the output image is stored back into a frame buffer. During the next screen refresh, the image is processed again by LLIB (edge tracking accelerator) for edge tracking coding in Freeman code ( Figures 11, 12 and 13). The coordinates and values of the contour points are stored in the processing element-shared tuple space. Then edge tracking and polygonal approximation are done in parallel. This step of processing is very important, since it provides a set of segments. The algorithm used for this purpose is well known: it consists of boundary segment splitting, where a segment is subdivided into two parts until the maximum normal distance from a boundary segment to the line joining its two end points does not exceed a defined threshold. This approach has the advantage of seeking prominent inflection points. The 3D scene segments are sent to three other clusters and are stored in their commonshared tuple space.
The fourth step execution time is 1.5 s. Thus the execution time is approximately the same for each step, and the three sets of segments (three images planes) are split into four parts. Then the four workers run the optimized segment matching code in parallel, which permits reduction of the execution time to 0.08 s.
PERCEPTION SENSOR FOR A MOBILE ROBOT 387
Step #
Step 1 and 2
Step 3
Step 4
Step 5a
Step 6
Step 5b The fifth step is performed by one worker. The sixth step is performed by two workers, and the results are sent to multi-sensor fusion and decision making (Figure 14) . Figure 15 shows the parallelization of the 3D scene reconstruction using HLINDA. Therefore, for a 3D scene containing approximately 100 segments, the system can perform the 3D scene reconstruction every 0.08 s (frequency 12.5 Hz) and meets the application response time. The obtained results are encouraging but many problems remain unsolved, such as reconstruction of cylindrical or spherical objects. Figure 16 shows the configuration of perception system for an application of 3D scene reconstruction using a geometric method, containing three modules. Each module contains one image acquisition board, one LLIB, four clus- ters of processing elements (16 workers) and an interconnection board. The host processor is a PC VME board.
Conclusion
Experimental results show that a high-performance, low cost system may be obtained by adapting algorithms and hardware to each other, especially for the perception sensor. Thus, configurable and flexible parallel architecture are the key concepts to the design of perception sensors. In addition, the configurable and flexible parallel perception system provides a platform for developing applications and experimenting with high-and low-level image-processing algorithms. Currently the system is used to evaluate and to implement real-time edge detection using binary logic neural networks and to implement real-time motion detection operators. With VHDL language, the system also permits the narrowing of the gap between software developer and hardware designer. 
