n this article, we present a new reconfigurable parallel architecture oriented to video-rate computer vision applications. This architecture is structured with a two-dimensional (2D) array of FPGA/DSP-based reprogrammable processors P ij . These processors are interconnected by means of a systolic 2D array of FPGA-based video-addressing units which allow video-rate links between any two processors in the net to overcome the associated restrictions in classic crossbar systems such as those which occur with butterfly connections. This architecture has been designed to deal with parallel/pipeline procedures, performing operations which handle various simultaneous input images, and cover a wide range of real-time computer vision applications from pre-processing operations to low-level interpretation. This proposed open architecture allows the host to deal with final high-level interpretation tasks. The exchange of information between the linked processors P ij of the 2D net lies in the transfer of complete images, pixel by pixel, at video-rate. Therefore, any kind of processor satisfying such a requirement can be integrated. Furthermore, the whole architecture has been designed host-independent.
Introduction
Computer vision presents us with a widefield of research incorporating a great amount of data and, consequently, requires a high level of processing capacity, mainly in parallel computing. Stated simply, computer vision can be separated into three main levels. The first one, the lower level, deals with pixeloperation functions such as differences between images and neighborhood filtering.
At the intermediate level, we kept in mind tasks such as segmentation, motion estimation and feature extraction or matching, all of which need pipelined processes.
At the upper level of computer vision we deal with interpretation. This function usually requires Artificial Intelligence tools as well as previous knowledge of the environment.
The challenge of pipelining these three levels in realtime can be undertaken using parallel architectures. Table 1 shows a well-known classification presented by Sima [1] in which the most representative groups of families are shown.
This table identifies some of the major families which represent key junctures in the evolution of parallelism oriented to image processing. To date, these premises are still quite valuable.
A great deal of work on parallelism has been carried out over the past decade. However, most approaches, even the most recent ones, have been based on a handful of historical architectures. In view of this, we should mention the systems based on a net of interconnecting modules which process the whole image by local operation on a pixeled neighborhood [2] [3] [4] [5] [6] . These architectures are generally based on geometric parallelism operating in a single instruction, multiple data (SIMD) mode. Moreover, pipelined systems and systolic nets of processors are based on multiple input processes whose outputs constitute the inputs of the next operation [7, 8] , scanning the whole image and processing it piece by piece. Thirdly, pyramidal systems compute complex operations by using the divide and conquer paradigm [9, 10] . These architectures are frequently used on multi-scalar or recursive operations such as image grabbing at different resolution levels. Furthermore, there are some architectures which are internally organized without taking into account the image structure or their operations [11] [12] [13] . Multiple instructions, multiple data (MIMD) architectures, such as digital signal processors (DSPs), permit the design of several types of parallelism. Finally, hypercube processors combine the advantages of pyramidal structures and meshed nets [14, 15] .
Since architectures for real-time image processing need to manage a large amount of data and work within real-time requirements, the use of parallelism is a fundamental part of most of these systems. An interesting project considering specific architectures for parallelism was carried out by Raman and Clarkson [16] . Their article describes a parallel architecture composed of several specific, non-identical modules which can work concurrently with only one shared memory.
Another interesting proposal was presented by Young [17] , a DSPs-based structure which computes in parallel to solve tasks with a high computational cost, as in realtime image processing. He gives a few examples using several DSPs from Texas Instruments, but most specifically the C40. Srinivasan and Govindaraj [18] have also used these devices (DSPs) in multi-processor networks. To increase the process speed, they split the original image into several independent blocks. Each Other projects, like Chen and Jen [19] , show a special processor for video signals. This processor is composed of several functional units which work in parallel. However, each unit in turn specializes in one kind of computation (arithmetic unit, multiplicative unit, discrete cosine transform unit, and so forth). Another project by Wu et al. [20] , presents a co-processor built of several basic modules interconnected by means of a programmable network.
The article by Turton et al. [21] is interesting since they attempt to apply genetic algorithms to computer vision studies. To take advantage of this kind of algorithm they had to use parallelism and decided on parallel algorithms genetic (PAG), specifically the Fine Grained version, which suggests that it is better to work with a larger number of simple processors than with only a few complex processors. Finally, the studies by Bertozzi and Broggi [22] describe a stereo vision system for obstacle and lane detection. The kernel of this system is a parallel architecture called Paprica 3 which is composed of 256 processing elements (PEs) which work in an SIMD manner. We found that the GIOTTO system, an architecture proposed by Cucchiara [23] used in robotics applications, was similar to the previous work. This is also a parallel computer based on an SIMD reduced-size array processor with a novel organization of the memory sub-system. Several of these architectures are modular, which means that the system can be extended with more PEs or basic cells, depending on the application desired.
As far as our DSP/FPGA-based parallel architecture is concerned, this article is organized as follows. Firstly, in Section 3, the architecture and its performance are discussed in general. Analog I/O video signal devices as well as the systolic crossbar are briefly described in Section 4. Section 5 presents the architecture of the elemental processors and its hardware implementation. A brief overview concerning the control and synchronization of the whole architecture, as well as communications with the host, is presented in Section 6. Finally, conclusions and further work are presented.
The Proposed Architecture
Figure 1 presents our new architecture, which can cope with the majority of problems which crop up in real-time computer vision application, mainly when dealing with more than one RGB camera simultaneously. The most important features of this proposed architecture can be summarized as follows:
K Each processor P ij operates on an input image supplied by any one of the other processors belonging to the net and supplies a new output image to be processed by the next module. K The data BUS between processors have been designed for byte-by-byte transmission, that is pixel by pixel. K A set of digital video multiplexors (dv ij ) allows parallel and pipeline connections between processors depending on the final application. K Since only video-sync information and enable/ disable control signals are considered, an FPGAbased processor (Pro1) controls the architecture. K Processors are independent and perform different image-processing functions. Each processor controls its own memory module. The same function can be paralleled as many times as needed. K RGB images can be used as input/output. K All the processors can operate as a switch, allowing the image to pass through without any inconvenience. In such cases, the video input is transferred straight away to the output of the module.
We placed a processor (Pro1) in the overall control of the whole cell architecture, while another processor (Pro2) was in charge of communications with the host. Finally, a matrix of elemental processors [P ij ] interconnected through a crossbar built with video-addressing processors called dv ij to deal with the initial stages of the computing process was used. This crossbar construction allows for all sorts of interaction among the basic processors [P ij ] which can send or receive images pixel by pixel using a digital 8-bits BUS. Through Pro2, this BUS is used to send information to the main host as well.
In the following, each module of the architecture will be described in further detail, pointing out which kind of devices have been or will be used in future implementation.
Video Transmission
Analog input/output video signal As mentioned before, multiple RGB input/output signals are possible when dealing with parallel processing of various input images. As examples of probable applications requiring multiple parallel input images, we tested for three-dimensional (3D) image processing, tracking, and object recognition.
In this architecture, every A/D module incorporates a Philips TDA8709A converter and an LM1881 synchronism extractor. These synchronizational signals are supplied to Pro1, the processor controlling the basic processors making up the net. The D/A conversion is performed using TDA8702 devices.
The systolic crossbar
The video-addressing units [dv ij ] alone constitute a 2D net of FPGAs (see Figure 2 ) with the purpose of taking care of video-transmission through the various basic processors [P ij ] in the main architecture.
In such a net, we can differentiate between the input/ output units addressing a 3 Â 8 video BUS from the rest of the units addressing an 8-bit video BUS. 
348
J. BATLLE ETAL.
For the implementation of the dv ij crossbar cells we opted for an Altera FPGA model FLEX 10K100ARC240-2. Figure 3 shows a simple sketch giving an idea of the number of I/O pins needed. The program and control word is 10 bits in size, four of which are used to select any one of the 12 configurations possible shown in Figure 4 .
If an architecture with 64 elemental processors is to be connected, then the 6 remaining bits will be used to identify the position (x,y) of the selected cell [dv ij ] from among the 64 possibilities.
The power of the FPGA used allows control tasks to be carried out on a local level, that is, two or more grouped devices [dv ij ] could, if necessary, collaborate independently from the general architecture BUS. Previous period to T1:
K The 4 first bits of the 10-bits control word indicate that we have selected the video-addressing function-1 represented in Figure 3 . This option allows the digital signal to pass through the selected dv ij cell from left to right. K In in_port we have the gray level of the input byte; for example, 10 h. K In out_port we have the same 10 h value. K up_port and down_port are in three-state level.
At instant T 1 :
K The value of the entering byte changes from 10 to 20 h.
At instant T 2 :
K The value of the input byte 20 h is available at the output pin ''out_port''. 
FPGA/DSP-BASED PARALLEL ARCHITECTURE 349
At instant T 3 :
K The value of the entering byte changes from 20 to 30 h.
At instant T 4 :
K The value of the input byte 30 h is available at the output pin ''out_port''.
At instant T 5 :
K A change in the control signal produces the new function-6 configuration of the video-addressing cell to that represented in Figure 3 .
The current prototype integrates the FPGA FLEX10-K250A.
The facility of programming such cells dv ij using VHDL can be seen from the program sample in Figure 6 , which shows how to manage the first three video-addressing functions of the 12 possibilities from Figure 4 .
Finally, in Figures 7 and 8 , some schematic examples are presented to indicate the possibilities of interconnecting the processors using the FPGA-based crossbar.
The Basic Processors Cells P ij
As can be seen in Figures 7 and 8 , the proposed architecture allows the linkage of an unlimited number of processors compatible with I/O requirements of the video-addressing cells [dv ij ]. However, as mentioned before, and with the goal of facilitating the programming task, we suggest the use of identical processors. Figure 9 shows the developed architecture for the basic processor cell which is composed of the following modules:
K A ping-pong memory. K A P 1 processor mainly oriented to computational functions. K A P 2 processor basically addressed to communications and low-level image processing tasks.
In the current prototype, P 1 is a DSP TMS320C51 device chosen for its high computational capabilities. As far as P 2 is concerned, it will be in charge of initial loading program functions, intercommunications with the rest of the architecture and memory management. The addressing needs suggest the use of an FPGA device like the Altera series FLEX which will give computational support to the DSP as well.
The cell itself constitutes a powerful tool oriented to real-time image processing. As far as the P 2 control signals are concerned, a 3-bit BUS was used: a videosynchronism bit and two binary bits to control the four functions: enable, disable, program load and execution. Figures 10 and 11 show two basic examples of lowlevel parallelism. In Figure 10 , processor P 1 computes with the data stored in M 2 while P 2 loads the input image into M 1 . In Figure 11 , processor P 1 deals with the input image at video rate and provides an output image to the next cell. Figure 12 shows the first hardware prototype of the basic cell (P ij ). Figure 13 shows the first parallel architecture with two basic cells and the I/O RGB interfaces. 
350
To end this section, we would like to present an example of how this basic cell (P ij ) can be programmed. The developed application consists of loading a frame into the M 1 memory and reading the previous frame from the M 2 memory. Figure 14 shows a chart of the VHDL program. Figure 15 shows the code program. In summary, the DIV block is a simple frequency divider, while the RAMCTRL block is in charge or read/write memory operation.
Control, Synchronization and Communication Tasks
Processor Prol -control and synchronization of the whole architecture This processor is in charge of control and synchronizational tasks. Its principal functions can be summarized as follows:
K to supply external video-sync. signals for the video cameras, 
Processor Pro2 -communication with the host
The Pro2 processor manages P ij program loading using its own internal dual-port memory as a shared address and communication with the host. As we suggested for any desired specification, this processor should provide host-independent features, although in the current prototype, only a PCI protocol was used. An FPGA Altera FLEX10K100A was used, mainly for its ability to be easily programmed with a 32/64-bit PCI interface.
Application and Conclusions
We have presented a highly versatile parallel architecture which allows dealing with high-level real-time image processing routines. The hardware has been designed to work co-operatively with a host, leaving the host free to deal with the final steps concerning scene understanding and interpretation tasks. The 2D FPGA- 
J. BATLLE ETAL.
based crossbar allows interconnecting the basic cells [P ij ] which, in turn, allow a free flow of pipelining and parallelism with no restrictions concerning the number of linked processors or the number of parallel input images to be dealt with at the same time. The first prototype was used as an embedded computer vision system to implement real-time underwater imaging procedures for the AUV GARBI developed in our Lab ( Figure 16 ). It is well known that the images of the sea bottom suffer from poor light and high noise. As a result, computational time is the most important parameter to be optimized when dealing with autonomous navigation. Keeping underwater imaging in mind, our main purpose was to perform in real-time operations such as undersea pipe tracking. As can be imagined, taking parameters from such an image would not be an easy task. The proposed architecture can perform a great deal of real-time computation from preprocessing steps until final interpretation levels.
The pipeline is detected using two parallel plane laser beams and a video camera oriented to the sea bottom. Since the aim of this application is to show how the board can be programmed, an easy example dealing with the obtained image when projecting two laser beams over a cylindrical object will be presented in Figure 17 . The lengths of both lines change with the modification of the distance between the robot and the pipe. Left-right movements of the underwater vehicle with respect to the pipeline can be detected by the location of the lines inside the image.
In the presented application, real-time image processing tasks are performed by FPGA, while the DSP computes parameters like angular displacement and the distance between. In fact, thresholding the image of the projection of the laser beams is the most important step in obtaining the parameters used in tracking control. Moreover, a matrix filter passes over the image and the parameters are extracted from the result. From an architectural point of view, the process is conformed by the modules executed by the FPGA presented in Figure  18 and described as follows:
K BINARY does threshold tasks. The binary image is shown in Figure 19 . K FILMAX3 performs a 2D matrix filter. The system compares the information obtained by processing the image with the pre-set values for the optimum distance to be maintained between the robot and the pipeline. The DSP is used as a complementary processor for computation of the displacement angle and the distance between Figure  20 shows the implemented software in C-language for DSP. 
This processor is capable of dealing with noise, segmentation, edge detection, correlation, FOE detection and perspective transformation at video rate with minimal delay of a few frames. As far as further hardware work is concerned, the FPGA FLEX10-K250A will be used to implement the 2D crossbar net, since only a low number of individual dv ij cells can be integrated using a single chip. As further work, the powerful processor DSP TMS320C62x will be used mainly because its facilities are oriented to mathematical computing. This processor is able to perform instructions at 5 ns, an invaluable feature when dealing with filtering, FFT operations and so forth. Furthermore, its 1-Mbit internal RAM will allow optimization of the external memory resources. Another important feature is its reduced size compared with its power, a useful characteristic when dealing with embedded systems to be located inside small autonomous underwater vehicles and other mobile robots.
