The paper discusses a parallel, multiprocessor implementation of methods based on partial differential equations (PDE). It focuses on the implementation requirements of segmentation by active contours. Here, the key task is the accurate computation of an implicit description of the contours, i.e. the distance map. This paper demonstrates the use of the proposed architecture by giving a hardware implementation of the distance transform based on the parallel Massive Marching algorithm. The extension of the non-linear filtering implementation is also proposed.
INTRODUCTION
Recent PDE-based methods, such as non-linear filtering, active-contours based segmentation or continuous mathematical morphology give promising results. However, their implementation requires computation of nonlinear functions. They are often solved by iterative and recursive algorithms characterized by a high computational cost. Therefore, only a limited number of realtime or embedded applications exists. The design of a custom chip becomes meaningful not only for the acceleration but also for the implement-ability on embedded systems (2) .
The most common, existing custom chips implement non-linear filtering, i.e. non-linear diffusion (1), (2) . One can find some experiments with super-computers (5) or some examples of the level-set segmentation implemented on graphic hardware (4) . On the other hand there are few chips dedicated to the image segmentation. The motivation of our work is to define a general type of parallel architecture fitting the needs of the abovementioned applications. This paper is structured as follows: the next section discusses the choice of a suitable architecture type. The following section shows the implementation of Massive Marching algorithm (8) , (9) on the proposed architecture type. In Conclusion the simulation results and the extension of the proposed processing unit to other applications are discussed.
PDE IMPLEMENTATION ISSUES
The filtering as well as the segmentation algorithms respond to some function of geometrical properties of a level set function u (5) such as gradient or curvature. These geometrical properties are obtained directly from the values of u or its derivatives (5), (6) . The computation of the derivatives is an elementary operation. For every point concerned, these operations are performed on the nearest neighborhood and are independent each on the other. Hence they can be executed in parallel (9) .
Since the PDE-based algorithms principally consist in repetitive computation of the elementary operations the SIMD (Single Instruction Multiple Data Stream) architecture is the natural choice to reduce the processing time. We propose a divide-and-conquer approach in order to obtain a more balanced processors' activity and to limit the space on the chip. Thus one can benefit from a quasi parallel implementation by dividing the input data into blocks.
Recall that the SIMD architecture consists in an array of processors with an interconnection network for the communication. Each processor has its private nonshared memory. A single controller broadcasts instructions to all the processors. The processors then execute the instructions simultaneously at a given time.
The choice of the algorithm to implement and the architecture type come together. Compared to the filtering, other techniques as the continuous watershed or active contours require computation of a (weighted) distance to the given markers. These algorithms use sophisticated data structures (such as a hierarchical queue or a sorted heap) which can penalize the execution time on the SIMD architecture. The processing of such data structures introduces a sequential approach. Only one point (with maximal priority) can be processed at a time. Another reason why the SIMD architecture would be less efficient for this type of algorithms is the random access to the memory.
In the next section we show the implementation of the Massive Marching algorithm used for the computation of a distance function. This algorithm is fully parallel. Hence the execution time on an architecture with P processors is t parallel =t sequential /P.
IMPLEMENTATION OF MASSIVE MARCHING
The proposition of the Massive Marching has been motivated by the search of an efficient architecture implementing the solution to the Eikonal equation (8) . Massive Marching does not use any sorted queues or heaps. Moreover Massive Marching allows to compute the distance only on the narrow-band of a curve which is advantageous for active contours segmentation. It includes two stages : curve detection (initial condition) and propagation.
Curve detection (Initialization). The current PDE-based segmentation methods use an implicit description of the curve C 0 which represents the initial condition (5) . C 0 is a closed curve and generally lies between the points of the grid Z 2 . Its accurate inter-pixel location is identified by the distance map u to C 0 . However, if it is to be described implicitly on a discrete support Z 2 , the curve C 0 may not be placed arbitrarily. Its location is only determined by
cates whether a given point x lies inside or outside C 0 (as introduced in (5)). Hence, C 0 is located somewhere between two adjacent points for which sign . )) differs. The exact distance u of these points to C 0 is obtained by linear interpolation. The other points are initialized to infinity. Consider | _ const for all x ∈ Z 2 and the 4-neighborhood. For all points adjacent to C 0 , the initialization of the distance function reduces to:
since u can only have a finite number of constant values. The initialization of u(x) can be realized efficiently as a combinatorial function of sign x i )) for all x i ∈ N 4 (x). The function result is used to retrieve the corresponding value c i from a register. The interpolation of one point takes only one clock cycle. The block implementing the interpolation is very simple and is not detailed in the text.
Propagation. The propagation computes the distance from the initialized points by using the numerical scheme below. The propagation uses the 4-neighborhood. We call active points those points the current value of which verifies:
where ε is the least possible increment allowed by the used numerical scheme ε = inf( f diff ) (see Eq.3). Since all the points are activated in parallel, with the same priority, the propagation front is not equidistant to the initial condition C 0 . The activation criterion Eq.2 allows to activate the unprocessed points as well as points already computed in an earlier iteration. The re-activation of some points happens essentially in cases where some point is activated by a propagation coming from another source rather than the closest one. The distance value of such point is detected and re-computed in the next iteration.
Numerical scheme. The computation of the distance can be done by using either the Godunov numerical scheme (5) or its approximation. The Godunov scheme requires to find the maximum solution of a quadratic equation considering all the four neighbors (8) . The Godunov scheme resumes to
where u x , u y and u min are the minimum 4-neighbours in the x, y and both directions. Generally f diff is some increasing, non-linear function. In order to reduce the computation complexity (i.e. computation of squares and roots) we propose to use a piecewise approximation of f diff . A discussion of several approximation types can be found in (8) .
The distance approximation is done in two steps. The first step, called Jacobi step, calculates the distance value in time t n+1 given the values obtained in time t n . The second step, called Gauss-Seidel step, recalculates the values obtained in t n+1 by the Jacobi step.
We discuss the implementation aspects in the text below.
GLOBAL ARCHITECTURE
For the simulation and validation we have chosen the division of the input image into the columns, i.e. one processing unit per column of the image. The processing units then process in parallel all the points in a row at the given time. As soon as one row is processed, all the processing units shift one point down to process the next row. Note that the choice of the image division affects only the communication procedure between the adjacent processing units and not their internal architecture. We give below a description of the proposed (and tested) accessing to the neighbors.
As introduced in section Algorithm Discussion all the processing units are controlled by the single Global Control block. It reads high-level instructions from the algorithm Code Memory. Every instruction is executed in parallel at a given time. The Global Control block not only controls the execution of the algorithm but also allows to change the values of the linearization registers inside each processing unit. Thus we can to implement the approximation of almost any non-linear function (see the section Piecewise Linearization).
Processing Unit
The Processing Unit (cf. Note that the input data are stored in a non-shared data memory. Therefore, all the units can access to its data simultaneously at the given time. The memory is a double-port memory; before the execution of the algorithm, the data are uploaded by using a global access port (not given in the schematics) and read after.
Recall that the Massive Marching uses the 4-neighborhood. In order to reduce the number of the interconnections, we use bi-directional buses for the communication with the adjacent PUs. The bus direction is controlled by the signals t_east and t_west derived from clk/2. The complete neighborhood of a point is obtained in two clock cycles in the following way (cf. Figure 2 ). With a rising edge of the clock a new SOUTH value is read from the local data memory whereas the old values SOUTH and CENTER are shifted upwards (e.g.
112, 113, 114).
The values EAST and WEST are read from the bus on the falling clock edges. First, the WEST value is read from the west-side adjacent PU while the SOUTH point value is being sent to the east-side adjacent PU. On the next falling edge is read the EAST point while the CEN-TER point is being sent to the west-side adjacent PU (Figure 3 ).
The complete neighborhood is ready immediately after the reading of the EAST point (indicated by the dashed line). The same communication protocol also applies to the filtering. Piecewise linearization. Instead of computing the exact value of the nonlinear function from Eq.3 (which requires the computation of square and square root) we propose to use a piecewise approximation. The approximation allows to preserve the necessary sub-pixel precision while reducing the implementation cost. The Eq.3 is rewritten in the form:
Where g diff is a function to be approximated. To calculate the piecewise approximation of g diff one only has to compute a (| u x -u y |) + b for each interval. Therefore, the number of operations to obtain u(p) reduces to three additions, one subtraction and two multiplications. The functional scheme of the Piecewise Linearization block is given by The computation process is completely pipelined. After some initial delay, the result is obtained in one clock period. The computation latency is 10 periods.
Pixel activation. Each pixel has its own flag from which is read the activity status of the processing unit. It indicates whether the new value is to be computed or not. The activation flag of the point x gets active whenever the condition (Eq.2) is verified. The active points are testing their activity for the next iteration by using the condition (Eq.2) and may activate their inactive neighbors by sending them an activation request (see Figure 5 for signals ARN, ARS, ARE, ARW -Activation Request to North, South, etc.). As soon as all the flags are inactive the propagation stops. 
CONCLUSION
This paper presents a new architecture not only accelerating the execution of PDE-based algorithms but also making them accessible in the embedded applications. The divide-and-conquer architecture has been chosen as a reasonable trade-off between the addressing complexity, the restriction of memory accesses and the balanced activity of simultaneously operating blocks. Whereas the activity distribution over the set of processing blocks for the curve-evolution based algorithms depends on the curve geometry itself for the filtering, the activity is uniformly distributed on all the processing units. The advantage consists in using the same architecture without any modifications for both types of algorithms.
The proposed implementation has been validated on FPGA. We have used FPGA-specific blocks such as fast multiplier/adders or delay-locked loop for the clock division in order to optimize the design speed and occupied surface. The estimated surface of the approximation block in one PU is 10k gates (8+8bits fixed point precision). The frequency estimation has been obtained by the implementation (after routing) on an FPGA Xilinx Virtex and it is 150MHz for the internal PU blocks, i.e. 75MHz for the memory accessing. The worst-case computation time estimation for the distance function on an image 256×256 from a point placed in the corner is 4ms.
Since the global control allows to change the values of the constant registers (piecewise linearization and interpolation) the proposed architecture of the processing unit can also be, without important changes, used for other applications. One of them is the anisotropic diffusion filter. As introduced in (2) the anisotropic diffusion can also be approximated by using the piecewise linearization approach. The same architecture only needs to rearrange the data path in order to compute the first derivatives for all neighbors and their sum.
The future work consists in the realization of the above-mentioned changes and in their evaluation. The next stage will be the definition of an appropriated approximation of the curvature and the implementation of active contours. Also, we will compare the efficiency of this architecture versus several pipelined processors processing the active points queued in a FIFO.
