Abstract. This work presents some considerations on the hardware organization of the external memory in low-cost massively parallel systems and discusses a few basic criteria for the development of e cient algorithms. These criteria are illustrated with the help of a case study.
Introduction
This work discusses the problem of data coding on low-cost SIMD massively parallel systems, namely where the Processor Array (PA) is composed of a set of extremely simple Processing Elements (PE) disposed on the nodes of a 2-dimensional mesh.
Since a 1:1 mapping between the data set and the PA is not feasible for generic Image Processing tasks, a processor virtualization mechanism is needed. When each PE can only handle a small amount of memory, the computation is serialized in windows: the PA is loaded with a sub-window of the data set, then the computation is performed until a special instruction is reached, and nally the results are stored back into the memory. These steps are iterated until all the sub-windows have been processed. Then the rst sub-window is reloaded again into the PA and the computation is resumed until the next special instruction is found. Generally, due to the low amount of memory associated to a single PE, low-cost systems 1, 4, 5, 6] utilize this second solution (the so-called external virtualization 1]), which requires the use of an external memory for data storage.
Moreover, the speci c choice of the hardware organization of the external memory is another crucial architectural key point. Two possible solutions can be devised: the rst one corresponds to the one considered in this work, where each memory word contains the values of di erent binary layers belonging to the same image pixel. In the second one, each memory word contains the values of adjacent image pixels belonging to a single binary layer. The former has the disadvantage that for each data transfer a xed number of binary layers (equal to the bus parallelism) are moved from the image memory to the PA. As a consequence, the data bus e ciency ( BUS ) seldom reaches high values, because the parallelism in the data transfer seldom matches the bus parallelism. Conversely, the latter has the advantage of moving through the data bus only the required amount of information, but it requires some additional hardware. In fact, a single data coming in parallel from a sensor (camera, VCR,...) carries the information of a single pixel, and this second solution would need a hardware data shu er for the transposition of the data. The Connection Machine CM- 2 3] has a hardware extension (Sprint-chip) explicitly devoted to this purpose.
The architecture considered in this work, PAPRICA 1] , is composed of a 16 16 PA, linked to the image memory through a 16-bit data bus, which transfers 16 bits of a single image pixel in parallel, and utilizes the external virtualization mechanism. The aim of this paper is the study of the classical trade-o between data packing and processing speed in the conditions described above.
The following Section presents an Image Processing case study which shows that, thanks to a plain data coding, it is possible to replace the evaluation of complex boolean functions with simple morphological operations. Section 3 ends the paper with some concluding remarks.
A morphological algorithm for slope detection
The aim of the algorithm is to label each foreground pixel of a binary image with a value representing the local slope s of the line the pixel belongs to. In the continuous case it is: s(P) = lim P 0 !P y 0 ? y x 0 ? x ; (1) where P and P 0 represent two foreground pixels with coordinates (x; y) and (x 0 ;y 0 ) respectively, belonging to the same line as shown in g. 1.a.
Fig. 1. Slope: continuous (a) and discrete (b) cases
Due to the speci c architecture, the problem must be reduced to the evaluation of the slope value through the analysis of the line morphology contained into a nite-dimension neighborhood of pixel P. A 3 3 neighborhood is too small to characterize the line slope, and thus it is necessary to increment the neighborhood dimension.
As an example, g. 1.b shows a discrete case corresponding to the analysis of the line morphology along an 11 steps long monodimensional neighborhood.
Called P (1) (x (1) ;y (1) ); P (2) (x (2) ;y (2) ); :::; P (n) (x (n) ;y (n) ) the sequence of n foreground pixels, s 11 (P (i) ) can be obtained as follows:
; with 5 < i n ? 5 ; (2) where the subscript \11" indicates the length of the monodimensional neighborhood analyzed. As shown in Eq. (2), the most intuitive solution, directly derived from the continuous case (Eq. (1)), is based on the iterative propagation of the pixels' x and y coordinates. Unfortunately its implementation on low-cost SIMD architectures is quite complex, due to the extremely simple PEs internal structure and interconnection network.
On the other hand, the algorithm presented in this work is based on an iterative analysis of a 3 3 neighborhood. The increment in the number of iterations corresponds to the widening of the dimension of the analyzed neighborhood.
The output quantization, namely the number of di erent slope values is a basic parameter of the algorithm. In the following case study, the nal quantization is limited to 16 di erent slope values.
2.1
Slope coding in a 3 3 neighborhood Assuming that the incoming binary image comes from a thinning lter, no more than two pixels can be set into the 3 3 neighborhood of each foreground pixel. Thus, excluding the case of line-ending pixels, there are only 8 2 = 28 possible neighborhood con gurations (3 3 patterns), which are then reduced to 16: in fact, the 3 patterns shown in g. Thus, the number of bits needed to code the 8 di erent slopes is 3, but as explained in the next subsection, even if a 1/8 slope coding is more redundant, it reduces the computational complexity of the next processing step, thanks to its uncompressed format. 
2.2
Coding a 5 5 neighborhood Due to the previous uncompressed coding, the logical union of the 1/8 codings of the two neighboring pixels provides a new 2/8 coding for each pixel, as shown in g. 4.g. Figure 4 .h shows the slope value computed on a 5 steps long monodimensional neighborhood: in this case the 2/8 coding corresponds to a 16-level quantization.
The logical union mentioned above re ects the de nition of a binary morphological dilation 2] 3 ; thus, the 2/8 coding can be performed by 8 binary 3 3 neighborhood-based dilations (one for each binary layer). The use of morphological dilations, naturally implemented on any SIMD cellular system, together with a 1/8 plain coding simpli es the 2/8 coding process, since background pixels are transparent with respect to this operation. Moreover, if both neighboring pixels encode the same value, the resulting coding is 1/8, which simpli es furthermore the nal step.
The advantage of the use of the 1/8 coding instead of a more compact one (for example using 3 bits only) is mainly due to the possibility to replace the synthesis of complex boolean functions with simple morphological operations. 2 This means that only 2 bits out of 8 can be set at the same time. 3 In fact a morphological dilation can be de ned as the result of the logical union among the pixels belonging to a given neighborhood.
2.3
The iterative process As shown in g. 4, the slope values stored with a 2/8 coding is furthermore requantized to 8 levels (the quantization can be measured by comparing the values presented in g. 4.h and g. 4.j) by a set of boolean functions in order to obtain a new 1/8 coding ( g. 4.i). The process is iterated until the required neighborhood dimension has been reached.
The rationale underlying the whole process is the following: 
From Eq. (3) and (4) ) and s 3 (P
), which encode the four following quantities: ). { The 8-levels quantization ( g. 4.i) is used to convert the 2/8 coding to a 1/8 coding to be used in the iterative process. It is important to note that in this operation there is a loss of information due to the quantization process.
Conclusions
The case study discussed in this paper has shown that the discrete version of the solution in the continuous case is quite complex to be implemented on lowcost SIMD systems. Conversely, due to the speci c features of the hardware architecture, a di erent algorithm has been proposed in this work, based on local computations only and on a simple morphological processing. In fact with a plain data coding the synthesis of complex boolean functions has been reduced to the trivial application of morphological dilations. The algorithm has been implemented directly in Assembly language on PA-PRICA architecture. Two iteration of this lter on a 256 256 image takes about 50 ms. 
