Abstract -Many low level vision tasks that are computationally intensive are easily parallelizable. The lack of parallel processing systems, or their prohibitive costs, have prevented the move of vision processing algorithms from single processor systems to multiprocessor systems. With the recent spurt of parallel processing hardware, there is a need to investigate the feasibility of using such machines for some vision algorithms. Speedup is an important factor in determining the feasibility of migration from single processor systems to parallel processors. In this work, we investigate a particular segmentation algorithm and present theoretical speedup results. Our formula can work out numerical speedups by simply plugging in the parameter values.
INTRODOCTION
Computer vision tasks require an enormous amount of computation demanding high performance computers for practical, redtime applications. Parallelism appears to be the only economical way to achieve this level of performance. Most of the work in computer vision focuses on images with 2-D data, but pragmatic vision problems require 3-D data which is easily available now.
Three dimensional data may be represented by a 3-D matrix
of intensity values, f ( i , j, k), where each intensity value represents a property associated with the location ( i , j , k ) . A p rimary .
goal (and initial step) of computer vision is to abstract 'Lrelevant'' information from an image. This may involve a process called segmentation that groups a set of homogeneous pixels into regions. Homogeneity can be defined by different criteria depending upon the image modality, Segmentation thus reduces the information content in the image to the most relevant and by defining some features of the segmented regions, computer vision scientists hope to extract just enough information to characterize those regions.
Such an abstraction will be helpful in other higher level tasks like object recognition (11 and visualization. Two major approaches to segmentation are the region growing ones and the region splitting ones. In region growing, each pixel is considered in relation to its neighbors and pixels that are "closer" in some distance metric are merged. On the other hand, in region splitting, the whole image is initially considered to be one single region and this region is recursively split into smaller regions. Both these approaches are in general amenable to parallel implementation.
In this paper we derive the formulas for the theoretical time complexity and speedups obtainable on the implementation of a segmentation algorithm on thr MasPar SIMD machine. In section we briefly describe the segmentation algorithm. In the following section we discuss the architecture of the SIMD machine, MP-1. In section , we derive the time complexity. section 0.0.2 is the concluding section of the paper.
DESCRIPTION OF THE ALGORITHM
The segmentation algorithm of Sabata et. al. [2] [3] was chosen as the candidate for implementation on a SIMD machine. The first stage of the two stage process involves oversegmenting the image based on zeroth and first, order surface properties. The second stage involves merging of small regions into a final segmented output with each segmented region being specified by a bivariate polynomial.
A number of intensity images are generated using the 3-D range data, each assuming a light source at different points. The first segmentation module uses a pyramid structure of p + 1 levels:
Level 0 has 2P+l x 2P+' nodes with each image pixel being mapped to one node. The next level contain 2P x 2P nodes, and the ith level containing 2' x 2' nodes. on all the image regions will yield a large number of segmented regions, resulting in oversegmentation of the image. The second stage is the merging process. Oversegmentation is followed by region mer ing based on a neighborhood criteria of least mean square (LMSB error of bivariate polynomial fits to adjacent regions. Adjacent regions with an error less than a threshold are merged. This stage is driven by the requirements of the higher level vision task. A segmented image is the final output.
PARALLEL IMPLEMENTATION
We attempt to parallelize the pyramidal node linking module and analyze its time complexity. To do so, we need to map the nodes of the pyramid to the PES of MP-1 in an efficient manner to minimize idle time of each PE and achieve load balancing. Communication bottlenecks and routing of messages have to be taken care of.
The configuration of the particular MasPar, hlP-l [4] we consider has 16K PES (processing elements) arranged as a 128 x 128 array, each with local memory of 64K. Each PE can communicate with the other by two methods. One is a "router'' that allows any P E to communicate with any other PE, and the other is the faster "XNet" connections between 8 neighboring PES. The front end is a DEC station 3500. The front end can communicate to the array control unit (ACU) that has its own processor and memory.
The ACU primarily controls the operations in the PES, though it can do limited computation. The front end machine (FE) is used to load programs and data into the ACU and DPU while it can still act as a standalone workstation.
Each generated image undergoes pyramidal node linking. We assume that the time taken for each iteration is a constant, t. This is justifiable on a parallel machine each P E operates on one pixel which makes the time taken independent of the data size. We shall use some terminology similar to [5] to describe different levels of computation. One stage is complete when all the iterations between any two levels I; and /;+I is completed.
Let the number of levels in the pyramid be p + 1
We consider two allocation schemes of tasks to processors, and The image is processed in a pyramidal fashion. The processing iterates between levels 20 and 11, until a certain criteria is satisfied, after which processing between levels 11 and 12 commence. Processing proceeds in this fashion until the topmost level of the pyramid is reached.
Mapping Scheme 1
In this mapping scheme, we divide the image into a number of sub-images, each equal to the size of the P E array. The total time, Tsl, taken up by one subimage is given by The average idle time/PE, Ta, is then given by Mapping Scheme 2 Let us consider yet another scheme of mapping of tasks to processors. In this case, we try to maximize the processor utilization time, by assigning the inactive PES to other pixels. The scheme is illustrated in figureq.
In this scheme, after each stage has completed, gth of the PES are freed up. These can be reconfigured immediately. In the subsequent steps, the same number of PES become free. Thus, we can consistently assign new tasks to the same number of PES after the computation has completed at stage. Let us denote the total time taken for such stages to be TI*. In the last stages however, when there is no more tasks left, an increasing number of PES will become idle as each level of computation completes. Let the total time in this case be Tlb.
In this scheme, the P E array is split into four equal sized regions, and the image is split into sub-images matching the size of these regions. The total time to complete the computation, Ttz, is given Since the image size is k x k and the P E array size is n x n, and the P E is split into four equal regions, we have the image split into $ sub-images. In the first level of computation, four sub-images are assigned to the PES and in the subsequent levels, three subimages are assigned. This means, that the $$ sub-images would have been completely allocated to PES in n, = 3 steps.
-4
Then, the time, Tia is given by where Tp is the equally spaced time interval after which P E is the time taken for the Last sub-image to iterate all the way reallocation takes place. through completion, and it given by Also, let t, fi 0 V j > p .
We can now formulate Tia as n . Equation (5) is modified thus, 
3=1

Tfa
O ( J ) + t xi, + tcm"(ns) + t S z ( k ) + t,z(k) (6)
Boundary conditions In the above mapping scheme, we conveniently ignored edge conditions. When we allocated subimages whose size were equal to the P E array, we assumed that information to compute the pixels values for the next level was fully contained in the subimage. This is not the case. The pixels on the border would actually require the information from the adjacent subimages. In effect, we can only compute the next higher level for an image that is slightly smaller than the P E array size.
Say, from any level lJ to level l J + l , there is a neighborhood of nh x nb(= n t ) at level 1, that takes part in the computation of one pixel in level Consider the allocation of one subimage of size n x n to the P E array. Since the information for some of the border pixels are missing, they cannot be computed. As a result, instead of being able to compute the values for &, we can only compute the pixels values for (E -[?I) (2 - 
I?]).
Incidentally, in this problem we use nb = nh = 2.
Following figure(), say we allocate the array of PES with the subimages as shown. There will be a certain amount of overlap in the subimages. In other words, some pixels may have to he allocated to the PES twice. If allocation proceeds in this manner, 
residual plxels
Figure 5 : There is a certain overlap in the allocation scheme to take care of the edge pixel computations. The unbroken lines indicate subimages whose size are equal to the PE array. However, since the P E array is missing some neighboring information to compute the border pixels, only the areas indicated by the broken lines are computed. Each subimage allocated to the PE has some overlap as indicated by the broken lines. The residual pixels occur at the right end and the bottom of the image. a rectangularly shaped region, which we shall call residual pixels, will be left out at the right end of the image and at the bottom of the image. These can later be allocated to the PE array. If the number of residual pixels are few, one allocation to the P E array would suffice. However, if there are too many residual pixels, a number of allocations have to be done. Here, we attempt to compute the number of times these residual pixels have to be allocated. As before, say we have an image of size k x k and a P E array of size n x n. The "thickness" of the pixels on the right hand side of the image would be [?1;. The height of the region is k. Similarly, the "height" of the residual pixels at the bottom of the image is and the corresponding length of the region is k. The total number of residual pixels, P,, is therefore given by because the bottom left corner of size x [?l! has been counted twice. Considering the arrangement of the residual pixels as shown in figure 3 , we would require more than Rt pixels to be allocated to the P E array, since neighborhood information is required. This bring us to the question of how many actual pixels need to be allocated to the P E array.
The row residues and the column residues do have some common pixels. However, for analytical simplicity, we assume that we allocate these common pixels, independently. Thus, the same pixel might be allocated twice in the PE array at the same time.
Therefore, the number of pixels Pt that remain to be allocated is If Pt is smaller than the P E array size, one allocation would suffice. If not, the PES have to work on the residual pixels a number of times. The number of times, nr is given by As in our case if nb = nh = 2,n = 128 and k = 256, then nr = 1.
Thus replacing the 5 term by ($ + n,) in equations (1) where n, = 3 changes to the term nt = n .
Single processor complexity On a single processor machine, the time taken for total computation is simpler to compute. Since after each level of computation, the number of pixels reduce by four, we get the following expression for the total time taken for complete computation.
As in the previous case, we can include a factor, ts3(k) for the time taken for file I/O. This would change the equation to
One can use equations(l), (6) and (12) 
3P
If n = 128 and p = 6 as in our case, we get a speedup of 3640.
This speedup is the theoretical maximum but it is important to note that this number is given for illustrative purposes only. In actual computations, we need to take the communication and other factors into account. For this reason, we make no attempts to give numerical values for speedups. Instead, we simple provide the equations, and if the communication and other delays are known, they can be directly plugged into the equation.
Speedup for mapping scheme 2
The speedups the mapping scheme 2 derived from equations (6) and (12) is given by
CONCLUSIONS AND FUTURE WORK
In this work, we have derived a formula for the theoretical speedup obtained for the pyramidal segmentation algorithm of Sabata et. al. [2] , when switching from single processor to a SIMD parallel processor. Our exact case analysis will give an accurate estimation of the speedups if the values of the different parameters are known. One could also approximately estimate these parameters. However, we have made no attempts to do so.
Currently, we are working on implementing the segmentation algorithm on the MasPar MP-1 system. This will help us to compare the performances of the system in the real world to our theoretical results.
