This paper presents a hierarchical image transformation and efficient parallel algorithms for its evaluation. This transformation maps image structures onto code trees of different height, depending on the size of the structure. Thereby, important structures are effectively separated from the background. The inherent parallelism of such a hierarchical image transformation is outlined. The algorithms are domain independent and were successfully used for workpiece recognition and for traffic sign detection. A communication module for farmer-worker applications that supports specialized processors, like frame grabbers or display units, as well as the parallel recognition process are illustrated in detail. The implementation is done on a 1 9-node transputer image processing system. The functionality from grabbing an image and low-level filtering to transformation and high-level symbolic pattern analysis is integrated.
Introduction
The analysis of images typically consists of several stages. Firstly, on a low level enhancement operations are used to reduce noise; secondly, a segmentation process separates objects from each other and from the background; and thirdly, the objects are classified.
Various techniques have been developed by different research groups. Parallel architectures are used to fasten these time-consuming processes. Two main types of systems have to be considered, the SIMD architecture (e.g. Connection Machine, DAP) for fine-grained systems and the MIMD architecture (e.g. Parsytec Supercluster, Intel iPSC/2 hypercube) for coarsegrained ones. Both types have been tested for image processing algorithms. Bhandakar [3] , for instance, compared the Connection Machine and the Intel hypercube, implemented several recognition algorithms, and discussed the problems and advantages when mapping algorithms onto the architectures. An overview of the characteristics of several image processing and vision algorithms, including low-level algorithms such as smoothing, filtering, edge enhancement and edge thinning, transforms such as Fourier and Hadamard, statistics-based algorithms such as local and global histograms, and high-level processing involving pattern and graph matching is given by Jamieson [15] . The relationship between these algorithms' characteristics on the one side and special image processing architectures (e.g. MPP, GAPP, Clip IV, Clip VII, Cytocomputer, CAAPP, Picap, Tospix) and more general architectures (e.g. PASM, Connection Machine, Ultracomputer) on the other side is examined. Thereby, a useful hint is given in answering questions about the range of algorithms for which a 2 particular architecture might be applicable. Bhandakar, Chu, Jamieson and Stubbington [3, 9, 15, 24] pointed out that fine-grained massively parallel architectures are very well suited for low-level image processing, whereby high-level symbolic object recognition can be dealt with better on coarse-grained architectures in which complex processors are coupled.
Until now, most research in parallel image analysis has been done in the field of low-level and intermediate-level operations. Little [16] , for instance, presents a set of solutions of computer vision problems on the Connection Machine. A sample of fundamental procedures for image understanding was implemented to demonstrate the ability of the Connection Machine to solve vision problems. These include filtering techniques, edge detection, connected component labeling, and Hough transform. The detection and segmentation of blobs has been implemented successfully by Atherton et al [1] on the Warwick Multiple-SIMD architecture. They used a spoke filter that maps onto the SIMD array very effectively, and also employed connected component labelling algorithms and a convex hull operation.
A number of the drawbacks of the SIMD architecture are solved by the M-SIMD by dividing the SIMD array into smaller clusters, each with a controller and an associated conventional processor for numeric and symbolic computation, which has been used for the convex hull operation.
Several authors discuss the implementation of a parallel Hough transform and of filtering on transputer nets, e.g., Bison and Wilson [5, 26] . The Inmos transputer is specifically designed for multi-processor systems. Four bidirectional links are available for communication among these otherwise conventional processors in the net. Although Eghtesadi [12] mentions that transputer networks are not optimal to solve these transforms, she outlines that the implementation is easily achieved and that it is flexible and expandable. Furthermore, a higher level of decision making in the recognition hierarchy can be integrated on the same network.
Parallel model-based shape recognition is presented by Cass [8] on the Connection Machine.
Object models and image features are represented as contour features. A transformation sampling technique is used to determine the optimal model feature to image feature transformation. Dinstein [10, 11] proposes an algorithm described for an EREW PRAM architecture. The approach uses parallel techniques for contour extraction, parallel computation of normalized contour-based feature strings and parallel string matching algorithms.
Preliminary suggestions for parallel knowledge-based vision systems can be found in the papers of Moldovan [21] and Browse [6] . Both of them compare image features with object models stored in a knowledge base. Moldovan maps his algorithms onto a mesh-connected array of general purpose microprocessors. A simulated MIMD algorithm to exploit parallelism in knowledge-based systems is presented by Greenberg [13] .
All of the papers mentioned above concentrate on certain aspects of parallel image processing tasks without combining them to a fast image processing system. They either ignore interprocessor communication and the distribution of the image to the different processors, or they work with relatively small images of 256x256 pixels or less or on synthetical images rather than on complex natural scenes.
The introduction of Parsytec's Transputer Image Processing system (TIP system) based on transputers solves some of the drawbacks of conventional transputer systems. In addition to their links, the processors are connected via a fast and scaleable bus [22] . True color video images of 512x512 pixels can be distributed to the transputers in video real-time. Filtering operations can be computed by broadcasting the original image to every transputer, each of them processing one subimage and then gathering the subimages for further processing.
Currently, the system size is limited to 32 transputers which can be connected to the bus.
Nevertheless, an unlimited number of processors may be linked to them by their normal hardware links. This architecture allows the combination of low-level and high-level operations in one system. A TIP-system with eight bus connected transputers has been used for the implementation of our algorithms.
In this paper we will concentrate mainly on the segmentation and classification process with respect to efficient parallel computation. A unique hierarchical image transformation will be presented. This image transformation, called "Hierarchical Structure Code" (HSC), which has been developed by our group, provides a uniform hierarchical representation for regions and lines of different contrast type as well as for edges. The transformation is domain independent and has been successfully used for workpiece recognition and traffic sign detection. Since only local operations are used during the encoding and the linking procedures, strategies of parallel computation can easily be applied to produce the HSC on different sized transputer nets within some seconds [23] . Due to its regular and local operations, in addition to efficient parallel algorithms, a VLSI-chip has already been built for our HSC-processor to encode images in real-time [4] .
For a better understanding of the analysis of the data structure, a short description of the HSC will be given before the analysis of the data structure and its parallel implementation will be discussed.
5

The Hierarchical Structure Code
The process of building the HSC consists of an encoding and a linking procedure which are a transition from the signal space of an image into the space of its symbolic representation.
During the encoding procedure, the hexagonal digitized image is subdivided into overlapping islands of seven pixels. Thus, continuous structures like dark and bright lines, areas, vertices or edges are subdivided in structure elements. The information about a structure element within an island, together with its coordinate, is represented by a "code element" <t;m;ϕ|k;n=0;r;c>. The type of the structure element is encoded by t, its shape and orientation by m and ϕ, while r;c is called the "coordinate" (row; column) of the code element and |k;n;r;c> its "hierarchical coordinate". This encoding process is done on several different resolution levels k with a pixel distance of 2 k in different spatial frequency channels. This provides us with a similar code for small lines and regions as for broad lines and wide regions on all levels. Figure 1 illustrates this process of segmentation. All of these code types are used for the recognition of workpieces, while only edges are used for traffic sign recognition. During the linking process seven islands I|k;n> are combined to a larger island I|k;n+1>. In a first step, code elements of the seven islands I|k;n> are analyzed for continuity. In a second step, continuous structures are represented by a code element <t;m;ϕ|k;n+1;r;c>. After the n=m'th repetition of this linking process, a continuous structure is represented by just one single code element <t;m;ϕ|k;n=m;r;c> which is called the "root node" of the structure. Of course, the linking process finally terminates when one island covers the whole image.
Hereby, a hierarchical data structure is built, consisting of "code trees" for all structures in the image with the detector elements of linking level n=0 as the tree's leaves. Bidirectional pointers connect the nodes in the tree. While the upward directed pointers are called supercode pointers, the downward directed ones are called subcode pointers. A more precise description of the encoding and linking process can be found in [14] .
Code elements on different linking levels and an example of a code tree are shown in figure shown in (4) while edge elements of code trees of size n≥4 can be seen in (5). Noise and small structures are already eliminated. Only dominant structures are able to build up code trees with root nodes on linking level n≥7 (6) .
A data structure of two arrays is used to hold the code trees. The "key array" has one entry for every hierarchical coordinate |k;n;r;c>. This entry is a pointer to the code elements, which are encoded at this coordinate and their size. The code elements themselves are stored in the "data array" together with their supercode and subcode pointers.
The preprocessing steps, the HSC-generation and the analysis algorithms are currently integrated on a 9-node transputer system. In addition to their link structure, the processors are connected to a fast data bus that is used to distribute the images and the HSC database. The system is completed by a frame grabber and a display unit which are also connected to the bus ( fig. 7 ). The parallelization of preprocessing steps like filtering operations for noise reduction and a laplacian operator for edge detection are directly supported by the fast bus.
Therefore, subimages are distributed including the necessary overlap for the operations via the bus, each transputer performs its operation and returns the altered subimage via the bus The HSC generation is either implemented in parallel with a similar strategie as described or in hardware on a VLSI-chip [4] which is currently being integrated in the transputer system.
While the image transformation itself is a very regular operation, the recognition process is irregular, due to its data dependency and thereby highly asynchronous. For a flexible implementation of the complete processing chain of grabbing an image, its smoothing, transformation and analysis, a communication module is necessary that meets the demands of the heterogenous and asynchronous environment. After a description of the recognition algorithms, the communication module based on a farmer-worker approach and the parallel implementation of the algorithms are presented.
HSC-Based Domain-Independent Operations
Based upon the HSC, we implemented a set of operations which enables us to localize and analyze code trees within the HSC-database. There are three basic operations: ROOT for finding root nodes, SEQU for building a leaves' sequence of a given code tree, and SHAPE for analyzing the shape of a sequence of code elements. Some more operations are already implemented like DNEIGHBOR and CONNECT to deal with neighboring structures. The whole set of operations enables a two-stages image analysis system with a domain independent preprocessing and HSC-generation on the first stage and a knowledge based evaluation of scenes on the second one [18] . The algorithms of the three most fundamental operations ROOT, SEQU, and SHAPE are described below, since they are used to build socalled "Attributed Structure Types" (ASTs). These ASTs describe simple geometric primitives, like circles or polygons, in an image and are the fundamental base for further object recognition. Afterwards, DNEIGHBOR is presented as an example for parallelism within a single code tree.
Searching for Root Nodes
ROOT is used to search for all root nodes of code trees of a given structure type t within a set of resolution and linking levels (k min ≤k≤k max ; n min ≤n≤n max ) as shown in figure 9 .
Additionally, a window defined by a polygon of coordinates can be used as a region of interest to minimize work for searching structures within larger ones. Since structures are often encoded on several resolution levels, we call the code tree of a structure in the most 11 detailed resolution its "real" code tree and the others its "virtual" code trees. To improve the later evaluation, ROOT results in an ordered list of root nodes, each of them followed by its virtual ones. When searching on different linking levels, higher levels are analyzed first in order to start recognition with the most dominant structures which build the largest code trees.
Considering a parallel analysis of the HSC, several strategies can be applied for searching
Fig. 9:
The operation ROOT; the evaluation starts at levels f=k+n=8 with the best resolution root nodes in parallel within the data structure. Within a given set of levels k;n it is possible to analyse complete levels in parallel. Additionally, for a fine-grained solution every level can be split into sublevels, like images are split into subimages. Nevertheless, ROOT is only a small part of the whole analysis process as described in chapter 5, since every found root node needs a detailed analysis.
Development of Code Element Sequences
The root node of a code tree only supplies us with a rough hint about the underlying structure due to the very generalized description of a code element's form and orientation on a high linking level. With the operation SEQU, a more detailed analysis becomes possible.
Starting at the root node of a code tree, SEQU descends top-down to the leaves of the tree at linking level n=0, follows the contour of the structure at this level and delivers a sorted list of its code elements. Thus, SEQU provides an inversion of the bottom-up linking process if the structure types "edge" and "line" are analyzed. In the case of regions, the surrounding edge contour is provided. The resolution level and the length of the sequence already give us some idea about the proportions of the structure [17] . Figure 10 illustrates this process.
The above described top-down descent is not limited to linking level n=0. The code elements can be gathered at every linking level. Of course, the information of sequences at higher levels is coarser then on the detector level, but is sufficient for the shape analysis of large objects. This fact is used for cutting computational costs and for a better load balance in the net as described in chapter 5. 
Shape Analysis of Code Element Sequences
To make knowledge based object recognition possible, a Also, the opposite direction of the structure sequence has to be considered. Tables 1 and 2 illustrate the process of comparing a given sequence of code elements with a symbolic description of a structure. 
DNEIGHBOR
The chapters above described how so-called Attributed Structure Types (ASTs) are created.
In the process of object recognition, ASTs are used to describe complex objects. Therefore, objects are modeled in semantic networks and are broken down to those ASTs by part/part-of relations [19] . Topological operations are necessary to ensure that recognized ASTs fit together and form an object. The operation DNEIGHBOR analyzes two root nodes for a direct neighborhood. DNEIGHBOR is described here to show another kind of parallel computation technique when analyzing hierarchical image databases. Not only can a set of root nodes be analyzed in parallel like it is done for the ASTs, but also every code tree of a root node. Furthermore, it shows that the mapping of image structures onto code trees by the hierarchical transformation allows the application of well known tree algorithms to the recognition process.
The direct neighborhood of two root nodes can be broken down to analyze every pair of subcode trees. Thereby, a recursive algorithm is defined which stops when a pair of neighboring leaves is found or when all possible pairs are evaluated. Obviously, pairs of subcode trees can be analyzed independently. For an efficient parallel computation it is important to ensure that these subcode trees are node straightforward algorithm that simply uses SEQU to find the contour elements of both structures and then compares pairs of them until a neighboring pair is found, there is one main advantage. The hierarchy can be used to prune the search tree effectively. Since it is known that the underlying code tree of a code element t 1 lies within a certain area, the coordinates of t 1 and t 2 can be used to compute whether a neighborhood of the leaves is possible or not. Therefore, let t 1 be a code element at coordinate |k 1 ,n 1 ;r 1 ,c 1 > and t 2 be a code element at coordinate |k 2 ,n 2 ;r 2 ,c 2 >, respectively. Let f 1 =k 1 +n 1 , f 2 =k 2 +n 2 , d=2 max(f1,f2)+1
, r diff =|r 1 -r 2 |, and c diff =|c 1 -c 2 |. Finally, leave nodes of the trees defined by t 1 and t 2 might be neighboring, if (r diff +2*c diff )≤d. Figure 12 illustrates this fact.
During the recursive procedure of comparing two code elements by comparing all pairs of subcode elements, this pruning step is executed several times. Experiments showed that this process should be done on every second linking level n≥2. At smaller linking levels the pruning step is too time-consuming compared to the further evaluation. An acceleration of 100 and more could be achieved for large structures in the sequential version. In the parallel 16 version, the pairs of subtrees are generated on linking level n=5. These pairs are then sent to the workers for further evaluation. Trees with root nodes on linking levels n≤4 are still analyzed sequentially.
Analogous to the parallel computation of the ASTs as described in the following chapters a set of processes is set up at the farmer to handle the generation of subtasks and the evaluation of the subresults. In this case the subresults are boolean values answering the question whether two pairs of subtrees are neighboring or not. Thereby, the whole task can be finished when either one pair of neighboring structures is found or all pairs of substructures are evaluated and none of them is neighbouring.
The Communication Module
For a flexible communication within the system, a farmer-worker communication module has been implemented. This communication module was designed to support both "normal"
worker processors as well as specialized workers like frame grabbers or display units, which are necessary for an image processing system. It was also considered that low-level and highlevel operators should easily be implemented in the system.
As mentioned before, the transputer is equipped with four bidirectional links. For the communication between the farmer processor and all of its workers -especially between those processors which are not directly connected -virtual links are established at program start by using routing processes. An internal channel connects the farmer process and the communication module. Incoming tasks are buffered and distributed to the workers, and each worker has its own buffer for incoming tasks to minimize delays. The buffer sizes were determined empirically and are set to a size of four tasks for the worker's task buffer.
Although buffers increase the program's complexity they increase the farm's performance by decreasing communication costs and synchronization delays. A detailed analysis of communication in transputer farms has been done by Tregidgo [25] . Nevertheless, buffers
should not be oversized, otherwise the load balance decreases in such a farmer controlled net when tasks are unequally time-consuming. For our image processing environment, two kinds of worker processors are distinguished. First there are the "normal" workers which are identical as far as the architecture and the program is concerned. The second group consists of specialized workers like the frame grabber or the display unit which run different programs. Tasks can only be send to the second group by directly addressing them in the task header, whereas workers of the first group can be accessed by either addressing them directly or by using a DON'T CARE address. In this case, the least loaded worker is determined by the farmer and will receive the job. Thereby, the communication module supports both direct access to special processors like the frame grabber or our HSC-processor and a black box sight on the set of worker processors. Figure 13 illustrates the different processes in the farm.
Two data structures are shared among the farmer's processes. The task buffer, a FIFO buffer, is used to store the tasks which are produced by the farmer process. It is filled by the task manager, while the task distributor empties it. The worker pool is an array with important runtime information about the different workers as, e.g., the current number of jobs stored in each worker. For every possible number, which is obviously less than the buffer size of the worker, there is a list which holds the related workers. The task distributor chooses the worker with the lowest number of jobs, increments its number and sends the job via a virtual link to the processor. The worker is then set into its new state list. After receiving a result, the result manager decreases the number of jobs and also adjusts the state lists. All accesses to these data structures are controlled by semaphores. The worker's program is less complex.
Incoming tasks are buffered in the task buffer by the task manager and forwarded to the worker process by the task distributor. The communication library provides virtual links Obviously, every root node in the HSC can be analyzed independently. In this way, its processing is a well defined task in a farmer-worker environment. Several processes are set up at the farmer for handling the distribution of the tasks and the incoming results. To achieve a good load balance, a reasonable number of tasks of almost equal time cost is desirable at every moment during the computation. Therefore, as soon as root nodes are delivered from one of the workers, each single root node is used to build a SEQU-SHAPEtask which is then handled by a worker using the sequential operations as described in chapter 3. Figure 14 illustrates the different processes on the farmer and the dataflow between them.
20
The main task is split into several ROOT-tasks which are routed via a multiplexer to the 
Experimental Results
The operations were tested in the fields of workpiece recognition and traffic sign detection.
Among the test pictures, there was a set of 300 traffic scenes. Experiments showed that a robust recognition of traffic signs is possible up to a distance of 30 meters using our equipment. Detailed recognition results can be found in [2] . After parallelizing the algorithms When analyzing the runtime, some points have to be considered:
1.
As it was expected, the best improvements by parallelizing the recognition process could be realized for the most complex images with more than 200 structures to be analyzed, while a few quite simple structured images (although it were "real" traffic scenes) result in efficiency factors close to 0.6. Nevertheless, in the average a good speed up could be achieved on our test set.
2.
Although some effort has been put into equalizing the complexity of the SEQU-SHAPE-tasks, they differ in runtime from approximately 3 ms to 100 ms. The short times are achieved when a sequence's shape and the shape models differ extremely.
In this case, the shape recognition process stops very quickly. Thereby, it happens that large code trees, which are analyzed first, are of smaller computational cost than smaller code trees, which are analyzed later. Obviously, the overall runtime is improved very much, though efficiency decreases.
3.
As mentioned in chapter 5, a commercially available communication library was used which supports virtual links and the bus architecture. Its general purpose character is too time-consuming. A new implementation specialized for trees will result in an improved communication behavior and thereby decrease the overall runtime.
4.
Since the farmer process does not need any access to the original image or the HSC database, it does not have to be connected to the bus. Thus, a low cost transputer is sufficient to run the farmer program. Thereby, hardware costs are determined by the bus connected transputer-modules. Their number is used for the efficiency data in column "Efficiency 2".
An alternative approach for the parallel generation of the HSC and its analysis based on subimages distributed on a net of transputers was investigated by Priese and Schwolle [23] .
In opposition to our algorithms they are using distributed subimages and HSC databases in all steps of processing while our approach uses redundant image data and HSC databases by broadcasting the data via the fast data bus to all workers for evaluation. Their algorithms came out to support larger transputer nets with up to 100 processors but with less efficiency for small nets.
Our experiments show that not only the iconic level, but also high-level image analysis can be accelerated considerably by using parallel techniques. The runtime of less than one second for the analysis became possible by the hierarchical image transformation which was used.
The unique mapping of edge structures to code trees allows the shape analysis of edge sequences on linking levels on which they are much shorter than on detector level, which is usually used by other techniques. Additionally, it supports an effective parallel recognition process. Analysis in video real-time becomes achievable with the next generation transputer image processing system based on the PowerPC for computation and the T805 or the T9000 respectively after availability for communication. In addition to increased link rates, T9000's hardware supports routing on chip for increased communication abilities. A speed up of 10 for calculation and communication compared to the current system is realistic. Thus, the analysis of approximately twenty images per second can be performed.
Conclusion
First, a general purpose communication module for image processing tasks supporting different kinds of processors in the system was described, and advantages of hierarchical image transitions for parallel analysis could be shown. Methods for parallel object recognition in hierarchical image databases were then presented. The three main advantages of the transition were discussed.
1. The HSC provides a good separation of important structures from the background.
2. The image structures can be easily accessed for further parallel evaluation.
3. The mapping of structures onto code trees allows the use of well-known tree algorithms for the analysis of structures.
The algorithms were implemented and tested on a 9-node transputer image processing system with T805 processors. Functionality from grabbing an image, low-level filtering and transformation to the described hierarchical data structure as well as high-level symbolic recognition is integrated. The robustness of the algorithms was tested on a set of 300 traffic scenes. The runtime for the analysis of traffic scenes was measured to be less than 500 msec in almost all cases. Future work will deal with accelerating the HSC-operations as well as the communication processes. A big step to video real-time recognition can be taken by using the next generation system based on the PowerPC and the T9000. A speed up of 10 for calculation and communication compared to the T805 is realistic. Thereby, video real-time analysis in natural scenes becomes achievable.
