We are interested in running in parallel cellular automata. We present an algorithm which explores the dynamic remapping of cells in order to balance the load between the processing nodes. The parallel application runs on a cluster of PCs connected by Fast-Ethernet.
INTRODUCTION
We are interested in running in parallel cellular automata. We present an algorithm which explores the dynamic remapping of cells in order to balance the load between the processing nodes. The parallel application runs on a cluster of PCs (Windows NT) connected by Fast-Ethernet (100 Mbits/sec). A general cellular automaton1'2 can be described as a set of cells where each cell is a state machine. To compute the next cell state, each cell needs some information from neighbouring cells. There are no limitations on the kind of information exchanged nor on the computation itself. Only the automaton topology defining the neighbours of each cell remains unchanged during the automaton's life.
Let us describe a simple solution for the parallel execution of a cellular automaton. The cells are distributed over several threads running on different computers. Each thread is responsible for running several automaton cells. Every thread applies successively to all its cells a 3 steps algorithm : (1)(2) exchange (send and receive) neighbouring information, (3) compute the next cell state. If communications are based on synchronous message passing, the whole system is synchronized at exchange time because of the neighbourhood dependencies. Due to the serial execution of communications, computations and multiple synchronizations, some processors remain partly idle and the achievable speedup does not scale when increasing the number of processors.
Improved performance can be obtained by running communications asynchronously. One can then overlap data exchange with computation. Neighbouring information is received during the computation of the previous step and sent during the computation of the next step. This solution offers improved performances, but still does not achieve a linear speedup. Like in the skeletonization problem, discussed in section 2, the computation load may be highly data dependent and may considerably vary from cell to cell. Furthermore, the parallel application may run on heterogeneous processors inducing a severe load balancing problem. Due to the neighbouring dependencies, cells consuming more computation time slow down the whole system. To reach an optimal solution we need a flexible load balancing scheme.
One solution is to allow each cell to be dynamically remapped during program execution. One or more cells may be displaced from overloaded threads to partly idle threads. Cell remapping requires 3 steps after terminating the computation of the cell to be remapped : (1) notify every thread about the decision to remap a given cell, (2) wait for acknowledgement from all threads and (3) remap the cell.
Step (2) ensures that the neighbourhood information for the remapped cell is redirected towards the target thread. In the applications we consider, the overhead for remapping a cell is insignificant compared with the computation time. For the sake of load balancing, we will present in section 4 a trategy for cell remapping.
As a typical example of a cellular automaton, we consider the image skeletonization problem3'4. Skeletonization requires spatial filtering to be repetitively applied to the image. Each step erodes a thin part of the original image. After the last step, only the image skeleton remains. Skeletonization algorithms require vast amounts of computing power, especially when applied to large images. Therefore, skeletonization application can potentially benefit from the use of parallel processing.
To parallelize image skeletonization, we divide the original image into tiles. These tiles are distributed across several threads. Each thread applies successively the skeletonization algorithm to all its tiles. Threads are mapped onto several processors according to a configuration file. Tiles cannot be processed independently from their neigbouring tiles. Before each computation step, neighbouring tiles need to exchange their borders. In addition, each computation step depends on the preceding step. Section 2 presents the image skeletonization algorithm. Section 3 develops a parallelization scheme. Section 4 shows how to load balance the application by cell remapping. The performance analysis is presented in section 5.
IMAGE SKELETONIZATION ALGORITHM
Image skeletonization consists of extracting the skeleton from an input black and white image. The algorithm erodes repeatedly the image until only the skeleton remains. The erosion is performed by applying a 5x5 thinning filter to the whole image. The thinning filter is applied repeatedly, thinning the input image pixel by pixel. The algorithm ends once the thinning process leaves the image unchanged. Figure 1 shows a skeletonized image. Since several skeletonization algorithms exist, let us describe the one providing excellent results4. Let TR(P1) be the number of white to black (0 -* 1) transitions in the ordered set of pixels P2, P3, P4 P9, P2 describing the neighbourhood of pixel P1 (Fig. 2) . Let NZ(P1) be the number of black neighbours of P1 (black = 1).
P1 is deleted, i.e. set to background (white = 0) if:
and and
Original image Skeletonized image
The process is repeated as long as changes occur. This algorithm is highly data dependent. One thinning filter step modifies only small parts of the input image and leaves the major part unchanged. In the next section we take advantage of this fact to improve the algorithm. 
In the parallel algorithm, the overhead for the exchange of information between neighbouring cells increases since communication and synchronization is needed between processing nodes responsible for adjacent cells. The parallel program requires therefore larger cell sizes.
To develop the parallel application, we use the Computer-Aided Parallelization (CAP) framework, which allows to manage the neighbourhood dependencies and the data flow synchronization. The CAP Computer-Aided Parallelization framework5'6 is specially well suited for the parallelization of applications having significant communication and I/O bandwidth requirements. Application programmers specify at a high level of abstraction the set of threads present in the application, the processing operations offered by these threads, and the flow of data and parameters between operations. Such a specification is precompiled into a C++ source program which can be compiled and run on a cluster of distributed memory PCs. A configuration file specifies the mapping between CAP threads and operating system processes possibly located on different PCs. The compiled application is executed in a completely asynchronous manner : each thread has a queue of input tokens (serializable C++ data structures) containing the operation execution requests and their parameters. Network I/O operations are executed asynchronously, i.e. while data is being transferred to or from the network, other operations can be executed concurrently by the corresponding processing node. If the application is compute bound, in a pipeline of network communication and processing operations, CAP allows to hide the time taken by the network communications. After initialization of the pipeline, only the processing time, i.e. the cell state computation, determines the overall processing time. Figure 4 shows a wait until receive completes schematic view of the schedule. The CAP tool allows the programmer to specify this schedule by appropriate high-level language constructs6. When the program starts, all the cells are at step zero. Each time the thinning filter is applied to a cell, the cell step isincremented by one. Because of the neighbouring dependencies, the differences between the step of a given cell and of its neighbours is at most one. Therefore, during the program execution, some cells are waiting for their neighbours to perform the computation of the next step. If all the cells of a processing node are waiting, the processor becomes idle, reducing the overall performance. To avoid as much as possible such a situation, the parallel algorithm is improved by computing first the cell with the smallest step value on each processing node.
While some cells are sending their neighbouring information or waiting for the reception of neighbouring information, other cells could potentially keep the processor busy. This argument fails if there is just one cell per processing node or if the cellular automaton topology implies that every cell is depending on all other cells. In order to run computations in parallel with communications, one may partially compute a cell without knowing the neighbouring information. Cell computation may start while receiving the neighbouring information from other cells7.
If the total computation load is evenly distributed over the processing nodes, the parallel algorithm can potentially keep all the processors busy. However, in the case of the skeletonization algorithm, the computation time is highly data dependent.
To keep all processors busy, we need to balance dynamically the computation load.
DYNAMIC LOAD BALANCED PARALLEL SCHEME
For load balancing, we need to remap the cells during program execution. In order to migrate a cell from one address space to another, we need to maintain the load (or the inverse of the load, i.e. an idle factor) for each processing node. A simple way of computing the load is presented here. Let A be the average step value of all cells A = :Cel lStepValue
NumberOfCells AliCells
For each processing node, we compute an IdleFactor by adding the signed differences between the processing node cell step values and the average step value A. When the processor is idle, the IdleFactor is set to a MaxldleFactorValue minus the number of cell in the processing nodet.
: (CellStepValue -A), ifthe processor is not idle
MaxidleFactorValue -NumberOfCells, if the processor is idle A negative IdleFactor indicates that the cell step values of the corresponding processing node are behind the other processing nodes. A processing node with a strongly negative IdleFactor is overloaded and slows other processing nodes which run its neighbouring cells. A positive IdleFactor indicates that the corresponding processing node is ahead of the others. The processor of such a processing node may soon become idle since the neighbouring dependencies with the cells of other processing nodes will put it in a wait state. To balance the load, a cell from the processing node having the most negative IdleFactor should be remapped to the processing node having the largest positive IdleFactor. The IdleFactor is evaluated periodically, every time a new cell migration is performed. Between two cell migrations, a specific IntegrationTime allows the system to take advantage of the previous cell migration.
In order to compute the IdleFactor, one thread, called MigrationThread, is added in each processing node. Periodically, a new token is generated and traverses all the MigrationThreads of every processing nodes. The token is generated in processing node P0, then it visits all processing nodes in the order : P1, P2 PN and back to P0. The migration token makes three full traversals in order to allow the parallel system to decide which cell to remap. During the first traversal, the migration token collects the number of living cells of each processing node and the sum of their step values. This information is distributed to all the processing nodes during the second traversal. During this same traversal, every node computes its IdleFactor. This IdleFactor is collected by the migration token and distributed over all processing nodes during the third and last traversal. Then every node decides in a distributed manner which processing nodes are involved in the migration.
t A processing node having no cell to process should have a higher IdleFactor than processing nodes with cells waiting for neighbouring information.
The processing node from which the migration starts, migrates the cell with the smallest step value. In order to perform the migration, the IOThread broadcasts to every processing node the migration cell destination and waits for acknowledgment. Once the IOThread receives acknowledgments from every processing node, no further information for the migrating cell will be received on the current processing node. The IOThread sends the cell data and all the previously received neighbouring information to the destination processing node. The migration is done. The time period between each migration cycle is set by the IntegrationTime parameter. If the IntegrationTime is too short, the processing nodes will waste time for performing useless cell migrations. In the worst case, a too short IntegrationTime results in migrating all the cells of a processing node leaving it without any cell. If the IntegrationTime is too large, then the processors may become idle before receiving a migrated cell.
Experiments show that it is difficult to find an a good IntegrationTime. In order to improve the cell remapping strategy, let us introduce the notion of stability. A processing node is stable if the difference between the CeliStep values within a processing node is at most one:
Processing node is stable Max(CellStepValue)
A processing node is unstable, if it is not stable. Since the ComputeNextCellStep function processes first the cells with the smallest CellStep value, the stable state is a permanent state if no cell migration occurs. Without cell migration, each unstable processing node will sooner or later reach the stable state. The migration cell emission and receiving processing nodes are determined by the IdleFactor. In order to improve the cell migration strategy, we take into account two migration rules avoiding in some special cases the migration of cells. We do not migrate the cell if the cell receiving processing node is in an unstable state. This rule avoids to carry out consecutive migrations to the same cell receiving processing node. We also do not migrate if the migrated cell will leave the receiving processing node in a stable state. This rule avoids migration if the receiving processing node has no major advance compared with the emission processing node. These two rules are not applied if the receiving processing node is detected to be idle. The stability information is exchanged in the same way as the IdleFactor.
PERFORMANCE MEASUREMENT
The performance measurements were carried out on three input images : a balanced input image, a highly unbalanced input image and a slightly unbalanced input image. The balanced input image (Fig. 5) consists of a repetitive pattern ensuring an evenly distributed computation load. In the highly unbalanced input image (Fig. 6) , according to the cell distribution (eqn. 2), the non-empty cells are distributed unevenly across the processing nodes. One of two processing nodes receives empty cells which require only one computation step. The slightly unbalanced input image (Fig. 7) is an intermediate case between the balanced and the highly unbalanced input images. The balanced and unbalanced input images are of size 2048x2048 pixels (8 bits/pixel) and splitted into 16x16 cells, incorporating 128x128 pixels. The slightly unbalanced input image is of size 1024x1536 pixels (8 bits/pixel) and splitted into 8x12 cells, incorporating 128x128 pixels.. In the case of the balanced input image, there is no significant performance difference between the two algorithms. For such an input image, the cell migration is useless since the load is perfectly balanced between the processing nodes. The results show that the overhead induced by the management of the cell migration is low. The parallelization does not provide a linear speedup because the neighbouring information exchange consumes processing resources (CPU power for the TCP/IP communication protocol).
In the case of the highly unbalanced input image, the performances are considerably improved by dynamic load balancing. Without cell migration the efficiency (speedup/N) is approximately 50% since one processor of two becomes idle. Implementing the cell migration allows the parallel program to reach approximately the same speedup with a balanced or an unbalanced input image. In the case of the slightly unbalanced input image, the performances are improved by using the dynamically load balanced algorithm. Since in the input image, about one out of four cells is empty, the theoretically maximal performance improve- 
CONCLUSIONS
We are interested in the parallelization of cellular automata. Our experiment is based on a particular image skeletonization method. We have developed a parallellization algorithm which can be easily applied to other cellular automata. We explore two parallelization methods, one with a static load distribution consisting in splitting the cells over several processing nodes and the other with a dynamic load balancing scheme capable of remapping cells during the program execution. Performance measurements show that the cell migration doesn't reduce the speedup if the application is already load balanced. It improves the performance if the parallel application is not well balanced.
Cellular automata have a wide range of applications : matrix computation, state machines, Von Neumann automata, etc. Many problems can be expressed as cellular automata. Developing from the scratch a custom parallel application requires a large effort. This paper shows the possibility of developing first a generic parallel cellular automaton and, on top of it, parallel applications making use of the cellular automaton program interface. This approach reduces the programming effort without loosing efficiency.
ACKNOWLEDGEMENT
We thank Gilles Ritter for having written the first version of a simple parallel skeletonization algorithm. This research is supported by the Swiss SPP-ICS research program, grant 5003-51332. 
