The GCA (Global Cellular Automata) 
Introduction
The GCA (Global Cellular Automata) model [8] is an extension of the classical CA (Cellular Automata) model [9] . In the CA model the cells are arranged in a fixed grid with fixed connections to their local neighbors. Each cell computes its next state by the application of a local rule depending on its own state and the states of its neighbors. The data accesses to the neighbors states are read-only and therefore no write conflicts can occur. The rule can be applied to all cells in parallel and therefore the model is inherently massively parallel. The CA model is suited to all kind of applications with local communication, like physical fields, lattice-gas models, models of growth, moving particles, fluid flow, routing problems, picture processing, genetic algorithms, and cellular neural networks.
The GCA model is a generalisation of the CA model which is also massively parallel. It is not restricted to the local communication because any cell can be a neighbor. Furthermore the links to the neighbors are not fixed; they can be changed by the local rule from generation to generation. Thereby the range of parallel applications is much wider for the GCA model. Typical applications besides the CA applications are graph algorithms, hypercube algorithms, logic simulation [10] , numerical algorithms, communication networks, neuronal networks, games, and graphics.
The state of a GCA cell consists of a data part and one or more pointers (Fig. 1) . The pointers are used to dynamically establish links to global neighbors. We call the GCA model one handed if only one neighbor can be addressed, two handed if two neighbors can be addressed and so on. In our investigations about GCA algorithms we found out that most of them can be described with only one link.
The aim of our research is the hardware and software support of this model. There are mainly three possibilities for an implementation.
1. Fully Parallel Architecture. A specific GCA algorithm is directly mapped into the hardware using registers, operators and hardwired links which may also be switched if necessary. The advantage of such an implementation is a very high performance [3] , but the problem size is limited by the hardware resources and the flexibility to apply different rules is low.
2. Partially Parallel Architecture with Memory Banks. This architecture [5, 7] offers also a Figure 1 . The operation principle of the GCA high performance, is scalable and can cope with a large number of cells. The flexibility to cope with different rules is restricted.
3. Multiprocessor Architecture. This architecture [4] is not as powerful as the above two ones, but it has the advantage that it can be tailored to any GCA problem by programming. It allows also integrating standard or other computational models.
In this contribution we are presenting a multiprocessor architecture for the GCA model which was also implemented in FPGA logic.
A Multiprocessor Architecture

Design Goals
• The system shall consist of a master processor, p cell processors with local memories and an interconnection network (Fig. 2 ).
• Each cell processor can hold a part of the GCA cell field of the application.
• A cell processor can modify only its own cells in its local memory.
• Each cell processor has only read access to the other (external) cell processors, write accesses need not to be implemented due to the GCA model.
• The local GCA rule shall be programmable by processor instructions.
• The processor instructions shall support the accesses to the cells in the local memory the read accesses to external cells stored in the other processors.
The tasks of the master are
• initializing the cell processors with program and data.
• central control and synchronization.
• optionally supplying the cell processors with general parameters, counters or identical instructions.
The network interconnects the master and the cell processors. Depending on the type of the GCA algorithm the communication pattern between the cells can be simple (regular and symmetric) or complex (irregular and not symmetric). Therefore the network complexity depends on the complexity of the communication patterns which is needed for the class of GCA algorithms to be implemented. For many GCA algorithms [2] the communication pattern is rather simple which simplifies the design of the network. Simple or specialized networks can be implemented with multiplexers or fixed connections, complex networks have to be able to manage concurrent read accesses to arbitrary external memory locations. As in the GCA model a cell is not allowed to modify the contents of another cell, the network design is simplified because write accesses need not to be implemented.
Evaluation of the General Architecture
In order to get a feeling about the performance of such a GCA multiprocessor architecture a mathematical model was developed. The model takes into account the probabilities for internal (local) and external memory accesses and allows predictions upon the time for the computation of a cell rule depending of the number of processors.
• The number of cell values to be computed in one generation is N . Each cell has L (global) neighbors.
Figure 2. General System Architecture
• The number of cell processors is p.
• Each cell processor processes n = N/p cells, where N can be divided by p without remainder.
The time to compute one generation (n cells in parallel) is T = n·t Rule , where t Rule is the time to compute the local rule on a cell processor. The time t Rule consists of the following parts:
The time to read the cell data from the local/internal memory.
• t ReadN eighbor The time to read the cell data of a neighbor. In the case the neighbor cell is in the internal memory the time shall be t ReadN eighbor(internal) = t 0 . In the other case if the neighbor is in an external memory (memory of another cell processor) the time shall be defined as t ReadN eighbor(external) = e(p) · t 0 meaning that it takes e(p) times longer than an internal access. Also the external access may increase with the number of processors p in a certain way.
• t Compute = c · t 0 The time to compute the local rule.
• t W riteSelf (internal) = t 0 The time to store the resulting cell value in the internal memory.
The probability to hit a neighbor cell in the internal memory is
if the probability to access an arbitrary cell is equally distributed. Then the probability to access a neighbor which is located on another cell processor (external access) is
The average time to access a neighbor cell is then
The time T to compute one generation with access to L neighbors will be
With P (ReadN eighbor(internal)) = 1 p we get the following result
Now the relative speed-up shall be evaluated. If only one processor is available, no external references are necessary and the formula for T can be simplified to
In the other extreme case the number of cells is equal to the number of processors, the time to compute one generation is
In the normal case, the number of cells is greater than the number of processors and the cells can be equally distributed to the processors, the relative speed-up will be
. Fig. 3 shows the speed-up for the common case L=1 (one neighbor). The parameter e(p) describes the cost of an external access relative to an internal access. In the case the external access cost is not a constant, the function e(p) = 1 + h · p (fixed cost plus incremental cost) was assumed as an example. The real cost of external accesses will depend on how efficient the interconnection network can handle the communication pattern desired by the GCA algorithm.
FPGA Prototype Implementation
A prototype of the multiprocessor architecture was designed and implemented for a FPGA (Altera Cyclone II) [1] . The system consists of p cell processors without a dedicated master. One of the cell processors takes over the tasks of the master. The cell processors are RISC processors. The processor design is simple compared to standard microprocessor because the resources or the FPGA were limited. The goal was to build a prototype system in order to study the principal behaviour of such an implementation. Therefore the processors are not optimized, e. g. the execution of the instructions is not pipelined.
The components of a cell processor are:
• Program Memory. The size is 256 words of 24 bits. The program is loaded into it during the initialization phase.
• Data Memory. The size is 256 words of 16 bits. It consists of two parts, each with 128 words. The two parts are necessary, because the old generation of cells has to be available whilst the new generation is computed. Beside the cells data also arbitrary local variables are stored in this memory.
• Register File. The size is 16 words of 16 bits. The registers R8..RE are general purpose, the registers R0..R7 and RF are special purpose.
• Dedicated Registers: • ALU. The ALU is connected to arbitrary registers and to the status register.
• Program Address Logic. The program address logic computes the next program address.
• Control Unit. The Control Unit interprets the master control information and the local instruction.
Instruction Set of the Cell Processor
The instructions are 24 bits wide (Tab. 1). The instructions operate only on registers, therefore we call this type of architecture VRISC (very much reduced instruction set computer). The instruction fields are:
• T: Instruction type Instruction typ 0 is a dyadic operation on registers. Operators are AND, OR, ADD and MUL. Instruction 1 is a monadic operation on registers. Monadic operations are SHIFT, NOT, MOV and logical reduction. Instruction 9 loads a 16-bit constant K into the register R7. Instruction 11 compares two registers and sets the condition bits of the status register. Instruction 2 conditionally sets the program counter to the target address L. Instruction 3 reads data from the address R6 of the local memory MEM into the register R5. Instruction 4 writes data into the local memory. Instruction 5 reads data from an external memory location R3 to the register R4. There are also some special instructions for synchronization of the cell processors (WAIT, READY, GO).
Some Implementation Details
The prototyping platform was a Cyclone FPGA with the Quartus synthesis software from Altera. The Cyclone II FPGA contains 68,416 logic elements (LE) and 1,152,000 RAM bits. The implementation language was Verilog HDL. The cell processor system with p = 32 processors was implemented with 56% of the available logic elements and 28% of the RAM bits (Tab. 2). The maximum clock frequency was around 85 MHz.
In the current prototype implementation every instruction needs six cycles for execution. The optimizing of the cell processor is under work, e. g. minimizing the number execution cycles, pipelining and the extension of the instruction set. Table 2 .
Ressources and clock rate
The communication network was implemented as a read-only crossbar consisting of multiplexers. Each processor has direct access to any other processors. The cost of the network in terms of logic elements and the time delay was not of significant relevance for p = 32 processors.
In order to test cost and delay of the network was investigated. The network consists of multiplexers. A one bit multiplexer was synthesized separately. The number of logic elements shown in the table 3 has to multiplied with the number of processors and the width of the external data bus.
The multiprocessor system had been simulated in JAVA before the synthesis process was started. A cross assembler is available to facilitate the machine programming. 
An Application: Merging of Bitonic Sequences
The principle of operation of the cell processor system will be demonstrated by the parallel merging of bitonic sequences. The result of merging two bitonic sequences is a sorted sequence of values. A sequence is bitonic if the values are increasing to a maximum and then decreasing. If such a sequence is cyclically shifted it remains bitonic. Bitonic sequences can be constructed from an unsorted sequence by applying nearly the same principle as merging.
In the fully parallel GCA model each cell holds its own value and compares it with the neighbor's cell which is in changing distance which is powers of two. If the own value has not the desired relation (e. g. ascending order) it will change its own value to the value of the neighbor. If a pointer points to higher indexed cell the minimum is computed, otherwise the maximum. The number of generations is log 2 (N ), where N is the number of cells.
If the GCA model is sequentially executed on one processor, n steps are necessary in each generation leading to a total number of n · log 2 (N ) steps.
The algorithm can be described in the language CDL [6] which was developed in order facilitate the description of cellular algorithms. A cell state consists of cell.data and the pointer cell.other. The pointer to the global neighbor is denoted with other. The CDL program (Listing 1) accesses the neighbors via relative addresses (±m/2) in the first generation, (±m/4) in the second and so on. The neighbor may also be accessed via absolute addresses. In this case the neighbor's address can be derived from the own address (or space index) of the cell, inverting a bit of the own address. The bits to be inverted are counted from the MSB to the LSB, according to the generation increment. In the prototype implementation absolute addressing of the neighbors was used.
The system was implemented for p = 1, 2, 4, 8, 16, 32 processors and N = 128 cells. In each processors are hold n = N/p cells. In the first processing part only internal cells are compared (Fig. 5) . We call these generations internal generations (G I ). In the second processing part only external cells are compared. We call these generations external generations (G E ).
The total number of generations is
In order to compute the time for the external processing, G E has to multiplied with the number n of internal cells (because they sequential exchanged via the communication net) and multiplied with a factor t e , representing the time for one external operation (operation with external access)
In order to compute the time for the internal processing, G I has to be multiplied with the number n of internal cells (because they sequentially activated) and multiplied with a factor t i , representing the time for one internal operation
Thus the total estimated time T is
For our implementation (Listing 2) the time was exactly counted in number of instructions
These formulas include additional constant parts which correspond to initialization code.
The number of needed instructions and the relative speed-up for p and N = 128 is shown in the 
Conclusion
A programmable multiprocessor architecture for the massively parallel GCA model was designed and implemented as a prototype in FPGA technology. The architecture consists of p cell processors with internal memories and a read-only interconnection network.
Compared to a dedicated implementation the proposed architecture is very flexible because it can be easily adapted to different GCA algorithms by programming. The speed-up of the prototype increases linear with the number of processor for the investigated algorithm. Also for other implemented algorithms (vector reduction, transitive hull) the speed-up was linear. The implementation of the network is relatively simple because it consists only of cascaded multiplexers. If the number of processors gets very high, the cost and time delay of the network have to be taken into account.
If the external processor offers the external cell data at the right moment to the demanding processor, no synchronization overhead downgrades the performance. Therefore the program should reflect the desired communication pattern of the GCA algorithm in order to minimize the synchronization overhead.
