Abstract. Genetic Algorithms (GAs) have been implemented on a number of multiprocessor machines. In many cases the GA has been adapted to the hardware structure of the system. This paper describes the implementation of a standard genetic algorithm on several MIMD multiprocessor systems. It discusses the data dependencies of the different parts of the algorithm and the changes necessary to adapt the serial version to the parallel versions. Timing measurements and speedups are given for a common problem implemented on all machines.
Introduction
In this paper we describe the implementation of a standard Genetic Algorithm [1, 2] on a number of different multiprocessor machines. After the discussion of the data dependencies in a straight forward implementation of a GA, the parallelization strategy on each machine is discussed and speedup measurements are presented. No attempt is made to optimize the algorithm for each specific architecture, instead we tried to stay as closely as possible to the original and parallelize in a way which seems natural for an application programmer on such a machine. This includes the use of standard libraries like Pthreads if available and e.g. recommended by the vendor as the preferred method of parallelization.
Genetic Algorithms
In this section we shortly outline the GA we have implemented on the different machines.
Selection The fitness of each individual in the population is evaluated. The fitness of each individual relative to the mean value of all other individuals gives the probability with which this individual is reproduced in the next generation. Therefore the frequency h i of an individual in the next generation is given by
where f i is the fitness of individual i andf the average over all fitness values.
Crossover The crossover operator takes two individuals from the population and combines them to a new one. The most general form is uniform crossover from which the so called one-point crossover and two-point crossover can be derived. First two individuals are selected, the first one according to its fitness and the second one by random. Then a crossover mask M i , i = 1, . . . , L, where L is the length of the chromosome, is generated randomly. A new individual is generated which takes its value at position i from the first individual if M i = 1 and from the second one if M i = 0. The crossover operator is applied with probability P C .
Mutation Each bit of an individual is changed (e.g. inverted) with probability P M . All three steps above are iterated for a given number of generations (or until one can no longer expect a better solution).
Parallel Genetic Algorithms
It has long been noted that genetic algorithms are well suited for parallel execution. Parallel implementations of GAs exist for a variety of multiprocessor systems, including MIMD machines with global shared memory [11] and message passing systems like transputers [3, 4] and hypercubes [8] as well as SIMD architectures [6, 7] like the Connection Machine.
Dependencies
It is easy to see that the following steps in the algorithm can be trivially parallelized:
Evaluation of fitness function The fitness of each individual can be computed independently from all others. This gives us a linear speedup with the number of processing elements.
Crossover. If we choose to generate each individual of the next generation by applying the crossover operator, we can do this operation in parallel for each new individual. The alternative would be to apply crossover and to put the resulting individual in the existing population where it replaces e.g. a bad solution.
Mutation The mutation operation can be applied to each bit of each individual independently. Besides from the bit value the only information needed is the global parameter P M .
Parallelization
It should be noted that it is usually not possible to gain a larger speedup for steps 1) and 2) because of data dependencies between the different steps of the algorithm. This can be seen e.g. for step 2: If the crossover operation selects one of the parents it does this according to its relative fitness. However this can only be done if the fitness values of all other individuals are already computed so that the mean value is available.
In the following we will point out what kind of data each processing element must access to perform the different steps of the algorithm.
Fitness evaluation. Each processing element must have access only to those individuals whose fitness it will compute. In the optimal case (number of processing elements = number of individuals) this is one individual. However the result of this computation is needed by all other processing elements since it is used for computing the mean value of all function evaluations needed in step 2.
Crossover. Each processing element which creates a new individual must have access to all other individuals since each one may be selected as a parent. To make this selection the procedure needs all fitness values from step 1.
Mutation. As in step 1 each processing element needs only the individual(s) it deals with. As mentioned above the parallelization could be even more fine grained as in steps 1 and 2, in which case each processing element would need only one bit of each individual. This could usually only be achieved by a SIMD style machine.
Many implementors of parallel genetic algorithms have decided to change the standard algorithm in several ways.
The most popular approach is the partitioning of the population into several subpopulations [5, 9] or introducing a topology, so that individuals can only interact with nearby chromosomes in their neighborhood [3, 6, 10, 12, 13] . All these methods obviously reduce the coupling between different processing elements.
Although some authors report improvements in the performance of their algorithm and the method can be justified by biological reasons, we consider it as a drawback that not the original standard GA could be efficiently implemented. The reason is that a genetic algorithm is often a computational intensive task. It often depends critically on the given parameters used for the simulation (e.g. P M and P C ). There are some theoretical results about how to choose these parameters or the representation of a given problem, but most of them deal with the standard GA only. Even then one often has to try several possibilities to adjust the parameters optimally.
In the following we present some results of such a parallelization on a number of different multiprocessor systems. We will show that only a small number of properties are required to get an efficient parallel program which implements the standard GA. The systems are all of the MIMD type but range from a special purpose system (NERV) to global shared memory systems (SparcServer, KSR1).
We implemented the same program on all machines. Instead of using some kind of toy problem we decided to use a real application as a benchmark. The task of the GA is to optimize the placement of logic cells in a field programmable gate array (Xilinx). The input is a design file, which is usually created by an external program, as well as a library of user-defined parts and information about the specific chip layout and package. From this input the program creates an internal list of the required logic cells and their connections.
Our test design used 276 logic blocks out of 320 possible on a Xilinx 3090 and 121 I/O blocks. The number of internal connections is in the order of several thousands. Since the chromosomes represent positions for each logic or I/O block, the chromosome length is given by the sum of these two numbers.
The program is completely CPU-bound until it writes the final output file. The big advantage of having a real application is that it does not have the usual problems of benchmarks, like being so small that they fit in the cache of the processor or concentrating the whole computational task in a few lines like in a matrix multiplication. Furthermore we have a simple criterion for comparing the implementations: how fast is the program speeded up by using multiple processors. Since the user may wait for the result of the placement program, measuring the real time to do the task seems to be the most reasonable solution.
5 The NERV multiprocessor system
The Hardware.
The NERV multiprocessor [14] is a system which has been originally designed for the efficient simulation of neural networks. It is based on a standard VMEbus system [15] which has been extended to support several special functions. Each processing element consists of a MC68020 processor with static local memory (currently 512 kB) and each VME board contains several processor boards. Usually the system is run in a SPMD (Singe Program Multiple Data) style mode, which means that the same program is downloaded to each processing element, while the data to be processed are distributed among the boards.
The following extensions to the VMEbus have been implemented in the NERV system:
A broadcast facility which is not part of the standard VME protocol. It allows each processor to access the memory of all other processor boards with a single write cycle. From the programmer's point of view there exists a region in his address space, where a write access will initiate an implicit broadcast to all other processors. A read from this region will simply return the data in the local memory. For software written in C or C++ a programmer might take advantage of this property in the following way:
A pointer -either to a global variable or dynamically allocated -can be modified in such a way that it points into the broadcast region by a special function called mk global(). Whenever this pointer is dereferenced by a write access an implicit broadcast happens. A read access will return the local data.
Note however that there is no explicit synchronization between the processors. If two processors update the same element, the last one will win.
A second extension on the VMEbus includes a hardware synchronization of all processing elements. The programmer calls a procedure synchronize() which will only return after all processors have reached the synchronization point.
Implementation of the Parallel Genetic Algorithm.
The previous sections suggest the following setup for the algorithm on the NERV system:
The same program is loaded onto each processor. Every processor has a copy of all individuals in his local memory. The current population and the population of the next generation are accessed by two pointers which have been prepared so that they both point into the broadcast region. The same holds for an array which contains the fitness values of all individuals. After each generation the two population pointers are simply exchanged. Let N be the number of processing elements in the system. The general strategy will be to distribute the computational load equally among all processing elements by assigning P N individuals to each processor.
The parallelization of each GA operator is now straightforward.
Fitness evaluation. Each processor evaluates the fitness of the individuals it has been assigned. The fitness values are simply written into the mentioned array which will automatically initiate a broadcast. Since each processor is responsible for another set of individuals no overlap will occur. Note that the evaluation function uses only the local copy of the population. The access to fitness[i] is the (implicit) broadcast.
Crossover. As already mentioned we decided to generate the next generation by looping over all individuals of the new population and either copying an individual from the old one or create a new one by crossover from two parents. Again each processor will be responsible for a part of the population: After this step P · L elements will have been broadcasted (assuming that we encode e.g. each bit in a separate character) and each processing element will have a complete copy of the new population. The broadcast of a bit changed by mutation is done by the assignment to individual [j] . Note that the right hand side of this assignment will only access local memory since it is a read access.
Discussion of NERV implementation
The program will transfer P fitness values (from step 1) and P · L bits for the new population (from step 2) over the common bus. In addition it must transfer the bits which are changed during mutation which may vary in each generation. This is all communication which will occur. All other values are fetched from local memory. A broadcast facility is the most efficient way to implement this since it does not depend on the number of processors. If we increase the number of processing elements we will decrease the time needed for each step while the communication overhead will stay constant.
SparcServer
The SparcServer 2000 is a commercially available shared memory system with up to 16 processors which supports symmetric multiprocessing. All processors have access to a global shared memory. The system runs the Solaris operating system which is responsible for load-balancing.
The normal way to take advantage of the multiple processors in the system is to use the threads-library. For synchronizing access to critical regions there are a number of mechanisms like mutex and condition variables. The thread library is very similar to the POSIX threads interface, although not completely identical.
The same arguments we used for the NERV-system apply here as well. Most parts of the algorithm can be parallelized ideally, but we need a synchronization after each major step. As long as we follow the same programming style as in the NERV-system it is unnecessary to lock data structures on a lower level, since there are no concurrent writes by different threads into the same memory area.
The implementation strategy is therefore to start N worker threads where each one is working on part of the population. The main thread is only responsible for initialization and controlling the synchronization of the other threads.
The synchronization has been implemented on top of the threads library. The GA routine uses only two functions, thread init() and thread sync(). The initialization routine is called once at the beginning of the program. It starts the worker threads and initializes the global mutex and condition variables. The number of worker threads can be given as a program argument. This allows us to vary the maximal number of processors.
KSR1
The KSR1 from Kendall Systems Research has some features which distinguish it from the more convential shared memory systems like the SparcServer. Although it looks like a global shared memory system from the programmer's point of view, there is no main memory in the usual sense at all. Instead each processor has a large cache memory of 32 MByte which is backed by mass storage and kept consistent by a cache coherence protocol. The interconnection network is invisible to the programmer, although the latency of memory updates may vary if two distant nodes have to communicate.
Each processor in the system is running an OSF/1 kernel, providing the usual Unix services. The machine can be split in several partitions with a certain number of processors dedicated to a certain program.
For a C program the preferred method of parallelizing a task is to use the POSIX threads library (Pthreads). The functionality is essentially the same as with the Solaris threads library. Therefore the main GA program will be exactly the same, only the threads init() and threads sync() routines had to be adapted.
Results
The figures show the speedups which can be achieved by the different systems depending on the number of processors or threads used.
NERV. For the NERV system the programmer has complete control of the system and can decide how many processors he wants to use. The overhead is only marginal, since the necessary functions are directly supported by the hardware. If the number of processors is increased the communication time stays constant. The common bus however sets an upper limit to the extensibility of the sytem-it is not reasonable to consider a system with more than 40 processors in a VME-crate. The speedup is linear although below the theoretical maximum. SparcServer. The SparcServer and the KSR1 are both very similar in that they provide a global shared memory and allow parallelization via a threads package. However, on the Sparc machine the user has no control over the processor resources. He can not specify on how many processors his program will run but must leave this decision to the operating system. The results show that the speedup is only in the order of two for an eight processor system. The system was in multiuser mode although no other computing intensive task was running during the measurements. It is difficult to give a specific reason for this behaviour. The working set of the program is quite large since several hundred net lists must referenced in each fitness evaluation. Therefore the local cache of each processor will usually not suffice to hold all relevant data. If the scheduler does not assign a thread to a specific processor but reschedules them every time they are runnable the situation may get even worse since they have to load their working set from the memory again. KSR1. The KSR1 allows for more user control over processors. The system is usually configured in a number of partitions with a given number of processors. The program can be run in one such partition and allocate all available processors. The largest usable partition contained 20 processors. The run time decreases as the number of threads is increased as long as we have less threads than processors. Then we can see that the run time starts to increase again. It seems the system overhead for scheduling the threads becomes significantly large at this point. One major drawback of the KSR1 is the large increase in access time for the different stages of the memory hierarchy. The first-level cache (256 kB) needs two clock cycles, the 32 MByte local cache needs 20 clock cycles and an access to a remote cache needs 140 clock cycles. This is nearly 2 orders of magnitude and our application definitely does not fit into the first-level cache.
Several experience reports [17] about the KSR1 show, that one can achieve a significant speedup if the problem is carefully adapted to the machine specific parameters. However our results are more in line with the report on the port of an existing application [18] whose speedup usually did not exceed a factor of 5. 
