Abstract One of the essential problems in parallel computing is: can SIMD machines handle asynchronous problems? This is a di cult, unsolved problem because of the mismatch b e t ween asynchronous problems and SIMD architectures. We propose a solution to let SIMD machines handle general asynchronous problems. Our approach is to implement a runtime support system which can run MIMD-like s o f t ware on SIMD hardware. The runtime support system, named P kernel, is thread-based. There are two major advantages of the thread-based model. First, for application problems with irregular and or unpredictable features, automatic scheduling can move some threads from overloaded processors to underloaded processors. Second, and more importantly, the granularity of threads can be controlled to reduce system overhead. The P kernel is also able to handle bookkeeping and message management, as well as to make these low-level tasks transparent to users. Substantial performance has been obtained on Maspar MP-1. 1
. It is more exible for di erent problem structures and can be applied to general applications. However, the complex control structure of MIMD architecture makes the machine expensive and the system overhead large.
Application problems can be classi ed into three categories: synchronous, loosely synchronous, and asynchronous. Table I shows a few application problems in each of the three categories 13 .
The synchronous problems have a uniform problem structure. In each time step, every processor executes the same operation over di erent data, resulting in a naturally balanced load.
The loosely synchronous problems can be structured iteratively with two phases: the computation phase and the synchronization phase. In the synchronization phase, processors exchange information and synchronize with each other. The computation load can also be redistributed in this phase. In the computation phase, di erent processors can operate independently.
The asynchronous problems have n o s y n c hronous structure. Processors may communicate with each other at any time. The computation structure can be very irregular and the load imbalanced. The synchronous problems can be naturally implemented on a SIMD machine and the loosely synchronous problems on an MIMD machine. Implementation of the loosely synchronous problems on SIMD machines is not easy; computation load must be balanced and the load balance activity is essentially irregular. As an example, the simple On 2 algorithm for N-body simulation is synchronous and easy to implement on a SIMD machine 12 . But the Bernut-Hut algorithm On log n for N-body simulation is loosely synchronous and di cult to implement on a SIMD machine 3 .
Solving the asynchronous problems is more di cult. First, a direct implementation on MIMD machines is nontrivial. The user must handle the synchronization and load balance issues at the same time, which could be extremely di cult for some application problems. In general, a runtime support system, such a s L I N D A 2, 5 , reactive k ernel 27, 33 , or chare kernel 30 , is necessary for solving asynchronous problems. Implementation of the asynchronous problems on SIMD machines is even more di cult because it needs a runtime support system, and the support system itself is asynchronous. In particular, the support system must arrange the code in such a w ay that all processors execute the same instruction at the same time. Taking the N-queen problem as an example, since it is not known prior to execution time how m a n y processes will be generated and how large each computation is, a runtime system is necessary to establish balanced computation for e cient execution. We summarize the above discussion in Table II .
Various application problems require di erent programming methodologies. Two essential in an array-based methodology because of mismatch b e t ween the model and the problem structure.
The solution to asynchronous problems cannot be easily organized into aggregate operations on a data domain that is uniformly structured. It naturally demands the thread-based programming methodology, in which threads are individually executed and where information exchange can happen at any time. For the loosely synchronous problems, either the array-based or the threadbased programming methodology can be applied.
Let SIMD Machines Handle Asynchronous Problems
To make a SIMD machine serve as a general purpose machine, we m ust be able to solve asynchronous problems in addition to solving synchronous and loosely synchronous problems.
The major di culties in executing asynchronous applications on SIMD machines are:
the gap between the synchronous machines and asynchronous applications; and the gap between the array processors and thread-based programming.
One solution, called the application-oriented approach, lets the user ll the gap between application problems and architectures. With this approach, the user must study each problem and look for a speci c method to solve i t 7 , 3 6 , 3 7 , 3 9 . An alternative to the application-oriented approach is the system-oriented approach, which provides a system support to run MIMD-like software on SIMD hardware. The system-oriented approach is superior to the application-oriented approach for three reasons:
the system is usually more knowledgeable about the architecture details, as well as its dynamic states;
it is more e cient to develop a sophisticated solution in system, instead of writing similar code repeatedly in the user programs; and it is a general approach, and enhances the portability and readability of application programs.
The system-oriented approach can be carried out in two levels: instruction-level and thread- The synchronization could suspend a large number of processors. Finally, this approach i s u nable to balance load between processors and unlikely to produce good performance for general applications.
We propose a thread-based model for a runtime system which can support loosely synchronous and asynchronous problems on SIMD machines. The thread-level implementation o ers great exibility. Compared to the instruction-level approach, it has at least two a d v antages:
The execution order of threads can be exchanged to avoid processor suspension. The load can be balanced for application problems with irregular and dynamic features. System overhead will not be overwhelming, since granularity can be controlled with the thread-based approach.
The runtime support system is named as Process kernel or P k e r n e l . T h e P k ernel is threadbased as we assign computation at the thread-level. The P kernel is able to handle the bookkeeping, scheduling, and message management, as well as to make these low-level tasks transparent to users.
Related research
Existing work on solving loosely synchronous and asynchronous problems on SIMD machines was mostly application oriented 7, 36, 3 7 , 39 . The region growing algorithm is an asynchronous, irregular problem and di cult to run on SIMD machines 39 . The merge phase, a major part of the algorithm, performs two to three orders of magnitude worse than its counterpart in MIMD machines. That result is due to the communication cost and lack of a load balancing mechanism.
The authors concluded that the behavior of the region growing algorithm, like other asynchronous problems, is very di cult to characterize and further work is needed in developing a better model."
The two other implementations, the Mandelbrot Set algorithm 36 and the Molecular Dynamics on a SIMD machine 6, 20 . The major di erence between these works and our approach is that we handle general asynchronous and loosely synchronous problems instead of studying individual problems.
The instruction-level approach has been studied by a n umber of researchers 8, 9 , 2 2 , 2 5 , 4 1 , 40 . As mentioned above, the major restriction of this approach is that the entire instruction set must be cycled through to execute one instruction step for every processor. A common method to reduce the average number of instructions emanated in each execution cycle is to perform global or's to determine whether the instruction is needed by a n y processor 9, 4 1 . It might be necessary to insert barrier synchronizations at some points to limit the degree of divergence for certain applications. Having a barrier at the end of each WHERE statement as well as each requires roughly a millisecond, unacceptably long for our ne-grained application" 17 . In the P kernel, the user can control the grainsize of the process and the grainsize can be well beyond one millisecond. Thus, the P kernel implementation can obtain acceptable performance. Besides, our model is general, instead of dedicating it merely to the functional language, as in the Graphinator model.
Our model is similar to the Large-Grain Data Flow LGDF model 18 . It is a model of computation that combines sequential programming with data ow-like program activation. The
LGDF model was implemented on shared memory machines. It can be implemented on MIMD distributed memory machines as well 42, 3 0 , 1 1 , 1 4 . Now w e s h o w that the model can be implemented on a SIMD distributed memory machine, too.
The P Kernel Approach | Computation Model and Language
The computation model for the P kernel, which originated from the Chare Kernel 30 , is a message-driven, nonpreemptive, thread-based model. Here, a parallel computation will be viewed as a collection of processes, e a c h of which in turn consists of a set of threads, called atomic computations. Processes communicate with each other via messages. Each atomic computation is then the result of processing a message. During its execution, it can create new processes or generate new messages 1 . A message can trigger an atomic computation, whereas an atomic computation cannot wait for messages. All atomic computations of the same process share one common data area. Thus, a process P k consists of a set of atomic computations A k i and one common data area D k : P k = fD k ; A k 1 ; A k 2 ; :::; A kn g; n 1
Once a process has been scheduled to a processor, all of its atomic computations are executed on the same processor. There is no presumed sequence to indicate which atomic computation will be carried out rst. Instead, it depends on the order of arrival of messages. Figure 1 shows the general organization of processes, atomic computations, and common data areas. In general, the number of processes is much larger than the number of processors, so that the processes can be moved around to balance the load.
T h e P k ernel is a runtime support system on a SIMD machine built to manipulate and schedule processes, as well as messages. A program written in the P kernel language consists mainly of a process ProcName f Common Data Area Declarations entry Label1: message msg1 Code1 entry Label2: message msg2 Code2 ... g
Here, bold-face letters denote the keywords of the language. The process body, which is enclosed in braces, consists of declarations of private variables that constitute the common data area of the process, followed by a group of atomic computation de nitions. Each atomic computation de nition starts with a keyword entry and its label, followed by a declaration of the corresponding message and arbitrary user code. One of the process de nitions must be the main process.
The rst entry point in the main process is the place the user program starts.
The overall structure of the P kernel language di ers from that of the traditional programming In the following, we will illustrate how to write a program in the P kernel language using the N-queen problem as an example. The algorithm used here attempts to place queens on the board one row at a time if the particular position is valid. Once a queen is placed on the board, the other positions in its row, column, and diagonals, will be marked invalid for any further queen placement. The program is sketched in Figure 2 ResponseQueen in processes SubQueen and Main counts the total number of successful queen con gurations. It can be triggered any n umber of times until there is no more response expected from its child process. The atomic computation SequentialQueen is invoked when the rest of rows are to be manipulated sequentially. T h i s i s h o w g r a n ularity can be controlled. In this example, there are two process de nitions besides that of process Main. A tomic computations that share the same common data area should be in a single process, such a s P arallelQueen and
ResponseQueen. The atomic computation SequentialQueen does not share a common data area with other atomic computations, and therefore, is in a separate process to preserve good data encapsulation and to save memory space. In general, only the atomic computations that are logically coherent and share the same common data area should be in the same process.
Design and Implementation
The main loop of the P kernel system is shown in Figure 3 . It starts with a system phase, which includes placing processes, transferring data messages, and selecting atomic computations to execute. It is followed by a user program phase to execute the selected atomic computation. The iteration will continue until all the computations are completed. The P kernel software consists of three major components: computation selection, communication, and memory management.
Computation selection
A fundamental di erence between the MIMD and SIMD systems is the degree of synchronization required. In an MIMD system, di erent processors can execute di erent threads of code, but not in a SIMD system. When the P kernel system is implemented on an MIMD machine, which atomic computation will be executed next is the individual decision of each processor. Whereas, due to the lock-step synchronization in a SIMD machine, the same issue becomes a global decision.
Let's assume that there are K atomic computation types, corresponding to the atomic computation de nitions, represented by a 0 ; a 1 ; :::; a K,1 . During the lifetime of execution, the total number of atomic computations executed is far more than K. These atomic computations are dynamically distributed among processors. At iteration t, a computation selection function F is applied to select an atomic computation type a k , where k = Ft and 0 k K . In the following user program phase at the same iteration, a processor will be active if it has at least one atomic computation of the selected type a k . Let the function numi; p; t record the number of atomic computation with type a i at processor p in iteration t. Let We present three computation selection algorithms here. The rst one, F cyc , is a simple algorithm.
Algorithm I: Cyclic algorithm. Basically, it repeatedly cycles through all atomic computation types. However, if acti; t is equal to zero, the type i will be skipped. F cyc t = minf i j acti mod K; t 0; F cyc t , 1 i F cyc t , 1 + Kg mod K where t 1 a n d F cyc 0 = ,1. Here, it is not always necessary to carry out K reductions to compute acti; t, since as long as the rst nonzero acti; t is found, the value of function F cyc is determined.
This algorithm is similar to the method used in the instruction-level approach in which p r ocessors repeatedly cycle through the entire set of possible instructions. The global reduction is essentially similar to the global-or" method, which is used to reduce the number of instructions that are emanated for each execution iteration. However, in the instruction-level approach, each processor executes exactly one instruction per cycle; while in the thread-level approach, each processor may execute many threads per cycle. Also, the execution order of a program is xed in the instruction-level approach, but the order can be exchanged in the thread-based approach.
To complete the computation in the shortest time, the number of iterations has to be minimized. Maximizing the processor utilization at each iteration is one of the possible heuristics. If actk 1 ; t i s 1 0 0 a n d actk 2 ; t is 900, a computation selection function Ft selecting k 2 is intuitively better, leading to an immediate good processor utilization. An auction algorithm, F auc is proposed based on this observation:
Algorithm II: Auction algorithm. For each atomic computation i, calculate acti; t a t iteration t. Then, the atomic computation with the maximum value of acti; t i s c hosen to execute next: F auc t = minf j j actj; t = max 0i K acti; t; 0 j K g The cyclic algorithm is nonadaptive in the sense that the selection is made almost independent of the distribution of atomic computations. In this way, it could be the case that a few processors are executing one atomic computation type while many processors are waiting for execution of the other atomic computation types. The auction algorithm is runtime adaptive. It will maximize utilization in most cases. An adaptive algorithm is more sophisticated in general. However, experimental results show that in most cases, the cyclic algorithm performs better than the auction algorithm. It has been observed that when an auction algorithm is applied, at the near end of execution, the parallelism becomes smaller and smaller, and the program takes a long time to nish. This low parallelism phenomenon degrades performance seriously, which i s c haracterized as the tailing e ect.
We propose an improved adaptive algorithm to overcome the tailing e ect. To retain the advantage of the auction algorithm, we i n tend to maximize the processor utilization as long as there is a large pool of atomic computations available. On the other hand, when the available parallelism falls to a certain degree, we try to exploit large parallelism by assigning priorities to di erent atomic computation types. An atomic computation whose execution increases the parallelism gets a higher priority, and vice versa. The priority can either be assigned by the programmer or be automatically generated with dependency analysis.
Algorithm III: Priority auction algorithm. For simplicity, w e assume that the atomic computations a 0 ; a 1 ; : : : ; a K,1 have been presorted according to their priorities, a 0 with highest priority a n d a K,1 with the lowest. Use m = cN as a gauge of available parallelism, where c is a constant and N is the number of processors.
F pri t = minf j j actj; t = max 0i K acti; t; 0 j K g if actj; t m minf j j actj; t 0; 0 j K g otherwise When max 0i K acti; t is larger than m, indicating that the degree of available parallelism is high, the auction algorithm will be applied. Otherwise, among the atomic computation types with acti; t 0, the one with the highest priority will be executed next. The constant c is set to be 0:5. If more than half of processors are active, the auction algorithm is used to maximize the processor utilization. Otherwise, the priority is considered in favor of parallelism increase and tailing e ect prevention. This algorithm can constantly provide better performance than that provided by the cyclic algorithm. Table III shows the performance for di erent computation selection algorithms with the 12-queen problem on the 1K-processor MP-1.
Communication
There are two kinds of messages to be transferred. One is the data message, which is speci cally addressed to the existing process. The other kind is the process message, which represents the newly generated process. Where these process messages are to be transferred depends on the scheduling strategy used. The two kinds of messages are handled separately.
Transfer of data messages. The real situation is even more complicated. During each time of the message transfer, a collision may occur when two or more messages from di erent processors have the same destination processor. Therefore, we need to prevent the message loss due to the collision. Let destp be the destination of a message from processor p and srcq be the source from which processor q is going to receive a message. There is a collision if two processors p 1 and p 2 are sending messages to the same processor q, so that destp 1 Notice that for later transfers, it is most likely that only a few processors are actively sending messages, resulting in a low utilization in the SIMD system. To a void such a c a s e , w e d o n o t require all messages to be transferred. Instead, we attempt to send out only a majority of data messages. The residual messages are bu ered and will be sent again during the next iteration.
Because of the atomic execution model, the processors that fail to send messages will not be stalled. Instead, a processor can continue execution as long as there are some messages in its queue.
We use k to measure the percentage of data messages that is left after k times of data transfers, k = j f p j d k p 0; 0 p N g j j f p j d 0 p 0; 0 p N g j : Thus, a constant can be used to control the number of transfers by limiting k . The algorithm for the data message transfer is summarized in Figure 5 . By experiment, the value of has been determined to be around 0.2. Figure 6 shows an example for the 12-queen problem on MP-1 in which the minimal execution time can be reached when = 0 :2. Random Placement algorithm. It is a simple, moderate performance algorithm: destp = random mod N where N is the number of processors. Once destp has been assigned, we can follow the same procedure as in the transfer of data messages. However, the two kinds of messages are di erent from each other in that the destp of a data message is xed, whereas that of a process message can be varied. In taking advantage of this, we are able to reschedule the process message if the destination processor could not accept the newly generated process because of some resource constraint or collision. Here, rescheduling is simply a task that assigns another random number as the destination processor ID. We can repeat this rescheduling until all process messages are assigned the destination processor IDs. However, it is a better choice that we only o er one or two c hances for rescheduling, instead of repeating it until satisfaction. Thus, the unsuccessfully scheduled process messages are bu ered similar to the residual data messages, and then wait for the next communication phase.
A load balancing strategy called Runtime Incremental Parallel Scheduling has been implemented for the MIMD version of the P kernel 31 . In this scheduling strategy, the system scheduling activity alternates with the underlying computation work. Its implementation on a SIMD machine is expected to deliver a better performance than random placement.
Memory Management
Most SIMD machines are massively parallel processors. In such a system, any one of the thousands of processors could easily run out of memory, resulting in a system failure. It is certainly an undesired situation. Ideally, a system failure can be avoided unless the memory space on all processors is exhausted. Memory management p r o vides features to improve the system robustness. When the available memory space on a speci c processor becomes tight, we should restrict the new resource consumption, or release memory space by m o ving out some unprocessed processes to other processors.
In the P kernel system implemented on SIMD machines, we u s e t wo marks, m1 a n d m2, to identify the state of memory space usage. The function p is used to measure the current usage of memory space at processor p, p = the allocated memory space at processor p the total memory space at processor p :
A processor p is said to be in its normal state when 0 p m 1, in its nearly-full state when m1 p m 2, in its full state when m2 p 1 , a n d i n i t s emergency state when running out of memory, i.e. p = 1 .
The nearly-full state. W e need to limit the new resource consumption since the available memory space is getting tight. It is accomplished by p r e v enting newly generated process messages from being scheduled. Thus, when a process message is scheduled to processor p with m1 p, the rescheduling has to be performed to nd out another destination processor.
The full state. In addition to the action taken in the nearly-full state, a more restricted scheme is applied such that any data messages addressed to the processor p with m2 p are deferred. They are bu ered at the original processor and wait for change of the destination processor's state. Although these data messages are eventually sent to their destination processor, the delay in sending can help the processor relax its memory tightness. Note that these deferred data messages will be bu ered separately and not be counted when calculating k .
The emergency state. If processor p runs out of memory, several actions can be taken before we declare its failure. One is to clear up all residual data messages and process messages, if there are any. Another is to redistribute the unprocessed process messages that have been previously placed in this processor. If any memory space can be released at this time, we will rescue the processor from its emergency state, and let the system continue on.
Performance
T h e P k ernel is currently written in MPL, running on a 16K-processor Maspar MP-1 with 32K bytes memory per processor. MPL is a C-based data parallel programming language. We h a ve tested the P kernel system using two sample programs: the N-queen problem and the GROMOS Molecular Dynamics program 38, 3 7 . GROMOS is a loosely synchronous problem. The test data is the bovine superoxide dismutase molecule SOD, which has 6968 atoms 28 . The cuto radius is prede ned to 8 A, 1 2 A, and 16 A.
The total execution time of a P kernel program consists of two parts, the time to execute the system program, T sys , and the time to execute the user program, T usr . T h e system e ciency is de ned as follows: sys = T usr T = T usr T usr + T sys ;
where T is the total execution time. The system e ciency of the example shown in Figure 7 is: sys = 0:4 3 0:4 3 + 0 :1 3 = 8 0 :
The system e ciency depends on the ratio of the system overhead to the grainsize of atomic computations. Table IV In the system phase, every processor participates in the global actions. On the other hand, in the user program phase, not all processors are involved in the execution of the selected atomic Figure 7 is 75, since three out of four processors are active.
The utilization e ciency is de ned as follows: util = P m t=1 utTt P m t=1 Tt = P m t=1 TtN active t N P m t=1 Tt where Tt is the execution time of tth iteration and P m t=1 Tt = T usr . The utilization e ciency depends on the computation selection strategy and the load balancing scheme. Table V shows the utilization e ciencies for di erent problem sizes with the priority auction algorithm and random placement load balancing. In irregular problems, the grainsizes of atomic computations may v ary substantially. The grainsize variation heavily depends on how irregular the problem is and how the program is partitioned. At e a c h iteration, the execution time of the user phase depends on the largest atomic computation among all active processors. The other processors that execute the atomic computation of smaller grainsizes will not fully occupy that time period. We use an index, called fullness, to measure the grainsize variation of atomic computation for each iteration: The fullness e ciency is de ned as: full = P m t=1 ftTtN active t P m t=1 TtN active t = P m t=1 P N active t k=1 T k t P m t=1 TtN active t Table VI shows the fullness e ciencies for di erent problem sizes. The N-queen is an extremely irregular problem and has a substantial grainsize variation and a low fullness e ciency. The GROMOS program is loosely synchronous and the grainsize variation is small. Now w e de ne the overall e ciency as follows:
The system e ciency can be increased by reducing system overhead or increasing the grainsize.
It is not realistic to expect a very low system overhead because the system overhead is dominated by communication overhead. The major technique of increasing system e ciency is the grainsize control. That is, the average grainsize of atomic computations should be much larger than the system overhead. The utilization e ciency depends on the computation selection and load balancing algorithms. The algorithms discussed in this paper deliver satisfactory performance, as long as the number of processes is much larger than the number of processors. The most di cult task is to reduce the grainsize variation which determines the fullness e ciency. The grainsize variation depends on the characteristics of the application problem. However, it is still possible to reduce the grainsize variation. The rst method is to select a proper algorithm. The second method is to select a good program partition. For a given application, there may be di erent algorithms to solve it and di erent partitioning patterns. Some of them may h a ve small grainsize variation and some may not. A carefully selected algorithm and partitioning pattern can result in a higher fullness e ciency.
The overall e ciencies of the N-queen problem and the GROMOS program on the 1K-processor MP-1 are shown in Table VII . Compared to the N-queen problem, GROMOS has a m uch higher e ciency mainly because of its small grainsize variation. Tables VIII and IX show the execution times and speedups of the N-queen problem and the GROMOS program, respectively. A high speedup has been obtained from the GROMOS program.
The N-queen is a di cult problem to solve, but a fairly good performance has been achieved. 
Concluding Remarks
The motivation of this research i s t wofold: to prove whether a SIMD machine can handle asynchronous application problems, serving as a general-purpose machine, and to study the feasibility of providing a truly portable parallel programming environment b e t ween SIMD and MIMD machines.
The rst version of the P kernel was written in CM Fortran, running on a 4K-processor TMC CM-2 in 1991. The performance was reported in 32 . Fortran is not the best language for system programs. We used the multiple dimension array with indirect addressing to implement queues.
Hence, not only was the indirect addressing extremely slow, but accessing di erent addresses in CM-2 costed much more 9 . In 1993, the P kernel was rewritten in MPL and ported to Maspar MP-1. The MPL version reduces the system overhead and improves the performance substantially.
Experimental results have s h o wn that the P kernel is able to balance load fairly well on SIMD machines for nonuniform applications. System overhead can be reduced to a minimum with granularity control.
