In this paper, we present the design and implementation of a multiprocessor simulator written in the language SimCaL We use the simulator to test our scheme to partition a sequential program for parallel execution on a shred memory, asynchronous multiprocessor. The results of the simulations indicate that our partitioning scheme can provide signticant speed-up by exeeuting the program in parallel. We then exeeute the partitioned program on an actual multiprocessor and find a high degree of correlation between the simulations and the actual executions. This correlation serves to validate our simulator. We then use the multiprocessor simulator to hypothetically extended the actuaJ multiprocessor and we show that adding more processors will not provide significant improvement in the parallel executions unless the communication structure is also improved to contain more parallelism.
INTRODUCTION
Over the past decade or so, changes in technology have provided the possibility for vast increases in computational speed and power through the exploitation of parallelism in program execution. However, it has keen difficult to test these new developments in parallelism on a target architecture since the architecture is often not available, or too expensive for the researcher to obtain.
One approach to solving the problem of unavailability of the target architecture is to use a simulator to capture the behavior of the architectural system. A problem with the use of simulators is the possibility that the simulations do not adequately capture the system due to the omission of an important factor in the system or because not all of the factors in the target system are adequately known. In this event, the simulations may altow a researcher to arrive at an erroneous conclusion about the power of the parallel system under scrutiny. Also, despite the researchers care in capturing every detail of the target system, other researchers are sometimes sceptical about the reliability of the simulation approach.
We present the design of a multiprocessor simulator. We briefly discuss our techniques to partition a sequential program into threads for parallel execution on a sh~ed memory, asynchronous multiprocessor. We then use the simulator to execute the threads for parameters that describe an actual multiprocessor system, the Data General AVliONmataGeneral 1990] . The correlation that we obtain between the simulations and the actual executions verify that the multiprocessor simulator captures the importarrt factors of the rnultiproeessor system. Having validated the simulator, we use it to hypothetically extend the Data General AViiON to determine the degree of speed-up that might be obtained if the multiprocessor were configured differently, for example, by adding more processom to the AViiON.
The multiprocessor simulator that we present is coded in a simulation language, SimCal[Malloy19861, that is based on Simula. Simula is a powerful, process oriented simulation language that possesses a high degree of expressibility.
The paper is organized as follows. In section 2, we briefly describe SimCal, the simulation language that we use to implement the multiprocessor simulator. In section 3 we discuss the computational model that captures the important features of the multiprocessor system under study. We then briefly describe the parallel threads that are exeeuted on the multiprocessor system followed by the design and construction of the multiprocessor simulator. In section 4, we describe our valkhtion of the simulator through the comparison of the results of the simulations with the results obtained by executing the threads on an actual multiproeessor, the Data General AVliON. Also, in section 4, we hypothetically extend the AViiON through the use of the simulator. Finally, in section 5 we draw conclusions.
DESCRIPTION OF SimCal
SimCal is a process-driven simulation language that is based on standard Pascal. The SimCal language is extended to directly incorporate simulation primitives designed to have essentially the same syntax and semantics as those found in Simula. Therefore, a SimCal user, knowledgeable in Pascal, need only consult previous work [Malloy 1990 , Malloy 1986 We have previously described the design and implementation of SimCa@lalloy 1990] , and the reader interested in the preprocessor construction may consult this work. We do not discuss the SimCal design and implementation in this paper but rather we summarize the actions of the language primitives and demonstrate how they can be used to construct a simulation model.
The discussion in this section will facilitate our discussion of the multiprocessor simulator described in the next section.
Because SimCal is a process-driven simulation language, there are language facilities to support the creation and manipulation of processes. A system clock and an event list ordered by time are included in the language. Since it is essentiat to express the relationships among processes in a simulation, the Simula list facility is also included.
A process in SimCal is represented by a special 
DESIGN OF THE MULTIPROCESSOR MODEL
In this section we begin by presenting the computational model that forms the basis for our target architecture. We then give a brief explanation of the technique used to partition a sequential program into threads for parallel execution on a shared memory, asynchronous multiprocessor.
Finally, we present the parametrized multiprocessor simulator that we construct from the computational model. This parametrized multiprocessor simulator is used to simulate execution of our parallel threads, constructed by partitioning a sequential program.
The Computational Model
In order for us to accurately evaluate the quality of the schedules that we produce, it is necessary that we be precise about certain aspects of the asynchronous processor system that we utilize.
In particular, we assume that such a system consists of p asynchronous identical processors, shared global memory modules, and a communication structure that allows processors to communicate with other processors or with the shared memory. An example of such a communication structure is a bus that typically allows a single processor to communicate vahtes to memory. We assume that the system includes the standard primitives send and receive, which are used for the synchronization of processors. Because of the kind of synchronization required here (i.e., based on data dependencies), we assume that the send operation does not require the invoker to wait until a corresponding receive is executed [Dinning 1989 ].
In conjunction with the above system, we employ three parameters that, together, describe the "speed of the architecture.
The fist is a function F.(I) that returns the number of cycles required to execute instruction I. [Malloy 1994 . The interested reader may consult our previous work [Malloy 1992a ] for a discussion of partitioning and BW, as discussed in the previous section. Since the simulator must actually execute the statements in the schedule, the second statement in the main program in Figure 4 initializes an Interpreter that actually executes the instructions in the schedule. After initializing the Interpreter, the simulator then reads the parallel schedule or threads, one thread for each processor. In executing the loop in the main program of the simulator, the CRE-ATE primitive is used to instantiate p processors (cpu) and the ACTIVA~primitive is used to begin execution of thread, in the respective processor cpui. In SimCal, as in Simula, the main program is itself a process and must not be allowed to terminate before any child processes terminate, since that would cause the entire program to terminate prematurely. Thus, the finat simulation primitive executed in the main program is HOLD(50000), which inserts the main program at the end of the event list, allowing each of the cpui an opportunity to execute the respective threads. Execution will resume in main after all of the cpui's have terminated. At that time, any statistics that may have been gathered during the simulation, such as the total time spent waiting to access the bus, may be output.
In addition to the main pro~am of the simulator, a summary of the actions of each prccessor (cpui) is also illustrated.
Each cpui is itself a PROCESS that, through the use of the event list, can execute in "parallel". The To illustrate the actions of the multiprocessor simulator, we will now discuss the send primitive listed in the case statement in PROCESS cpu shown in Figure 4 .
The tirst action of the send primitive is to "wait to access the bus" as described above. Having gained access, the next action of the send primitive is to increment busCount to update the number of processors currently using the bus. Then, the synchronization bit corresponding to the data value being communicated is set to indicate to the receiving process that the value is "ready".
Having "sent" the data, the next action in implementing the send is to HOLD for the number of cycles that are required in the send operation; this execution of the HOLD will update system time appropriately.
In the early stages of the multiprocessor simulations, we executed the HOLD primitive for the send operation for the number of cycles that we felt were reasonable. Later, as we will discuss in the next section, we conducted experiments on an actual multiprocessor to provide greater accuracy for our simulations/predictions. The final action of the send primitive is to decrement the busCount to indicate that this processor is now relinquishing the bus. The actions of the other operations listed in PRO-CESS cpu are similar to the send primitive.
4.
VALIDATION OF THE MULTIPROCESSOR SIMULATOR
In the previous section we presented the design of the parameterized multiprocessor simulator. In this section, we validate the simulator by using it to simulate the executions of schedules produced by our algorithm to partition sequential code into threads for parallel execution. This is achieved by supplying appropriate vatues for the parameters p, F.(I), FC, and BW, to the simulator that we constructed using SimCal[Malloy1990].
Performance of the Partitioning Scheme on a Data General Multiprocessor
In order to determine the performance of our partitioning scheme on a "red" multiprocessor, we executed our parallel threads on a Data Geneml AVliON shared memory multiprocessor system [DataGeneral 1990] equipped with a unibus communication structure and two identical processors. As we will show, we obtained an excellent correlation between these "actual executions' and the simulations, thereby validating our mt.dtiprocessor simulator. The send and receive primitives were implemented on the AViON using spin-lock operations on unix shared variables [Bach1986] . In order to obtain the parameters for our simulator, we first conducted a series of experiments to determine the average cost of the send and receive primitives and the cost of using the unibus communication structure. These experiments revealed that a send primitive requires approximately the same time to execute as a floating point multiplication, and that a receive primitive requires approximately twice as long as a floating point multiplication (provided, of course, that the receive does not have to wait). These values were utilized in setting the parameter F. for the simulation studies described below. The results summarized in Table 1 indicate a very conclusion that the our partitioning scheme is able to strong correlation between the simulation results and the actual executions on the Data General multiprocessor, In Table 1 , the first column lists the programs used in the experiments, the next three columns report the results of the simulations and the last three columns report the results of the actual executions. For the simulations, the second and third columns express the number of cycles required to execute the test program on 1 and 2 processors respectively. For the actual executions, the fifth and sixth columns express the number of seconds required to execute the test program 10,000 times; these experiments were conducted 1000 times and the results reported are the averages. As a particular instance, note that the simulation indicates that 54 cycles are required to execute the sequential code, and that 60 cycles are required to execute the schedule for 2 processors with a resulting speed-up of 0.90 over the sequential execution.
Note that a speed-up of less than one indicates that the parallel execution took longer than the sequential execution assuming machines with the same amhitecturai configuration. For the actual execution of the Fibonacci program on the Data General multiprocessor, an average of 0.23 seconds were required for 10,000 iterations using 1 processor and 0.25 seconds were required for 10,000 iterations using 2 processors producing a speed-up of 0.88 over the sequential execution.
The similarities in speed-up bctw~n the simulation and actual execution results are established by comparing columns 4 and 7. With the exception of the Pyramid and Livermore programs, the diffenmce between these speed-ups is never more than 0.06. This is a remarkably small difference, and certainly validates the use of the simulation approach in most instances.
provide good speed-up~or progr~s containing sufficient parallelism, Sufficient parallelism implies that the sequence of code being scheduled does not contain a large number of data dependencies and has enough parallelism to support all or most of the processors.
Since the Data General AViiON mtdtiproeessor at our installation is equipped with only two processors, we are not able to evaluate the performance of the partitioning scheme for actual executions of schedules using more than two processors. However, simulations using parametem appropriate to the Data General machine, produce the results shown in Table 2 for executions on 2,3,4,8 and 16 processors. These results suggest that if the AViiON were to maintain its current configuration except for the addition of more processors, no signi6cant speed-up would be achieved by using these additional proeessom. The main bottleneck in the system is the unibus communication structure. In fact, an examination of We have reported our experiences in developing a multiprocessor simulator using-the process oriente-d 1~-guage SimCal. We used the simulator to test our scheme to partition a sequential program for parallel execution on a shared memory, asynchronous multiprocessor. The results of the simulations indicate that our partitioning scheme can provide significant speed-up in the parallel execution of a program. We also executed the parallelized program on a Data General multiprocessor where the swed-ups on the actual machine correlated very closely with the simulations. This correlation served as a validation for our simulations. We then used the simulator to hypothetically extend the Data General machine by adding more processors and a communication structure that provided more parallelism, We concluded that the parallel execution of the program would not achieve any significant speed-up simply by adding processors.
