Parallel simulation of parallel programs for large datasets has been shown to o er signi cant reduction in the execution time of many discrete event models. This paper describes the design and implementation of MPI-SIM, a library for the execution driven parallel simulation of task and data parallel programs. MPI-SIM can be used to predict the performance of existing programs written using MPI for message-passing, or written in UC, a data parallel language, compiled to use message-passing. The simulation models can be executed sequentially or in parallel. Parallel execution of the models are synchronized using a set of asynchronous conservative protocols. This paper demonstrates how protocol performance is improved by the use of application-level, runtime analysis. The analysis targets the communication patterns of the application. We show the application-level analysis for message passing and data parallel languages. We present the validation and performance results for the simulator for a set of applications that include the NAS Parallel Benchmark suite. The application-level optimization described in this paper yielded signi cant performance improvements in the simulation of parallel programs, and in some cases completely eliminated the synchronizations in the parallel execution of the simulation model.
Introduction
Direct-execution simulators make use of available system resources to directly execute portions of the application code and simulate architectural features that are of speci c interest, or are unavailable. For instance, direct execution simulators can be used to study various architectural components such as the memory subsystem or the interconnection network. Speci cally, if an analyst is interested in determining if a faster communication fabric for a network of workstations is of value for a given set of applications, she can run the application on the currently available machines and only simulate the projected network's behavior. The bene ts of this direct-execution simulation are obvious: rst, one can estimate the value of the new hardware without the expense of purchasing it; second, one can do the simulation fast|there is no need to simulate the workstation's behavior (for example down to the level of memory references) since that part of the hardware is readily available.
Many of the early program simulators were designed for sequential execution BDCW91, DGH91, CDJ + 91]. However, even with the use of abstract models and direct execution, sequential program This research was supported in part by an ARPA/CSTO Award (No. F30602-94-C-0273) and by DARPA/ITO under Contract N66001-97-C-8533. Parts of this work were previously reported in PB95] and PB98] simulators tended to be slow with slowdown factors ranging from 2 to 35 for each process in the simulated program BDCW91]. Several recent e orts have been exploring the use of parallel execution LW96, RHL + 93, DHN94, PB95, CH96] to reduce the model execution times, with varying degrees of success. Our simulator, MPI-SIM, is capable of simulating a set of core MPI GL93] functions such as non-blocking, synchronous or bu ered sends and non-blocking receives. These are the building blocks of more complex point-to-point and collective communications. We have also developed a compiler to translate data parallel programs to the level of a message-passing SPMD program.
In this paper, we describe a parallel simulator, which can model the behavior of parallel programs using conservative synchronization algorithms Mis86]. The main contributions of this paper are as follows:
We address the problem of optimizing the simulation of data parallel and task parallel programs; most previous work has addressed the simulation of task parallel programs. We simulate data parallel programs by compiling them to their task parallel equivalents. We show that using compiler techniques that result in deterministic communication patterns, we are able to simulate these programs very e ciently. The knowledge of communication patterns of task parallel applications is exploited by the simulation protocols at runtime, resulting in signi cant performance improvements. Global synchronizations that are implicit in a data parallel program allow for even more comprehensive optimizations. We apply the application-level optimization to the simulation of programs using conservative parallel simulation protocols. We demonstrate the signi cant reduction in the frequency and strength of synchronization in the parallel simulator. In some cases all synchronizations can be eliminated. We present the results of an experimental study in which we simulate a number of data and task parallel applications including programs from the NAS Parallel Benchmark Suite (NPB 2), on parallel architectures. The results show that the optimizations suggested in this paper can signi cantly reduce the synchronization overheads for the simulator. As a result MPI-SIM is able to achieve good speedup: of about 12 for the 16 processor SP benchmarks and 14 for a 64 processor FFT data parallel code, in both cases using 16 processors for the simulation. The rest of the paper is organized as follows. The next section describes related work in the area of direct-execution simulation and the conservative simulation protocols used by many simulators. Section 3 formally describes our simulation model and its application to the simulation of MPI programs. Section 4 details the runtime optimizations that can be performed by the simulator. It also demonstrates how the optimization can be combined with compile-time analysis for data parallel programs. The validation of the simulation model, and the performance of the simulation protocol are presented in Section 5. Section 6 contains the conclusions and directions for future work.
2 Related Work
Simulation Systems
Unlike MPI-SIM, which is a simulator for data and task parallel applications, previous systems have focused solely on task parallel programs. Many such simulators use sequential or parallel im-plementations of the quantum protocol. In order to support multiple simulation processes (possibly executing on multiple processors) and maintain accuracy, parallel simulation protocols are used to synchronize the processes. The Quantum protocol lets the processes compute for a given quantum before synchronizing them. In general, synchronous simulators that use the quantum protocol must trade-o simulation accuracy with speed|frequent synchronizations slowdown the simulation, but synchronizing less frequently introduces errors, by possibly executing statements out-of-order.
Among the sequential simulators are Proteus BDCW91], a parallel architecture simulation engine, and Tango DGH91] and MINT VF94], two shared memory architecture simulation engines. Parallel simulators include the Wisconsin Wind Tunnel (WWT) RHL + 93, MRF + 97], a shared memory architecture simulation engine and SimOS RBDH97], a complete system simulator (multiple programs plus operating system). SimOS, which simulates the MIPS architecture, takes into account system details such as cache and CPU models as well as device drivers. It is possible to use the emulation mode, which in part uses direct execution to characterize the program execution. In the emulation mode, the simulation is still ten times slower than real time. The main drawback to SimOS is that it does not use any synchronization protocol when running multiple simulation processes on a parallel platform RHWG95], thus reducing the accuracy of the simulations.
Although MPI-SIM is the only simulator that identi es communication patterns and directly exploits them for the purposes of synchronization, other simulators have used techniques to reduce the synchronization overhead. Among them are LAPSE DHN94] and Parallel Proteus LW96]. Both LAPSE and Parallel Proteus use some form of program analysis to increase the simulation window beyond a xed quantum, without sacri cing accuracy. LAPSE is a parallel simulation engine for programs that use the message passing library of the Intel Paragon. It uses a quantum protocol called WHOA (Window-based Halting On Appointments) and runtime analysis to determine the size of the simulation quantum. An appointment is the earliest time the message can be placed in the network. Adding the latency of the network to the appointment time gives the earliest possible arrival for the message. Processes use the minimum of their appointment times (incoming) to determine whether a message can be processed or not.
Parallel Proteus is the parallelization of the Proteus simulation engine, a system designed to simulate message passing and shared memory access instructions. The synchronization overhead caused by frequent barriers of the quantum protocol is reduced using predictive barriers and local barriers. The predictive barriers method uses runtime and compile time analysis to determine, at the beginning of a simulation quantum, the earliest simulation time at which any process will send a message to any other process. Runtime analysis involves running a process until it communicates, then using analysis performed at compile time to predict when it would have sent a message if it were instantly resumed. The method of local barriers uses statically available communication topology information (i.e. groups of processes that communicate only within the groups they belong to) to reduce the global synchronization at the end of a simulation quantum to local synchronizations between groups of processes.
Our novel approach to synchronization used in MPI-SIM is to reduce blocking time at the receive statement based on an analysis of the communication pattern in the program. Speci cally, each simulation process uses this analysis to locally identify whether an incoming application message is safe to process right away or whether synchronizations with other processes are necessary.
As will be demonstrated in Section 5, the optimization may result in simulations where no synchronization is necessary. This is more e ective than the approach used in Parallel Proteus where the number of synchronizations is bounded from below by the number of communications present in the program.
Other distinguishing features of MPI-SIM are:
It supports a variety of conservative simulation protocols, and can easily incorporate new protocols to allow the study of new synchronization techniques. It is portable|to date, it has been ported to a variety of hardware platforms such as the Intel Paragon, the IBM SP, and the SGI Origin 2000 BDDP99]. This distinguishes it from the original WWT which used hardware components of the Thinking Machine CM-5 to aid in the simulation. In contrast to LAPSE, which has been designed to simulate the vendor-speci c messagepassing system of the Intel Paragon, a now rarely used machine, it is geared towards the MPI message-passing library, which has become a standard for a variety of high performance machines such as the IBM SP, SGI Origin as well as networks of workstations. This makes MPI-SIM a valuable tool for many application developers DDH + 98].
Simulation Protocols
In general, parallel applications are simulated by simulating individual processes of the program. Each process in the program is simulated by a simulation process, sometimes referred to as the Logical Process (LP). Synchronization between the LPs is performed by the simulation engine.
Quantum Protocol
MPI-SIM was designed to incorporate a variety of synchronization protocols. Among them is the quantum protocol. In the synchronous version of the protocol, each LP periodically simulates its corresponding process for a previously determined interval Q, termed the simulation quantum, and then executes a global barrier. The barriers are used to ensure that messages from remote LPs will be accepted in their correct timestamp order: an LP waiting at a receive will accept a matching message from its bu er only if the receive timestamp of the message is less than the simulation time at which the current quantum terminates. If more than one such message is present, the LP will select the one with the earliest timestamp; if no such messages are present, the LP remains blocked, and its simulation time is updated to the end of the current quantum. The size of the quantum is picked to be less than the communication latency of the target architecture. MPI-SIM also supports a variety of conservative protocols CM79]. Such protocols allow an LP to process an event only when it is safe to process it. An event with timestamp t is safe if there is no possibility that an event with a timestamp less than t can arrive. Conservative protocols might result in poor performance due to their relatively frequent synchronization demands. However, we will demonstrate that we can improve the performance of these protocols with the use of applicationlevel analysis. The following synchronization algorithms have been implemented in MPI-SIM: the null message protocol, the conditional event protocol and the accelerated null message protocol. The protocols are de ned using the following terms for each LP in a simulation model BMT + 98]:
Earliest Input Time(EIT): The EIT of an LP is de ned to be the lower bound on the timestamp of all future messages that may be received by it.
Earliest Output Time(EOT): The EOT of an LP is de ned to be the lower bound on the timestamp of all future messages that may be sent by the LP.
Earliest Conditional Output Time(ECOT): The ECOT of an LP is de ned to be the lower bound on the timestamp of all future messages that may be sent by the LP, assuming that it will not receive any further messages; in other words ECOT is computed as its EOT, assuming its EIT is in nity.
The synchronization algorithm used by the conservative simulations essentially computes the EIT for each LP in the model. We use the term safe message to refer to a message that is stored in the incoming message bu er (inqueue) of the LP and has a timestamp that is smaller than the EIT of the LP. The e ciency of a parallel simulation model is directly related to the e ciency with which it can advance its EIT. In the best case, if a message can be identi ed to be safe as soon as it arrives at an LP, the overhead of parallel simulation will be small.
Null Message Protocol
In the null message protocol Mis86], each LP periodically computes its EOT and sends this value to all LPs in the model. The EOT may be transmitted aggressively or using a demand driven scheme. Messages can be piggy backed with regular messages or sent as special null messages. An LP updates its EIT to simply be the minimum of the incoming EOT. It is generally more e cient to have each LP maintain its communication topology BtL94]: the source set of an LP is the set of LPs from which it can receive messages, and the destination set of an LP is the set of LPs to which an LP can send messages. An LP needs only to communicate with other LPs in its source and destination set to advance its EIT. The di erence between the EOT and current simulation time of a process is referred to as its lookahead Fuj88 ].
In our model of a parallel program, each LP executes asynchronously until it reaches a receive statement. At this point, if it does not have any safe messages, it demands updated EOT from all LPs in its source set. On receiving a request, an LP must send a new EOT. In the worst case, the new EOT of an LP may simply be S+L, where S is the current simulation time of the LP and L is the minimum message communication latency for the target architecture. On receiving responses to its request, if the EIT of an LP advances su ciently to identify a safe message, the LP proceeds with its simulation; if not it initiates another round of EOT updates. If communication is relatively infrequent in a target program, this algorithm may perform reasonably well. However, in general, this algorithm has been shown to have poor performance LW96] because the lookahead of an LP is poor and might result in many rounds of null messages being sent before and LP is able to proceed. The conditional event protocol o ers a simpler solution to that problem.
The Conditional Event Protocol
The conditional event protocol (CEP) CS89] identi es the earliest event in a simulation model by using global information. It calculates a global lower bound on the receive timestamp of the next message that will be received by any LP. In order to be able to compute EIT it is su cient to calculate the minimum ECOT of all LPs, and the minimum receive timestamp of all messages in transit. The conditional event protocol can be made demand driven by an LP sending its e ective ECOTs to other LPs only if it reaches a receive statement and needs to compute its EIT, or if it gets an e ective ECOT from another LP. In this way, the protocol gets switched on only if at least one LP is at a receive statement, and automatically gets switched o when all LPs complete their receives. The primary drawback with this algorithm is that, in the worst case, a separate global computation may be required to identify each safe event.
Accelerated Null Message Protocol
The Accelerated Null Message protocol (ANP) used by our simulator uses the CEP to advance the EIT of an LP rapidly in situations where a low lookahead might otherwise require many rounds of null message transmissions BMT + 98]. Using this protocol, the EIT of an LP is calculated as the maximum of the EIT reported by the null message and conditional event algorithms. The ANP can lead to signi cant reductions in the number of synchronization messages as described in Section 5.
MPI-SIM

Parallel Program Model
The following, standard, terms are used in this paper:
Target Program: The message passing program whose performance is to be predicted. Target Machine: The machine on which the target program executes. Host Machine: The machine on which the simulator executes.
We assume that the target program uses a shared nothing programming model and contains three types of statements: local code block or LCB, send statements send(receiver id, tag), and receive statements receive(sender id, tag). An LCB is the sequence of local statements executed between two message passing statements. The send and receive statements provide a bu ered communication capability where the send deposits a message in the message bu er at the named destination and the receive statement removes messages from its local bu er. We assume that the receive statement can select speci c messages from its bu er based on message attributes such as sender id, or tag. Here, we do not postulate a precise syntax or semantics with the preceding statement types. In subsequent sections we will apply this model to MPI.
We assume that the target program executes on the target machine as one process per processor with no dynamic process creation or termination. Messages from one process to another are assumed to arrive in FIFO order. At any point in its execution, a process in the target program may be in one of two states: running or blocked. A process is blocked if it has executed a receive statement and no matching message is available; otherwise it is said to be running. The model assumes that the execution times of local statements and the communication times of messages are non-deterministic. Note that each local statement is deterministic; however due to interrupts, cache behavior etc, the execution time may not be.
The goal of the simulator is to predict the execution time of a target program on a target architecture for given program inputs. Each process in the target program is modeled by a logical process. Each LP has a message queue inqueue and simulation clock, clock. The message queue is used to store incoming messages as they arrive at the corresponding LP. The simulation model of a parallel program contains three types of events: local events that correspond to execution of an LCB in the target program, and send and receive events that respectively correspond to the execution of a send or receive statement in the target program. Each of the preceding events is simulated as follows:
1. local event: The most common method for simulation of a local event is by direct execution:
the LP executes the LCB on the host machine, measuring its duration (t), and advancing its clock by t. For the runtime measurement to have a reasonable degree of accuracy, the host and target processors must be the same (or an appropriate scaling factor must be determined). Even when the host and target processors are the same, other sources of error remain; these are discussed in Section 5.
2. For a send statement, lp i computes the communication latency (l) for the message (or messages, if the send is a collective operation) using a model of the interconnection network. The message is timestamped with the send time (which is simply the value of clock i when the send statement is executed) and its predicted receive time. We use a simple contention free model to predict the communication latency of a message. In this model, the latency of a message is a function only of its size. This simple model yields good results for a variety of applications PHN96]. The results presented in this paper also support this assumption.
3. For a receive statement, lp i uses a simulation protocol to remove messages from its inqueue in their simulation timestamp order rather than the order in which messages are physically deposited in its inqueue. When lp i accepts a message from its bu er, clock i is set to the larger of the simulation timestamp of the receive statement and the receive timestamp of the accepted message.
MPI Simulation Model
MPI For93] is a message passing library which o ers a host of point-to-point and collective interprocess communication functions to a set of single threaded processes executing in parallel. All communication is performed using a communicator, which is simply an identi er associated with a group of processes. Only member processes may use a given communicator. We have developed a simulator for MPI programs that can simulate a substantial subset of MPI applications including most of the commonly used MPI functions.
The simulator can be used to simulate unmodi ed MPI programs. Each program is rst passed through a preprocessor that implements necessary transformations as explained next. The preprocessor also replaces all MPI calls by equivalent calls to corresponding routines in the simulator. The simulator does not directly simulate every MPI call. Consistent with MPI implementations, such as MPICH mpi], all point-to-point calls are translated in terms of simple non-blocking bu ered and non-bu ered sends, non-blocking receives and waits for operation completion. For example, a blocking send is simulated as a non-blocking send followed by a wait for send completion. All collective communication functions are rst translated by the simulator in terms of point-to-point communication functions as depicted in Figure 1 . Note that the translation of collective communication functions in the simulator must be identical to how they are implemented on the target architecture. The remainder of this section describes a library based facility to simulate MPI programs.
Preprocessing MPI Programs
MPI programs execute as a collection of single threaded processes, and, in general, the host machine will have fewer processors than the target machine. This requires that the simulator supports multithreaded execution of MPI programs. We have developed MPI-LITE, a portable library for multithreaded execution of MPI programs, for this purpose. Execution of an existing MPI program as a multithreaded program requires that the permanent variables, (global variables and static variables within functions) be handled separately. If the unmodi ed MPI program is executed as a multithreaded program, all threads on a given host process will access a single copy of each permanent variable. To prevent this, it is necessary to localize the permanent variable such that each thread has a separate copy. Each permanent variable is redeclared with an additional dimension whose size is equal to the maximum number of threads in a host process. Each reference to the permanent variable is also modi ed such that each thread uses its id to access its own copy of the permanent variable. This process of adding an additional dimension to the permanent variables is referred to as localization. A preprocessor is provided with MPI-SIM to support the automatic transformation of an MPI program to an equivalent program that is compatible with the simulator. The preprocessor localizes permanent variables, converts each call to an MPI function to an equivalent MPI-LITE function, and implements miscellaneous transformations needed to link the target program with the simulator routines. Note that the programmer is not required to make any manual changes to the program to ensure compatibility.
Simulation Model for Core Functions
The simulation model is very similar to the one presented earlier for simple message passing programs. The only di erences are (a) at each LP, there is a message queue for each communicator of which the LP is a member, and (b) at each LP, there is an ordered list (ordered by simulation timestamp) of the pending (send and receive) operations for the LP; this list is referred to as the request list. Each LP executes the target program, and takes the following actions upon encountering any of the four core calls:
1. MPI Issend: The message (with source, destination, tag, communicator and data) is sent to the receiver LP. It is timestamped with the send timestamp, which is the current simulation time of the LP and the receive timestamp, which is the send timestamp plus the predicted message latency. A corresponding request is queued at the end of the LP's request list.
2. MPI Ibsend: The same procedure is followed as for MPI Issend, except for three di erences:
(a) Initially, a bu er availability check is performed. A bu er availability check reserves space for the message in the user provided bu er area. Note that in the simulation model, this space does not physically exist; a data structure is used to indicate the portion of the bu er that is currently occupied. If no space is available at that simulation time, the simulation is completed with the report that the program would have aborted at that point due to lack of bu er space. (b) While calculating the receive timestamp of the message, the predicted message latency accounts for the additional copying that would occur in a bu ered send. (c) The request queued at the request list is marked as a request from a bu ered send.
3. MPI Irecv: A request is queued at the request list. It is marked as a request to receive a message. The source, tag, and communicator of the receive are included in the request, as is the pointer to the bu er where the accepted message should be deposited.
4. MPI Wait: The action taken depends on the type of operation, which is indicated by the request id:
(a) MPI Irecv request: The LP is blocked until a matched message is available. At this point, the LP's simulation clock is updated to the maximum of the current simulation time and the receive timestamp of the matching message. An acknowledgment is sent to the sender, and the LP is resumed.
(b) MPI Issend request: The synchronous send completes only when the corresponding acknowledgment has been received from the destination. At this time, the simulation time of the LP is updated to the maximum of the current simulation time and the receive timestamp of the acknowledgment. The LP is blocked until the send is completed.
(c) MPI Ibsend request: For a bu ered send, the LP is not blocked. The corresponding request stays in the request list until it is satis ed, at which simulation time the corresponding bu er space in the user provided bu er area is released.
The simulation protocol used to synchronize the simulation models must ensure that MPI's in-order delivery rules are obeyed (in simulation time) while matching arriving messages with the request list of an LP. Matching acknowledgments with their corresponding requests requires no such e ort by the simulation protocol, simply because there is only one matching request for each acknowledgment.
In the simulation model, each LP executes without synchronizing with other LPs until it gets blocked on a speci c request. If the LP is in non-deterministic mode, it sends a request for EOT to all processes in its destination set as explained previously. The simulator computes the EIT of a process on a per communicator basis to reduce synchronization costs. In other words, if an LP belongs to more than one communicator, it will de ne separate source sets and maintain separate EIT to identify safe messages within each communicator. An LP in the deterministic mode simply waits until a matching message is available in its inqueue at which point it accepts the message and continues asynchronously with its execution.
Protocol Optimizations
Our optimizations are geared towards extracting determinism from applications. In particular, we have focused on the communication portion of the application. When a process posts a receive statement, that statement can be deterministic or not. A receive statement is said to be deterministic, if the program contains a unique message that matches every execution of the receive statement. In general, the receive statements in a process are non-deterministic. Being able to infer whether the receive is deterministic can yield performance bene ts, because the need for synchronization is reduced. Speci cally, a blocked LP requests EOT or initiates minimum ECOT computation, when it does not have safe messages. However, if an LP (lp i ) is blocked at a deterministic receive statement, it can identify safe messages locally. Since there is exactly one message that can match a deterministic receive, it is only necessary for the process to wait until it receives a matching message. As soon as the message is received, it is known to be safe; no null messages are necessary! Of course, if another LP in the model is blocked on a non-deterministic receive, lp i must still respond to requests for EOT and ECOT updates, but it will not initiate any such requests, thus reducing the overall tra c and blocking time in the model. Clearly, if no LP in the model is blocked on a non-deterministic receive statement, no synchronization messages will be generated in the model and the parallel simulation can be extremely e cient. Our goal is to automatically or manually identify as many deterministic receive statements in the simulation model as possible.
Message passing programs are often programmed with some mix of determinism and nondeterminism. Although some semantic analysis of the application can yield a reduction of nondeterminism in the code, our approach is to exploit the determinism that is readily available. For example, in MPI programs, when a receive statement speci es a source processor, the receive statement will be deterministic. We also recognize, that a big improvement can be made for data parallel programs at the compiler level, because the compiler has the necessary knowledge to extract the deterministic behavior.
Extracting Determinism in Data Parallel Programs
Most data parallel compilers for distributed platforms translate the data parallel program into a message passing SPMD program which uses MPI for message passing. The operations in the data parallel applications (written, for example, in HPF BCF + 94]), lead to communication and synchronization in the corresponding SPMD program may be classi ed into the following categories:
Data Distribution: These are operations which specify the placement and alignment of data (relative to other data or to some template) over a set of processors. Familiar HPF primitives which perform these functions are align and realign for relative alignment of data, and distribute and redistribute to partition aggregate data among processor memories.
Parallel data assignment: These allow parallel operations on sections of arrays with the same shape. In HPF such operations occur in statements like the forall, independent, and where statements, and in array assignment statements.
Parallel data combination: Data combination occurs in operations like reductions which generate a single value from aggregate data. Other operations include pre x, su x and combining scatter. In HPF, these operations occur as intrinsic operations. It is generally possible to compile data parallel programs such that message communications in the resulting SPMD program are deterministic. The data distributions and reductions almost always generate deterministic communications and most commonly used forms of parallel assignments also generate deterministic communications.
For example, assume that an array a is mapped onto processors using a simple block mapping, where a i] is stored on (or owned by), processor i. Now, we want to perform the parallel assignment a i] = a (i+1)%8]. In a message passing program, this can be translated into a simple send/receive pair (see Figure 2) . Clearly, the message communication in the translated program is deterministic and the form of determinism is such that it can easily be recognized automatically by the simulator. each processor knows which element of a it needs. We present two possible compilations for this fragment. The rst method is perhaps less e cient but produces deterministic code: broadcast the entire array a and allow each process to select the speci c element that it needs (Figure 3 , lines 1-6). To broadcast the entire array will require a sequence of broadcasts, where each process, in turn, transmits the elements that it owns to all other processes. A non-deterministic alternative is shown in lines 7-15 Figure 3 , where each process asynchronously sends its local elements to all other processes, and receives the elements in a non-deterministic order, and selects the one that it needs.
Other implementation alternatives are likely for the preceding code fragments and it is not appropriate to assume that compilers for all data parallel programs always generates deterministic code. However, data parallel code fragments with unpredictable communication patterns (like the program in Figure 3 ) are relatively uncommon and could, in any case, be compiled using deterministic communications for simplicity. Our simulator is designed to handle the non-deterministic case, and exploit the deterministic case when it can be detected.
Further Optimizations for Data Parallel Programs
It is also possible to optimize synchronizations when determinism cannot be readily detected: Assume that a given receive is not deterministic, in that more than one message in the system can potentially match the receive statement executed by the LP. For example the MPI receive statement in line 14 of the translated code in Figure 3 is not deterministic as it can match messages sent by multiple processes. Even for such fragments, it is possible to exploit knowledge of the communication patterns in the target application to avoid a global barrier. The compiler for the data parallel language can easily deduce that exactly 7 (in the case of 8 processors) messages will be received by each process in the translated code; we refer to this as the expected-message-set.
An LP may use the expected-message-set to further reduce its synchronization overheads in one of two ways. First, the LP needs not to initiate EIT computations to determine that a matching message is safe. Instead it may simply wait until all messages in the expected-message-set have been received, at which point it sets its EIT to the maximum of the timestamps of all messages in this set. A second alternative is to have the LP initiate the standard EIT computation phase by sending EOT requests to the LPs in its source set and identifying safe messages. At the same time it also monitors its expected-message-set and as soon as it is complete, advances its EIT as explained above. For data parallel programs in particular, the compiler may compute the expected message set on those occasions when it cannot generate deterministic object code. Note that the expected message set can always be de ned for every LP: in the default case, it consists of a null message from every LP in the source set. Thus incorporating this optimization in the protocol need not introduce additional overhead.
Results
First, we establish what it means for a parallel program simulator to be accurate. We must identify a unique timed program trace (a timestamped program trace) that the simulator is required to reproduce. The non-determinism in the execution of the target program is due to variances in its execution environment which typically arise from sharing the computing resources with other (user and system) programs. If the execution environment for the program can be held constant, a unique timed trace can be used to characterize the execution of the program on a given architecture. We propose the following two properties for a simulator:
Reproducibility Assumption : Stand alone execution of a program on a machine yields a unique timed program trace. 1 This trace is referred to as the ideal target trace.
Fidelity Assumption : The simulator can precisely predict the execution time of every LCB and the communication latency of every message in the target program. Based on the preceding properties, we de ne what is meant by an accurate simulator as follows: A simulator is said to be accurate if it can reproduce the ideal target trace under the reproducibility and delity assumptions.
Asynchronous PDES protocols, such as the ones we use in our simulator, guarantee that each LP will (eventually) process all events in the strict order of their global timestamps. It follows that subject to the reproducibility and delity assumptions, such a simulator will be accurate. As all events are executed in the order of their timestamps, any inaccuracies in the predictions will be due only to the degree of accuracy to which a given component like the interconnection network is modeled in the simulation.
Broadly speaking, the simulator is composed of two parts. One simulates the local code (via direct execution), and the other simulates the communications. Hence, there are two major sources of error present in the simulator:
1. The runtime measurements to predict the execution time of local code blocks may be inaccurate due to a number of factors: rst, the simulator causes additional code to be inserted in the target program that is simulated. Even though the time to execute this additional code can easily be excluded from the measurements, the inclusion of this code can have indirect e ects whose impact on the measurements may be hard to isolate. For instance, insertion of the simulator will typically a ect the cache behavior and register allocation. The only way to exclude these indirect e ects is to use detailed simulation model for local code blocks as implemented in simulators like SimOS. 2. We use a simple contention free model of the communication protocol. The model assumes a xed transmission delay based on the message size, for each message, and hence does not account for the software and hardware queuing delays.
1
Of course, stand alone execution of a program does not guarantee that two executions will yield exactly the same (timed) program trace. However, our goal here is to select a reasonable execution environment that can be reproduced for the program and for its simulation; other execution environments (including synthetic ones) may also be used for this purpose.
The contribution from either of these factors is hard to estimate, because it is application dependent. For compute intensive applications, the error due to local code block execution might be more signi cant than the errors due to the communication modeling. However, as will be seen from the following validation gures (Figures 4,5) , the di erence between the estimates of the runtime of the application when simulated on a varying number of host processors are very similar. This indicates that the degree of multithreading (of the simulation threads) in the simulator does not a ect the predictive ability of the simulator. This allows the simulator to be used in application scalability studies, where the performance of the application is projected as a function of system characteristics such as the number of processors.
The rest of this section is divided into four components:
1. De nition of experiments. We have simulated both data and task parallel programs.
Where possible, standard benchmarks have been used.
2. Validation of the simulator. We show the validation that we can achieve with the simulator. We are able to predict the runtime of a program with accuracy ranging from 5% in the best case to 20% in the worst case. We address the possible sources of error.
3. Performance of simulation protocols. When no optimizations are employed, on average, the accelerated null message protocol performs the best. When runtime analysis is performed it is possible in some cases to remove all synchronizations.
4. Performance of MPI-SIM. We demonstrate the scalability of the simulator. The ability of the simulator to exploit determinism in the program leads to very good performance. The performance of the parallel simulator is presented using the following standard metrics: slowdown, speedup, and synchronization overhead. For some applications (the NAS benchmarks described below), no speedup could be achieved using 16 processor and conventional simulation protocols. However, using the runtime optimizations speedups ranging from just above 3 to almost 12 were reached. Speedup and slowdown are basic metrics of a simulator's performance. Assuming that the host architecture has n processors, speedup refers to the ratio of the time taken by the simulator when executing on one processor to the time taken by the simulation using n processors. Note that the one processor implementation uses a sequential simulation algorithm to synchronize the multiple LPs. Assuming the application being simulated has su cient parallelism, speedup measures the ability of the simulator to exploit this parallelism. Slowdown refers to the ratio of the total elapsed time taken to execute the simulator on the host architecture to the elapsed time to execute the target program on the target architecture. Slowdown is a measure of the total overhead of the simulator. The synchronization overhead is typically represented by the number of synchronization messages required in a given execution of the simulator (this measure is denoted by Sync: below).
A 24-node IBM-SP2 at UCLA was selected as both the target and host architecture for the experiments reported in this paper. Each node of the IBM-SP2 is a POWER2 node with 128Kb of cache and 256Mb of main memory. Nodes are connected using a high performance switch which o ers a point-to-point bandwidth of 40Mb/s, and has a hardware latency of 500ns. The execution time for each experiment reported in this paper was taken by executing the program in exclusive or stand alone mode and taking the average of at least three di erent runs.
Experiments
A set of three data parallel programs was selected: Gauss-Jordan Elimination, matrix multiplication, and FFT computation. The programs were written in a C-based data parallel language called UC BKM95] developed at UCLA. The data distribution, reduction operations, and parallel assignments used in the UC programs reported in this paper are similar to those supported by HPF. As we had ready access to the compiler it was possible to directly insert the simulator hooks into the object code generated by the compiler.
The matrix multiplication program uses an algorithm, which employs a checkerboard block decomposition of the input matrices. The algorithm alternates between the alignment and multiplication phases. In the alignment phase, parallel assignments are used to cyclically permute the multiplicand matrices such that the operands needed for the multiplication phase are available locally at each processor. The Gaussian Elimination benchmark uses the standard algorithm to reduce a coe cient matrix of a set of linear equations into an upper triangular form. The FFT program computes the discrete Fourier transform of a polynomial.
MPI-SIM was also validated for the NAS (Numerical Aerodynamic Simulation) Parallel Benchmarks (NPB 2) BHS + 95], a public-domain benchmark suite for MPI. The NPB 2 benchmarks are a set of programs designed at the NASA NAS program to evaluate supercomputers. These benchmarks are written in Fortran 77 and embedded with MPI calls for communication. Since MPI-SIM currently supports localization for only C programs, it was necessary to convert the benchmarks to C. We were able to convert four out of the ve benchmarks using f2c FGMN90], a Fortran-to-C converter. The speci c con gurations of the benchmarks that were used in the performance study were constrained primarily by their memory and CPU requirements.
Four di erent asynchronous algorithms and a synchronous quantum protocol were used to synchronize the parallel program simulator: in the rst three asynchronous modes; Null Message Protocol (NMP), Conditional Event Protocol (CEP), and Accelerated Null Protocol (ANP) modes, the simulator respectively uses the three asynchronous simulation algorithms described in Section 2 to advance the EIT of each LP. The fourth protocol, called the Deterministic Synchronization Protocol (DSP), tries to switch the simulator to the deterministic mode when feasible, based on the analysis of the most recent receive statement executed by the LP. In this mode, if the receive statement is deterministic, the LP does not initiate any synchronization messages. If the receive statement is not deterministic, the LP uses the ANP protocol to advance its EIT as described earlier. The above simulator modes allow us to determine the contribution of each protocol and each optimization to the performance of the simulator. Figure 4 shows the accuracy of the prediction for the Gauss-Jordan Elimination. The curve labeled "Real Execution" shows the execution time of the target program executing on a target architecture with N host processors; N = 4; 8; 16. As seen from the gure, the predicted times are in close agreement with the measured times, with the predicted times lying within 5% of the actual execution time. Table 1 summarizes the relevant con guration information for the NAS benchmarks. Each benchmark was executed for 3 target machine con gurations. The LU and MG benchmarks were executed on 4, 8 and 16 processors, whereas the BT and SP benchmarks required 4, 9 and 16 processors. The column labeled \Loc." is the degree of localization of the simulator, i.e. the maximum number of threads that could be mapped to each host process of the simulator (this was primarily constrained by the available memory on each host processor and the memory requirements of each target process). For each target machine con guration, the columns labeled \Host" lists the size of the executable code for the target program and the simulator. It also lists the number of processors of the host machine that were used to execute the simulator.
Validation
For each target and host processor con guration, the simulator was executed in each of the four (1,2,4) (1,2,4,9) (1,2,4,9,16) SP 5555 S 4,9,16 16 700K/7M 500K/6M 500K/5M (1,2,4) (1,2,4,9) (1,2,4,9,16) modes described in the previous section. The NPB 2 benchmarks are self-verifying, meaning that each benchmark after completion compares the computed results against precomputed results to ensure that it executed correctly. All target programs and simulators were found to verify correctly. Figure 5 plots the target program execution time (solid line) and the execution time as predicted by the simulator (dashed lines) as a function of various target machine con gurations. The plots for the various simulator modes were very similar, and consequently the gure only displays the predicted time in the DSP mode. The multiple curves represent the simulator running on a di erent number of host processors between 1 and 16. In the best case the predicted and measured times di ered by less than 5% and in the worst by 20% lending reasonable credibility to the simulations. We also found that the predicted times matched the measured time very closely for long running con gurations of the simulator (such as for the LU benchmark), lending additional credibility to the accuracy of the simulator.
Performance
Protocol Performance for MPI Programs
We compared the performance of the asynchronous protocols with the quantum protocol for the MPI benchmarks. The performance of the simulation protocol in each simulator mode is gauged by the number of rounds of protocol messages, Sync:, sent for each processor. The performance of the quantum protocol is gauged by the number of global synchronizations. A round of protocol messages is similar to a global synchronization, although it is frequently less expensive, since in many cases, a processor does not need to wait to receive protocol messages from all other processors in order to identify a safe message from its input queue.
Given a target processor con gurations, we found that Sync: decreases only modestly on increasing the number of host processors used to simulate the con guration. show the variation of Sync: with the simulator modes for two representative target and host processor con gurations of each benchmark. In each graph, the number of rounds of protocol messages is normalized against the number of global synchronizations of the quantum protocol. Sync: is an important measure, because it characterizes directly how much overhead is due to synchronization protocols. Consider only the ANP mode: the amount of improvement over the quantum protocol is strongly dependent on the average duration for which an LP (i.e. thread) executes before getting blocked. Table 2 shows this average duration for each benchmark and each target program con guration, in terms of L, the minimum message latency of the target machine. The 9 processor BT benchmark has the largest average uninterrupted execution time per thread, and in the simulation, the ANP mode is able to eliminate more than 80% of the global synchronizations of the quantum protocol (Figure 7) . The 16 processor MG benchmark has the smallest average uninterrupted execution time per thread, and the ANP mode is unable to signi cantly reduce the number of global synchronizations of the quantum protocol (Figure 8 ).
The performance of the CEP mode is signi cantly better than the NMP mode only for the 9 processor BT benchmark. The NMP mode eliminates 40% of the global synchronizations in the quantum protocol, and the CEP mode eliminates 80%. This is because the CEP signi cantly improves over the NMP only when some LPs are far ahead of the others in simulation time, requiring the other LPs to exchange many rounds of null messages to update their simulation times. This situation is more likely to occur when the average duration of uninterrupted execution is long, as in the 9 processor BT benchmark. In such situations, a thread will not block until a message receive is encountered. This allows the thread to run ahead of the others in simulation time.
The NMP mode almost never performs better than the CEP mode, and the ANP mode is not signi cantly better than simply the CEP mode. This is because all the benchmarks predominantly use one communicator, so that any LP can communicate with any other LP in the system. This means that in the NMP, each LP needs to check all N ? 1 (where N is the number of target processors) communication channels. Consequently, the null message protocol is unable to extract and use information on the communication topology, resulting in high communication overhead.
Using the DSP mode, we note that it is possible to eliminate all global synchronizations in the BT and SP benchmarks, because all the receive statement are deterministic by indicating the sender of each receive. However, the optimizations were not e ective in signi cantly reducing the synchronizations from the MG and LU benchmarks as discussed in the next section.
Parallel Simulator Performance
The performance of MPI-SIM using the NMP, CEP, and ANP protocols for the four benchmarks is presented in Figure 10 . Each graph describes the performance of the three protocols for the target machine con guration with 16 processors. In general, the parallel performance of these protocols is poor primarily because of the relatively low computation granularity for the applications. An analysis of the parallel execution showed clearly that a very large number of null messages were used even with the ANP protocol, which e ectively canceled any bene ts that accrued from using more processors.
In contrast, when our optimization which exploits determinism in the application is applied, good performance can be achieved. First, determinism must be present in the application. A number of techniques may be used to enforce deterministic communications in a parallel program. In an MPI program, two commonly used techniques to de ne deterministic receives are either by having a receive specify the source explicitly or if it speci es an explicit tag and each source uses unique tags. Although the rst type of determinism can be detected automatically by the current simulator, we have not yet implemented the second mode. Out of the four NPB benchmarks considered in this study, SP and BT use determinism of the rst type, and MG and LU benchmarks specify determinism of the second kind. As the simulator does not automatically detect the second type of determinism, we manually inserted the optimizations to evaluate the potential bene t that can be derived from exploiting this form of non-determinism. The nal speedups obtained from the execution of all the benchmarks are presented in Figure 11 . The speedup for the LU benchmarks are relative to the smallest host processor con guration that could be used to run the simulator. For example, the 8 target processor simulator could be executed on 2, 4 or 8 host processors. Hence, the reference execution time is of the 2 processor simulation. This understates the expected performance improvement for this application. Extrapolating from the speedup obtained by the 4 processor target program, we expect this application to also yield the excellent speedup that were obtained for the SP benchmark. Using the DSP protocol exposes the speedup of the application itself, because synchronizations between processes are no longer present. The slowdown for the DSP mode for the 16 target processor BT and SP benchmarks is presented in Figure 12 . We can see, that MPI-SIM can reduce the slowdown factor from 20 for a sequential simulation down to a factor of less than 4 when running on 16 host processors. However, in scalability studies, there are usually less host processors than target processors. If we have only 4 host processors and want to predict the performance of the application as running on 16 processors, then the slowdown we would incur would still be tolerable: just above 8 for both the BT and SP benchmarks.
The performance of the simulator for data parallel program is presented using both the slowdown and speedup metrics. Since the data parallel programs are compiled to be deterministic, there is no need to use any other protocol other than the DSP. Figure 13 , 14, and 15 respectively present the simulator slowdowns and speedups for the Gauss-Jordan Elimination, FFT and Matrix Multiplication programs. Clearly, the simulator speedups and slowdowns are very application dependent. For example, in the Gauss-Jordan elimination, when we increase the number of host processors from 1 to 2, the simulation time goes down for all problem sizes. There are two con icting factors here: The decrease in computation time due to the fact that the number of the processors increased, and the increase in communication time for communications between threads that now lie on di erent processors. In Gauss-Jordan the bene t due to the rst factor outweighs the loss due to the second factor. In both matrix multiplication and FFT, this is not the case for the nest grained problem executions (64 processors in FFT and 32 processors in matrix multiplication), where the second factor outweighs the rst. So the slowdown increases from 100 to 160 in FFT, and from 80 to 110 in matrix multiplication, on increasing the processors from 1 to 2. Consider the comparative speedups obtained for target programs that use a varying number of processors(N) in the target architecture, for a constant data size and a constant number of processors (K) in the simulator:
for Gauss-Jordan elimination, the speedup increases with N, but decreases for FFT and matrix multiplication. Closer analysis revealed that in the latter two applications, the communication in the target program increased substantially with N which also results in degrading the performance of the simulator.
Conclusion
Parallel simulation can o er signi cant reduction in the execution time of simulation models of parallel programs. We have developed a parallel simulator that uses asynchronous conservative synchronization protocols together with optimizations that exploit the communication characteristics of the program being simulated. We analyzed and compared the behavior of several protocols under varying synchronization demands of the applications. We demonstrated the performance of the simulator on a range of task and data parallel programs including four of the ve benchmarks that are de ned in the NAS Parallel Benchmarks Suite (NPB 2). We have described a very e cient way to model data parallel programs by extracting the determinism present at the application level. We enhanced the conservative synchronization protocols used by the simulator to take advantage of determinism present in the applications, thus improving the lookahead in the simulator. The results show that the optimizations suggested in this paper can signi cantly reduce the synchronization overheads for the simulator. As a result MPI-SIM was able to achieve good speedup (of about 12 for the 16 processor SP benchmarks and using 16 host processors). Recently, MPI-SIM was extended to model aspects of the computer architecture beyond those of the communication system. Detailed models of I/O systems, parallel le systems, and I/O data caching and placement algorithms were added to MPI-SIM BDK97].
Acknowledgments
All the data presented in this paper was collected on the IBM-SP2 at UCLA's O ce of Academic Computing, granted to UCLA by IBM Corporation under their Shared University Research Program. Special thanks to M. Dhagat for implementing the data parallel programs and to S. Docy and A. Kahn for their role in implementing the simulator.
