Abstract
Introduction
As parallel computers have become more widely available, a lot of tools have been developed for them. These tools try to examine: the application performances, the loadbalancing, the effective parallelism and the communication network problems (contentions, links utilization).
We propose a new tool, that allows to simulate a MIMD computer in order to run real parallel applications with just recompiling the source code on a workstations cluster, or a small parallel machine. This simulation will give the application results and all the timing and monitoring information corresponding to the simulated computer.
We try to be as general as possible to be able to simulate a wide range of parallel computers and provide as well several application programming interfaces (APIs), including NX1 and PVM2.
All the code of the application is really executed as in a native run, but all functions related to message-passing or to time measurement are treated by an engine that simulate the *This work has been supported by CNRS contract PICS, GDR-PRC 'the interface commonly available on Intel parallel machine 2one of the most popular message-passing interface PRS action EXEC and CEE-EUREKA contract EUROTOPS 1080-241x196 $5.00 0 1996 IEEE Proceedings of SIMULATION '96 timings and behavior of the target machine. This simulation is based on conservative discrete event simulation.
The result of the simulation can be exploited either by a trace-file generated during the simulation (which can be visualized with classical tools like Paragraph), or by time measurements made inside the application that reflect the virtual time.
We wanted to simulate a wide range of parallel computers, but we also wanted to simulate real applications, that is to say, applications that deal with a great amount of data and that are also quite demanding on CPU power. Consequently we parallelized the simulation (of a machine that is itself a parallel computer) and this is probably the main specificity of our work. Last, we wanted it to be portable.
Moreover, we studied the theoretical limitation of distributed simulation under our assumptions (cf. $4). Hence, we could predict, for a given simulation machine the conditions necessary to achieve a good efficiency (cf. 54.3). This is especially helpful in determining the number of nodes that should be used to achieve the maximum simulation speedup on a LAN of workstations for instance. This work is original for two reasons, until now there is no genera1 tool that allow the simulation of a parallel computer at the application level that's to say answer at the question: "how much time this application will take on this computer?". Others work done on this subject are either dedicated to only one machine, or they focus on a very high accuracy of the simulation at the hardware level and so they are restricted to simulate "toy" applications. The second specificity is that we parallelize the simulation itself. This avoids a severe limitation of traditional application-level simulators: the limited amount of memory of one workstation and the elapse time. By using a network of workstations or a parallel computer, we can now deal with larger problems and decrease as well the execution time of the simulation. Parallel simulation is commonly used for network studies but to our knowledge it has never been done for general MIMD computers at the application source code level (except some works in progress for the Paragon).
The interest of the simulation tool and some backgrounds on this subject are described in section 2 and section 3 then the theoretical study of its efficiency in section 4. The specification and implementation are described in section 5 and 6 and we show in section 8 experiments that proved the quality of the results obtained as well as simulation speedups.
Why is simulation useful?
Simulation is useful to develop or to test applications without the target machine. In the case of modular or custommade parallel systems, it is also useful for the design and choice of the hardware to be bought or made. Moreover. in order to develop and optimize parallel applications, the most used methods have been instrumentation and runtime tracing. There are a lot of different ways to do this monitoring [ 191, the instrumentation of the program can be done more or less automatically, and there are several approaches to visualization[ 16, 101. The main problem with observational analysis is neutrality. It is actually difficult to instrument and collect data without perturbing a lot the timings of the target application. In the worst case, even the behavior of the application can be changed due to the non-determinism introduced by parallelism. So a lot of efforts have been made, in software as well as in hardware to ensure as much neutrality as possible [S] .
Hence, our parallel simulation tool allows the following new e e 0 2.1.
features:
It insures neutral observation.
It allows developing without access to the real machine.
It allows to design and study a machine without (or before) building it.
It allows testing of massive parallelism on real applications without requiring the target machine and without a huge execution time.
Sequential simulation of parallel computers at application level
Some tools already exist to simulate execution of a parallel application on some peculiar hardware. Generally, the application is transformed automatically in a sequential program. This program will simulate all events that would have occurred on the target machine during a real run of the application. The result of the simulation can be examined with the appropriate tools. Among them are:
Proteus [3] : This tool allows a quite realistic simulation. First at compile time the cost of each basic block of the application is evaluated to be able to take it into account during the simulation. Then the application is sequentially simulated with a simulation engine, which is responsible for: maintaining a virtual clock, sharing the simulator CPU among the different simulated virtual nodes and simulating of the communications.
EPPP simulator [ 151 : EPPP is a complete programming environment, including a simulator based on Proteus. The evaluation of computation time has been improved. The compile-time analysis is done by looking at each assembly code basic block generated for the target machine.
EPG-sim [13]
: This is an integrated set of tools allowing trace generation, serial execution-driven simulation, and trace-driven simulation.
Beside those cited, some other works have been done:
Parallel simulation of MIMD computers
In the case of trace-driven simulation, there are already some tools and studies that allow to achieve an efficient parallel simulation. Some execution-driven simulation can use the same technics than trace-driven simulation, in the case where the set of messages exchanged between processors is deterministic, and where the execution path of each application process is completly insensitive to timings. This is equivalent to generate and treat the trace online (for instance EPG-sim is able to use this mode).
Lapse[7]
: LAPSE is a parallel simulator for the Paragon. It assumes that the behavior of applications is most of the time insensitive to timings. It uses "windows" technics, that work well in the previous case. If the application becomes sensitive to timings, the windows become so thight that this part of the simulation becomes sequential.
Another tool adapts its behaviour dynamically:
Simulation of parallel program by discrete events
We presents a quick overview of the sequential discreteevent simulation of parallel programs with our assumptions and point of view. The next section shows how we exploit parallelism into our simulation algorithm.
Structure of the simulation engine
Events are unavoidable to simulate a complex system where some parts evolves separately and interacts at certain times, they are a representation of these interactions.
The global state of the simulated computer at a given time is represented by a set of variables. The simulation progresses by way of transitions, one transition modify the variables to represent a new state of the machine and increase the time by a specific amount. Hence the global state of the machine is changed only at precise countable points in time, that's why we speak of discrete events simulation.
One important structure that is maintained is a queue of events. Two attributes are associated with each event, its nature and its occurence time (called its time-stamp).
We consider all changes to the state of the system to be atomic. Then actions that last will be represented in the model by two events, one at the beginning of the action and one at the end. The evolution of the real system in between is supposed not to be meaningful in the scope of the simulation.
At a given time t , let Q be the queue of events and S be the state of the system. The simulation engine consist basically in the following algorithm: While true During this modification, events can be created that are inserted into Q.
endwhile
At each stage of the algorithm, the virtual time of the simulation is given by the time-stamp of the event e that is currently processed.
The representation of the state of a machine depends a bit of its architecture, but we can generally decompose the state of all machines with our assumptions (section 5.1) as follows: the state (active or idle) of each link of the communication network.
the list of messages blocked at each router, waiting for a link availability. the list of links currently monopolized by each message in transit on the network. the state of each application process (data, stack, program counter), which is represented by a real process (or just a thread) on the simulating host, linked with a library that redirects message-passing calls to routines that interact with the simulation engine. e for each processor node, a list of events blocked waiting for some resources, a list of events corresponding to received messages not already grabbed by an application process, and some others structures depending on which flow control protocol is used.
This is a simple model, but notice that we can extend it easily, for instance if the routing function is more complex then the router maintains some additional state, etc . . .
The fundamental events and their associated actions
We describe in the following the most representative types of events and the corresponding actions that are processed when treating them (depending on the simulated architecture). We describe the simulation of a circuit-switched network, from this example, the worm-hole case is straightforward.
Treatment of a "transmission" event.
Let s be the source site, t the sending date, d the target site. The following algorithm is implemented :
1. Acquire on s the resources needed for emission, eventually this can lead to "sleep" (see explanation below) if the resource is not at once available.
2. Compute t' the date at which the resources are ready to use (there can be a "switching time" associated with some resources). Insert an event at date t' of type routing into Q (the event queue).
This algorithm is considered atomic if all resources are available. On the contrary, the information necessary to do the rest of the processing is inserted into the queue of blocked waiting events associated with the resource. The algorithm will resume when another event frees the resource.
Treatment of a "routing" event.
Let n be the node on which the message is arriving : If n is the final destination of the message then try to acquire on this node the resources needed for delivery like in the case of a transmission event, compute the time at which the delivery will be terminated and insert in Q an event of type "end of transmission". Else, ( n is an intermediate node), compute the next node n' with the appropriate routing function, wait for the availability of the resource corresponding to the link between n and n', add a switching time to obtain the final date of the event of type routing (for the node n'), insert it in &.
Treatment of an "end of transmission" event. All the resources used for the message are freed, in particular the links along the path between the sender and the receiver, (this can lead to re-start some actions that were waiting for the corresponding resources). The message is delivered to the application.
Constraints for the parallelization of the simulation
We have seen in the preceding section our sequential simulation algorithm. We present in this section the main problems that occurs when we parallelize it.
We investigate how several events can be proceeded in parallel. The classical problem that arises is the constraint of coherence in parallel simulation : The simulation algorithm must ensure that the results of transitions on the machine state are exactly as ifthey have been processed sequentially in chronological order:
It is not possible to apply a "Time Warp" or any optimistic parallel simulationmethod because we have seen that in the state of the machine, we consider real application processes and it is not possible to save and restore efficiently the state of such processes (which includes stack, heap and static data).
We cannot use a simple conservative window algorithm either because in a window algorithm, we have to compute at a time t an interval 6 during which it is assumed that no event is generated, and in our case we cannot predict the service time of an event. Moreover the service time is very low for components such as routers compared to the application inter-events typical time. So the only possible window interval S would be too small to allow some parallelism. Notice that this restriction is removed if we suppose that both the pattern of messages exchanged by the application and the execution path of the processes are completely deterministic, thus allowing an efficient look-ahead (see [ 141 for more details).
Using the latencies of the simulated machine
We describe here each element of the simulated machine as having a service time. In practise, the usual components present a too small service time compare to the inter-event occurrences to let the abiliry of exploiting parallelism with a window-type strategy (see [ 141) .
We define, between any couple of components ( p . q ) : the latency l ( p , q ) to be the minimum time possible between an event that occurs at p and an event that occurs at q who is a consequence of the first one.
Let h, design the time-stamp of an event x . At each stage of the simulation algorithm, we can choose for the next transition, to compute any event e E Q verifying he < he, + I,,,, where eo is the event with the smallest time-stamp and s, SO are the locations where e and eo occured. The simulation algorithm become non-deterministic and so has an inherent potentiality for parallelism.
In the corresponding distributed algorithm, Q and S are in fact distributed among a certain number of processes. Each one is responsible for a site and then deals with all transitions associated with this particular site. On each process the following algorithm is executed:
1. Let t be the smallest time-stamp among the events owned locally.
2. For every other site q, wait that the process associated with q reach time t -lq,p where p is the local site.
3. Modify S by taking into account the occurrence of e.
During this modification, events can be created that are dispatched to the appropriate site.
The approximate parallelism provided by this algorithm will depend essentially of the ratio of the typical interval between two events on the same site with the typical information transfer time between sites. In practice in our case, the only sites where the local state progresses independently of the other nodes are the compute processes; every other sites (routers, links) are essentially driven by external events (i.e. events not generated on the same site). It roughly means that there is a synchronization with the neighborhood at each event. In practice, there is no parallelism exploitable at this level.
But it will be a useful optimization when used in conjunction with the algorithm described below.
Master slaves organization
Anyway we can try to preserve the parallelism inherent to the application by distributing the computations of the different application processes.
With that aim, one node of the simulation machine takes in charge several nodes of the simulated machine. To take into account contentions and network constraints, a master process simulates the communication hardware and delivers in order (chronologically with regards to the virtual time) acknowledges for the messages that are exchanged on the network between application processes that we called the slaves. Each slave have to inform the master of the virtual time reached by the nodes it simulates. Notice that the master is able to run on a node with several slaves.
Cutting the computation phases
Compute processes have a local evolution of their state between two calls to the message-passing library. But if we stay with our previous model, a computation phase is just considered like one atomic transition leading to no parallelization at Hopefully a computation phase is something that can be decomposed. Then in the middle of a computation the other simulation processes can be informed of what simulation point is reached. Hence, they can eventually starticontinue their computation phases, which would proceed in parallel with the rest ofthe current computation. The problem is that there is no obvious decomposition granularity, and it would be too costly to decompose at the instructionlevel, or on the contrary, a loss of parallelism at a too large granularity.
Our solution to this problem is to allow an intermptdriven decomposition. That means, when the master must wait for a slave to reach a certain point in time, for instance before allowing the delivery of a message which will start a computation phase on another node, it interrupts the slave computation phase. The slave answers if it has reach or not that corresponding point in time and if not sets a timer in order to inform the master as soon as it reaches this critical point. Moreover when several virtual nodes are simulated on a single real node, we will see later that it is essential to be able to switch between the processes (representing virtual nodes) in the middle of computation phases. So from now we will assume that computation phases can be dynamically cut into several parts.
Algorithm on an ideal simulation host
Let LT (for Virtual) be the number of nodes of the simulated machine and R (for Real) the number of nodes of the simulating machine. We assume from now on that the simulating host has the following properties:
Computation phases can be cut into "infinitely" small parts without overhead.
Communication can fully overlap with computation.
The average latency of small messages between a slave process and the master process is p.
Algorithm of the master
Our simulation model has changed a bit since $3.1 with the introduction of interruptions in computation phases, but we have still a set of events Q that is completely managed by the master.
While true
The algorithm on the master is then:
1. Let e be the first event in Q (i.e. with the smallest timestamp).
this is a consequence of 4. I , where t is large but cannot be predicted 2. Wait for each application process to be inactive (waiting for a message or some information from the master) or to reach a point later than the time-stamp of e (more precisely later than the time-stamp of e minus the latencies /? corresponding to the communication path of short messages, see 34.1).
3.
Run the action corresponding to e (it can result in the starting of computations on slaves).
EndWhile

Algorithm of the slave
Let N = $ be the average number of virtual nodes managed by a slave. We can represent the state of the slave by ( t z ) l < i l~, each t i represents thevirtual timereached by one of the managed nodes. The vector t represents the advancement of the simulation on one slave. When a virtual node reaches a communication point, it must wait for the master to inform that all slaves have reached "this point". We will say that the virtual node is blocked.
The algorithm of a slave is composed of the following actions:
1. Let S be the set of nodes that are not blocked. Advance their computation uniformly: i.e. runs the virtual node of S with the smallest t i . As we assumed that we can switch with an infinitely small granularity between nodes of S, there will be a subset of the t i that will increase at the same time.
2. When a virtual node becomes blocked (at a receive call for instance), sends the corresponding t i value to the master. Hence the master can inform all the slaves when a condition on the t i has been reached.
Notice that at any time, a message of the master can unblock a node. For instance, by ensuring there will be no more delivery of application messages with a stamp less than the point of blocking t i . The node can then continue its computation, running some computations based on messages available at this point.
Efficiency analysis
We study the efficiency of our algorithm. For sake of simplicity, this will be done on an average virtual application defined as follows:
The compute phases have an average duration of time M between two message-passing calls.
All process are busy at any time.
The execution profile of each process will be compute phases with an average of communication points by unit of time.
Lets consider a particular slave and try to determine the conditions necessary to avoid idle states in the simulation.
We consider a "computation cycle" beginning at a time where we receive a message from the master unblocking a 4 2. The computation on this node progress until the next global blocking point, on average that means a $ computation (at some conditlons, see below).
3. The slave returns its status to the master.
These steps are represented on the space-time diagram example of figure 1. Figure 2 represents Wa and W on a space time diagram.
In the computation of the critical path length, we says that the time from a blocking point to the next global blocking point was $, for that we assume that at the time the master sends the unblocking message, it knows also the point in time of the next blocking point. A sufficient condition for that assumption is that for all slaves, the W, are greater than _\I.
Let's have look at what happens across several cycles. If the bVa are too small, the critical paths are longer but then, before leading to idle states, the W , increases. When the lit: >= M and as we must have R < $, the critical path between two cycles is smaller than M so the W, decrease. So in average the W , values will stabilize themselves somewhere below M . Moreover, if V >> 2R, the maximum values for PJ,7a is several times M which ensure that the reasoning with average values is valid.
Note that average values are taken among different nodes at one time and not on one node along time. If all nodes do small computations at the same time, then the efficiency will drop. Now that our average evaluation is validated, we have to prove that we can generalize our analysis of a special kind of application (see [ 141) All the other reasoning remains valid and then the efficiency will be optimal at the same condition R < 9. Thus, we can generalize our previous result:
Property 2
The efJiciency of the simulation is directly related to the "granularity " of the application. The eficiency is close to 1 at the condition R < 9.
Simulation on a realistic machine
The preceding section studied the algorithm on a ideal machine. The strong assumption was that one CPU of the simulation host could be shared between different logical processes (representing virtual nodes of the application) with an infinitely small granularity.
In practise the processor CPU can be shared but with context switching between threads or processes with a granularity g depending of the system and the implementation (logical processes representing virtual nodes can be managed by several Unix processes or by several threads into a single Unix process).
The previous study can take into account g. The only important modification is during the computation of the critical path length, when the time necessary to reach the next global point is computed. This time was $ and it must now be replaced by max ( $ , g ) 
Limitations due to the application
All the efficiency considerations we discussed until now did not take into account the time necessary to transfer the data messages between the different simulated nodes. We just spoke about the messages necessary for the coherency of the simulation. It is quite obvious that if an application cannot be run on the simulating host because of the bottleneck of its communication with a normal message-passing library like NXlib, or PVM, there is no hope to compensate that by adding coherency constraints and sequential ordering on the simulation machine. So what we determined, are the conditions at which the simulation could be done with the same order of speed than with a simple message-passing library on the simulation machine.
Specification of the simulation tool
Architecture modeling
We deal exclusively with distributed memory machines, linked with a classical communication network: point to point, multi-stage, or crossbar.
Our tool has to be enough parameterizable to model the target machines in a fixed format. The parameters we chose are the following:
Power of computing processors: For a wide range of problems, an estimation of the computation time for a given processor can be obtained with the scaling of the simulator's processor computation time. It gives an acceptable accuracy, especially for processors of the same family (for instance RISC). moreover, the lack of precision introduced is often not greater than those introduced by a compiler change or even just a compiler option change. So our choice is quite simple, the computing processor is modeled by just one scalar obtain by benchmarks adapted to the type of computation4.
Notice that an approach like the one used in [ 151 allows more accuracy but is beyond the scope of our approach. Communication protocol: It can be store and forward, circuit switched, worm-hole. Depending on the protocol, several others parameters must specified: the packet size if messages are split, the flit-size for worm-hole communication, etc..
Communication numerical constants:
These parameters specify the bandwidth of the links, the switching time of the routers, the different software and hardware overheads implied in message transfers.
Parameters for the upper communication layer:
These are some options to specify the size of communication buffers used and the type of control flow if any.
All these parameters are given by the user in a configuration file filled either manually or with a graphical interface. The file is read at the beginning of the simulation.
Related issues
The different message-passing APIs provide generally some high-level operations such as collective operations. Our simulator try to reproduce the behavior of the corresponding software layer of the target machine by dividing in two layers our runtime library emulating one API. The upper layer implements the calls of the API and use the *The only limitation is that some timings effects due to processor specific strengths will not be reflected in the simulation (for instance cache size).
lower layer to simulate the sending of messages on the hardware. Although we cannot guarantee that we use always the constructor algorithms, we can reasonably assume that the discrepancies introduced in the upper layer are quite small. This is also in this layer that we will take into account the control flow protocol used on some machines.
There are some operating system (OS) features that we do not take into account: on a multi-tasked node, the OS will determine the scheduling policy, on some systems it manages a virtual memory space with eventually paging or swapping. Given the number of different operating systems, it is not reasonable to take into account all possibilities in a general tool. So we adopt a conservative approach, we made some simple choices (for instance, we assume a simple round-robin scheduling if there are several threads or processes on one processor) and we assume no swapping or paging. Anyway, our approach seems useful in a lot of cases for several reasons:
o Most parallel applications do not rely on paging because in order to provide acceptable performance, it is necessary that all code and data can be stored in physical memory. Hence system impact due to memory management is negligible and it appears reasonable to ignore it in simulation.
o A lot of parallel applications using message-passing do not rely much on the operating system except at load time and for input-output operations. In particular, most of the time there is only one application process per node which eliminates the problems related to scheduling policies. As regards to input-output operations, it is true that our simulator will not give any hint when such operations are predominant over computation (there is also no benchmarks of that type).
o On a machine that allow only a single application at a time, the operating system has hardly any influence on the behavior of the application. On a multi-user machine, such as a LAN of workstations, the operating system and the load introduces "random" perturbations. It is not the scope of our project to study this kind of perturbations. On the contrary, they make analysis and performance tuning of the application more difficult. So the fact that the machine we simulate is more deterministic than the real one is most of the time an advantage.
Implementation presentation
For portability and simplicity reasons, our environment is built on top of PVM [8] .
There are 3 kind of PVM tasks: the "Slaves" noted S that manage the different virtual nodes, the "main simulation engine" noted M (for Master), that simulates the communications on the virtual hardware, and there will be a certain number of application processes noted T (like Thread), each of them representing a virtual node. A slave and its attached nodes are all ran on the same CPU.
In a future version, we will perhaps change a bit this implementation in order to run a slave and all its attached processes within one single PVM task. Anyway this is only a technical detail to minimize switching time between the different processes on one CPU.
Trace generation
The simulation can be exploited by two means. On one hand the timing measurements on the application are virtual times (identical to those that would have been measured on the target machine) and so that allows to analyze the application "manually". On the other hand, there is the possibility to generate a trace file during the simulation. This file can then be examined with existing tools to do a post-mortem analysis. Note that the trace-file obtained correspond to a perfect neutral observation of the execution (what is almost impossible with a real machine!).
We choose the PICL[ 18,201 trace file format that allows the use of the Paragraph[ 101 and PIMSYIVIST[ 171 tools to displays the results of the simulation.
The trace generation is actually done at the master site, which has all information about circulating messages and computing processors activities.
Validation of the simulator
In this part, we present several results obtained with our simulator. For our test, we took several programs written for the Intel iPSC860 and ran them both on the real iPSC860 and with the simulator on a LAN of SUNS.
The first test (figure 3) is a communication "ping-pong" test. This simple test validate the correctness of the parameters for the target machine.
We then took two algorithms that come from the SCALAPACK library package [l] that provides with numerical linear algebra routines on parallel computers.
The first one (figure 4) does a LU decomposition of a matrix, then solve several linear equations by using this decomposition. We present the execution time of these two phases for several matrix sizes.
The second program (figure 5) is doing a QR decomposition, then the solve of a linear system. As for LU, we indicate each phase time.
This results shows that the simulator has a good accuracy. In our case we were simulating on a LAN of workstations with Sparc processors and a power ratio issued from the LINPACK benchmark. The difference between the real and simulated execution time is essentially due to the non constant power ratio between the two processors while using different vector sizes5. Nevertheless the bias stay relatively small (within 10%). 
Conclusion
At this stage of our work, we obtain some promising results. This tool seems to be useful in several cases: for the development of parallel applications without having an account on the target machine, for the neutral analysis of an application run, and to help the design and study of a future parallel machine.
The tests that have been done with both simulation and native execution show that a good accuracy is obtained easily.
The other interest of this work is the theoretical study of the parallelization efficiency of our distributed simulator.
We are able to characterize the type of simulation that can be done in parallel as a function of the application granularity and the communication latency of the simulation host.
In practise, with today parallel computers parameters, the speedup should increase with up to a dozen of workstations on a wide range of linear algebra applications. Notice that finer grained applications can be simulated using a small parallel computer that provide a much smaller communication latency instead of a LAN.
Some work are in progress to allow the use of the simulator with more APIs, t o reduce some simulation overheads using threads instead of processes and to group the threads into one PVM task.
