Distributed simulation of parallel computers. by Prylli, Loïc
HAL Id: hal-02101809
https://hal-lara.archives-ouvertes.fr/hal-02101809
Submitted on 17 Apr 2019
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Distributed simulation of parallel computers.
Loïc Prylli
To cite this version:
Loïc Prylli. Distributed simulation of parallel computers.. [Research Report] LIP RR-1995-12, Labo-
ratoire de l’informatique du parallélisme. 1995, 2+26p. ￿hal-02101809￿
Laboratoire de l’Informatique du Parallélisme
Ecole Normale Supérieure de Lyon
Unité de recherche associée au CNRS n°1398 
Distributed simulation of parallel computers
Lo c PRYLLI May  
Research Report No  
Ecole Normale Supérieure de Lyon
Adresse électronique : lip@lip.ens−lyon.fr 
Téléphone : (+33) 72.72.80.00    Télécopieur : (+33) 72.72.80.80
46 Allée d’Italie, 69364 Lyon Cedex 07, France
Distributed simulation of parallel computers
Loc PRYLLI
May  
Abstract
Our work deals with simulation of distributed memory parallel com 
puters The tool we realized allows to take an application written
for say an Intel Paragon and run it on a workstations cluster by just
recompiling the code The hardware of the target machine is simu 
lated so that the behavior of your application on the workstations is
identical to a native run on the simulated computer except for total
execution time We present here this tool as well as a mathemati 
cal analysis of the conditions required about the simulation host the
simulated host and the application to be able to distribute eciently
the simulation
Keywords  simulation parallel computers performance analysis
Resume
Nous nous interessons ici a la simulation distribuee d	ordinateurs
eux m
emes paralleles Nous avons realise un outil permettant d	exe 
cuter une application developpee pour une machine parallele par
exemple le Paragon d	Intel sur un reseau de stations de travail par
le simple biais d	une recompilation Les composantes materielles de
la machine cible sont simulees de sorte que le comportement de l	ap 
plication est identique a celui obtenu par une execution native sur la
machine simulee hormis le temps total d	execution  Nous presen 
tons ici cet outil ainsi qu	une analyse mathematique des conditions
sur la machine simulante l	application et la machine simulee qui
permettent de distribuer ecacement la simulation
Motscles  simulation ordinateurs paralleles analyse de performances
Table of Contents
 Backgrounds 
 Why is simulation useful                                             
 Parallel programming on workstations                               
 Tools to simulate a parallel computer                                 
 Specication of the simulation tool 
 Modeling of an architecture                                           
 The available application programmer interfaces                   
 Related issues                                                           
 Simulation by discrete events 
 Structure of the simulation engine                                   
 Simple example of simulation                                         
 Choice of a time scale                                                 
 Representing the state of the machine                               
 Some examples of events                                               
 circuit switched case                                           
 wormhole case                                                 
	 Parallelization of the simulation 
 Constraints                                                             
 Using the latencies of the simulated machine                       
 Master slaves organization                                             
 Cutting the computation phases                                     
 Algorithm on an ideal simulation host                               
 Algorithm of the master                                       
 Algorithm of the slave                                         
 Eciency analysis                                             
 Simulation on a realistic machine                                     
 Limitations due to the application                                   
 Implementation presentation 

 Trace generation 
 Validation of the simulator 

 Conclusion 	

Introduction
As parallel computers have become more widely available a lot of tools have
been developed for them These tools try to ll several needs studying applica 
tion performances load balancing and eective exploitation of parallelism and
problems related to the communications network contention links utilization
What we propose is a new tool that hopefully has its place in this domain
It allows to simulate a MIMD computer on a workstations cluster and to run
real parallel applications with just recompiling the source code We try to be
as general as possible so that we can simulate a wide range of parallel com 
puters and provide as well several application programmer interfaces including
Chorus NX PVM and MPI A virtual clock representing the time on the sim 
ulated computer is maintained and the result of the simulation can be exploited
either by a trace le generated during the simulation which can be visualized
with classical tools like Paragraph or by time measurements made inside the
application that will reect the virtual clock
What is new is our approach is the parallelization of the simulation itself
This avoids a severe limitation of traditional simulators the limited amount of
memory of one workstation and the elapse time By using a network we can
now deal with larger problems and decrease as well the time of execution of the
simulation
This document will present the tool as well as a theoretical study of the
eciency we can hope to achieve by distributing the simulation
  Backgrounds
   Why is simulation useful
To develop and optimize parallel applications the most used methods have been
instrumentation and runtime tracing There are a lot of dierent ways to do
this instrumentation can be done more or less automatically and there are a lot
of dierent approaches to visualization The main problem with observational
analysis is neutrality It is actually dicult to instrument and collect data
without perturbing a lot the timings of the target application In the worst case
even the behavior of the application can be changed due to the non determinism
introduced by parallelism So a lot of eorts have been made in software as
well as in hardware to ensure as much neutrality as possible Another problem
that come with instrumentation and tracing is that like workstations parallel
computers tends to become multi tasked and multi user So the study of the
application will be disturbed by events external to the application
So although observational analysis is useful simulation can be more suited in
some cases and of moreover both can be combined to insure neutral observation
Parallel simulation allows also new features 

  It insures neutral observation
  It allows developing without access to the real machine
  It allows to design and study a machine without or before building it
  It allows testing massive parallelism on real applications without requiring
a huge execution time
  Parallel programming on workstations
Some tools already exist to ease the development of parallel applications on a
cluster of workstations But they aim at executing as fast as possible a parallel
application without bothering simulating a peculiar MIMD computer
PVM GBD   This is a library that allows to developp parallel applica 
tions using a message passing paradigm It is available either on networks of
workstations or on real parallel machines
NXLib SLL  You can use this environment to execute applications writ 
ten for the Paragon But although it does provides the application programmer
interface of the paragon it doesn	t simulate the behavior of this machine
Trollius BDV Bur  Trollius is both an operating system and an API
for parallel programs You can use the trollius environment on workstations
with Trollius LAM which provides also the MPI interface
  Tools to simulate a parallel computer
Some tools already exists to simulate execution of a parallel application on some
peculiar hardware The application is transformed either manually or automat 
ically in a sequential program This program will simulate all events that would
have occurred on the target machine during a real run of the application The
result of the simulation can be examined with the appropriate tools
Proteus Bre BDCW  This tool allows a quite realistic simulation
First at compile time the cost of each basic block of the application is evaluated
and some code is inserted to be able to take it into account during simulation
time Then the application is sequentially simulated with a simulation engine
which is responsible for
  maintaining a virtual clock
  sharing the CPU of the simulating machine among the dierent virtual
nodes simulated

  simulating the communications
EPPP simulator RHS  EPPP is a complete programming environment
including a simulator based on Proteus The evaluation of computation time
had been improved by doing the compile time analysis of each basic block of
the assembly code generated for the target machine
Beside those cited some other works have been done 
EGPsimPY TangoDGH PEETGNS 
 Specication of the simulation tool
Our work aims at providing a tool able to simulate a wide range of parallel
computers so it has to be quite parameterizable Moreover we want to be
able to simulate real applications that is to say applications that deal with
a great amount of data and also that are quite demanding on CPU power
Consequently we want to parallelize the simulation of a machine that is itself a
parallel computer and this is probably the main specicity of our work Last
we want it to be portable
Nevertheless even if we want to deal with big applications the main objec 
tive is the accuracy of the simulation and the time necessary to do the simulation
comes only second Of course it has to be enough reasonable to test applications
with massive parallelism
Trace generation is a major feature of the simulator The generated trace
les can then be analyzed with classical tools cf 
This work is original for two reasons until now there is no general tool that
allow simulating a parallel computer at the application level that	s to say answer
at the question how much time this application will take on this computer
Well others work are in progress but either dedicated to only one machine or
they focus on very high accuracy at of simulation at the hardware level and so
they are restricted to simulate toy applications The second specicity is that
we are the rst one to parallelize the simulation which eliminate the necessity of
a huge computer to deal with real applications in fact there is some theoretical
reasons cfx that probably prevented other people to do so before But we
will show how we can circumvent these in our specic case
  Modeling of an architecture
We deal exclusively with distributed memory machines linked with a classical
communication network point to point multi stage or crossbar
The tool has to be enough parameterizable to model the target machine in
a xed format Here are the parameters we chose

  The power of the computing processors
  The topology of the communication network and the routing strategy
  The protocol used for communication circuit switched store and 
forward worm hole
  The bandwidth of the links
  The switching time of the routers
  The timings associated with the initialization of a transmission without
taking CPU time
  The CPU timings associated with a transmission
  The packet size if packeting is used
  The it size in the case of worm hole routing
  And some ad hoc options to deal with peculiar buering schemes
All these parameters are simply given by the user in a conguration le that
is read at the beginning of the simulation
Let	s look more closely at these dierent options
Power of computing processors  Our choice is quite simple the computing
processor is modeled just by one scalar for instance the Mops as given by
the Linpack benchmark This choice is a limitation the relative timings of
dierent CPU depends also of the type of computation An approach like the
one used in the EPPP project allows more accuracy but is beyond the scope of
our approach
Nevertheless for a wide range of problems an estimation of the computation
time for a processor obtained with the scaling of the computation time regarding
the simulator	s processor gives an acceptable accuracy this is particularly true
for processors of the same family for instance RISC And in fact the lack
of precision introduced is not greater than those obtained when you change
from one compiler to an other for the same processor or when you just change
compiler options The table  shows the worst case where we compare eciency
of some processors in very distinct domains with the Linpack benchmark dense
linear algebra the whetstone benchmark aggregate melting of integer and
oating point operations and the dhrystone benchmark integer operations

linpack whetstone dhrystone
Alpha   
Mips   
Sparc   
i   
RSK   
Table  processors comparison the numbers give an power estimation unit
not meaningful Alpha is given as  for reference
Topology and routing  Topology and routing are indissociable so they are
represented by only one parameter Of course in some cases you can have sev 
eral choices of routing strategy for the same topology The currently available
topologies are ring  mesh and hypercube but provision has been made so that
it	s very easy to add new ones Classic routing strategies have been imple 
mented XY routing for the mesh or e cube routing for the hypercube But
more complicated routing like Hot Potato routing can also also be implemented
easily
Communication network protocol  The simulator can simulate dierent
routing protocols worm hole with a specied it size circuit switched or a
theoretical idealized protocol with which you don	t really simulate the physical
transmission on the network but instead you assume that the transfer time fol 
lows the law Ld where L is the length of the message and d is the number
of links between the two and tau are numerical constants characterizing the
network
Also it takes into account as well half duplex link or full duplex links
Numerical constants for the network  There is little to say about that
you can simply notice that we have taken into account separately the operations
that require the computing processor and those that occur in parallel with it
Packet size  This optional parameter allows to signal that a message is cut
into xed size fragments before transmission on the network
Others parameters  It is necessary to do a compromise between having a
generic tool dealing with a wide range of parallel computers and simulating as
accurately as possible a peculiar machine That	s why we introduce some ad hoc
parameters that are useful to take into account features specic to one machine
for instance
  The iPSC buering protocol to insure that room is available on the
destination node before sending a message Int

  The behavior of the Paragon that has also a peculiar buering protocol
PR
 The available application programmer interfaces
It is important to deal with existing applications but they are written for dier 
ent machines with dierent APIs CMMD NX PVM MPI     So we decided
to provide several APIs
For historical reasons we rst provide a library conforming to the Chorus
interface specication For practical reasons the other interfaces have been
implemented on top of the Chorus ones with the help of some functions to insure
that the introduction of a intermediate level doesn	t introduce any discrepancies
in the simulation
At this time the Chorus and the NX interfaces are available
 Related issues
In this part we will not consider the message passing facilities as part of the op 
erating system Otherwise saying the operating system will name the func 
tionality provided to the application by the execution environment exception
made of the message passing library
We have seen so far how we model an architecture and what range of ap 
plications we can simulate it appears that we have left the modeling of the
operating system of the target machine This is something important so far
as the operating system can inuence dramatically a machine performance in
some cases On a multi tasked node it will determine the scheduling policy
On some systems it manages a virtual memory space with eventually paging or
swapping Given the number of dierent operating systems it is not reasonable
to take into account all possibilities in a general tool So we adopt a conserva 
tive approach we made some simple choices for instance we assume a simple
round robin scheduling if there is several threads or processes on one node we
assume no swapping or paging Anyway our approach seems acceptable for
several reasons
  Most parallel applications don	t rely on paging because generally to pro 
vide acceptable performance it is necessary that all code and data can
be stored in physical memory So system impact due to memory manage 
ment will generally be negligible and it appears reasonable to ignore it in
simulation
  Most parallel applications using message passing doesn	t rely much on the
operating system except at load time and for input output operations In
particular most of the time there is only one application process per node

which eliminates problems related to scheduling policies As regards input 
output operations that is true that our simulator will not give any hint
for applications where such operations are predominant over computation
  On machine that allow only a single application at a time the operat 
ing system has hardly any inuence on the behavior of the application
On multi user machine which tends to spread the operating system in 
troduces random perturbations but in the scope of our project we are
not interested by reproducing this kind of perturbations On the contrary
they make analysis and performance tuning of the application more di 
cult So the fact that the machine we simulate is more deterministic that
the real one is most of the time an advantage
 Simulation by discrete events
We will present here the method of simulation we used an overview is rst given
at an abstract level and then the notion introduced will be further explained
just after
  Structure of the simulation engine
First we need a set of variables that represent the global state of the simu 
lated computer at a given time Simulation progress by way of transitions one
transition modify the variables to represent a new state of the machine and
increase the time by a specic amount so in fact the global state of the ma 
chine is changed only at precise countable points in time that	s why we speak
of discrete events simulation One important structure that is maintained is a
queue of events two attributes are associated with each event the nature of this
event and the time at which it will occur called the time stamp of the event
Events are unavoidable to simulate a complex system where some parts evolves
separately and interacts at certain times they are in fact a representation of
these interactions as will show the examples below
We consider all changes to the state of the system to be atomic Then actions
that last will be represented in the model by two changes one at the beginning
of the action and one at the end the evolution of the real system in between
should not be meaningful in the scope of the simulation
At a given time t let Q be the queue of events and S be the state of the
system The simulation engine consist basically in the following algorithm
 Remove an event e from Q with the smallest time stamp
 Modify S by taking into account the occurrence of e During this modi 
cation events can be created that are inserted into Q
At each stage of the algorithm the virtual time of the simulation is given
by the time stamp of the event e we are dealing with

 Simple example of simulation
Let	s took a simple example with the following situation we have a computer
with three nodes A B C on a row Node A and node B each compute something
and then send the result to C We will took in this section an oversimplied
model To communicate three steps are necessary rst we must acquire suc 
cessively the links necessary to reach the destination node If a link is free we
can acquire it instantaneously Then the communication time is constant Let	s
suppose the computation of A lasts  unit of time and the one of B lasts  units
the communication lasts 
Here follows the dierent stages of the simulation
A state B state A B link B  C Q
    Free Free init computation
Busy Busy Free Free end B end A
Busy Idle Free Active end A end B  C
Idle Idle Active Active end B  C
Idle Idle Active Active end A C
 Choice of a time scale
When doing discrete event simulation you have generally to choose either a
discrete time model where time increases by multiple of a specied time unit or
a contiguous time model where the time can take any value In our case in a
parallel machine every node are asynchronous so there is no appropriate discrete
time as for instance a global clock signal on which to base the simulation so we
choose a contiguous timemodel in fact this can be considered as an extreme case
of the discrete time case where the chosen unit would be very small compared
to the typical intervals of time used in the model
Simultaneous occurrence of two events  In a contiguous time model
the probability of two events occurring simultaneously is zero unless a strong
dependency exists between them but this is taken into account in the model
So we choose to ignore this kind of situation if it occurs in our simulation events
can always be ordered But in a computer there are a lot of things based on
clocks so for some hardware parts all events occurring in one period should be
considered simultaneous We don	t take into account this kind of thing because
it will be inconsistent with the level of our model
 Representing the state of the machine
After having presenting the general concepts we will now describe more pre 
cisely how we apply them in our case

The representation of the state of a machine depends a bit of its architec 
ture but we can roughly decompose the state of all machines entering in the
classication of section  as follows
  For each link of the communication network we store his state active or
idle and at the attached routers the list of messages blocked waiting the
availability of this link
  For each message in transit on the network we must know the list of links
it is currently monopolizing
  The real code of an application process is in fact executed in a real process
with a library that redirects message passing calls to interact with the
simulation engine
  For each node we maintain a list of requests blocked waiting for some
resources to be processed a list of received messages not already grabbed
by an application process and some others structures depending on which
ow control protocol is used
This is a simple model but note that we can extend it easily for example if
the routing function is not xed then the router must maintain some additional
state etc   
 Some examples of events
We will now describe the most representative types of events that can occur and
the corresponding actions that must be taken when treating them they depend
on the architecture we are simulating We will use again the notation Q to
design the set of events managed by the simulation engine cf x
Treatment of an application send request This one is generated when
an application process reach a send library call If the packet splitting option
is on the message is rst cut and transformed into several requests each corre 
sponding to a new event of type internal transmission It is also here that we
eventually deal with special buering protocols
 circuitswitched case
Treatment of an internal transmission request Let s be the source site
t the sending date d the target site The following work
  Acquire on s the resources needed for emission eventually this can lead
to sleep see explanation below if the resource is not at once available

  Compute t  the date at which the resources are ready to use there can be
a switch time associated with some resources Insert into Q an event at
date t  of type routing
The action we described here is atomic if we don	t need to sleep If we
have to wait for the availability of some resources then we insert the information
necessary to do the rest of the processing into the queue of blocked requests
associated with the resource It will be processed when another event frees the
resource
Treatment of a routing event Let n be the node on which the message is
arriving
If n is the nal destination of the message then try to acquire on this node the
resources needed for delivery like in the case of a transmission event compute
the time at which the delivery will be terminated and insert in Q an event of
type end of transmission
If n is an intermediate node then compute the next node n  with the ap 
propriate routing function Then wait for the availability of the resource cor 
responding to the link between n and n  add a switch time to obtain the nal
date at which to insert a event of type routing in Q for the node n  
Treatment of end of transmission All resources used for this message
are freed in particular the links along the path between sender and receiver
This can lead to execute some actions that were waiting for the corresponding
resources Depending on the required simulation precision the resources can
be freed successively instead of all at one time in this case several events are
in fact generated The state of the destination node is changed so that the new
message is taken into account if an application process was blocked waiting for
such a message it is resumed
 wormhole case
Treatment of internal transmission request As for the circuit switch
case we acquire the resources necessary to reach the second node basically a
link we compute the date at which these resources will be ready to use and
we insert an event of type routing in Q Of course the timings constants will
generally not be the same as for the circuit switched case
Treatment of a routing event If the end of the message has left the source
node free the last link used by the tail of the message
Then we execute the same operations as in the circuit switched case if we are
not at the nal destination compute the next node with the appropriate routing
function acquire the necessary resources and insert a event of type routing in
Q If we are at the nal node insert a event of type transmission end phase

Treatment of transmission end phase If the message is entirely arrived
on the nal node then the actions taken are similar as for an end of transmission
in the circuit switch case Else do one step of transmission the message progress
according to the it size if the queue of the message has left the source node
free the last link at the tail of the message insert an event of type end of
transmission in Q at the date corresponding to the delay of achievement of the
previous operations
 Parallelization of the simulation
  Constraints
We have seen that there was a simple sequential simulation algorithm There
several points before starting with parallelization
  It is rst quite obvious that we cannot parallelize the evaluation of one
transition because they either consists in very simple action or in execution
of the code of an application process which is by nature sequential
  Then we have to investigate how we can proceed in parallel several events
The problem that arises is the problem of coherence The simulation
algorithm must ensure that the results of the state transitions are exactly
as if they have been processed sequentially in chronological order
The coherency constraint implies that all sites of the simulatingmachine have
generally to be synchronized at each time jump which in practice will prevent
any actual parallelism We will examine dierent methods in the following
subsections to remove partially this constraint
 Using the latencies of the simulated machine
We will use some topological knowledge of the machine we are simulating After
decomposing we can associate a localization to every event Then we will use
the fact that there is a minimal latency time of propagation between the dier 
ent components of the simulated machine More formally if we represent the
components of the machine by a connected graph the vertices will represent the
computing nodes and the routers of the machine we will design the dierent
components by s s       Then we will have the following property that is a
consequence of the hardware latency
  For each edge si sj there exists a minimal latency li j such that every
event generated when evaluating a transition on si and associated with the
site sj has a time stamp greater than ti  li j where ti is the time stamp
of the current transition

We can now generalize the latency to any couple of components even if
there are not directly connected by saying the corresponding latency is equal to
the one of a path of minimal latency between the two components graph the
closest path Let hx design the time stamp of an event x At each stage of the
simulation algorithmwe can choose for the next transition to compute any event
e  Q verifying he  he   ls s  where e is the event with the smallest time 
stamp and s s are the vertices associated to e and e The simulation algorithm
become non deterministic and so has an inherent potential for parallelism
In corresponding distributed algorithm Q and S are in fact distributed
among a certain number of processes Each one is responsible for a site and
then deal with all transitions associated with this particular site On each
process the following algorithm is executed
 Let be t the smallest time stamp among the events owned locally
 For every other site q wait that the process associated with q reach time
t lq p where p is the local site
 Modify S by taking into account the occurrence of e During this mod 
ication events can be created that are dispatched to the appropriate
site
The approximate parallelism provided by this algorithm will depend essen 
tially of two factors the ratio of the typical interval between two events on the
same site against the typical latencies between sites
In practice in our case the only sites where the local state progress inde 
pendently of the other nodes are the compute nodes every other sites routers
links are essentially driven by external events that mean events not generated
on the same site that roughly mean that there is a synchronization at each
event with its neighborhood That means the latency here will be limited an
upper bound for the eciency being the diameter of the graph if the communi 
cation costs in the simulation host are null in practice there is no parallelism
exploitable at this level between such site Although it seem we cannot gain
much parallelism here the features described here can be useful when used in
conjunction with the algorithm described below
 Master slaves organization
There is little hope to parallelize the simulation of the communication network
in fact we could parallelize it using predictive action but that would not be
an eective solution in our case but we can try to conserve the parallelism
inherent to the application by distributing the computations done by the dier 
ent application processes One node of the simulation node will take in charge
several nodes of the simulated machine There will be a master process that
will simulate the communication hardware and will deliver in order chronolog 
ically with regards to the virtual time the messages that are exchanged on the

network to the applications process that will be called the slaves Each slave
inform the master of the virtual time reached by the nodes it simulates
 Cutting the computation phases
On the compute nodes we have a local evolution of process between two calls to
the message passing library But at this point if we stay with our clean model
of x then a computation phase is just considered like one transition and the
problems of x will not be solved But a computation phase is something that
can be decomposed If we do so then in the middle of a computation we can
inform the other simulation sites what simulation point we reach so that they can
eventually start other computation phases that would proceed in parallel with
the rest of the current computation The problem is that there is no obvious
decomposition it would be too costly to decompose at the instruction level
and that will cause problems to evaluate computation time On the contrary
if we decompose the computation with a too heavy granularity we will loose
any parallelism The solution to this problem is to allow an interrupt driven
decomposition that means when the master must wait for a slave to reach
a certain duration before starting a computation phase on an other node it
interrupts the computation phase the slave answers if it has reach or not the
critical point and if not sets a timer so that it can inform the master as soon as
it reaches this critical point Moreover when several virtual nodes are simulated
on a single real node we will see later that it is essential to be able to switch
between the several processes representing a virtual node in the middle of
computation phases So from now we will assume that computation phases can
be dynamically cut into several parts
 Algorithm on an ideal simulation host
Let V for Virtual be the number of nodes of the simulated machine and R for
Real the number of nodes of the simulating machine We consider here that
the simulating host has the following properties
  Computation phases can be cut into innitely small parts without over 
head
  Communication fully overlap with computation
  The average latency of small messages between a slave and the master is

	 Algorithm of the master
Our simulation model has changed a bit since x with the introduction of
interruptions in computation phases but we have still a set of events Q that
would be completely managed by the master

The algorithm on the master is then
 Let e be the rst event in Q
 Wait that every application process are inactive waiting for a message or
some information of the master or that we know it has reach a point later
than the time stamp of e more precisely less than the time stamp of e to
which we subtract the latencies described in x
 Take into account the rst event of e That can result in starting a com 
putation on a slave
 Go to step 
	 Algorithm of the slave
Let N  V
R
be the number of virtual nodes managed by a slave We can represent
the state of the slave by tiiN  each ti representing the virtual time reached
by one of the nodes managed The vector t represents the advancement of the
simulation on one slave
When a virtual node reach a communication point it must wait for the
master to inform that all slaves have reached this point We will say that the
node is blocked
The algorithm is composed of the following actions
  Let S be the set of nodes not blocked advance in their computation uni 
formly that means run the node of S with the smallest ti We supposed
we can switch with an innitely small granularity between nodes of S
that means there will be a subset of the ti that will increases at the same
time
  When a node becomes blocked we inform the master that it should inform
us when the corresponding ti has been reached globally
  Messages received from the master unblock a node
	 Eciency analysis
Now we will study the eciency of this algorithm It will be done on a virtual
application that we dene as follows
  Compute phases have average duration of time M 
  All process are busy at any time

The execution prole of each process will be compute phases with 
M
com 
munication points by unit of time
We now consider a particular slave Let try to determine the conditions
necessary to avoid idle states in the simulation We consider a cycle beginning
at a time where we receive a message from the master unblocking a node Let t
be the state of the slave as dened in  The critical path to go to the next
cycle will consists of several repetition of the following
  The master unblocks a node of a slave
  The computation on this node progress until the next global blocking
point on average that means a M
V
computation at some conditions see
below
  The slave returns its status to the master
This steps are represented on the space time diagram example of gure 
There will be on average R steps before returning to the initial slave so that
means a critical path of length R  M
V
   MR
V
 R The amount of
computation of one cycle is M  We have no idle time if M  MR
V
 R
As practically we will have V  R the remaining condition is M  R 
R  M


t
Host 1 Host 2
communication operation
critical path
Figure  Critical path representation
Now we must justify that just average considerations lead us to a valid
result For that at a beginning of a cycle on a particular slave let Wa denote

the amount of computation already done by anticipation that means if the
node just unblocked is at time t Wa  
P
i ti  t Let W be the total amount
of work that can be done from t W  
P
i t
 
i  t where t
 
i is the next blocking
date of node i On average W  MV
R
 An example of what represent Wa and
W is given in gure 
Wa part
W part
communication point
Figure  W and Wa at beginning of a cycle on one slave
In the computation of the critical path length we said that the time from
a blocking point to the next global blocking point was M
V
 for that we assumes
that at the time the master send the unblocking message it knows also the
point in time of the next blocking point A sucient condition for that is that
for all slaves the last Wa were greater thanM  So now we have to look at what
happen across several cycles If the Wa are too small the critical paths are
longer but then before leading to idle states the Wa will increase When the
Wa are greater than M and as we must have R 
M

 the critical path between
two cycles is smaller than M so the Wa decrease So in average the Wa values
will stabilize themselves somewhere belowM  If V  R the maximum values
for Wa is several times M which ensure that it is valid to reason with average
values Note that average values are taken among dierent nodes at one time
and not along time If all nodes do small computations at the same then the
eciency will drop
Last we have to see if we can generalize our analysis of a special kind of
application to more general cases The point is to see how idle times in the
virtual case inuence the performance of the simulation Let now consider that
M will represent in fact the interval between two computations starting points
so M will be decomposed into a computation part Mc and a idle part Mi Of
course we always consider average values The algorithm does not change at
all simply there are some node that are idle instead of busy and then the new
formulae for W id now W  V
R
 M

 Mi so we need that Mc  Mi and
preferably Mc  Mi W must several times greater them M  The condition
on V and R becomes V
R
  MM

Mi
 All the other reasoning remain valid and

then the eciency will be optimal at the same condition R  M


 Simulation on a realistic machine
We just study the algorithm on a ideal machine the strong assumption was
that we could share one CPU of the simulation host between dierent logi 
cal processes representing the nodes of the application with an innitely small
granularity In reality what we can do is to switch between threads or processes
with a granularity g depending of the system and the implementation logical
processes representing nodes can be managed by several unix processes or by
several threads into a single unix process
So we could take again the previous study with g the only important mod 
ication is during the calculus of the critical path length when we take into
account the time necessary to reach the next global point That was M
V
 that
must be now replaced by maxM
V
 g so then we have the supplementary con 
dition Rg  M  R  M
g
 To ensure the validity of this limitWa must now be
greater than M  gR but his change is not very important for the stability of
Wa as long as R 
M
g
which impliesM  gR  M 
 Limitations due to the application
All the eciency considerations we discussed until now didn	t take into account
the time necessary to transfer the data messages between the dierent simu 
lated nodes We just spoke about the messages necessary to the coherency of
the simulation It is quite obvious that if an application can	t be run on the
simulating host because of the bottleneck of the communication with a normal
message passing library like NXlib MPICH or PVM cf x there is no hope
to compensate that by adding the coherency constraints and ordering of the
target machine simulation So what we determined are the conditions at which
the simulation could be done with the same order of speed than with a simple
message passing library on the simulating host
 Implementation presentation
For portability and simplicity reasons our environment is built on top of PVM
There will be three kind of PVM tasks the slaves noted S that will manage
the dierent virtual nodes the main simulation engine noted M like master
that will simulate the communications on the virtual hardware Last there
will be a certain number of application processes noted T like thread each
representing a virtual node A slave and its attached nodes will all be run on
the same CPU
These dierent entities will be interconnected cf gure  by several kinds
of communication channels

S
S
T
T
T
T
T
M
Figure  overview of the simulator organization
  The channels MS allow the slaves to cooperate with the master to allow
progression of the simulation
  The channels TS allow a slave to dispatch the CPU between the dif 
ferents threads and gathering application messages information that is
further sent to the master
  The channels TT allow raw data of the application to transit directly
between application processes Only information about such messages
transit by the master
In a future version we will perhaps change a bit this implementation so as
to run a slave and all its attached processes within one single PVM task Any 
way that is a technical detail to minimize switching time between the dierent
processes on one CPU
 Trace generation
The simulation can be exploited by two means On one hand the timings mea 
sured by the application are virtual times identical to those that would have
been measured on the target machine and so that allows to analyze supercially
the application On the other hand there is the possibility to generate a trace
le during the simulation This le can then be examined with existing tools
to do a post mortem analysis Note that the trace le obtained correspond to
a neutral observation of the execution what is almost impossible with a real
machine
We choose the PICLvRT Wor trace le format that allows us to use
ParagraphHE to displays the result of the simulation

The trace generation is done at the master site which has all information
about message circulating and computing processors activities
 Validation of the simulator
In this part we will present several results obtained with the simulator To do
these tests we took several programs written for the iPSC and ran them
both on the real iPSC and with the simulator
The rst test gure  is a ping pong test It is a simple test to verify
that the parameters are correct for the target machine
Then we took two algorithms that comes from the SCALAPACK pack 
age that deals with numerical linear algebra operations on parallel computers
ABD 
The rst one gure  does a LU decomposition of a matrix then solve
several linear equations by using this decomposition We put the execution
time of these two phases for several matrix sizes
The second program gure  is doing a QR decomposition then also solve
a linear system As for LU we indicate each phase time
This results shows that the simulator has a good accuracy In our case we
were simulating on some workstations with sparc processors The dierence
between the real and simulated execution time is essentially due to the non
constant power ratio between this two processors but nevertheless we can see
that it does cause a small bias

0.0 500.0 1000.0
message size
0.0
0.2
0.4
tr
an
sm
is
si
on
 ti
m
e 
(m
s)
iPSC860
Simulation
Figure  ping pong simulation for the iPSC
0.0 100.0 200.0 300.0 400.0 500.0
size
0.0
0.5
1.0
1.5
2.0
tim
e 
(s
)
LU i860
Ax=b i860
LU simulation
Ax=b simulation
Figure  LU simulation on  nodes

0.0 100.0 200.0 300.0 400.0
size
0.0
1.0
2.0
3.0
tim
e 
(s
)
QR i860
SOL i860
QR simulation
SOL simulation
Figure  QR simulation with  nodes

 Conclusion
Even at this stage of our work we obtain some promising results This tool
seems to be useful in several cases for the development of parallel applications
without having an account on the target machine for the neutral analysis of an
application run and to help the design and study of a parallel machine
The tests that have been done with both simulation and native execution
seems to show that a good accuracy can be obtained
Some work is in progress to allow the use of the simulator with dierent
APIs
The other interest of this work is the theoretical study of the parallelization
eciency We are now able to characterize the type of simulation that can
be done in parallel depending on the granularity of the application and the
parameters of the simulation host

References
ABD  E Anderson A Benzoni J Dongarra S Moulton S Ostrouchov
B Tourancheau and R Van de geijn Lapack for distributed mem 
ory architecture progress report In Fifth SIAM Conference on Par
allel Processing for Scientic Computing  
BDCW E A Brewer C N Dellarocas A Colbrook and W E Weihl Pro 
teus A high performance parallel architecture simulator Technical
Report MIT!LCS!TR  Massachusetts Institute of Technology
Laboratory of Computer Science September 
BDV Greg Burns Raja Daoud and James Vaigl LAM An Open Cluster
Environment for MPI 
Bre E A Brewer Aspects of a parallel architecture simulator Technical
Report MIT!LCS!TR  Massachusetts Institute of Technology
Laboratory of Computer Science February 
Bur Gregory Burns Trillium operating system In Third Conference on
Hypercube Concurrent Computers and Applications pages "

DGH H Davis S Goldschmidt and J Hennessy Multiprocessor simula 
tion and tracing using tango In ICPP 
GBD  Al Geist Adam Beguelin Jack Dongarra Weicheng Jiang Robert
Mancheck and Vaidy Sunderam PVM Parallel Virtual Machine
Scientic and Engineering Computation MIT Press 
GNS  D Grunwald G J Nutt A M Sloane D Wagner and B Zorn
A testbed for studying parallel programs and parallel execution ar 
chitectures Technical report University of Colorado April 
HE M T Heath and J A Etheridge Visualizing performance of par 
allel programs Technical report Oak Ridge National Laboratory

Int Intel Corporation iPSC and iPSC Source Code Product 
Internal Product Specication
PR Paul Pierce and Greg Regnier The paragon implementation of the
nx message passing interface In SHPCC 
PY David K Poulsen and Pen Chung Yew Execution driven tools for
parallel simulation of parallel architectures and applications In
SUPERCOMPUTING 
RHS Eric Reiher Herbert HJ Hum and Ajit Singh Simulating
networks of superscalar processors Technical report Centre de
recherche informatique de Montreal 

SLL Georg Stellner Stefan Lamberts and Thomas Ludwig NXLIB
User	s Guide Institut f#ur Informatik Technische Universit#at
M#unchen 
vRT M van Riek and B Tourancheau The trace formats that are used
in picl paragraph and gpms Technical Report   LIP " Ecole
Normale Sup rieure de Lyon 
Wor P Worley A new PICL trace le format Technical Report TM 
 Oak Ridge National Laboratory October 

