Shared memory as a basis for conservative distributed architectural simulation by Stoller, Leigh B. & Swanson, Mark R.
Shared Memory as a Basis for 
Conservative Distributed Architectural Simulation 1
Mark R. Swanson 
Leigh B. Stoller 
E-mail: {swanson,stoller}@cs.utah.edu
UUCS-97-005 •
Departm ent of Computer Science 
University of Utah
A b s t r a c t
This paper describes experience in parallelizing an execution-driven architectural simulation  
system  used in the developm ent and evaluation of the Avalanche distributed architecture. It re­
ports on a specific application of conservative distributed simulation on a shared memory platform. 
Various com m unication-intensive synchronization algorithms are described and evaluated. Perfor­
mance results on a bus-based shared memory platform are reported, and extension and scalability  
of the implementation to  larger distributed shared memory configurations are discussed. Also ad­
dressed are specific characteristics of architectural sim ulations that contribute to decisions relating 
to the conservatism of the approach and to the achievable performance.
1 Introduction
Architectural simulation is a valuable tool in the processes of exploring, designing, enhancing and 
implementing com puter system s. Execution-driven  simulation is especially valuable in gaining an 
understanding of an architecture in a dynamic sense, in exploring the interaction of its parts, and in 
evaluating macro and micro performance characteristics. In addition, it enables the im plem entation  
of working software based on that architecture, allowing software costs, including programming, to 
be evaluated concurrent with architectural design.
Accurate architectural simulation is costly, with slowdown factors of 10’s to  1000’s being re­
ported for sim ulations of uniprocessors. When sim ulating multiprocessors, as the Avalanche group 
is doing, the simulation process is generally slowed down by a factor proportional to  the number 
of processors sim ulated, by the additional simulation of the interconnect, and by context switching  
within the simulator between the simulated processors.
'This work was supported by a grant from Hewlett-Packard, and by the Space and Naval Warfare Systems 
Command (SPAWAR) and Advanced Research Projects Agency (ARPA), Communication and Memory Architectures 
for Scalable Parallel Computing, ARPA order #B990 under SPAWAR contract #N00039-95-C-0018
1
For research groups developing parallel architectures, exploiting parallelism to speed up the 
required architectural simulations is an obvious approach. This paper reports on such a paralleliza- 
tion effort and its unusual approach of performing distributed simulation within a shared memory 
model.
1.1 T h e  A va lan ch e A r c h ite c tu r e
The Avalanche distributed system  will be a cluster or network of 32 to 64 workstations[7] inter­
connected with a Myrinet network. Its unique aspects lie in providing a communications interface 
supporting extrem ely efficient m essage passing and distributed shared memory (DSM) and design­
ing that interface to plug in to com m odity workstations. All interactions between processors occur 
as com munications over the M yrinet via this interface. The com munications are always initiated  
by processor memory references: stores to interface registers in the case of message passing and 
cache misses in the case o f DSM . The design o f the Avalanche interface is intended to minimize 
all aspects of interprocessor com m unication overhead, with special emphasis on memory hierarchy 
effects. To expose these effects, a detailed and quite fine grained simulation is required.
2 T he B ase Sim ulation E nvironm ent
The base uniprocessor simulation environment developed by the Avalanche project is comprised of a 
simulator for the HP PA-RISC architecture[5], including an instruction set interpreter, and detailed  
simulation modules for the first level cache, the system  bus, the memory controller, the network 
interconnect, and the com m unications device which is the focus of the Avalanche project’s research. 
This environment is called PAint (PA-interpreter)[6] and is derived from the Mint simulator[9]. 
The simulator is designed to  model multiple nodes, consisting o f the modules listed above, and 
the interactions between nodes, with em phasis on the effects o f communication on the memory 
hierarchies.
PAint schedules tasks to  perform simulation events at specified tim es in the future-w ith  the 
present time being an acceptable degenerate case. Consider modeling a first level cache access. 
When an instruction performs a memory reference, a task is scheduled to model the resulting cache 
activity. This task would first need to look up the access in the tag rams. To model the tim e cost 
of this, it schedules itself to  resume after an appropriate delay. W hen the task is again executed, 
it performs the tag lookup, and assum ing it succeeds, once again schedules itself in the future to 
model the delay involved in accessing the data rams and returns to the scheduling loop. W hen  
contention arises for a resource, such as the tag rams, the task requiring access to that resource 
will enqueue itself on the resource, and the holder of the resource will schedule that task to execute  
when the holder finishes with the resource. It is thus possible that several tasks may be scheduled  
concurrently for a particular node, executing both overlapped and interleaved in time, depending  
on the interactions of the modeled entities. The com putational granularity of these tasks varies 
widely, from one or two m icroseconds to 30 or more microseconds (on the platform discussed in 
this paper).
Current simulation time in PAint is the minimum time of the tasks in the task list2. Scheduling 
a task is accomplished by inserting  it in a list ordered by desired execution time; executing a task  
involves extracting the task with the lowest execution tim e from the list and performing the action  
described by the task. In sequential PAint, there is but a single task list for all simulated nodes, so
2The data structure used is actually an array of lists. It’s implementation is not crucial to the results described.
2
simulation of nodes is interleaved. A representative time to  insert a task on the platform described 
here is 775 nanoseconds; representative extraction tim es range from 700 to 1200 nanoseconds.
The driving task in all of this is the instruction execution task, modeled by the n ex t_ ev en t  
function. It interprets instructions which ultim ately initiate actions in all of the other modules. 
PAint is an exam ple of an execution-driven simulator. It interprets icodes, a pre-decoded form of 
instructions. The decoding is performed at program load time. The n ex t_ ev en t task interprets 
icodes until it reaches one that potentially results in actions in other modules within its processor 
and possibly in other processors. Such an icode is preceded by a special event icode inserted by the 
loader. This event breaks the task out of the interpretation loop, suspends the n ex t_ ev en t task, 
and causes the scheduling of some other task, such as the cache task described earlier. In PAint, 
the m ost com mon events of interest are memory references.
Since the n ex t_ ev en t task may iterate through several instructions each time it is scheduled, 
the advance of its time is “chunky.” T hat is, its time will have advanced several machine cycles as a 
result of being executed once, and it may be “ahead” of other tasks. This does not result in causality 
errors, since it always reschedules at points where external entities, such as other processors, might 
influence it.
3 Parallelizing the Sim ulation
The m otivation for developing PPAint (Parallel PAint) was an im m ediate pragmatic one of needing 
to run larger sim ulations and run them more quickly. The simulator is a com plex program, tuned for 
uniprocessor performance. It was not desirable to either increase its com plexity significantly, nor to  
decrease its uniprocessor performance in order to  parallelize it, nor was a long term parallelization 
effort possible.
A shared memory approach was chosen for three reasons:
1. synchronization  costs: at present,  basic communication costs between processors in mes­
sage passing system s are between one and two orders of m agnitude greater than those in 
shared memory system s. It appeared likely that these costs on a message passing platform  
would dwarf the actual com putation, resulting in little or no speedups.
2. com plexity: The software cost of building efficient, message based synchronization is also 
higher than the m ethods used in the work reported here, which are based on shared variables.
3. availability: an SGI Power Challenge system  was available and acquisition of Hewlett- 
Packard shared memory multi-processor workstations is planned. In the future, the Avalanche 
machine itself, with its distributed shared memory capability, will provide an additional 
platform.
In spite of targeting shared memory platforms, PPA int resembles a conservatively-synchronized  
distributed simulation with the simulated nodes acting as the logical processes  (LPs). The use 
of shared memory is limited to  (1) maintaining global simulation time, (2) inter-processor task 
scheduling, and (3) providing the substrate for the architectural message passing being modeled.
Particular characteristics of multi-processor simulation m otivate this use of a distributed logical 
process approach and the choice of the node as the LP:
1. The simulation of individual nodes is naturally independent except for well-defined synchro­
nizing events.
3
2. For many of the applications simulated, the work load per simulated node is very similar, 
making the node processor an acceptable “unit” for load balancing purposes.
3. M ost simulator data structures are naturally created on a per-sim ulated-node basis and some 
are shared by com ponents within the node. LP state is most naturally bounded at the node 
boundary.
4. W ithin a node, modules schedule tasks in other modules frequently and often do so with zero 
delay; non-instantaneous interaction is a crucial assumption in many distributed simulation  
scheduling algorithm s. In making the node the LP, these interactions are encapsulated within 
the LP and are handled by normal sequential scheduling algorithms.
5. Tasks are scheduled between nodes relatively infrequently, and always do so with non-zero 
delay, enabling use of standard distributed synchronization algorithms.
As a result, PPA int is structured as a collection of slightly modified uniprocessor PA int’s, where each 
PAint models one or more simulated nodes. The modifications are largely isolated to initialization  
and synchronization code; the com ponent-m odeling modules, with only two minor exceptions, 
remain unchanged.
As expected, this approach proved to be economical in terms of im plem entation time. From 
initiation of the project to  working parallel simulator was a m atter of only a few weeks effort by 
a single programmer. T he subsequent performance improvements and experim ents reported here 
required only a few more weeks.
3.1 M ain ta in in g  C ausality  and T im e b etw een  S im u lated  N o d es
Simulated nodes in the architectural model can only affect one another by sending m essages, either 
explicitly in message passing applications or implicitly in the use of DSM . T he inherent latency 
of this m essage passing, in terms of simulation cycles, is the basic lookahead factor used to  reduce 
synchronization between simulation processors. This latency and the resulting lookahead are deter­
mined by the nature of the interconnect modeled and the level of accuracy desired from that model. 
The interconnect modeled here has a 125 nanosecond delay, on-the-wire, for a single-switch fabric. 
T hat translates into 15 cycles for simulated processing nodes with a 120 MHz clock. The lookahead 
for a sim ulation that ignores contention in the switch is thus 15 cycles. M any interconnects, in­
cluding the one modeled here, would require flit-by-flit modeling to capture such contention effects. 
Such fine grained simulation is prohibitively expensive for uniprocessor simulators, but that fine 
grain also makes attaining speedup from parallel simulation difficult. Such accuracy is som etim es 
required, of course; making it practical is one of the goals of PPAint development.
At a low level, simulated nodes in PAint actually affect each other through the scheduling of 
tasks, specifically tasks executing within the network interface model which, in turn, schedules 
tasks in the memory system  model. The network interface can also effect the simulated processor 
directly by interrupting it. Thus explicit communication between simulation processes is isolated 
to just a single task scheduling location in the network interface module. The remainder of the 
sim ulator, except for initialization code and the task extraction routine, is essentially oblivious to 
whether it is running as a parallel program or not.
3.2 T im e  M an agem en t
As a distributed sim ulation, PPAint does not maintain a centralized global clock. Each simulation  
process m aintains its own time in a clock variable globally visible to all simulation processes. Each
4
simulation process also m aintains a tim e value a t which it needs to synchronize with the rest of 
the simulation. The task scheduling loop ex tracts tasks from the local task  list, advancing its clock 
variable as the schedule tim e of those tasks moves forward. When it reaches a task th a t  would 
advance its clock beyond the synchronization tim e, it invokes a synchronization algorithm  which 
does two things: (1) it com putes and returns the tim e for the next synchronization and (2) dequeues 
any inter-processor tasks and inserts them  into the local task list.
3.3  In ter-P rocess Task O rdering and C ausality
The network m odule discrim inates between tasks it schedules for nodes sim ulated by its own proces­
sor and those it schedules for nodes sim ulated on other processors. Tasks bound for o ther processors 
are placed on a queue owned by th a t  processor; access to  th a t queue is protected by a lock. Correct 
ordering of task execution is ensured as follows:
•  Inter-processor network tasks are constrained to  be scheduled with a delay into the fu ture 
th a t is greater than  the lookahead value.
•  A task adding a task  to  another processor’s queue will not advance its own processor’s tim e 
until the queue operation is visible a t the o ther processor.
•  After com puting a new synchronization tim e, the synchronization code always moves all inter­
processor tasks from the queue into its local task list before allowing the local clock to  advance 
beyond the old synchronization time.
M aintaining these ordering constrain ts is trivial in a shared memory system th a t provides sequential 
consistency. A store into a global variable is known to  be logically visible to the other processors 
when the store instruction completes. In a system  with weaker consistency, it may be necessary to 
perform a write barrier between an inter-processor queue operation and the subsequent advance of 
the queuing processor’s clock variable.
4  P e r f o r m a n c e  R e s u l t s  w i t h  P a r a l l e l  P A i n t
It is widely recognized th a t  synchronization and waiting tim e are the m ajor overheads in conserva­
tive parallel sim ulations. Not surprisingly, the performance of PPA int, over the range of processor 
counts studied, is significantly im pacted by these factors. Several progressively more sophisticated, 
bu t still conservative, synchronization algorithm s were implemented in an a ttem p t to  to lerate these 
load imbalances.
Another common difficulty for m ultiprocessor sim ulations is the frequent synchronization re­
quired for accurate sim ulation of the interconnect. Variable-lookahead variants of the synchroniza­
tion algorithm s were im plem ented to  address this problem.
4.1 E xp erim en ta l T estb ed
All tests were conducted on an SGI Power Challenge with 14 90 MHz R8000 processors and a 
common memory of 2 gigabytes. Each processor has 16 kilobyte on-chip d a ta  and instruction 
caches and a 4 m egabyte unified second level cache. The penalty for a miss to  main memory is 53 
cycles, while a miss to  another cache takes 80 cycles. The sim ulator was compiled with the M IPS 
C compiler in 32 bit mode. Micro m easurem ents were obtained using a 21 nanosecond resolution 
interval tim er; an average overhead of reading the tim er has been factored out of reported times.
5
M acro (whole program ) tim es reported were wall clock tim e as reported by the getrusage()  system 
call. The tests were not run on a dedicated machine but sim ulated processes did not share processors 
with other user processes.
4.2 Synch ron ization  A lgor ith m s
Figure 1: The BA RRIER Time Algorithm
m intim e = c lo c k [ T h is P r o c e s s o r ] ; 
f o r  ( i  = 0 ; i  < N um ber_o f_ rp rocs; i++) 
i f  ( c lo c k [ i ]  < m intim e) 
r e t u r n  m in tim e; 
r e tu r n  m intim e + lo o k ah ead ;
Several synchronization algorithm s were evaluated; four are reported on here. The procedures are 
called from a loop which checks to ensure th a t the returned tim e is in its future; if not, it calls the 
procedure again. The first is algorithm  is BA RRIER (see Figure 1), in which all processors join a 
barrier every lookahead  cycles. This algorithm  is simple to  implement, using one c lo c k  variable 
per processor to  hold its sim ulation tim e. The sim ulation tim e need only be updated once per 
synchronization epoch, as the  processor enters the barrier, minimizing communication tim e. As 
other studies have found, this algorithm  is prone to  high waiting times.
Figure 2: The SIM PLEM IN Time Algorithm 
m intim e = c lo c k  [0] ;
f o r  ( i  = 1; i  < N um ber_o f_ rp rocs; i++)
m intim e = m in ( c lo c k [ i ] , m in tim e ); 
r e tu r n  m in tim e + lo o k ah ead ;
SIM PLEM IN (see Figure 2) is a modified version of BA RRIER th a t trades ex tra  com m unication 
and more frequent synchronization for waiting tim e. In SIM PLEM IN, each processor determ ines 
the minimum simulation tim e across all processors; the lookahead value is added to  th is minimum 
and, if it is greater than  the processor’s current tim e, it continues sim ulating up to  th a t  tim e. The 
ex tra  communication cost arises because the processors’ global clock variables m ust be updated 
on each cycle for this algorithm  to  be effective. The ex tra  synchronization cost arises because 
the window between synchronizations is, on average, significantly less than  the lookahead value. 
Two conditions are necessary for SIM PLEM IN to outperform  BARRIER: (1) the tim e to  perform 
a synchronization m ust be significantly less than  the average tim e spent in sim ulation between 
synchronizations, and (2) the cost of com m unicating the clock values m ust be low.
TW OW INDOW  (see Figure 3) is an adaptive algorithm . Since interaction between processors 
is confined to  network com m unication, it is possible to  identify impending comm unications some 
num ber of cycles before they actually  occur. TW OW IND OW  com putes a minimum horizon  value 
over all processors. It is the sum of each processor’s tim e and its current lookahead value, based on 
whether or not it is, or soon will be, comm unicating with another processor. Three factors increase 
synchronization tim e over the previous algorithm s: (1) the added complexity in com puting the
6
Figure 3: The TWOWINDOW Time Algorithm
horizon = clock [0] + lookahead [0]; 
for (i = 1; i < Number_of_rprocs; i++)
horizon = min(horizon, clock[i] + lookahead [i]); 
return horizon;
horizon, (2) the communication of two values: the processor’s clock and the current lookahead 
value, and (3) the dynamic m aintenance of the lookahead values. TW OW IND OW  can, of course, 
be generalized to  an arb itrary  num ber of lookahead values, limited by the nature of the simulated 
system  and the acceptable complexity in determ ining and dynamically m aintaining those values. 
Like SIM PLEM IN, it depends on each processor updating its global clock on each cycle.
Figure 4: The CLSTR2W IN Time Algorithm
horizon = clock[ThisClusterBase] + lookahead [ThisClusterBase]; 
for (i = ThisClusterBase + 1; i < ThisClusterMax; i++)
horizon = min(horizon, clock[i] + lookahead[i]);
Cluster_Horizon[ThisCluster] = horizon; 
for (i = 0; i < Number_of.clusters; i++)
horizon = min( horizon, Cluster_Horizon[i]); 
return horizon;
CLSTR2W IN (see Figure 4) is a clustered  minimum calculation. Processors are grouped into 
clusters. Processors within a cluster use one of the algorithm s described above such as SIM PLEM IN 
or TW O W IN D O W  across the processors within the cluster. A synchronizing processor then posts 
its cluster minimum in a per-cluster clock variable. Next it com putes a minimum over the cluster 
clocks. For a system  of N processors and a cluster size of M, this algorithm  decreases synchronization 
tim e and comm unication from O(kN) to 0 (k (M  +  N /M )). For the small system sizes reported 
here, the effects are small, but for larger system s CLSTR2W IN should extend the scalable range 
of the base algorithm  it is applied to. It also can form the basis for clustered clock m anagem ent 
in hierarchical, NUMA systems, where both locality and the num ber of sharers can effect the 
com m unication cost of a given shared variable.
Figure 5 shows the tim e for these basic synchronization operations, for four and eight processor 
runs. All tim es were produced with a load of two sim ulated nodes per processor, each running 
an SOR calculation. The times reported were gathered for the initial iteration of the algorithm  
a t each synchronization event. This iteration  is likely to  be the m ost costly, since it incurs the 
m ost com m unications cost, in the form of cache misses. It is also representative of the minimum 
cost th a t a synchronizing processor m ust pay a t each such event. Also reported is an e ffec tiv e  
lo o k a h e a d  metric. This is the average num ber of cycles a processor is allowed to  advance between 
synchronization events. For the BA RRIER case, it is always the lookahead factor, in this case 15.
In evaluating these synchronization tim es, it is useful to  consider the average work performed 
by one processor in a sim ulated cycle. In the sim ulations ju st reported for 8 processors, the average 
task tim e was 10.7 microseconds and on average .45 task  was executed every cycle. Consider







4 BARRIER 5.63 15
4 SIM PLEM IN 5.75 7
4 TW OW INDOW 7.41 13 to  15
4 CLSTR2W IN 9.95 13 to  15
8 BARRIER 9.51 15
8 SIM PLEM IN 10.37 7 to  9
8 TW OW INDOW 13.31 8 to  10
8 CLSTR2W IN 10.35 7 to  9
Figure 6 : Synchronization A lgorithm  Comparison
Sync Runtim e Effective Speedup
Algorithm Min:secs Lookahead
BARRIER 5:42 15 3.57
SIM PLEM IN 4:06 3 4.96
TW OW INDOW 3:48 4 to  5 5.35
CLSTR2W IN 3:48 4 to  5 5.35
the m ost expensive synchronization algorithm , TW O W IN D O W . W ith its effective lookahead of 
approxim ately 9, the average com putation per synchronization would be 43.3 microseconds or 
about 3.25 tim es the average synchronization cost.
4.3 C om parative Effectiveness of the A lgorithm s
Each of the  algorithm s described above makes a different tradeoff of comm unication and compu­
tation  complexity and synchronization frequency. Figure 6 shows the results of tests  using 8 real 
processors, sim ulating 32 nodes running an SOR program . The basic lookahead tim e is 15 cycles, 
while the enhanced lookahead used by the TW O W IN D O W  and CLSTR2W IN algorithm  is 35 cy­
cles. Speedups are calculated based on a uniprocessor version of PAint th a t  ran in 20 m inutes 20 
seconds.
From these results, the TW OW IN D O W -type algorithm  (of which CLSTR2W IN is a variant) is 
a clear winner. This happens despite the fact th a t  the effective window size declines to  only 4 to  
5 cycles between synchronizations. Even the very simple SIM PLEM IN algorithm  performs much 
better than  BA RRIER, again in spite of significantly increased synchronization activity.
The real benefit of TW O W IN D O W -type algorithm s, however, lies in their ability to  to lerate the 
kind of lookaheads required for more accurate network sim ulations. Figure 7 shows performance 
as the basic lookahead window is decreased, keeping the enhanced lookahead window constant a t 
35. The speedup degrades only m odestly even with a basic lookahead of 1, which should allow 
accurate, flit-by-fiit modeling of the interconnect. A program  th a t com m unicates more frequently 
will see a greater degradation, of course.
Figure 7: Lookahead Tolerance of TWOWINDOW Algorithm





15 3:48 4 to  5 5.35
5 3:56 4 to  5 5.17
2 3:52 4 to  5 5.26
1 3:57 4 to  5 5.15
4.4 O verall P erform ance
Figures 8 and 9 report speedups obtained for sim ulations of 3 program s: two successive-over- 
relaxation program s (SOR-sync and SOR-async) and a gaussian elimination (GAUSS). These are 
modified versions of program s used by [2]. The modifications consisted of replacing the message 
passing libraries with ones based on Direct Deposit[8], a protocol suite developed for the  Avalanche 
system. SOR-sync perform s two basic kinds of communication a t each tim e step: a global reduction 
and accum ulation and broadcast of a solution vector. The reduction uses a tree rooted a t simu­
lated node 0, while the  solution vector operation involves all-to-all com m unication. SOR-async 
implements the propagation of the  solution vector by having each node broadcast values as they 
are computed; the  particu lar im plem entation results in a large increase in messages sent and bytes 
comm unicated but results in a faster convergence. It still performs the global reduction a t each 
tim e step. GAUSS perform s a global reduction a t each step to  determ ine the owner of the pivot 
row, followed by broadcast of th is row from the owner to  all the other sim ulated nodes.








SOR-sync 1 32 20:20
SOR-sync 2 32 11:24 1.78
SOR-sync 4 32 6:19 3.22
SOR-sync 8 32 3:38 5.35
SOR-sync 1 64 53:20
SOR-sync 2 64 28:51 1.85
SOR-sync 4 64 15:34 3.43
SOR-sync 8 64 9:10 5.82
SOR-async 1 16 19:11
SOR-async 2 16 10:59 1.75
SOR-async 4 16 6:12 3.09
SOR-async 8 16 3:49 5.02
GAUSS 1 32 40:31
GAUSS 2 32 25:42 1.58
GAUSS 4 32 15:05 2.69
GAUSS 8 32 9:53 4.1
9
Real Processors
Figure 9: Speedup Curves using TW OW INDOW
The sim ulator was configured to  use the TW OW IND OW  synchronization algorithm , using 
lookahead values of 35 and 15 cycles. The results are much as one would expect. An increase in 
the number of nodes sim ulated per processor results in greater speedup. An increase in the  am ount 
of communication results in a decrease in speedup; compare SOR-sync against SOR-async and 
GAUSS, which have greater am ounts of comm unication.
Figure 10 gives results using an enhanced version of TW OW IND OW , which tracks processors 
th a t  are targets of com m unication, ra ther than  those th a t are sources. This results in a m ore precise 
application of the appropriate  lookahead value. This becomes im portan t with increasing num ber of 
simulated nodes and with applications with relatively high communication to  com putation ratios.








SOR-sync 8 64 8:50 6.04
GAUSS 4 32 13:45 2.95
GAUSS 8 32 9:00 4.5
5  S c a l a b i l i t y
Shared memory system s are frequently criticized as lacking scalability. Emerging DSM system s 
promise to  extend shared mem ory from the current 4 to  sixteen processors on high-perform ance 
bus-based system s to  hundreds of processors. The high cross sectional bandw idth of switch-based 
fabrics makes possible th is order-of-m agnitude increase in processor count. Latency of cache misses
10
due to  fabric delays and protocol processing are higher, of course, than  in bus-based system s such 
as the Power Challenge.
The scalability of shared memory distributed parallel simulation such as th a t reported here 
rem ains to  be seen. The challenges will be in the areas they have always been in d istributed sim u­
lation: synchronization and waiting. Synchronization effects should fare no worse for DSM than  for 
message passing systems, since the underlying com m unications fabrics and transport mechanisms 
will likely be identical. Arguably, DSM will do better as a synchronization substra te  than  message 
passing. The frequent updating of each processor’s clock variable is analogous to  null messages in 
a message based system. The variable serves to  coalesce sequences of updates so th a t another pro­
cessor always get the single latest clock value when it performs a synchronization. This coalescing 
also serves to  reduce bandw idth consum ption in the interconnect. For example, the average tim e to  
ex tract a task  from the task list when using 8 processors sim ulating 64 nodes is 750 nanoseconds, 
which is comparable to  the tim e in a sim ilar uniprocessor run. This indicates th a t many updates to 
the clock variables, which occur within the task  extraction procedure, are coalesced, costing only 
as much as a local cache access.
A thornier problem is waiting for slow processors. Efficient im plem entations of the shared 
memory model offer the opportunity  to  trade  frequent comm unication and synchronization for 
decreased waiting. As long as the fundam ental synchronization model remains conservative, of 
course, tem porary local load imbalances will lead to  waiting. Short of adopting an optim istic 
strategy, decreasing the com putational weight of individual tasks, to  allow finer-grained scheduling, 
offers the best hope of decreasing waiting tim e. Little effort was made in this direction in the work 
reported here; it will be undertaken when availability of a larger platform  makes it practical.
6 R elated  Work
Num erous groups have developed sim ulators for m ultiprocessor architectures. A few of them  are 
surveyed here.
The W isconsin W ind Tunnel[3] uses direct execution of instrum ented program s on a CM-5, re­
sulting in very fast and accurate sim ulation. Interactions between nodes occur only when program s 
access shared memory locations, which are transla ted  by the underlying simulation system  into 
message passing events between CM-5 nodes. Processor execution proceeds in lock step  using a 
conservative window approach, with control returning to  the sim ulator a t the end of the window, 
or when a non-local memory operation is performed th a t misses in the cache. The CM -5’s fast 
reduction operators are used to  ensure th a t all processors have reached the end of the current win­
dow before proceeding. The main disadvantages of the W W T  is the dependence of the sim ulation 
environm ent on the CM-5 hardw are and the lack of flexibility in modifying many aspects of the 
architecture due to  its direct execution nature.
Parallel Proteusfl] performs direct execution sim ulation, using a conservative tim e window 
approach. To overcome a small lookahead size resulting from switch level sim ulation, they use 
local barriers and predictive barrier scheduling. Local barriers use a nearest neighbor approach to 
reduce the number of nodes each processor m ust synchronize with. Predictive barriers reduce the 
num ber of required synchronization points by taking advantage of the fact th a t processors need 
not synchronize during periods when the  sim ulated processors are not comm unicating. This is 
another example of increased lookahead, and works well when the sim ulated processes engage in 
long com putational periods between com m unication. Both compile-time and runtim e analysis are 
employed to  predict when sim ulated processors are going to  com m unicate.
LAPSE[4] is another sim ulator th a t performs direct execution of instrum ented program s, in this
11
case message passing programs. The granularity of synchronization is larger since there are larger 
periods of execution between message events.
Parallel Embra[10] is the simulator most closely resembling PPAint. It, too, executes on a 
shared memory platform. It differs from PPAint in using a largely direct-execution model, though 
mechanisms are provided to alter the model of most architectural features. Little is published 
about it; its synchronization mechanism seems to be a conservative time window approach with 
late messages simply being moved into the present.
7  C o n c l u s i o n s  a n d  F u t u r e  W o r k  ‘
The development of PPAint has demonstrated that shared memory provides an effective substrate 
for distributed architectural simulation. Its most appealing characteristic is the ease with which 
synchronization can be implemented and refined. The efficiency of the model encourages experi­
mentation writh communication-intensive synchronization algorithms that would be impractical in 
a message-based system.
Numerous avenues for continued effort are apparent: evaluating scalability on larger systems, 
and on DSM systems in particular; determining which tasks have the largest effect on load im­
balance and whether they can be made less “chunky;” evaluating the underlying shared memory 
performance, perhaps by simulating PPAint on top of PPAint, to gain insight into the dynamics of 
synchronization via shared variables.
R e f e r e n c e s
[1] B r e w e r , E., D ella ro c a s , C ., C o l b r o o k , A ., and W e ih l , W. PROTEUS: A High- 
Performance Parallel Architecture Simulator. Tech. Rep. MIT/LCS/TR-516, Massachusetts 
Institute of Technology, Sept. 1991.
[2] C h a n d r a ,  S., L a ru s ,  J. R ., a n d  R o g e r s ,  A. Where is Time Spend in Message Passing and 
Shared-Memory Programs? In Proceedings o f the 6th Sym posium  on Architectural Support fo r  
Program m ing Languages and Operating S ystem s  (Nov. 1994), pp. 61-75.
[3] C h a n d r a s e k a r a n ,  S., a n d  H i l l ,  M . D. Optimistic Simulation of Parallel Architectures 
Using Program Executables. In Proceedings o f the 10th W orkshop on Parallel and D istributed  
Sim ulation  (May 1996), pp. 143-150.
[4] D ic k e n s , P. M., H e id e lb e r g e r ,  P., a n d  N ic o l ,  D . M. Parallelized network simulators 
for message-passing parallel programs. In M A S C O T S  95 (Jan. 1995).
[5] H e w l e t t - P a c k a r d  C o . P A -R IS C  1.1 Architecture and Instruction  S et Reference M anual, 
February 1994.
[6] St o l l e r , L. B., and Sw a n so n , M. R. PAINT: PA Instruction Set Interpreter. Tech. Rep. 
UUCS-96-009, University of Utah, March 1996.
[7] S w a n so n , M. R., D avis, A ., a n d  P a r k e r ,  M. Efficient Communication Mechanisms for 
Cluster Based Parallel Computing. In W orkshop on C om m unication and Architectural Support 
fo r  Network-based Parallel Com puting (C A N P C  97) (February 1997), vol. 1199 of Lecture 
N otes in C om puter Science, Springer-Verlag, pp. 1-15.
12
[8] S w an so n , M. R ., a n d  S t o l l e r ,  L. B . D irect Deposit: A Basic User-Level Protocol for 
Carpet Clusters. Tech. Rep. UUCS-95-003, University of Utah, March 1995.
[9] V e e n s t r a , J ., and F o w l e r , R. MINT: A Front End for Efficient Simulation of Shared- 
Memory M ultiprocessors. In M A S C O T S  1994 (Durham, NC, Jan. 1994), pp. 201-207.
[10] WlTCHEL, E ., AND R o sen b lu m , M . Embra: Fast and Flexible M achine Sim ulation. In 
Proceedings o f  the 1996 International Conference on Parallel Processing  (Aug. 1996), pp. 99­
107.
