Parallel discrete event simulation: A shared memory approach by Malony, Allen D. et al.
Parallel Discrete Event 
Simulation: A Shared Memory Approach 
- 
Daniel '4. R e e d  
Department of Computer Science 
University of Illinois 
Urbana, Illinois 61801 
Allen D.  Malony 
Center for Supercomputing Research and Development 
University of Illinois 
Urbana, Illinois 61801 
Bradley D.  ,McCredie 
Department of Electrical and Computer Engineering 
University of Illinois 3 >, y y j  , 
Urbana, Illinois 61801 \ :., + 1 -  
ABS TRA CT 
With traditional event list techniques, evaluating a detailed discrete event 
simulation model can often require hours or even days of computation time. 
Parallel simulation mimics the interacting servers and queues of a real system by 
assigning each simulated entity to a processor. By eliminating the event list and 
maintaining only sufficient synchronization to  insure causality, parallel 
simulation can potentially provide speedups that  are linear in the number of 
processors. We present a set of shared memory experiments using the Chandy- 
,Misra distributed simulation algorithm to simulate networks of queues. 
Parameters of the study include queueing network topology and routing 
probabilities, number of processors, and assignment of network nodes to  
processors. These experiments show that  Chandy-Misra distributed simulation 
is a questionable alternative to  sequential simulation of most queueing network 
models. 
This work was supported in part by NSF Grant Number DCR 84-17948 and NASA 
Contract Number NAG-1-813. v' 
(BiASA-CR- 1806 16) P A B A L L E L  CJSCiiE!IE EVEIT N07-26576 
S I B U L A I I C P :  A S E A F E D  ffEWOliY I E - L F C A C H  
( l f l i n < z i =  Univ,) 47 F A v a i l :  b l I S  HC 
A 0 3 / R €  A01 CSCL 09B Unclas 
G3/62 0064478 
https://ntrs.nasa.gov/search.jsp?R=19870017143 2020-03-20T09:47:10+00:00Z
Introduction 
Historically, there have been two major techniques for modeling systems: queueing theory 
and discrete event simulation. When effective, queueing theoretic techniques can quickly provide 
mathematicaj insight into the behavior of systems over a broad range of parameter values. Their 
major limitation is the number of restrictive assumptions tha t  must be satisfied to  insure 
accuracy. Conversely, simulation models can mimic - a  real-world system as closely as 
understanding permits and needs require. However, highly detailed simulation models can be 
computationally taxing. Computer systems simulations are particularly vexing because 
simulated events occur on a millisecond or microsecond time scale, often for many simulated 
minutes. 
For example, simulating the behavior of processor executing a user or  system program may 
involve millions or even tens of millions of events. In one architecture performance study, we 
recently examined the performance of allocation strategies for register windows in reduced 
instruction set computers (RISCs) iPaSe82, Patt851 as a function of multiprogramming level 
K o R  WS6]. This analysis required instruction-level simulation for many different program mixes 
and consumed many hours of processor time. 
Simulation of complex (VLSI) digital circuits for logic verification and fault analysis is 
another example of the computational constraints imposed on simulation of computing 
components. Although such simulations can consume months of machine time [Pfis82, FrWW841, 
designers have little choice; an untested design is unacceptable. Moreover, simulation complexity 
continues to  increase dramatically; technology advances are doubling the number of circuits per 
chip every 1-2 years. 
At a much higher level than  logic design, we recently encountered difficulties while studying 
multicomputer networks iReed83, ReF ~ 8 6 1 ,  designed, ironically, to solve computacionally 
2 
intensive problems. Briefly, a multicomputer network is a large number of interconnected 
computing nodes tha t  asynchronously cooperate via message passing to  execute the tasks of 
parallel programs.' Many design issues must be resolved before constructing a multicomputer 
network (e.&, the relative speeds of computation processors and internode communication links, 
topology of connecting communication links, buffer requirements for messages, and memory 
sizes). Although some of these issues can be attacked analytically, most are analytically 
intractable and can only be resolved via simulation. A parametric simulation study, whose 
individual simulation runs cover several minutes of simulated time, typically requires several 
hundred hours of processor time. 
Although each of the three preceding examples, processor simulation, logic simulation. and 
network simulation, is very different, they share a common need for faster simulation techniques. 
Processor simulation reflects the fetch/decode/execute cycle of instruction execution and is, by i ts  
nature, sequential; parallelizing this application is the subject of architectural research. Circuit 
simulation, although clearly amenable to  parallel processing [Pfis82;, typically involves 
synchronous activation of many entities. In contrast, network simulation is typically 
asynchronous. 
Prior Work 
I t  might initially appear that  evaluating models of many complex systems is both 
analytically and computationally intractable. However, recent developments have suggested tha t  
the computation time for some simulations can be reduced via either vector processing :ChBr83i 
or distributed simulation [PeWM79, ChMi81, JeSo851. 
'Hypercubes ;Seit85] are a special case of multicomputer networks. 
3 
Ve c t o r Simulation 
Chandak and Browne [ChBr83\ recently proved an item of computing folklore -- discrete 
event simulation models cannot always be vectorized. Specifically, they showed tha t  any network 
of queues model containing feedback is not vectorizable. This result is quite negative: most 
interesting simulation models contain some type of feedback. 
Given this result, we recently investigated the level of vectorization pract ica l ly  achievable 
[Reed851 by instrumenting a discrete event simulation of queueing network models on a Cray x- 
MP. Although we simulated a variety of workloads and queueing network models. the observed 
vectorization level never exceeded 5 percent. Even this fraction was primarily attr ibutable to 
initialization code. Thus, the efficacy of vector simulation is in doubt. 
Distribute d Sim u 1 at io n 
The inherently sequential nature of event list manipulation limits the potential parallelism 
of standard simulation models. The  head of the event list must be removed, the sirnulacion clock 
advanced, and the event performed (possibly causing new events to  be added to  the event list). 
Although techniques for performing event list manipulation and event simulation in parallel have 
been suggested [Comf82, Comf83], large scale performance increases seem unlikely. Only by 
eliminating the event list, in its traditional form, can additional parallelism be obtained; this is 
the goal of distributed simulation. 
If one views a simulation model as a network of interacting servers and queues, distributed 
simulation maps each server/queue pair onto a processor of a multicomputer network. Each 
processor 'operates with its own simulation clock, and there is no global event list. Event 
occurrence times are transmitted across communication links to appropriate recipients (e.g., a 
message departing one server for another would carry with i t  its time of departure). 
4 
Several distributed simulation techniques have been proposed, notably the Chandy-Misra 
algorithm iChHM79, ChMi79, ChMi811 and the Time Warp algorithm [JeSo85]. The Chandy- 
lMisra algorithm and Time Warp differ in their approach to  time management. The former is 
pessimistic, advancing the processor simulation clocks only when conditions permit. In contrast. 
Time Warp assumes the simulation clocks can be advanced until conflicting information appears; 
the clocks are then rolled back t o  a consistent state, a so-called "time warp." 
Both the Chandy-Misra algorithm and Time Warp have been simulated iSeeti8, JeSo85;. 
but, to our knowledge, no experimental results have yet been reported. In the remainder of this 
paper, we present the Chandy-Misra algorithm [ChMi81] and the results of an extensive study of 
its performance on a shared memory parallel processor when simulating queueing network 
models. Parameters of the study include queueing network topology and routing probabilities, 
number of processors, and assignment of queueing network servers to  processors. We conclude 
with a summary of lessons learned and directions for future research. 
The Chandy-Misra Distributed Simulation Algorithm 
Consider some physical system composed of independent, interacting entities. A natural ,  
distributed simulation of the physical system creates a topologically equivalent system of logical  
nodes. Interactions between two physical nodes are modeled by exchange of timestamped 
messages. The timestamp is the simulated message arrival time at the receiving node. 
Each logical node is subject to some constraints. First, node interaction is only via message 
exchange; there are no shared variables. Second, each node must maintain a clock. representing 
the local simulated time. Finally, the timestamps of the messages generated by each node must 
be non-decreasing. 
5 
Intuitively, the distributed simulation has no single "correct" simulation time; each node 
operates independently subject only to  those restrictions necessary to  insure tha t  events happen 
in the correct simulated order (i.e., causal i ty  is maintained). Independent events can be 
simulated in parallel even if they occur a t  different simulated times. 
Message timestamps and node clocks are a manifestation of the need for causality: the 
behavior of a node P at its simulated time T cannot be influenced by any information 
transmitted to it after time T .  This constraint has rather dramatic ramifications. Consider a 
node P that  receives messages from two other nodes A and B .  When a message arrives from 
node A ,  one would expect node P to  interpret the message, perhaps producing a message as a 
consequence. However, if the arrival time of the message from A is greater than the arrival time 
of the last message from B ,  the message from A cannot  be processed. Why? .A message might 
later arrive from B with a smaller timestamp. Thus, a node with multiple inputs must wait 
until it receives messages from all inputs before selecting a message to interpret. 
Although appealing, distributed simulation poses several pragmatic problems: 
Optimal assignment of nodes to  processors is expensive. 
Only a subset of all discrete event simulation models are amenable to  distributed simulation. 
As noted above, shared variables are not permitted. Hence, no events depending on the 
global system state  are possible. 
Deadlocks can occur in most simulation models. Recall that  a node must insure that  no 
information received later can affect its output;  this may require waiting for additional 
inputs. A cycle of waiting nodes results in deadlock. 
The assignment problem deserves additional comment. The natural hardware realization of the 
network of nodes is a multicomputer network. Pragmatics dictate, however, that  the 
multicomputer network have a fixed interconnection topology. Thus. a node network m u s t  be 
8 
mapped onto the multicomputer network. Unfortunately, finding an optimal mapping is known 
to be NP-complete iChu801. In practice, scheduling heuristics must be used; sub-optimal 
mappings are produced a t  considerably less computational expense. Even if an optimal mapping 
were found, the - respective topologies of the multicomputer and node networks may be ill suited, 
resulting in either large communication delays or processor load imbalances.2 
Like node scheduling, deadlock resolution, although difficult, is solvable. Chandy and Misra 
have described two distributed deadlock resolution techniques, avoidance and recovery jChMi81]. 
For specificity’s sake, we describe these techniques in the context of our RESQ implementation 
[SaMS801 for simulating queueing networks. 
In the RESQ scheme, there are  five node types: service,  f o r k ,  merge, source .  and -sink. 
Service nodes correspond to  the interacting entities of a physical system (e.g.. servers in a 
queueing network). In contrast, fork and merge nodes exist only to  provide routing. Finally, 
source and sink nodes respectively create and destroy network messages. Thus. the central server 
model [Buze73] of Figure l a  would be represented, using the RESQ scheme, as shown in Figure 
lb .  
The RESQ notation for describing queueing network models has been widely used as an 
input language for sequential simulations. Using RESQ for parallel simulation entails modifying 
the semantics of some node node types. Specifically, distributed simulation with deadlock 
avoidance [ChMi79] requires fork, merge, and server nodes3 to  send null messages under certain 
conditions. These null messages are time stamped and tell the receiving node that  no r e a l  
message will be forthcoming before the specified time. This enables the receiver t o  process 
outstanding messages with the assurance that  its actions will not be revoked a t  a later time. 
*Scheduling difficulties can be ameliorated by a shared memory implementation of message passing. This ap- 
3By definition, source and sink nodes can never be members of a deadlock set. 
proach is discussed later. 
7 
A fork  node accepts a single stream of message inputs and distributes this stream across .V 
outputs. Upon receiving a real or null input message, a fork node routes the message to  the 
selected output and creates N-1 null messages, each with the same timestamp as the message 
received. One null message is routed to  each destination not selected. 
A merge node accepts N streams of message inputs and routes them in t imes tamp order t o  a 
single output. As noted earlier, the timestamp ordering forces the node to  wait for messages, 
perhaps null, on all inputs before producing an output. 
Finally, a server node accepts a single input stream and produces a single output stream. 
When the time of last message arrival is greater than the time of last message departure, and the 
server has no real messages to process, it produces a null message with a timestamp equal to the 
minimum time of next real message departure. 
Although the null message technique provably avoids deadlocks, it does so a t  the price of 
potentially high overhead. In networks containing many fork/merge cycles, simulations have 
shown that  the ratio of null to  real messages can be very high iSeet78, Reed851. The alternative 
to  deadlock avoidance is deadlock detection and recovery. In this approach, the distributed 
simulation alternates between computation and recovery phases. As proposed by Chandy and 
Misra [ChMi8lj, the simulation runs until a distributed deadlock detection algorithm verifies 
deadlock. The simulation then enters a deadlock recovery phase and finally returns to  active 
computation. 
Although deadlock detection and recovery avoids null messages, it does so by diverting 
computation resources to  detection and recovery. The performance advantage of this approach 
versus deadlock avoidance depends on the relative costs of synchronization and message passing. 
In light of the many potentially performance-limiting problems with distributed simulation, 
it seems important to analyze the performance of distributed simulation in a realistic 
8 
environment. Many such performance studies of traditional simulation algorithms have been 
conducted, and, based on these studies, new event list algorithms have been proposed [FrMa'i7, 
Wyma751. Only limited simulation studies of distributed simulation have been reported [Seet78, 
JeSo85, Reed851; little or no empirical data are available. In the remaining sections we discuss 
our experimental environment, implementation, and experimental results. 
Experimental Environment 
All simulation experiments were conducted on a Sequent Balance 21000 containing 20 
processors and 16 Mbytes of memory. As shown in Figure 2, each Balance 21000 processor is a 
10 MHz National Semiconductor NS32032 microprocessor, and all processors are connected t o  
shared memory by a shared bus with a 80 Slbyteis (maximum) transfer rate. Each processor has 
a 8K byte, write-through cache and an 8K byte local memory; the latter contains a copy of 
selected read-only operating system data structures and code. 
The Dynix- operating system for the Balance 31000 is a variant of CTC-Berkeley's 4.2BSD 
CTnix~m with extensions for processor scheduling. Because Dynix schedules all processes from a 
common pool, a process may execute on different processors during successive time slices. 
However, as long as the number of active processes is less than the number of processors, each 
process will execute on a separate processor. In this case, process and processor are equivalent 
notions. To the time-sharing user, the Balance 21000 appears as a standard Unix system, albeit 
with better interactive response time. 
Parallel programs consist of a group of Unix processes that  interact using a library of 
primitives for shared memory allocation and process synchronization. Shared memory is 
implemented by mapping a region of physical memory into the virtual address space of each 
process. This mapping can be done only once during program execution, typically a t  the 
9 
beginning. Once mapped, shared memory can be allocated t o  specific variables as desired. 
Access to the shared memory region is controlled by software spin locks and barriers.  These 
locks, semantically equivalent to binary semaphores, provide mutual exclusion. Barriers are used 
to synchronize a group of processes; a process reaching a barrier is forced to  wait until all 
processes in the specified group have reached the barrier. 
In summary, the Balance 21000 is a "standard" Unix system with minimal extensions for 
parallel programming. Consequently, many parallel operations are  dominated by operat,ing 
system overhead. For comparison with later discussion, Table 1 shows the elapsed times for 
typical operations. 
Shared Memory Implementation of Distributed Simulation 
A shared memory multiprocessor, such as the Balance 21000, provides a flexible testbed for 
studying the performance of distributed simulation. The  problems associated with mapping a 
node network onto a multicomputer network are removed; the shared memory processors are. 
effectively, completely connected. By implementing message passing using shared memory, 
communications costs are the same for all processors. However. a shared memory 
implementation of distributed simulation requires special consideration for synchronization of 
shared message queues, processor allocation, and deadlock management. 
In a shared memory implementation of distributed simulation, all node s ta te  information, 
including input message queues, resides in shared memory. Message-based communication 
between nodes is implemented via shared access to the message queues of each node. Each 
message queue is protected by a synchronization lock to  guarantee mutual exclusion. 
Synchronization is only necessary, however, if the communicating nodes execute on separate 
processors. 
10 
Before a node can send a timestamped message to  another node, it must first acquire a free 
message from a shared free message list. A lock is necessary to  prevent simultaneous access to  
the free message list. After retrieving a free message, the node timestamps it and writes it t o  the 
destination node's * message queue, using synchronization primitives to lock and unlock the queue: 
if necessary. A message is returned to  the free message list once it has been processed by the 
destination node. Because only messages are used for internode communication, the requirement 
that  no simulated events depend on the global system state is still satisfied. 
Processor Allocation 
There are two basic approaches to processor allocation in a shared memory implementation 
of distributed simulation. The first approach, static node assignment, fixes the assignment of 
nodes to  processors for the duration of the simulation. When the number of network nodes 
equals the number of allocated processors, each node is assigned to  a separate processor. 
Otherwise, nodes must be clustered, and these clusters are assigned to  individual processors. 
Several clusterings are possible when the  number of nodes exceeds the number of processors; each 
such clustering exhibits different performance. One advantage of static node assignment is tha t  
communication between nodes in a cluster can be done "locally" without the overhead for locking 
message queues. However, intercluster message transmissions require queue locking. 
The second approach, dynamic node assignment, assigns nodes to  processors during the 
simulation. Idle processors obtain work from a shared queue of unassigned network nodes. This 
shared node work queue must be locked before a processor can be allocated an unassigned 
network node. When a processor obtains a node, it satisfies any outstanding work for the node 
before returning the node to  the tail of the node work queue. Because processors are assigned 
only one node a t  any time, all communication between nodes must be synchronized to  guarantee 
exclusive access to shared message queues. 
11 
With dynamic node assignment, nodes must wait on the work queue until assigned to  a 
processor. The length of this delay depends on size of the node work queue and can be quite large 
for large networks. However, not all nodes on the work queue have outstanding work (Le.. there 
are input messages ~ tha t  will generate output messages when processed). In deadlock avoidance 
mode, for example, those nodes awaiting input can only generate null messages if processed. A 
natural strategy for improving performance places only those nodes with outstanding work on 
the work queue. This reduces the size of the node work queue and the waiting delay. 
Our implementation of the above node waiting strategy is conservative. When a processor 
identifies a node with no outstanding work, it sets a “waiting” flag in the node’s s ta te  and does 
not place the node on the work queue. When a message is sent to  a waiting node, the processor 
sending the message will reset the waiting flag of the waiting node and place it on the work 
queue. The implementation is conservative because the new message may not actually instigate 
any new work for the node. 
To investigate the effects of this node waiting strategy, we also implemented a no node 
waiting scheme. In this approach, a node is immediately placed at the tail of the work queue 
after it has been processed? even if it has no outstanding work. 
Although static node assignment is efficient for nodes within a cluster, the node assignment 
cannot be changed to  balance network load. Conversely, dynamic node assignment naturally 
adjusts to network load but incurs synchronization overhead not only for all messages but  also 
for access to the node work queue. Which implementation is best for a particular simulation 
model depends on the relative costs of synchronization and the beneficial effects of load balancing. 
Deadlock Avoidance and Recovery  
Our deadlock avoidance approach is a straightforward implementation of the algorithm 
described earlier iSeet78, ChMi81:. In contrast, deadlock recovery merits further discussion. 
12 
As described by Chandy and Misra, distributed simulation with deadlock detection and 
recovery alternates between simulation and distributed deadlock detection and recovery phases. 
The presence of shared memory obviates the need for most of the protocol for distributed 
deadlock detection [ChHM83]. Instead, each processor sets a flag in global memory when i t  
believes it is deadlocked. A guardian  processor monitors the global system s ta te  and forces the 
processors to  rendezvous at a synchronization barrier when they all report potential deadlock. 
The deadlock recovery algorithm is then invoked. 
Notice, however, that  all processors reporting an inability to  progress is a necessary but not 
sufficient condition for deadlock. Between the time a processor P reports potential deadlock and 
the time the guardian processor sees this report, processor P may have received messages 
enabling it to  progress. Consequently, the processors may appear deadlocked when they are not. 
T o  reduce the probability of detecting such false deadlocks, the guardian uses a backofalgorithm 
that  must re-verify a potential deadlock before invoking deadlock recovery. This algorithm. 
controlled by an input parameter, weighs the relative cost of forcing synchronization and 
deadlock recovery for a false deadlock against the lost time when detection of a real deadlock is 
delayed. 
One may well ask why this deadlock detection technique was used, rather than a variation 
of graph reduction [Fink861 or the distributed deadlock detection proposed by Chandy e t  af 
/ChHM83]. Simply put, the number and frequency of deadlocks in a distributed simulation is 
potentially enormous. Hence, deadlock detection and recovery must be fast. To obtain a 
consistent state for graph reduction, the processors must either exchange messages or 
synchronize. The overhead of the first is near that  for deadlock avoidance. The latter is as 
expensive as detecting false deadlocks. Thus, detecting some false deadlocks using a backoff 
mechanism seems a reasonable compromise. 
13 
Simulation Experiments 
Experimental evaluation of distributed simulation requires not only an implementation but  
also a set of test cases. This is particularly important in light of earlier simulation studies 
[Seet78, Reed851, which showed that  the performance of distributed simulation is extremely 
sensitive to  the topology of the simulated network. Simple tests (e.g., tandem queues) have easily 
interpretable results, but do not reflect typical simulations. Conversely, simulation of complex 
queueing networks, although realistic, make it difficult to  interpret the sources of performance 
degradation in distributed ~ i m u l a t i o n . ~  
As a compromise! we selected several simple queueing networks and a few complex ones. 
tandem networks (1, 3, 4. 8, and 16 server nodes) 
general, feed-forward networks ( 6 ,  IO! and 14 nodes), 
cyclic networks (2,  4? and 8 nodes) 
central server networks ( 5  nodes), and 
cluster networks (10 and 18 nodes). 
The tandem and feed-forward networks are open networks and contain no cycles. With 
potentially linear speedup, they represent the best-case performance of distributed simulation. 
The cyclic networks show the performance degradation of tandem networks when they are 
closed. As an often used model of computer systems, central server networks have pragmatic 
importance ‘Buze73j. In addition, they have nested cycles, a more restrictive constraint than the 
simple cyclic networks. Finally, the cluster networks illustrate the effects of decomposability on 
simulation performance. 
‘We distinguish between the performance of the Chandy-Misra simulation and the performance measures for the 
Simulated network. The former are the subject of this study. 
14 
Each of these networks was simulated for a variety of workloads, (e.g., routing 
probabilities, arrival rates, and service times) using six variations of a Chandy-Misra 
implementation: static node assignment with deadlock avoidance, static node assignment with 
deadlock recovery, dynamic node assignment with deadlock avoidance, dynamic node assignment - 
with deadlock recovery, dynamic node assignment with waiting and deadlock avoidance, and 
dynamic node assignment with waiting and deadlock recovery. In all cases, we varied the 
number of processors from one to the number of nodes in the simulated network. Together, 
these simulations represent approximately two weeks of computation time on the Sequent 
Balance 21000. Figures 8-25 and Tables 3-5, discussed below, show the results of a portion of 
these experiments. All such figures and tables show 95 percent confidence intervals about mean 
values. 
Speedup, defined as 
where T, and T p  are the respective execution times using one and p processors, is the 
performance metric used to  compare all experimental results. All speedups are shown relative to  
a one processor distributed simulation using static assignment with deadlock recovery. In this 
case, all simulated nodes execute on one processor. Consequently, no synchronization is needed 
during queue insertion and deletion. Deadlocks can occur with one processor. This increases the 
value of T, and, consequently, increases the apparent speedup. Although it might seem 
preferable to use an event list oriented simulation as the point of reference, this would color the 
results with the idiosyncrasies of two implementations. For comparison, we conducted equivalent 
event list simulations on the Balance 21000 using SMPL, a portable simulation package. These 
results show that  a single processor distributed simulation always executes more slowly than the 
equivalent sequential simulations. Thus, the speedups presented can be viewed as upper bounds 
15 
on the speedup achievable with distributed simulation. 
In addition to speedup, we also use deadlock recovery and null message fractions as 
performance measures. These are defined as 
number  of deadlock recoveries  
number  of message  t ransmis s ions  
F ,  = 
and 
number  of nul1 message  t ransmis s ions  F N  = 
number  of message  t ransmis s ions  ’ 
respectively. The deadlock recovery and null message fractions measure the amount of useful  
computation performed by each simulation. 
T a n d e m  Networks 
Tandem networks are a feasibility test of distributed simulation. If distributed simulation 
cannot achieve good pipelined speedups for tandem networks, there is little prospect for success 
for networks containing cycles. 
Figure 8 shows the speedups for both deadlock avoidance and recovery when nodes are 
statically assigned to  processors. Because there are no cycles, no deadlocks occur. and there is 
little distinction between deadlock recovery and avoidance. Recall that  deadlock avoidance must  
continually verify tha t  no null messages need be sent. Conversely, deadlock recovery does 
nothing until deadlock is detected. Thus deadlock avoidance incurs a small overhead even i f  no 
null messages need be sent. This difference is magnified as the number of nodes increases, leading 
to  a small, but perceptible difference a t  16 server nodes. 
Figure 8 shows a linear speedup for a small number of nodes, and a decrease in the slope of 
the speedup curve for additional nodes. This sublinear speedup for a larger number of nodes 
arises from memory and bus contention, as well as synchronization overhead. 
16 
By comparison, Figure 9 shows speedups when deadlock recovery is used, and nodes are 
retrieved from a work queue.5 Dynamic node assignment yields greater speedup than static 
assignment, but it too suffers from memory contention. Using half as many processors as nodes 
results in near linear speedup, albeit a smaller speedup than tha t  obtained with maximal 
parallelism. This is the final confirmation of the effects of memory contention. 
Waiting (i.e., placing nodes on the work queue only when they can profitably be evaluated), 
is ineffective because, in a tandem network, all nodes are always active. The  additional overhead 
simply reduces the speedup, as shown in when maximal parallelism is used. 
The previous discussion, with one exception, assumed the number of processors equaled the 
number of nodes. When the number of nodes exceeds the number of processors, nodes must. with 
static assignment, be clustered onto processors. Table 2 shows the effects of this clustering for a 
tandem network containing 16 server nodes. For static assignment, speedup declines 
precipitously as the number of processors is reduced (e.g., reducing the number of processors 
from 18 to  12 reduces the speedup from approximately 9 to  4). -In contrast, dynamic node 
assignment allocates processors to  nodes based on their need for evaluation. When the number of 
nodes exceeds the number of processors, dynamic assignment is the method of choice. 
Finally, we note that  the sequential execution time, 113 seconds, compares favorably to  the 
single processor distributed simulation. This suggests tha t  the overhead for distributed 
simulation, other than that for deadlock avoidance or recovery, compares favorably to  that  for 
event-driven simulation. 
Ge ne r a I F e e d - f o r w a r d N e  two r ks 
'In Figure 9, "half parallelism" means that the number of processors used is equal to one half the number of net- 
work nodes. 
17 
Among the simplest generalizations of a tandem network are those containing forks and 
joins. Figures 4 and 5 show two such networks of differing complexity. Tables 3 and 4 show the 
corresponding speedups as a function of node clustering and deadlock technique. 
Unlike fhe tandem networks, where deadlock avoidance and recovery are  indistinguishable, 
feed-forward networks with forks necessarily distinguish between the two deadlock techniques. 
Because this network is open and contains no cycles, no deadlocks can occur, and deadlock 
detection detects none. In contrast, deadlock avoidance requires tha t  null messages be sent a t  
each fork node. This overhead is the reason for the difference in the performance of the two 
deadlock techniques. 
Although speedups are  not linear in the number processors, the fork and join nodes do not 
Because of this, a linear speedup from a require as much processing time as the server nodes. 
sequential simulation cannot be expected. 
Cyclic 'Vetworks 
The closed equivalent of a tandem network is the cyclic queue of Figure 6 .  C'nlike the 
tandem network, where the interarrival time a t  the source node does not affect the execution 
time of the simulation, the cyclic network depends on the simulated population. Figure 10 shows 
the speedup obtained for a four node cyclic network; similar results also hold for larger cyclic 
networks. In contrast to  the tandem networks, the cyclic network does not show linear speedup 
as a function of the number of processors. 
Initially, one might suspect deadlock avoidance or recovery caused this decrease in 
performance. However, examination of the simulations shows tha t  deadlock avoidance sent no 
null messages. Instead, messages circulate in large groups or trains; a node processes a train of 
messages and waits until they return on their next cycle. This suggests tha t  fewer processors. 
dynamically assigned to  the nodes, would achieve most of the potential speedup. Figure 11 
18 
confirms this supposition: two, dynamically assigned processors, achieve nearly 80 percent of the 
speedup obtained with four processors. When dynamic assignment is augmented with waiting, as 
in Figure 12, the difference grows even smaller. 
The second important conclusion drawn from simulating cyclic queues is the extremely high 
cost for deadlock recovery. As noted above, deadlock avoidance sent no null messages. In 
contrast, deadlock recovery detected a small number of potential deadlocks. At  population 40, 
the deadlock recovery fraction F ,  was 0.0015. This corresponds to  approximately 250 deadlock 
recoveries in 160,000 message transmissions. As discussed earlier, deadlock detection recovery 
forces all processors to synchronize at a barrier before invoking the deadlock recovery algorithm. 
Execution profiling showed that  the deadlock recovery routine and the barrier p_rimitive 
comprised a negligible fraction of the simulation time. Synchronizing is  expensive, but only 
because there is a significant interval between the arrival of the first processor a t  the barrier and 
the last. The  parallelism declines as each processor reaches the barrier. Were the transition 
from computation to deadlock recovery abrupt,  deadlock recovery would be inexpensive. .is 
Figure 11 shows, dynamic node assignment improves the performance of deadlock recovery, 
primarily because processors can more quickly rendezvous at  the synchronization barrier. 
Finally, because fewer processors are  actively evaluating nodes, dynamic node assignment with 
waiting further reduces the rendezvous delay; see Figure 12. 
Central Server Networks 
Central server networks have long been used as models of computer systems .Buze73!, and 
consequently have pragmatic importance. Because they contain nested cycles, central server 
networks are susceptible to  deadlock in a distributed simulation. Hence, they are a more realistic 
test of distributed simulation. Figures 13 and 14 show the speedup obtained for a central server 
network containing three servers; see Figure 1 for the network topoiogy. Even with five 
19 
processors, the speedup barely exceeds unity. Moreover, this is using the single processor, static 
node assignment case as the basis for calculating speedup. As Table 5 shows, the  parallel 
implementa t ion  rarely  comple tes  m o r e  quickly t h a n  the sequent ia l  imp lemen ta t ion .  Indeed, static 
node assignment with deadlock avoidance runs 16 times more slowly than the sequential 
implementation. 
than Figures 13 and 14 suggest. 
Consequently, the speedups over an event-driven simulation are much lower 
Unlike the simple cyclic network, where both deadlock avoidance and recovery were rare. 
the central server network frequently forces the simulation to either send null messages or 
a t tempt  deadlock recovery. Figures 15 and 16 respectively show the deadlock recovery and null 
message fractions for static node assignment. Although not shown, the fractions for dynamic 
node assignment are similar. 
With only one circulating message, nearly fifty null messages are transmitted for each 
movement of the real message. Although the null fraction decreases as the number of circulating 
messages increases, i t  converges to approximately twenty null messages per real message 
transmission.6 In contrast, the deadlock recovery fraction converges to 0.35. -4though these 
deadlock recoveries are  expensive, as the analysis of cyclic networks showed, their number is so 
small compared to the number of null messages sent during deadlock avoidance tha t  deadlock 
recovery is significantly faster. 
Finally, we must emphasize tha t  these results are signi f icant ly  more negative than earlier 
simulated results IReed851. A sequential simulation of a network, by its nature. imposes jome 
sequential ordering on the evaluation of network nodes. When those nodes are not being 
evaluated, they do not generate null messages, nor can they deadlock. In contrast, in a fully 
'This value, twenty, seems Independent of the network routlng probabilities Removing the nested cycle from 
We hypothesize that the value IS a Figure 1 neither increases the observed speedup nor decreases the null fractlon 
function of the relative speeds of the processors and memory 
20 
parallel implementation, all nodes are always active. Thus, they continue to  receive and generate 
null messages while awaiting receipt of real messages. Thus, the overhead is higher than 
suggested by a sequential simulation of distributed simulation. 
Clus t e r N e  t w-o r ks 
Cluster networks were the most complex simulated during our experimental study. As 
Figure 7 shows, a cluster network is composed of several tightly clustered subnetworks. This 
has two important ramifications. First, the network is nearly decomposed and should yield 
significant speedups with parallel simulation. Secondly, the clustering increases the expected 
execution time of the simulation. Why? The  clocks of all nodes must reach the terminating 
value before the simulation completes. With only a few circulating messages, some nodes may be 
idle for long periods of time. Only when a message "escapes" from a subcluster will the clocks of 
other nodes advance. This asynchrony means that  the clocks of some nodes may run far past the 
terminating simulation value. 
Figure 17 shows the speedup of the cluster network with static node assignment. For small 
populations, deadlock recovery is significantly faster than deadlock avoidance. As with the 
central server network, this is attributable to  the large number of null messages sent. As the 
population increases, the null fraction decreases precipitously, and deadlock avoidance becomes 
the method of choice. (Figure 18 shows the deadlock and null fractions.) Interestingly, the 
speedup obtained with deadlock recovery is relatively insensitive t o  the simulated population. .As 
Figure 18 shows, the deadlock fraction is negligible but constant. With a large number of 
processors, the delay to  synchronize at a barrier is prohibitive; this overhead is the reason for the 
poor performance of deadlock recovery. 
When nodes are dynamically assigned to  processors, Figure 19, the performance of both 
deadlock avoidance and deadlock recovery increase significantly. Xs noted earlier, messages are 
21 
often "trapped" in network subclusters and many nodes are often idle. With deadlock avoidance 
and static node assignment, many nodes continually generate null messages. These messages 
simply cause additional overhead and memory contention. With dynamic assignment, a node 
must migrate from the tail to  the head of the node work queue before being evaluated. This 
additional delay between evaluations reduces the number of null messages and is the source of 
the additional speedup with dynamic node assignment. 
Figure 19 also shows that a smaller number of processors yields a marginally larger speedup 
than tha t  obtained with maximal parallelism. Although the difference is not statistically 
significant in this case, Figure 20 shows tha t  the difference is large when node waiting is 
introduced. The reason is, as before, the presence of many idle nodes. By suspending nodes tha t  
cannot productively contribute to  the simulation, contention for the node work queue is reduced. 
and those nodes with work can proceed without interference. 
In summary, the cluster network shows tha t  distributed simulation can produce significant 
speedups if the network is decoupled, and subclusters interact infrequently. 
S urnmary 
Distributed simulation has been the subject of several simulated performance studies; little 
or no experimental da ta  have heretofore been available. Obtaining such da ta  was the primary 
goal of this work. Using queueing networks as the simulation application, we simulated a variety 
of such networks with varying workloads using several variations of the Chandy-kfisra algorithm 
on a shared memory machine. 
These experiments show that,  with rare exception, the Chandy-Misra approach to  
distributed simulation is not a viable approach to  parallel simulation of queueing network 
models. There are two primary reasons for this. First, a single processor implementation of the 
22 
Chandy-Misra algorithm is sometimes slower than the equivalent sequential, event-driven 
simulation. Thus, multiple processors are needed just to recoup the loss due to  the inefficiency. 
Second, networks with cycles require deadlock avoidance or recovery techniques. These 
techniques are  extremely costly, and there is little prospect that  they can be reduced to  
acceptable levels. 
~ 
Because queueing network simulation requires little processing by server nodes, nodes 
interact frequently in real time. Because of this, queueing networks are a stress test for 
distributed simulation. In simulations tha t  require extensive computation between node 
interactions, distributed simulation is analogous to  a group of decoupled processes. In such cases. 
distributed simulation should prove more attractive. 
Acknowledgments 
Jack Dongarra and the Advanced Computing Research Facility of Argonne National 
Laboratory graciously provided both advice and access to  the Sequent Balance 21000. 
23 
References 
[Buze73] J. P. Buzen, "Computational Algorithms for Closed Queueing Networks with 
Exponential Servers," Communications o f  the ACM, Vol. 16, No. 9, September 1973, 
pp. 527-531. 
[ChHM79] K. M. Chandy, V. Holmes, and J. Misra, "Distributed Simulation of Networks," 
Computer Networks, Vol. 3, No. 1, February 1979, pp. 105-113. 
[ChMi79] K. M. Chandy and J. Misra, "Distributed Simulation: A Case Study in Design and 
Verification of Distributed Programs," IEEE Transactions on Software Engineering, 
Vol. SE-5, No. 5, September 1979, pp. 440-152. 
iChMi811 K. M. Chandy and J. Misra, "Asynchronous Distributed Simulat,ion via a Sequence of  
Parallel Computations," Communications of the AChf, Vol. 24, No. 4, April 1981, pp. 
198-206. 
[ChHh183j K. M. Chandy, L. M. Haas, and J. Misra, "Distributed Deadlock Detection," .tC~LI 
Transactions o n  Computer Systems. Voi. 1, No. 2 ,  May 1983. pp. 141-156. 
[ChBr83] A. Chandak and J. C. Browne, "Vectorization of Discrete Event Simulation," 
Proceedings of the 1989 International Conference on Parallel Processing, -2ugust 
1983, pp. 359-361. 
[Chu80] W. W. Chu, et al, "Task Allocation in Distributed Data Processing," IEEE Computer, 
Vol. 13, No. 11, November 1980, pp. 57-69. 
[Comf82] J. C. Comfort, "The Design of a Multi-microprocessor Based Simulation Computer - 
I," Proceedings of the Fifteenth Annual Simulation Symposium, March 1982. pp. 45- 
53. 
[Comf83] J. C. Comfort, "The Simulation of a Master-Slave Event Set Processor," Simulation, 
Vol. 42, pp. 117-124. 
[Fink861 R. A. Finkel, A n  Operating Systems V u d e  -Mecum, Prentice-Hall, 1986. 
[FrWW84] M. A. Franklin, D. F. Wann, and K. F. Wong, "Parallel Machines and .\lgorithms for 
Discrete-Event Simulation," 1984 International Conference o n  Parallel Processing, 
August 1984, pp. 449-458. 
[FrMa77] D. Franta  and W. Maly, "An Efficient Data Structure for the Simulation Event Set," 
Communications of the ACM, Vol. 20, No. 8, August 1977, pp. 596-602. 
24 
[Heid861 P. Heidelberger, "Statistical Analysis of Parallel Simulations," Proceed ings  of the 
1986 W i n t e r  S imu la t ion  Conference,  t o  appear. 
:JeSo85] D. Jefferson and H. Sowizral, "Fast Concurrent Simulation Using the Time Warp 
Mechanism, Distributed Simulat ion 1985,  T h e  1985  Society  f o r  C o m p u t e r  S i m u l a t i o n  
,Multiconference, San Diego. California. 
~ 
[KoRW86] M. B. Konsek, D. A. Reed, and W. Watcharawittayakul, "Context Switching with 
Multiple Register Windows: A RISC Performance Study," in preparat ion.  
[PaSe82] D. A. Patterson and C. H. Sequin, "A VLSI RISC," IEEE C o m p u t e r ,  Vol. 15, No. 9, 
September 1982, pp. 8-21. 
mPatt851 D. A. Patterson, "Reduced Instruction Set Computers," C o m m u n i c a t i o n s  of the ACM, 
Vol. 28, No. I. January 1985, pp. 8-21. 
-PeCVM79] J. K. Peacock. J. CV. Wong, and E. G. Manning, "Distributed Simulation Using a 
Network of Processors." Computer  Ne tworks ,  Vol. 3, No. I, February 1979, pp. 44-36. 
-Pfis82! C. F. Pfister, "The Yorktown Simulation Engine: 
Design  A u t o m a t i o n  Conference,  June 1982, pp. 31-54. 
Introduction," -4G"kf IEEE 19th 
IReed83) D. A. Reed, ".I Simulation Study of Multimicrocomputer Networks," Proceedings  of 
the  1988 Internat ional  Conference o n  Parallel  Processing,  August 1983, pp. 161-163. 
Reed851 D. A. Reed, "Parallel Discrete Event Simulation: A Case Study," R e c o r d  of 
Proceedings: 18th 'Annual Simulat ion S y m p o s i u m ,  March 1985, pp. 95-107, invi ted 
paper. 
.ReFu86] D. A. Reed and R. M. Fujirnoto, ,Multicomputer Networks:  Message  Based  Parallel  
Processing,  submitted to MIT Press. 
'SaMSSO] C. H. Sauer, E. A. MacNair, and S. Salza, "A Language for Extended Queueing 
Networks," IBM Journal  of Research  and Deve lopmen t ,  Vol. 24, No. 6, November 
1980, pp. 747-753. 
'Seet781 M. Seethalakshmi. "Performance Analysis of Distributed Simulation," M.S. R e p o r t ,  
Computer Science Department, University of Texas, Austin, Texas, 1978. 
[Seit85] C. L. Seitz, "The Cosmic Cube," C o m m u n i c a t i o n s  of the ACA4, Vol. 28, No. 1, 
January 1985, pp. 22-33. 
rWyma751 P .  F. Wyman, "Improved Event Scanning Mechanisms for Discrete Event 
Simulation.'' Communica t ions  of the AC'M, Vol. 18, No. 4, April 1975, pp. 221-230. 
25 
Table 1 
Typical Operation Times for the Sequent Balance 21000 
Operat ion 
Lock/unlock 
Subroutine call/return 
System call 
Context switch 
Process creation 
Time (psec) 
60 
60 
400 
1000 
60000 
26 
Table 2 
Speedups for tandem network with 16 server nodes 
Ius t e r ing  
Case 
A 
B 
C 
D 
E 
F 
G 
H 
I 
J 
STATIC 
Recovery 
9.24 f 0.41% 
8.53 &0.31% 
4.58 & 8.03% 
3.87 k 7.61% 
3.22 f 3.19% 
3.60 i 0.23% 
2.41 & 2.24% 
1.87f0.21% 
1.00 i 0.33% 
Avoidance 
8.73 f 0.50% 
8.23 f 0.56% 
4.27 i 11.29% 
3.28 i 23.16% 
2.83 f 8.11% 
3.51 f 0.86% 
2.38 f 1.69% 
1.83 & 0.62% 
0.98 i 0.37% 
Parameter 
Node Service Time 
Confidence Level 
Speedup Base 
Mean Base Execution Time 
Mean Sequential Execution Time 
Cluster case A 
Cluster case B 
Cluster case C 
Cluster case D 
Cluster case E 
Cluster case F 
Cluster case G 
Cluster case H 
Cluster case I 
Cluster case J 
Recovery 
10.22 & 1.92% 
7.50 * 0.99% 
DYNAMIC 
Recovery 
w/ Waiting 
9.97 i 0.94% 
7.52 2 1.32% 
0.91 5 0.17% ' 0.98 = 0.37% , 
4 void anc e 
10.19 = 2.49% 
7.46 i 1.48% 
0.91 = 0.39% 
I 
Avoidance 
w /  Waiting ~ 
9.74 1.25% I 
7.41 r 1.71% 
I 
0.90 z 0.77% 
Value 
0.0625 
95% 
One processor static deadlock recovery 
117.88 seconds 
113.45 seconds 
(1) ...( 18)- 
(1 17) (2) ...( 15) (16 18) 
(1 17) (2 3) (4) ...( 13) (14 15) (16 18) 
(1 17) (2 3) (4 5 )  (6) ...( 11) (12 13) (14 15) (16 18) 
(1 17 2) (3 4 5) (6 7 8) (9 10 11) (12 13) (14)  ...( 16) (18) 
(1 17 2) (3 4 5)  (6 7 8) (9 10 11) (12 13 14) (15 16 18) 
(1 17 2 3 4) (5 6 7 8) (9 10 11 12) (13 14 15 16 18) 
(1 17 2 3 4 5) (6 7 8 9 10 11) (12 13 14 15 16 18) 
(1 17 2 3 4 5 6 7 8) (9 10 11 12 13 14 15 16 18) 
(1 17 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 18) 
Node numbers refer to Figure 3. Parenthesized node groups execute on one processor. 
27 
Juster ing 
Case 
A 
B 
C 
D 
E 
F 
Table 3 
Speedups for generalized feed forward network with 6 nodes 
STATIC 
Recovery 
3.34 * 1.00% 
2.16 *0.64% 
2.17 f 0.38% 
1.57 + 1.40% 
1.55 f 0.46% 
1.00 i 0.82% 
Avoidance 
2.57 i 0.62% 
1.81 5 1.67% 
1.36 i 4.95 
1.21 i 0.36% 
1.25 L 1.46% 
0.76 5 0.51% 
I 
Recovery 
3.74 * 0.00% 
2.47 c 1.13% 
0.88 r 0.48% 
I 
! 
! DYNAMIC 
Recovery 
w/ Waiting 
3.66 1.40% 
2.47 =1.12% 
0.87 t 0.4’7% 
Avo i d  a nc e -4 vo ida n c e 
1 w/ Waiting 
2.90 1.80% I 2.81 = 1.35% 
I 
1 
I 
1.86 1.31% 1.83 ~ 2 . 0 9 %  
0.65 = 0.43% 0.63 = 0.40% 
1 Parameter 
1 Node Service Time 
Confidence Level 
Speedup Base 
Cluster case A 
I Cluster case B i Cluster case c 
Cluster case D 1 Cluster case E 
I Cluster case F 
I 
Vaiue 
1 .o 
95% 
One processor static deadlock recovery I 
(1) ...( 6)- 1 
I 
, 
(1) ( 2 )  (3  4) ( 5  6) 
(1 2 )  (3  4) (5  6) 
(1) ( 2 )  (3  4 5 6) 
I 
(1 5 6) (2 3 4) 
(1 2 3 4 5 6) 
- 
Node numbers refer to Figure 4. Parenthesized node groups execute on one processor. 
28 
Table 4 
Speedups for generalized feed forward network with 1 4  nodes 
I i 1 I STATIC DYNAMIC 
I Clustering i C;e 1 -  Recovery 
6.05 i 1.46% 
I 
i B I 4.86&0.91% I 
I 
I C  I 1.82 i0 .42% 
I I 
i D ;  1.83 5 0.65% 
i 
1 
E ~ 1.79 ~ 4 . 6 9 %  
i 
F 1.00 = 0.91% 
2.36 f 0.45% 
2.32 i 0.93% 
0.97 iO.19% 
0.96 f 0.31% 
0.99 + 0.52% 
7.19 I 2.27% 
5.44 i 0.96% 
0.57 I 0.76% 0.90 0.56% 
Recovery 
w /  Waiting 
6.93 i 2.69% 
5.42 i 0.81% 
0.89 I 0.13% 
Avoidance 
2.53 f 2.56% 
3.04 i 0.64% 
0.48 0.46% 
, 
, Parameter Value 
1 Node Service Time 
Confidence Level 
I 
I Speedup Base 
Cluster case .A 
Cluster case B 
Cluster case C 
Cluster case D 
Cluster case E 
Cluster case F 
1 .o I 
95% I 
One processor static deadlock recovery 
(1) ...( 14)- 
I 
(1) (2) (3  9) (4 14) (5  6)  (7) (8) (10 13) (11) (12) ~ 
(1) (2) (7)  (8) (11) (12) (3 4 5 6 9 10 13 14) 
(1 2)  (7 8)  (11 12) (3  4 5 6 9 10 13 14) 
(1 2 7 8 11 12) (3 4 5 6 9 10 13 14) 
(1 2 7 8 11 12 3 4 5 6 9 10 13 14) 
, 
1 
I 
Avoidance , 
w/ Waiting j f
A 
1 2.2522.54% I 
2.98 i 0.45% ' 
I 
0.48=0.47% , 
- 
Node numbers refer to Figure 5. Parenthesized node groups execute on one processor. 
29 
PARALLEL 1 Popu- SEQUENTIAL 
lation 1 Recovery Avoidance 
1 
I ! -  
26.32 28.97 491.85 
' 2 1  42.80 44.87 510.71 
I 52.73 490.48 
I 
I 
I I 
I 
5 1.44 I I 3 1  
Table 6 
Sequential and parallel mean execution time 
for five node central server 
(time given in seconds) 
~ PARALLEL 
Recovery  I Recovery Avoidance , w /  Wait ing 
r 
I 
569.00 1 662.30 I 
I ' 655.15 ' 33.62 35.47 50.19 1 56.70 
59.89 1 61.95 
I ! I I I 
1 
1 Routing Probability 
I 
I 
(1) 0.10, (4) 0.45, (5) 0.45 ' 
I 
.\lode numbers refer to Figure I. Parenthesized node groups execute on one processor. 
30 
Server 1 '-z,-qp 
Figure la 
Central Server Queueing Model 
P 2  - Server 21 
I I 'il: s, ~ , 
I 
Figure lb 
RESQ Representation of Central Server Model 
31 
... 
Memory Memory 
1 
I 
I 80 Mbytei sec 
I 
I 
~ Global 
I 
I Bus 
10 mz m 10 mz 
NS32032 NS32032 NS 32 032l 
1 PE 20 PE 1 PE 2 
I I 
K Loca 
I 
K Loca 
1 Cache 1 
1- 
... I I SK Locall 
Memoryi 
I 
8K 
Cache 
Figure 2 Sequent Balance 21000 Configuration 
32 
Figure 3 Tandem queue 
1 Source 
i K+1 
Figure 4 Generalized feed forward network 
Figure 5 Generalized feed forward network 
Figure 6 Cyclic network 
I 
I -  I 1 - ... ^I: I d I Server -4 K I C 1 I  
33 
Figure 7 Cluster network 
/ !  
j 4 1 Merge I_ Server 1-a Server 1 1 Fork i2 
2 - 4  u 1 ,0.05 j 0.25) , 
, 
Fork 1 ~ 
_i 
9 
I 
1 1  
I 
I '  0.95 
34 
Parameter 
Node Service Time 
Confidence Level 
Speedup -Base 
Figure 8 
Speedup for tandem queue 
(static node assignment) 
Value 
1 / number of server nodes 
95% 
One processor static deadlock recovery 
- Speedup 
12 
9 
6 
3 
c 
1 
I I , 1 
e Deadlock Recovery - Maximum Parallelism 
o Deadlock Recovery - Sequential 
0 Deadlock Avoidance - Maximum Parallelism 
4 7 10 
Server Nodes 
13 16 
I I 
35 
Figure 9 
Speedup for tandem queue with deadlock recovery 
(dynamic node assignment) 
- Speedup 
12. 
9 
6 
3 
c 
I 1 
0 Deadlock Recovery - Maximum Parallelism 
o Deadlock Recovery - Half Parallelism 
Deadlock Recovery - Sequential 
x with Waiting - Maximum Parallelism 
C with Waiting - Half Parallelism 
e with Waiting - Sequential 
I 
- 
- 1 4 1 10 13 16 
Server Nodes 
I I I 
. Parameter Value I 
Node Service Time 1 / number of delay nodes I 
Confidence Level 95% 
Speedup Base 
i 
I 
One processor static deadlock recovery 
36 
Node Service Time 
Confidence Level 
Speedup Base 
Cluster case A 
Cluster case B 
r 
Figure 10 
Speedup for four node cyclic queue 
(static node allocation) 
0.25 
95% 
One processor deadlock recovery 
(1) (2) (3) (4)- 
(1 2) (3  4) 
Speedup 
2.00 
1.60 
1.20 
0.80 
0.4C 
0.oc 
I I I 1 
n 4 
/ T 
+ 
0 Deadlock Recovery Cluster Case A 
o Deadlock Recovery i Cluster Case B 
0 Deadlock .\voidance (Cluster Case A) 
x Deadlock Avoidance (Cluster Case B) 
0 8 16 24 32 40 
Population 
I Parameter I Value 
- 
Node numbers refer to  Figure 6. Parenthesized node groups execute on one processor. 
37 
Parameter Value 
Node Service Time 0.25 
Confidence Level 95% 
Speedup Base One processor deadlock recovery - 
Figure 11 
Speedup for four node cyclic queue 
(dynamic node assignment) 
Speedup 
2.00 , I I I I I 
I I 
1.60 1 ‘ 
1.204 
~ 
1 
~ 
I 0 Deadlock Recovery 
~ 3 Deadlock Recovery 
I o ’ a O ~  Z Deadlock Avoidance (4 PES) x Deadlock Avoidance (2  PES) 
I 
0.00 ~ 
, 
1 I I 
0 8 16 24 32 40 
Population 
Figure 12 
Speedup for four node cyclic queue 
(dynamic node assignment with waiting) 
Node Service Time 
Confidence Level 
Speedup Base 
Speedup 
One processor deadlock recovery 
0.25 
95% 
- 
2.00- 
1.60- 
1.20- 
0.80- 
0.40- 
o Deadlock Recovery 4 PES) 
o Deadlock Recovery I 2 PES) 
c] Deadlock Avoidance (4 PES) 
x Deadlock Avoidance (2  PES) 
, I I , 0.00- 
0 8 16 21 32 40 
Population 
I Parameter I Value I 
39 
Figure 13 
Speedup for five node central server 
(static node assignment) 
Speedup - 
1.25 
1 .oo 
0.75 
0.5c 
0.2: 
O.O( 
0 Deadlock Recovery Cluster Case A) 
. o Deadlock Recovery i Cluster Case B) 
CI Deadlock Recovery (Cluster Case C)  
x Deadlock Avoidance (Cluster Case A 
C Deadlock Avoidance (Cluster Case B 
Deadlock Avoidance (Cluster Case C )  
1 
1 
0 8 16 24 
Population 
32 40 
Parameter 
Routing Probability 
Confidence Level 
Speedup Base 
Cluster case A 
Cluster case B 
Cluster case C 
Value 
(1) 0.10, (4) 0.45, (5) 0.45 
95% 
One processor static deadlock recovery 
(1) (2 )  (3)  (4) ( 5 ) -  
(1 2) (3) (4 5) 
( 1  2) (3  4 S j  
- 
Node numbers refer to Figure 1. Parenthesized node groups execute on one processor 
40 
Routing Probability 
Confidence Level 
Speedup Base 
Figure 14 
Speedup for five node central server 
(dynamic node assignment) 
(1) 0.10, (4) 0.45, (5) 0.45 
95% 
One processor static deadlock recovery 
Speedup 
~ 
1.25 
1.00 
0.75 
0.50 
I I I I 1 
0 Deadlock Recovery 
3 Deadlock Recovery 
0 Deadlock Recovery with waiting (5 PES) 
x Deadlock Recovery with waiting (3  PES) 
T Deadlock Avoidance ( 5  PES) 
' Deadlock Avoidance (3  PES) 
- Deadlock Xvoidance with waiting ( 5  PES) - Deadlock Avoidance with waiting (3  PES) 
I 
~ 
P 4 
I 
0 8 16 24 
Population 
32 40 
Parameter 1 Value 
41 
Figure 15 
Deadlock fractions for five node central server 
(static node assignment) 
F rac tion 
0.50- 
0.40. 
0.30 
0.20 
0.104 
Deadlock Recovery - PO Cluster Case A) 
o Deadlock Recovery - PO Cluster Case B) 
Deadlock Recovery - PO (Cluster Case C )  
;< Deadlock Recovery - p l  (Cluster Case A) 
7 Deadlock Recovery - p l  (Cluster Case B) 
Deadlock Recovery - p l  (Cluster Case C)  
i: 
0 8 
I Parameter 
PO Routing Probability 
p l  Routing Probability 
Confidence Level 
Speedup Base 
Cluster case A 
Cluster case B 
Cluster case C 
I 
16 24 32 40 
Population 
Value 
(1) 0.0, (4) 0.5, (5) 0.5 
(1) 0.10, (4) 0.45, (5) 0.45 
95% 
One processor static deadlock recovery 
(1) (2) (3)  (4) (5)- 
(1 2) (3) (4 5) 
(1 2) (3  4 5 )  
Node numbers refer to Figure 1. Parenthesized node groups execute on one processor. 
42 
Parameter 
PO Routing Probability 
p l  Routing Probability 
Confidence Level 
Speedup Base 
Cluster case A 
Cluster case B 
Cluster case C 
Figure 16 
Null fractions for five node central server 
(static node assignment) 
Value 
(1) 0.0, (4) 0.5, ( 5 )  0.3 
(1) 0.10, (4) 0.45, ( 5 )  0.45 
95% 
One processor static deadlock recovery 
(1)  (2) (3) (4) ( 5 ) -  
(1  2) (3) (4 5) 
( I  2) (3 4 5) 
Fraction 
I I 
0 8 16 24 32 40 
Population 
- 
Node numbers refer to Figure 1. Parenthesized node groups execute on one processor. 
43 
, 
Figure 17 
Speedup for four block cluster 
(static node assignment) 
Speedup 
- 6  
5 
4 
3 
I! 
1 
C 
Deadlock Recovery Cluster Case A 
o Deadlock Recovery r Cluster Case B 
CI Deadlock Recovery (Cluster Case B) 
x Deadlock Avoidance (Cluster Case -1) 
0 4 8 12 16 
Population 
Parameter 
Node Service Time 
Confidence Level 
Speedup Base 
Cluster case X 
Cluster case B 
Cluster case C 
Value 
1.0 
95% 
One processor deadlock recovery 
( 1 )  (2)  ... (18) - 
( 1  4) (2  3) ( 5  8) (6 7 )  (9  10) ... 
( 1  ? 3 4) ( 5  6 7 '3) (9  10) ... I 
- 
Node numbers refer t o  Figure 7 .  Parenthesized node groups execute on one prvcessor 
44 
I 
Figure 18 
Deadlock and null fractions for four block cluster 
(static node assignment) 
F r ac tion 
I 
4OJ 
I 
30 + 
1 
3O-l 
I 
I 
I 
10: 
0 Deadlock Cluster Case A 
o Deadlock i Cluster Case B 
0 Deadlock (Cluster Case C)  
- Null (Cluster Case C) I 
i 
\ \  \ 
4 8 12 16 
Population 
Parameter 
Node Service Time 
Confidence Level 
Speedup Base 
Cluster case A 
Cluster case B 
Cluster case C 
Value 
1.0 
95% 
One processor deadlock recovery 
(1) (2) ... (18)- 
(1 4) (2  3) (5  8) (6 7) (9 10) ... 
( 1  2 3 4) ( 5  6 7 8)  (9 10) ... 
- 
Node numbers refer to Figure 7. Parenthesized node groups execute on one processor. 
45 
Parameter 
Node Service Time 
Confidence Level 
Speedup Base 
Figure 19 
Speedup for four block cluster 
(dynamic node assignment) 
Value 
1.0 
95% 
One processor deadlock recovery 
Speedup 
8 ,  I I I I 
0 Deadlock Recovery 
o Deadlock Recovery 
G Deadlock Avoidance (18 PES) 
x Deadlock Avoidance (9 PES) 
i L 
I I I 
0 4 8 12 16 
Population 
40 
~~ ~ ~ 
Node Service Time 
Confidence Level 
Speedup Base 
Figure 20 
Speedup for four block cluster 
(dynamic node assignment with waiting) 
1.0 
95% 
One processor deadlock recovery I 
Speedup 
0 4 8 12 16 
Population 
I Parameter I Value I 
