Reactive-Process Programming and Distributed Discrete-Event Simulation by Su, Wen-King
ReactiveProcess Programming
and
Distributed DiscreteEvent Simulation
Thesis by
WenKing Su
In Partial Fulllment of the Requirements
for the Degree of
Doctor of Philosophy
California Institute of Technology
Pasadena California

Submitted October  	

CaltechCSTR	
ii
c
  
WenKing Su
All rights Reserved
iii
Acknowledgments
Many thanks
To my thesis advisor Dr Charles L Seitz whose care and dedication made it
all possible
To my committee members Dr Charles L Seitz Dr Mani Chandy
Dr Alain Martin Dr Brad Sturtevant and Dr Eric Van de Velde for
their careful review and analysis of my research
To our technical editor Dian De Sha who spent glorious days and nights
tracking and hunting the blunders and blemishes in my writing
To our operations manager Arlene DesJardins who takes care of every little
daytoday detail and makes the department feel like a nice big family
To my peers Bill Athas Bill Dally John Ngai and Craig Steele for their help
and advice
To my junior coworkers Nanette Boden Charles Flaig Glenn Lewis
Mike Pertel and Jakov Seizovic for their feedback and support
To our system managers Don Speck Chris Lee and Joe Beckenbach for
keeping our machines running smoothly
To our guests from abroad Sven Mattisson and Lena Peterson for their
enthusiasm and friendship
To my advisors at UC Davis Dr Wen C Lin of EECS and
Dr George E Bruening of BioChem for my enlightenments
To my buddies from UC Davis Glenn Saito and John Bakos for their help in
my college years
To my teachers and counselor at Casa Roble High School Mr Gomez
Dr Smithson Mr Homan Mr Scalatta Mr Pickard Mr Hellen
Mrs Sproul and Mrs Cruzen who worked to keep me involved in school
To Xerox Corporation for supporting this work through a Xerox
specialopportunity fellowship
To my parents who endured many dicult times to bring me here and to
raise me in this land of opportunity
And to Freedom and Liberty
sacred to our very heart and soul yet sadly denied to so many
The research described in this thesis was sponsored in part by the Defense Advanced
Research Projects Agency DARPA Order number 	
	 and monitored by the Oce
of Naval Research under contract number N


K

iv
Abstract
The same forces that spurred the development of multicomputers  the demand for
better performance and economy  are driving the evolution of multicomputers in
the direction of more abundant and less expensive computing nodes  the direction
of  negrain multicomputers This evolution in multicomputer architecture derives
from advances in integrated circuit packaging and messagerouting technologies
and carries farreaching implications in programming and applications This thesis
pursues that trend with a balanced treatment of multicomputer programming and
applications First a reactiveprocess programming system  ReactiveC  is
investigated then a model application  discreteevent simulation  is developed
nally a number of logiccircuit simulators written in the ReactiveC notation are
evaluated
One diculty in multicomputer applications is the ineciency of many dis
tributed algorithms compared to their sequential counterparts When better for
mulations are developed they often scale poorly with increasing numbers of nodes
and their benecial e	ects eventually vanish when many nodes are used However
rules for programming are quite di	erent when nodes are plentiful and cheap
 The
primary concern is to utilize all of the concurrency available in an application rather
than to utilize all of the computing cycles available in a machine We have shown in
our research that it is possible to extract the maximum concurrency of a simulation
subject even one as dicult as a logic circuit when one simulation element is as
signed to each node Despite the initial ineciency of a straightforward algorithm
as the the number of nodes increases the computation time decreases linearly until
there are only a few elements in each node We conclude by suggesting a technique
to further increase the available concurrency when there are many more nodes than
simulation elements
vContents
List of Figures ix
List of Program Listings xiii
 Introduction 
   Motivation                              
  History                               
  Outline                               
 ReactiveProcess Programming 
  Denition of a Reactive Process                    
 ReactiveC Programming System                   
 ReactiveProcess Layers 
  Simple Layers                             
   The bottom layer blayer                   
  The lengthcarrying layer llayer               
  The nonblockingreceive layer nb	layer            

 
 Handler layering                       
 Message Type                            
 Discretion on Receive                         
  Discretion using b	layer functions               
 The RPCdiscretion layer rlayer               

 The CSPdiscretion layer csp	layer              

 A more general typediscretion layer tlayer          


 Other Layers                            
 

  A owcontrolling layer flayer                
 

 The CK primitives                      


 The RK primitives x	primitives                


vi
  Layering on LightWeight Processes                  
  Cosmic Environment  
 The Cosmic Environment Specication                 
 Our Cosmic Environment Implementation                
 Structure of our CE implementation              	
 Cosmic Environment exterior                  
  Cosmic Environment processes                 
 Program compilation                     

 Spawning programs                      

 Data representation and conversion               
 Model of Simulation 
 Mathematical Framework and Analysis                 

 Systems and elements                     

 States and time                       
 
  Knots and progress                      

 Rules of thumb  sucient conditions for progress        


 Nonexistence of necessary and sucient progress conditions    

 Simulation and Boolean satisability         

 Simulation and simultaneous equations        

 Operational Framework                        

 Breaking a simulation into smaller slices             

 Slices and knots                       
  Implementation considerations                 
  The Generic Simulator Model and Its Derivatives             
  Messagedriven simulation                   
  Concurrent eventdriven simulation               
vii
  Sequential simulator                     
  Concurrent backtracking simulators               
   Branchandbound simulators                 	
  Timedriven simulators                    

  Summary                          
  LogicCircuit Simulator Experiments 
 Why Logic Circuits                          

 CMBVariant Simulator                       

 The element simulators                    


 The simulator message system                 

 The variants                         

 Variant algorithms                      

  Instrumentation                       	

 Experimental results                     	

 Sequential Simulator                         	
 Sequential simulator mechanism                	

 Hazards in sequential simulators                	
 Instrumentation                       

 Big multiplier results                     
  Small multiplier results                    
 Circuit topology vs activity level                
 Hybrid possibilities                      
	
 Hybrid Simulators 
 Coordinated Sequential Simulator Hybrid              


 The algorithm                        



 Sorting with a dierent key                  

viii
  The simulator mechanism                   
  The simulator output                     
 	 Expectation                         

  Experimental results                     
  Progressive Hybrid Simulator Hybrid                
  The mechanism                       
  Experimental results                      
  Additional Performance Results 
 D Clock Network                          
 Description                         
 Sweepmode results                      
 Realmode results                      	
 TreeRing Example                         	
 Description                         	
 Simulation results                      	
 FIFO Loop                             
 Description                         
 Simulation results                      
 Summary 

 Economy and Performance of a Multicomputer              

 Overhead and Latency                         

 FineGrain Multicomputer Programming                 

 The Next Frontier                           
 Bibliography 
ix
List of Figures
   Block diagram of a multicomputer                     
  Possible behavior of a reactive process                   
 Representation of a process                        
 Operation of a ReactiveC kernel                      
 Specication of the factorial process                     
	 The divide step                              

 The combine step                             
  Mapping a binary tree to a multicomputer                  
 Process structure comparison                        
 Structure of a llayer message buer                   
 An example of a FIFO queue                       
	 Expansion steps in the mergesort program                  

 Giving away a list for the third time stack grows up            
 Getting an outofsequence reply                     
 Structure of a channel in a channelbased CSP implementation        
 Control ow for heavyweight processes                  	
  Control ow for lightweight processes                   

  Elements of a computation                        
 A process group                            	
 Partitioning into two parts                        	
 A multicomputer shared by two users                   	 
	 Host messagesystem implementation                   	

 Cosmic Environment with unied resource management           	
	  Representation of a system                        
 
	 Representation of a system composed of elements              
 
	 Closing a system into a closed graph                    

x  Arc source and destination                        
   Element inputs and outputs                       
  Arcs a
  
form a path of length                      
  Arcs a
  
form a circuit of length                     
  Example of a knotcontaining system                    
 	 A circuit to evaluate satisability of a set of clauses             
 
 Mapping equations into physical system                  
 

 Elementsimulator operation for an element with a nonzero delay       
 
 Elementsimulator operation for an element with a zero delay         

 
 A system that contains all three types of slices               
 
 Representation of an arc                         
 
  Replacing tape by messages                        
 
 Example of deadlock in an eventdriven simulation             
 
 Model of a sequential simulator                      
 
 A researcher submitting a grant                      
 
	 Comparison between three simulators                   

  An example of a continuous system                    

 A logic circuit whose behavior is dierent from its Boolean network      
 A number of circuit simulators and their relationship             
 Domain of the generic simulator model                  
 Process structure and a simple example of connectivity           
  A sample circuit and a possible mapping to a multicomputer         	 
 Structure of a sweepmode simulation                   

 Structure of a realmode simulation                    

 Three phases of the oscillating multiplier                  

	 A 	
gate multiplier sweepmode                    
 

 A circuit containing a dynamic hazard condition              


xi
  A gate multiplier for  s on an iPSC	
               
  A gate multiplier for  s on an iPSC	               
  Combining the iPSC	
 and iPSC	 graphs with sequential timing aligned   
  A gate multiplier for  s on a Symult 
              
  A gate multiplier for  s on an iPSC	               
   A gate multiplier for  s on an iPSC	
               
  A gate multiplier for  s on a Symult 
              
 	 Eect of increased latency on simulation performance            	
 
 A gate multiplier for  s on a Symult 
  fast oscillation     

  Modied Laer Curve                          
 An event that invalidates another event                  
 Layering in the hybrid simulator                     
 Expected performance of the hybrid simulator              

 A gate multiplier for  s on a Symult 
             
 A gate multiplier for  s on a Symult 
 with random placement   
  A faster oscillating gate multiplier for  s on a Symult 
      
 A gate multiplier for  s on a Symult 
             
	 A gate multiplier for  s on a Symult 
 with random placement   


 A fasteroscillating gate multiplier for  s on a Symult 
      

 A gate multiplier for  s on a Symult 
              
	 A FIFO consisting of  units                       
	 A Celement FIFO consisting of  units                  
	 A   array of selfoscillating FIFO units                 
	 Sweepmode CMBvariant simulation of an gate clock network      
	 An gate clock network for  s on a Symult 
             
	  An gate clock network for  s on a Symult 
            	
	 A gate clock network for 
 s on a Symult 
            

xii
   A gate clock network for  s on a Symult             
  A 	gate clock network for 	 s on a Symult              
  A unit tree ring                           
  A to pulsedistributor circuit                     
  A gate tree network for 	 s on a Symult              
  A gate tree network for 	 s on a Symult              	
 
 An 
	gate tree network for  s on a Symult              
  An 	gate tree network for  s on a Symult              
 	 An 
gate tree network for  s on a Symult             
  Circuit for one latch                           	
   Sweepmode CMBvariant simulation of an gate FIFO loop       	
  An gate FIFO loop for  s on a Symult             	
  An gate FIFO loop for  s on a Symult             	

  A 	gate FIFO loop for  s on a Symult              	
  A 		gate FIFO loop for  s on a Symult              		
 Two idealized multicomputer evolution paths                	
 Multicomputer cost space                        	
 Intersection with A plane                        	

 Intersection with Bplane                        
 Two idealized multicomputer evolution paths in the path space        
xiii
List of Program Listings
  Kernel of ReactiveC programming environment               
   ReactiveC factorial program                       
  Factorial main program                         
 Heavyweight factorial program                      
  Program fragments for mapping a binary tree to a multicomputer        
 The carrier program for building FIFO                   
 The mergesort program                          
	 An incorrect implementation of the C read function             	
 A correct implementation of the C read function              

 Three representations of   in doubleprecision oatingpointnumber format  	
  Three layouts of a structure in order of increasing byte address        	
 Structure of a FRAGMENT                         
  An inverter in a CMBvariant simulator                  
 An XORgate in a CMBvariant simulator                 
 An ORgate in a CMBvariant simulator                   
	 CMBvariant QUEUEFRAGMENT function                  
 CMBvariant TRIMFRAGMENT function                   

 CMBvariant OUTPUT function                      
 CMBvariant main loop                         
 CMBvariant indenitelylazy main loop                  
 CMBvariant demanddriven main loop                  
 CMBvariant main loop as a lightweight process              
  Sequentialsimulator event structure                    

 An inverter in sequential simulator                    
 The SENDEVENT function in sequential simulator              
	 An ORgate in sequential simulator                    
xiv
   Sequentialsimulator main loop as a lightweight process           
  A SENDEVENT function that reduces glitches                
 Hybrid main loop                           
 Hybrid embedded message system                    
 Generic logicgate handler for hybrid                   	
	 Hybrid main loop                            
Section  Motivation  
Chapter  Introduction
Section  Motivation
Advances in applications programming methods and computer architectures are inextrica
bly intertwined Architectures and programming methods develop in response to demands
from applications they also give rise to new applications Simulation is an application
that contributes to and benets from the development of faster and more economical com
puters Discreteevent simulation can produce a broad variety of interaction patterns and
timing relationships it is therefore a model application for the study of multicomput
ers and reactiveprocess programming This research is a study of both reactiveprocess
programming and distributed discreteevent simulation on multicomputers
COMMUNICATION NETWORK
C
N
C
 
C

C

N computing nodes
Figure  Block diagram of a multicomputer
A multicomputer 	Figure 

 is composed of a collection of node computers connected
to each other via a messagepassing network Multicomputers can be divided into three
categories by their node size
Category
Node
Size
Memory
per Node
N
Examples
Coarsegrain cabinet   MB  Network of supercomputers
Mediumgrain circuitboard   MB 
 iPSC NCUBE Symult 

Finegrain chip   
KB 
 Mosaic
Each node has its own private memory that is not directly accessible by other nodes and
each node can contain multiple processes Processes on dierent nodes run asynchronously
  March  
  Chapter  Introduction
processes within a single node are interleaved to produce the same eect as if they were in
dierent nodes Communication between processes is performed via message passing
Section  History
Simulation and programming have long inuenced each other Although one can argue that
every computation is in fact a simulation of some physical or abstract process the rst
eort to provide a programming system for discreteevent simulation was the development of
Simula 	
 which was based on the Algol programming language Discreteevent simulation
operates on a system of components physical processes that interact by discrete actions
Structured languages such as Algol permit the modular representation of these components
As such languages became available discreteevent simulation techniques began to emerge
from the traditional eventlistoriented simulation techniques Each Simula module contains
its own set of private data and procedures and is in eect a process that interacts with
others to perform a simulation
Although it was initially conceived as a simulation language Simula became a general
purpose objectoriented multipleprocess programming language The assimilation of
objectoriented and multipleprocess programming concepts led to the development of CSP

 Smalltalk 
 and other systems that are more closely identied with programming
Although Smalltalk was created to make programming simple its programming model also
gave it the potential for concurrent operation of its objects CSP was created to study
and unify diverse distributed programming constructs by using concurrent processes and
synchronous messages Smalltalk and Simula are both objectoriented systems CSP in
cludes the concept of independent interacting processes without the distraction of such
objectoriented concepts as inheritance
Multicomputer implementations for variants of Simula 
 and Smalltalk 
 were shown
to be feasible and useful Occam 
 a CSP variant with static interprocess communication
graphs provided a programming system for transputerbased multicomputers However
  March  
Section  History  
most commercial multicomputers do not use language derivatives as their basic program
ming system because the concepts of multipleprocess programming also appear in operating
systems Interprocess synchronization and communication capabilities became common in
such popular operating systems as UNIX Although UNIX began as a system with simple
le locks and data streams it evolved into one in which both servers and clients abound
and whose processes are capable of complex interaction with other processes either on the
same machine or on other machines via computer networks Thus when mediumgrain
multicomputers with PCsized nodes became available the conventional process model of
multiprogramming operating systems was used
These machines use generic sequential programming languages such as C Fortran
Lisp and Pascal Codes written in these languages compile into independent programs that
are run in the nodes as processes These processes interact with each other by calling library
functions that send and receive messages The model of a conventional operating system is
chosen because the sequential programming languages are adequate for most applications
and also because objectoriented languages and others such as Lisp and Prolog can be
implemented easily on such systems Program objects are represented by processes and
embedded processes
Early experiments in distributed discreteevent simulation were done by Mani Chandy
and Jay Misra 	 and independently by Randy Bryant 
	 These approaches were seen
as variants of eventlistbased sequential simulation algorithms in which synchronization
is accomplished by messagepassing Although the degree of synchronization that exists in
most sequential simulators can be relaxed when a simulation is distributed extra work or
overhead is required to maintain the necessary synchronization Such simulators are called
conservative simulators because the processes do not perform speculative computations
The speculative optimistic approach was developed by David Jeerson 	 to improve
the performance of simulations for the mediumgrain multicomputers His research on the
Time Warp simulator resulted in a generalpurpose programming system called the Virtual
  March  
  Chapter  Introduction
Time System  The idea was to save the state of a process whenever the process encounters
a synchronization point then instead of blocking the process until the synchronization is
complete to have the process select a possible outcome and continue to execute  When
the synchronization is nally complete if the outcome diers from the selected outcome
the process and all those that it has since aected are rolled back and process execution
restarts at the synchronization point 
Methods for reducing overhead were studied intensively because nodes in a medium
grain multicomputer are few and expensive  However as multicomputers evolve toward
their next incarnation  the negrain multicomputers  nodes become abundant and
cheap  With a myriad of singlechip nodes negrain multicomputers promise signicantly
better costvsperformance ratios and total computing capacity than do the mediumgrain
multicomputers  The Mosaic C currently being developed at Caltech is an example of
a negrain multicomputer  While each node of the Mosaic C contains a 	bit CPU a
message router and only 	 Kbytes of RAM the entire Mosaic C will contain 	K nodes 
A number of negrain reactiveprocessbased programming languages have been devel
oped in anticipation of the negrain multicomputer  Among them is the Cantor notation
which most strongly in
uenced the programming methods used in this research  Can
tor is being developed by W C  Athas  using a model similar to the Actor notation
  Reactiveprocess programming systems are similar to CSP but impose additional con
straints on the operation of the processes in order to simplify the operating systems of
the negrain multicomputers  Cantor also allows us to express programs in nely divided
objects that are distributed over many small nodes 
The inversion of the cost ratio between the processor and the memory forms a new set
of ground rules for multicomputer programming  The shifting focus has strong implications
for programming in general The memory rather than the processor is now the scarce com
modity  Programming techniques that buy speed by using a large number of idle memory
cells are no longer favorable but ones that buy speed by using idle processors are  Instead
  March  
Section  Outline  
of trying to have something useful happen in every available CPU cycle in the machine
application writers should now focus on extracting as much concurrency as possible from
the application
In this experiment the concept of negrain reactiveprocess programming inuenced
simulation The overhead that prompted the development of optimistic approaches for
mediumgrain multicomputers was recast in a more benign role Having this overhead
merely required the use of a larger number of inexpensive processors in the multicomputer
and did not reduce the amount of concurrency that could be extracted from the system being
simulated A programming system similar to Cantor was developed for this research and
a number of conservative simulators suitable for negrain multicomputers were developed
Section  Outline
Since this research is a study of both programming and simulation this thesis is divided into
two major parts Chapters  through 	 deal with programming and Chapters 
 through 
deal with simulation The two parts are only loosely interdependent and do not reect the
extensive twoway inuence that exists between simulation and programming For example
the lazyevaluation model of simulation guided us in the design of the xprimitives which
are the messagehandling functions of our reactiveprocess programming system and the
support mechanisms in the simulator were modeled after the mechanisms of the ReactiveC
programming system
Chapter  introduces reactiveprocess programming and the ReactiveC implementation
of its basic mechanisms ReactiveC is merely the ordinary C programming language used
with a particular programming discipline It is useful for exposing the simplicity of reactive
process programming systems  a level of simplicity that is necessary for any programming
system for negrain multicomputers It is not the best tool however for studying reactive
process programming Therefore a slightly higherlevel programming system is used in
Chapter  to demonstrate the generality and simplicity of reactiveprocess programming
  March  
  Chapter  Introduction
Chapter  describes the Cosmic Environment a programming environment that embodies
the reactiveprocess programming discipline
The discussion of simulation begins in Chapter  with the model of simulation The
subject system being simulated is recursively dened to be a collection of interacting systems
or elements and elements are simulated by a set of simulators that interact by message
passing The condition for progress is discussed in detail a generic simulator is described
and the derivation of a variety of simulators is shown Chapter  describes a direct im
plementation of the generic simulator using the ReactiveC notation Logic circuits are the
subject of choice because they are diverse and because they expose properties of the simu
lators by imposing few processing requirements of their own The performance we observed
is shown to be that which was expected	 The time required for a simulation decreases
linearly as the number of computing nodes increases Comparing the performance to the
sequential simulator shows that the overhead does not interfere with the ability to utilize
the concurrency available in the system Chapter 
 introduces new simulators that do not
have an overhead when only one node is used However the speed increase is no longer
linear	 Performance converges to that of the previous simulator as more nodes are used
Although only one test circuit was used throughout these two chapters additional results
on a few other circuits are presented in Chapter  The results are all similar even though
the circuits being simulated are quite dierent
Finally Chapter  defends the rationale for simulation on negrain multicomputers
and discusses some of its implications on programming and simulation
  March  
Section  Denition of a Reactive Process  
Chapter  ReactiveProcess Programming
Reactiveprocess programming is a discipline in which processes are inactive until they are
triggered by inputs When suitable inputs are present a process and its inputs will react
in a single atomic action in which the inputs are consumed Reactiveprocess programs can
be written in specically designed notations such as Cantor they can also be written in
vanilla notations such as C Although Cantor hides many rough edges to make programming
simpler C is perhaps better in exposing the mechanics of reactiveprocess programming We
will use C for our discussion and assume that readers are familiar with C
A reactiveprocess program can be written as a simple combination of data structure
and function as a fulledged heavyweight process with its own process context or as a
complex multitasking operating system The diversity arises from a small and elegant set
of properties that allows reactiveprocess programming systems with very dierent capa
bilities to be built on top of one another in a consistent manner Since the tailoring of a
programming system to specic requirements is made simple an application no longer has
to be twisted around the system instead the system can be crafted to suit the intrinsic
needs of the application
In this chapter we will describe reactiveprocess programming in its simplest form the
next chapter will be devoted to examples of building morecomplex programming systems
on top of simpler ones
Section  Denition of a Reactive Process
A reactive process can be characterized by its two runstates	
Waiting While a process is waiting it is completely inert The process will remain
in the waiting state as long as there is no message ready for it to receive
otherwise the process will be run taking the earliestarriving message as its
input
  March  
  Chapter  ReactiveProcess Programming
Running While a process is running it cannot receive any more messages A process
can run for only a nite period of time before it returns to the waiting state
While a process is running it can
a modify its internal state
b send messages
c instantiate other processes or
d selfdestruct
Message buers remain attached to a process until they are explicitly released
by the process
msg
msg
msg
msg
run
wait
run
wait
time
state
w
12345
12345
12345
12345
123456
123456
ait
Figure  Possible behavior of a reactive process
The reactiveprocess programming environment has these additional properties
 Processes do not exist until they are instantiated
	 Processes persist until they selfdestruct
  March  
Section  ReactiveC Programming System  
  Each process has a unique process ID
 Messages are addressed by the destinationprocess ID
 Message order between any pair of processes is preserved
 Messages not immediately consumed are queued
 Messages with a valid destination will eventually be delivered
 Message bu	ers are allocated by calling an allocate function

 Message bu	ers can be released by calling either a deallocate or a send function
Section  ReactiveC Programming System
ReactiveC is a minimalist implementation of a reactiveprocess programming environment
using the C programming language As shown in Figure  a process in ReactiveC is
represented by a process structure that includes two pointers a function pointer and a
data pointer The function pointer references a C function the current entry function of
the process The entry function is called when a process is run
data
f
 
f

f

f

data ptr
current entry
function of
the process
process structure
entry ptr
set of functions
Figure  Representation of a process
The data pointer references an arbitrary data structure maintained by the process
Both the data structure and the two pointers are state variables of the process that owns
them and the process can modify them at any time while it is running When a process
starts to run the triggering message and the process structure are passed to the entry
  March  
   Chapter  ReactiveProcess Programming
kernel kernelkernel
queue
message
identifyprocessmesggetmessage procentryproc	mesg
Figure  Operation of a ReactiveC kernel
function as function arguments A process returns to the waiting state by returning from
the entry function
Listing  is a sample kernel loop of the ReactiveC programming environment As
shown in Figure  the kernel repeatedly gets a message from the message queue identies
the receiver and calls the entry function of the receiving process
  kernelloop
 
 char mesg	

 PROC proc	
 while 
 
 mesg  getmessage	
 proc  identifyprocessmesg	
  procentryprocmesg	
   
  
Listing  Kernel of ReactiveC programming environment
Listing  contains an example of a reactiveprocess program that computes a factorial in
logarithmic time on an arbitrarily large machine
  typedef struct  REF ID	 int HI LO	  FACDATA	
 fac procmesg

 RCPROC proc	 FACDATA mesg	
 
 FACDATA mesg	
 int half	

 ifmesgHI  mesgLO
  
   rcsendmesgIDmesg	
  rcexit	
 
 
  else
  March  
Section  ReactiveC Programming System   
  
  half  mesgHI 	 mesgLO

 
  mesg  FACDATA 
 rcmallocsizeofFACDATA


  mesgID  rcmyid

 mesgHI  mesgHI
  mesgLO  half	 
 rcspawnfac mesg


 mesg  FACDATA 
 rcmallocsizeofFACDATA


 mesgID  rcmyid

 mesgHI  half
 mesgLO  mesgLO
 rcspawnfac mesg


 procdata  char 
 mesg
  procentry  fac
 
 
 facprocmesg

 RCPROC proc FACDATA mesg
 
 FACDATA 
procdata

LO  mesgLO
 rcfreemesg

 procentry  fac
  
 facprocmesg

 RCPROC proc FACDATA mesg
 
 FACDATA 
procdata

LO  mesgLO
 rcfreemesg

 rcsendFACDATA 
procdata

ID procdata

 rcexit

 
Listing  ReactiveC factorial program
The three functions in Listing  fac fac and fac are in a suitable form for
entry functions because their arguments are the process structure and the input message
and because they are assured to return in nite time However they do not represent actual
processes they are merely message	handling functions for processes that reference them by
their entry pointers
Let a factorial process be a process that references any of the three functions Initially
a factorial process waits for a message whose structure is dened by the C data structure
called FACDATA The message is called a FACDATA message
  March  
   Chapter  ReactiveProcess Programming
ID LO HI
ID LO HI
caller
caller
result
fac
fac
Figure  Specication of the factorial process
  typedef struct  REF ID int LO HI  FACDATA
ID Data structure containing the callers process ID
LO Low end of a number range
HI High end of a number range
After receiving the message Figure  the factorial process computes the product
of all integers within the closed interval	 
LO  HI The factorial process stores the product
in the LO eld of another FACDATA message which is returned to the requester Thus
sending a FACDATA message with a  in the LO eld to the factorial process will cause the
the factorial of HI to be computed
To compute the factorial of a value the requesting process caller instantiates a new
process whose entry pointer contains the address of the fac function We shall call this
new process the fac process The factorial is computed by a divideandconquer method
that iterates using the dierence between HI and LO
 ifmesg	
HI  mesg	
LO
When the fac process receives its rst message it compares the two ends of the
interval described in the message If HI equals LO then there is only one integer in the
interval If HI is  therefore less than LO which must be  at this point then the factorial
of  is to be computed In either case the correct reply value is equal to the number already
contained in LO
   rcsendmesg	
IDmesg
  rcexit
  March  
Section  ReactiveC Programming System   
Therefore when LO   HI the message is bounced back to the caller untouched The
rcsend function called in line  causes the message buer mesg to be sent to the process
whose ID is mesgID which is in this case the ID of the caller Since rcsend dissociates
the message buer from the process the process does not have to release it explicitly before
the process is terminated by calling the rcexit function
  half  mesgHI  mesgLO	

d
e
C
d
e
C
d
e
C
facfacfac
Figure  The divide step
If HI is greater than LO the fac process computes a midpoint that divides the in
terval into two smaller intervals Two more fac processes are created to work on these
two intervals Figure 	
 These processes are called the siblings of this process and an
initialization message is sent to each sibling as it is created
  mesg
  FACDATA  rcmallocsizeofFACDATA
Message buers are allocated by the rcmalloc call The function rcmalloc has the
same semantics as the malloc function in C Depending on the implementation rcmalloc
can be identical to C malloc can be built on top of C malloc or can be an entirely dierent
allocator that gets space from a dedicated memory region
  mesg
ID  rcmyid

 mesg
HI  mesgHI

  mesg
LO  half 
  March  
   Chapter  ReactiveProcess Programming
After a message buer has been allocated it is lled with data to be sent to a sibling
Lines 	 are for the sibling that handles the upper half of the interval The rcmyid
function returns the ID of the process The process becomes the caller of its siblings after
its ID has been stored and sent in the ID elds of the initialization messages The fac
process will receive one reply from each of its siblings When two replies are received the
process multiplies the values contained in their LO elds and returns the product to its own
caller
   rcspawnfacmesg 
Processes are created with the rcspawn function call At line 		 a new process
structure is created the entry pointer of the new process is initialized to reference the
function fac 
rst parameter to the rcspawn function and the message mesg 
second
parameter to the rcspawn is sent to the new process as its rst input message
	 proc
data  char  mesg
 proc
entry  fac 
d
e
C
d
e
C
d
e
C


facfac
Figure  The combine step
The process must now return from the fac function in order to wait for the replies
from its siblings 
Figure 	 The process sends its reply using the same message buer
that it received but to prevent losing the reference to that message buer it assigns the
message buer into the data pointer of its process structure Furthermore since the process
  March  
Section  ReactiveC Programming System   
is now waiting for a reply message instead of a factorial request message the entry pointer
is changed to reference the function that handles the rst reply message By storing the
address of the fac function into the entry eld the fac process becomes a fac
process The process then returns from the fac function to indicate that it is going back
to the waiting state
  facprocmesg
  RCPROC 	proc
 FACDATA 	mesg

  
  FACDATA 	procdataLO  mesgLO

  rcfreemesg

 procentry  fac 

 
The fac process waits for the rst reply message When it arrives its reply value is
simply copied into the LO eld of the original message buer since the process needs a value
from each reply before the product can be computed The reply message buer from the
sibling is no longer needed and is released by calling rcfree The process then becomes a
fac process
 FACDATA 	procdataLO 	 mesgLO

 rcfreemesg

 rcsendFACDATA 	procdataID procdata

 rcexit

When the fac process gets the second reply message the returned value is multiplied
into the LO eld of the original message buer The reply message buer is also freed The
original message buer now containing the product of the two reply values is sent back to
the caller Lastly the process terminates by calling rcexit
Listing  is a sample program that calls the factorial program It waits for an in
put number computes the factorial of the input number prints the factorial and then
terminates
 rcmainprocmesg
 RCPROC 	proc

  char 	mesg

 
 int hi

 FACDATA 	mesg

 rcfreemesg

  March  
   Chapter  ReactiveProcess Programming
  printfEnter number  scanfd	
hi
  mesg  FACDATA  rcmallocsizeofFACDATA
  mesgID  rcmyid
  mesgHI  hi
  mesgLO   
  rcspawnfac 	mesg
  procentry  mainreply
  
  mainreplyproc	mesg
 RCPROC proc FACDATA mesg
 
 printfdn	mesgLO rcfreemesg rcexit
 
Listing  Factorial main program
The basic ReactiveC primitives are summarized below
char rcmalloc Allocates a message buer
rcfree Releases a message buer
rcsend Sends and releases a message buer
REF rcmyid Returns the ID of the calling process
rcspawn Instantiates a new process
rcexit Terminates the calling process
Deliberately omitted from the list is a function that receives a message In ReactiveC a
message is implicitly requested when a process is created or when a process returns from its
entry function The request is fullled when its current entry function is called The other
unusual aspect of the ReactiveC primitives is that rcspawn does not return the ID of the
new process thus the only direct way for a parent process to get the ID of the sibling is to
receive the ID from a message sent by the sibling
ReactiveC is a minimalist reactiveprocess programming system 	The kernel code for
a singleprocessor system is only 
 lines long Since the parent process can always send
its ID to the sibling during spawn and since the sibling can always send its ID back to its
parent via a message it is not necessary for the spawn function in a minimalist system to
return an ID The goal of ReactiveC is to create a system that is minimal but that is not
  March  
Section  ReactiveC Programming System   
necessarily easy on the programmer However a close relative of the ReactiveC turns out to
be well suited for writing eventdriven simulators Another derivative the Reactive Kernel
proves to be very useful in implementing the inner kernel and the handlers of multicomputer
operating systems Details of the Reactive Kernel can be found in the Masters thesis of
Jakov Seizovic 
ReactiveC is strongly in	uenced by the Cantor programming language which is a

negrain reactiveprocess programming system in which process spawning uses futures to
immediately return the sibling ID The properties and programming paradigms related
to 
negrain reactiveprocess programming are explored in detail the Doctoral thesis of
WC Athas 
In the next chapter we will focus on the universality of reactiveprocess programming a
property that is best illustrated using full	edged coarsegrain reactive processes Although
we will be leaving the ReactiveC environment for now we should bear in mind that duality
exists between a ReactiveC process and its heavyweight counterpart What is applicable for
one is equally applicable for the other Heavyweight programs are used for the remainder
our discussion because they are simpler to describe
Universality of a programming system requires the programming system to eciently
support a large variety of other programming systems Layering or the implementation
of new functions on top of basic functions is the principal means by which universality is
achieved
  March  
   Chapter  ReactiveProcess Layers
Chapter  ReactiveProcess Layers
In contrast to a lightweight ReactiveC process which has only a function and a data
structure we can generally consider a heavyweight process to be one that although its
structure is machine dependent has its own code data stack and thread of control We
can run heavyweight reactive processes under the ReactiveC programming environment
with minimal overhead by using a dedicated lightweight reactive process called a handler
In one possible arrangement the data pointer of a handler references a table containing
three segment pointers for the code data and stack segments and a context structure
containing the frozen records of a suspended heavyweight process When a message is
received by a handler the entry function for the handler performs a context switch to
resume the execution of the heavyweight process When the heavyweight process calls a
receive function it saves the process context restores the system context and returns to
the handler The handler returns from its entry function to request a new message
In this manner the combination of the heavyweight process and its handler appears to
the kernel as an ordinary ReactiveC process The cost of supporting a heavyweight process
under a handler as opposed to supporting it under the kernel is no more than one extra
level of function call A handler for a heavyweight process is an example of layering A
handler that supports multiple heavyweight processes is used in the Reactive Kernel node
operating system for running normal user processes
Section  Simple Layers
 The bottom layer blayer
As we did for ReactiveC we shall establish the groundwork for the discussion of universality
and layering with an example Listing  contains a heavyweight reactiveprocess program
that computes a factorial in the same manner as the ReactiveC example We shall refer to
the programming system used in this example as the bottom or blayer
  typedef struct  int pn pp int HI LO  FACDATA
 main	
  March  
Section  Simple Layers   
  
 FACDATA data
  FACDATA mesg  	FACDATA 
 brecvb	

 FACDATA mesg
 int half k
 if	mesgHI  mesgLO

 
 bsend	mesgmesgpnmesgpp

  exit	

  else
 
 half  	mesgHI  mesgLO

 k  mypid	
nnodes	
  mynode	

 mesg  	FACDATA 
 bmalloc	sizeof	FACDATA


 mesgpn  mynode	

 mesgpp  mypid	

  mesgHI  mesgHI
 mesgLO  half
 spawn	pfac 	k
nnodes	
 	k
nnodes	
 

 bsend	mesg 	k
nnodes	
 	k
nnodes	
 

 mesg  	FACDATA 
 bmalloc	sizeof	FACDATA


 mesgpn  mynode	

 mesgpp  mypid	

 mesgHI  half
 mesgLO  mesgLO
  spawn	pfac 	k
nnodes	
 	k
nnodes	
 

 bsend	mesg 	k
nnodes	
 	k
nnodes	
 

 data  mesg
 
 
   FACDATA mesg  	FACDATA 
 brecvb	

  dataLO  mesgLO
   bfree	mesg

  
   FACDATA mesg  	FACDATA 
 brecvb	

  dataLO  mesgLO
 bfree	mesg

 bsend	datadatapndatapp

 exit	

 
  
Listing  Heavyweight factorial program
  March  
   Chapter  ReactiveProcess Layers
A comparison between the ReactiveC example and the blayer example reveals numer
ous similarities The three entryfunction candidates are replaced by three program blocks
each block is headed by a line that waits for and receives a message
   FACDATA mesg  FACDATA  brecvb
Instead of messages being passed to it as function arguments a blayer process must
perform an explicit brecvb call to get a message The brecvb call suspends the process
until a message arrives The message is then returned to the process by the brecvb
function
	 typedef struct  int pn
 pp int HI
 LO  FACDATA
A blayer process is identied by its node and pid pair rather than by just a REF value
There is no reason why it should not use the same singlevalue representation that Reactive
C uses except that heavyweight processes require better control over process placement
because they take up a great deal of memory Thus wherever ID was used it is replaced
with the node and pid pair
	 k  mypidnnodes  mynode
 spawnpfac
 knnodes
 knnodes
 
  bsendmesg
 knnodes
 knnodes 
 spawnpfac
 k	nnodes
 k	nnodes
 
 bsendmesg
 k	nnodes
 k	nnodes 
Listing  Program fragments for mapping a binary tree to a multicomputer
Both bsend and spawn need node and pid as their arguments In order to give a
process better control over the placement of its siblings a process is allowed to dene the
node and pid of the new processes it creates The three program fragments shown in Listing
	 map a tree structure onto a multicomputer such that if the tree is balanced the number
of processes in any two nodes will di
er by no more than 
As shown in Figure  the tree is rst mapped to a linear array such that a process with
an ID of nodepid on a multicomputer with N nodes will have an index of k  pidN
	 node The two siblings of the process will have an index of 
k	 and 
k	
 respectively
The list is than folded into the multicomputer using the  and the  operators
  March  
Section  Simple Layers   
 
 
	

 


pid
node
     
k
 k  
 k   
Figure  Mapping a binary tree to a multicomputer
The functions mypid mynode and nnodes return the pid of the process the node of
the process and the number of nodes in the machine The spawn function creates a process
whose program le name is specied in the rst argument and whose ID is specied in the
second and third arguments The program le in this case is named pfac The rst process
to be spawned by the caller should have an ID of 
data
entry
code seg ptr
data seg ptr
stack seg ptr
saved SP
saved PC
saved msg
saved msg
data
entry
heavyweight
var  data
process
stack
next action
context switch
function
next entry
function
 fac 
lightweight
Figure  Process structure comparison
The equivalence between the lightweight and heavyweight processes is most obvious
when the process structures of the two factorial processes are compared at the time that they
are both waiting for their rst reply message Figure   The lightweight factorial process
retains its message buer in the data pointer of its process structure the heavyweight
factorial process retains its message buer in a pointer located on its program stack The
  March  
   Chapter  ReactiveProcess Layers
lightweight factorial process species its next action with the entry pointer of its process
structure the heavyweight factorial process species its next action with the program
counter stored in its context structure
The basic blayer primitives can be summarized in the following list The set is minimal
given the decision that processes are allowed to directly control process placement
char bmalloc Allocates a message buer
bfree Releases a message buer
char brecvb Receives a message
bsend Sends and releases a message buer
int mynode Returns the node of the calling process
int mypid Returns the pid of the calling process
int nnodes Returns the number of nodes in the machine
spawn Instantiates a new process
exit Terminates the calling process
  The lengthcarrying layer llayer
We shall introduce the general concept of layering by a very simple example We will create
a new set of functions the llayer functions that are parallel to the blayer functions with
the exception that llayer functions contain an additional function for accessing the length
of a message buer To store the length information we will make each message buer a
little larger than it needs to be and store the length information in the extra space
header
body
buer address seen by llayer programs
buer address seen by blayer
1234
1234
programs
Figure  Structure of a llayer message buer
That extra space is placed at the front of each message buer and is called the header
of the message the rest of the message is called the body We can hide the header by
having llayer functions work only with pointers to the body of the message As a result
the llayer functions become a super set of the blayer functions
  March  
Section  Simple Layers   
  typedef struct  int length  HEADER
 define BODYOFh	 h
sizeofHEADER		  given header find body 
 define HEADOFb	 bsizeofHEADER		  given body find header 
The HEADER structure shown above denes the content of the header for an llayer
message buer The only eld in this header is an integer that contains the length of the
message body In order to allow all data types in the message body headers should normally
be padded to the maximum data alignment requirement of the hardware In the interest of
simplicity however padding is neglected for our examples
 char lmallocn	
 int n
 
 char p
  p  bmallocn 
 sizeofHEADER		
   HEADER 	 p	length  n
  returnBODYOFp		
  
  char lrecvb	  returnBODYOFbrecvb			 
The two functions that return message buers  receive and allocate  call the cor
responding blayer functions to get message buers When one is obtained the pointer to
the body of the buer is returned by the functions In addition the lmalloc function
stores the buer length into the message header before it returns Similarly a function that
takes a message buer as input has to locate the real beginning of the message buer before
passing it to the corresponding blayer function
  lfreep	 char p  bfreeHEADOFp		 
  lsendpnodepid	
 char p
  int node pid
 
 bsendHEADOFp	 node pid	
 
 llengthp	
 char p
 
 returnHEADER 	HEADOFp		length	
 
This is the simplest application of layering it does not change the message properties
in any way By adding more elds to the header structure we can just as easily include any
  March  
   Chapter  ReactiveProcess Layers
information that we would like to send along with a message such as length of the message
buer message type and sender node and pid
   The nonblockingreceive layer nblayer
A process running in a reactiveprocess programming environment should not monopolize
the processor by running nonstop for long periods between receive calls for if a process does
not call a receive function other processes in the same node will not get a chance to run
A conventional multitasking operating system makes scheduling fair by interrupting a
longrunning process with a timer in order to wrest control away from a process The same
thing can be done in a ReactiveC implementation of a heavyweight programming system by
treating a timerinterrupt mechanism  as a process resource A process therefore includes
an interrupt mechanism and an interrupt service routine When a process is interrupted by
the timer the interrupt service routine of the process calls a receive function to relinquish
control
A timerinterrupt is just one of the ways to make a process call a receive function
periodically While a timer may still be needed as a backup mechanism to stop runaway
processes the preferred method is to convert a nonreactive process into a reactive process
by having the process call a receive function periodically during extended computations
Although the messages received may not be needed right away they can always be queued
by the process until they are needed
It is better for the process to be descheduled at choice points in the program rather
than at arbitrary points selected by the timer Choice points are places in a program where
much of the system resources used by the program such as oatingpoint accelerators
directmemoryaccess units and processor registers are released by the process as a normal
part of the program execution The amount of state information that needs to be saved
and restored when a program is stopped and restarted at a choice point is usually small
and can be reliably predicted during compile time
  March  
Section  Simple Layers   
Calling a receive function either from a timerinterrupt handler or from a choice point
presents a problem however A process that relinquishes control by calling a receive func
tion will not be restarted until a message is ready for it As a result a node can sit
idle with runnable processes suspended because there are no messages queued for them
Furthermore if a suspended process does not receive any more messages it will remain
suspended indenitely
What we need is a receive function that does not block This function can be imple
mented by having the process send a uniquely identiable message to itself just before it
calls a blocking receive function We can create such messages by the same layering mech
anism that we used for message length Let us prex the new functions with nb and let
us invent a new receive function nbrecv A call to nbrecv has the same eect as a call
to a normal receive function except that in cases where a normal receive function would
block nbrecv returns a null pointer A nbrecv call may still return a null pointer at
other times but it will always cause the process to release control rst
Below is a set of routines that implement the nblayer functions We will list only those
functions that are dierent in form from the llayer functions First of all two private
variables are needed The tokengot variable indicates whether a uniquely identiable
token message has been previously allocated The tokenmsg pointer contains the token
message if it is allocated and if the process is currently holding it	 the pointer contains null
otherwise
  typedef struct  int istoken  HEADER
 static int tokengot  
	 static char 
tokenmsg  
 char 
nbrecv
 
 char 
p
  iftokengot  tokenmsg  lmalloc tokengot    
  if tokenmsg  HEADER 
HEADOFtokenmsgistoken   
  bsendHEADOFtokenmsg mynode mypid
 	 tokenmsg   
  p  brecvb
  March  
   Chapter  ReactiveProcess Layers
  ifHEADER  pistoken 	 tokenmsg 
 p returnNULL 
  returnBODYOFp
  
The rst thing that the nonblocking nbrecv does is to check for the existence of the
token message If the token message has not been allocated the function allocates it Next
the function checks to see if it is currently holding the token message If it is the function
sends the token message to itself so that a subsequent brecvb call is guaranteed to return
Lastly it calls brecvb to get a message If the message obtained is a token message the
token message is saved and null is returned Otherwise the message is returned
 char nbrecvb
  	
 char p
 p 
 brecvb
 whileHEADER  pistoken 	 tokenmsg 
 p p 
 brecvb 
 returnBODYOFp
 
 nbsendpnodepid
 char p
  int node pid
 	
 HEADER HEADOFpistoken 
 
 bsendHEADOFp node pid
 
The blocking nbrecvb waits for a nontoken message and returns that message when
it is received If a token message is received rst it is stored in tokenmsg and nbrecvb
continues to wait for the next message The nbsend function clears the token ag in the
message header before sending the message because it can only send ordinary messages
In order to improve eciency detection of token messages is ordinarily integrated into
the kernel so that the kernel can defer token messages until the input message queue is
otherwise empty The primary eect is that processes with pending nontoken messages are
favorably scheduled The side eect is that processes have a reliable method of determining
whether the input queue of the node is empty This special treatment of token messages
constitutes the basis for indenitelazy computation in distributed simulation This will be
discussed in a later section
  March  
Section  Simple Layers   
  Handler layering
Running a heavyweight process inside a handler is an example of layering We can also run
a lightweight process inside a heavyweight process or a lightweight process inside another
lightweight process When each handler process controls just one reactive process the ID
of the handler is sucient to uniquely identify the process When there may be more than
one process inside a handler a secondary pid needs to be included in the message header
to distinguish them Examples of handler layering are the Reactive Kernel for heavyweight
processes and simulators for lightweight processes
  typedef struct  int pid  HEADER  message header 
 struct PROC ptab	MAX
PID  process table 
 main
loop
 
 char mesg
 PROC proc
  while 
   
  mesg  b
recvb
  proc  ptab  HEADER ppid
  procentryprocBODY
OFmesg
  
  
Shown above is the main loop of a heavyweight process capable of handling more than
one lightweight process The message functions resemble the llayer functions but with
the second pid rather than the message length in the message header The heavyweight
process repeatly calls brecvb to get a message nds the real destination process by the
pid eld and calls the entry function of the process If this program fragment looks
familiar it is because this is the main loop of the ReactiveC kernel The ReactiveC kernel
is itself a reactiveprocess program
Although the denition of a reactiveprocess program is xed as stated in the beginning
of Chapter  certain properties of the programming system are implementationdependent
Handler layering provides a way of running a programming system with a dierent set of
properties on top of another programming system For example assume that we have a
  March  
   Chapter  ReactiveProcess Layers
programming system in which all messages to nonexisting processes are thrown away To
implement systems such as the Cantor runtime system messages to nonexisting processes
must be preserved Suppose we were to support Cantor by running a Cantor handler under
a reactive kernel As far as the kernel is concerned all messages will nd their destination
processes namely the Cantor handler processes When the handler gets a message the
message is beyond the jurisdiction of the kernel the handler can do any number of things
with it In particular the handler can queue messages for Cantor processes that have not
yet been created
Section  Message Type
It is convenient in many computations for a process to respond dierently to dierent types
of messages In the factorial examples there are three types of messages the message
from the parent the rst message to arrive from the siblings and the second message to
arrive from the siblings These messages do not have to be distinguished by type because
they are identied by their order of arrival In the ReactiveC example dierent responses
to dierent messages are specied by storing dierent function pointers into the process
structure after each message is received In the blayer version the responses are specied
by the locations in the program where brecvb is called
carrier carriercarriercarriercarriercarrier
tailhead
Figure  An example of a FIFO queue
In the next example however it is necessary to distinguish messages by type The
FIFO 	rstinrstout queue
 structure shown in Figure  can be constructed with the
chain of carrier processes described in Listing  The carrier processes are connected
into a singly linked list by the nextnode and nextpid variables in each process The
FIFO is accessed by a reference to the head carrier and a reference to the tail carrier
When an item is to be added to the FIFO the item is sent as a message to the tail of
the FIFO The process at the tail of the FIFO spawns a new carrier for the new item and
  March  
Section  Message Type   
returns the reference of the new carrier to the caller When an item is to be retrieved
from the FIFO a message is sent by the caller to the head of the FIFO The process at
the head of the FIFO sends its item and the reference of the next carrier to the caller
The process then removes itself from the FIFO Message types are needed because the two
commands  new item and retrieve item  can arrive in any order when a FIFO is
only one element long
  typedef struct  int type value node pid  REQMESG
 main	

 
 REQMESG req
 int value
 int callernode callerpid
 int nextnode nextpid
  while 	
   
  req  REQMESG 	 brecvb	
 
 switchreqtype	
  
  case ADDVALUE spawnanywherecarriernextnodenextpid	
  reqtype  SETVALUE
  bsendreq nextnode nextpid	
  break
  case SETVALUE value  reqvalue 
 nextnode  INVALIDNODE
 nextpid  INVALIDPID 

 callernode  reqnode 
 callerpid  reqpid 
 reqnode  mynode	 
 reqpid  mypid 	 
 bsendreq nextnode nextpid	
 break
  case GETVALUE reqvalue  value 
 callernode  reqnode
 callerpid  reqpid 

 reqnode  nextnode
 reqpid  nextpid 
 bsendreq nextnode nextpid	
 exit	
 
 

 
Listing  The carrier program for building FIFO
  March  
   Chapter  ReactiveProcess Layers
When a carrier receives an ADDVALUE message it spawns another carrier and the
message is passed to the new carrier after its message type is set to SETVALUE 	
The spawnanywhere function will spawn the speci
ed process on some available node and
return the node and pid of the process in the nextnode and the nextpid variables	
When a carrier receives a SETVALUE message the process is the new tail process	
The value 
eld of the message is copied into the value variable of the carrier	 The next
reference of the carrier is initialized to a null ID	 The ID of the carrier is written into the
message and the message is returned to the caller 	 After the message is received
by the caller the callers tail reference is updated	
When a carrier receives a GETVALUEmessage its value and its nextcarrier reference
are copied into the message	 The message is sent back to the caller and the process exits
	
Section  Discretion on Receive
Discretion on receive means allowing a process to select certain messages to consume while
deferring other messages	 The ReactiveC the blayer and other simple layered variants all
have the same message property in that they do not supply any mechanisms for discretion
their processes have no choice but to take messages in the order they arrive	 Discretion can
however be implemented inside a process	
 Discretion using blayer functions
An example in which discretion is implemented in the program is a mergesort program in
which the list to be sorted is split recursively along the branches of a timeontarget tree
until every processing node in the machine is used	 The machine should have a poweroftwo
number of nodes to support this doubling approach	
At the beginning of the sort the zerothgeneration process is created in a machine with

n
nodes and a list of numbers to be sorted is sent to the process as a message	 The
zerothgeneration process then proceeds to 
ll the machine with processes in a total of n
expansion steps	 In the kth expansion step every process in the machine creates a new
  March  
Section  Discretion on Receive   
C



	
C

	

C
	
C

123
123
123
12
12
12
12
12
1212
12
12
12
12
12
12
12
12
12
12
23
23
23
23
12
12
12
12
123
123
123
1234
1234
1234
1234
12
12
12
123
123
12
12
	
Figure  Expansion steps in the mergesort program
kth
generation process giving half of its list to the new process and keeping the other half
for itself After n steps there will be 
n
processes on the machine each holding 
n
th of
the original list
The processes begin to sort their share of the list locally When sorting is complete
the expansion steps are reversed to merge the fragmented lists In the kth merging step k
decreasing each kth
generation process sends its list back to its parent in a reply message
After n steps only the zeroth
generation process remains The list that it now holds is the
sorted version of the original list
When the process structure is fully instantiated each kth
generation process has a
sibling for every generation number from k to n Since the computation is asynchronous
returning messages from the siblings may arrive in a dierent order from the order of the
merging steps Since each process needs to consume reply messages from its siblings in the
order of decreasing generation number each sibling will need a dierent message type for
its reply message and the process will selectively wait for a certain message in each merging
step
The sorting program in Listing  rst appeared in Multicomputers Message
Passing
Concurrent Computers  The rst version of the program which uses integer
based
types was written by CL Seitz the version appearing in Listing  and in the IEEE
paper was modied by the author to use pointer
based types
  typedef struct MESG MESG  Message header structure 
 struct MESG  int pnode ppid  Address of the parent process 
	 int tbase   Base for time
on
target tree 
 int len   Number of elements in the vector
 MESG type    Type field for filtering message
  March  
   Chapter  ReactiveProcess Layers
  define BUFv double v  Data follows MESG immediately	 

 unsigned int thisnode thispid nodecnt
 main
  MESG v
 thisnode  mynode  Node number of this process	 
 thispid  mypid  Pid number of this process	 
 nodecnt  nnodes  number of nodes in this machine	 
 v  MESG  brecvb  Receive list from parent process	

 ifvlen   mergesortv  Sort the list	 
 bsendv vpnode vppid  Send result back to parent	 
 
 mergesortv
 MESG v 
  unsigned l l i newnode
 MESG v v v
  double d s b b

 l   vlen       Break the list into two lists	 
 l   vlen   
 v  MESG  bmallocsizeofMESGsizeofdoublel
 v  MESG  bmallocsizeofMESGsizeofdoublel
 fori  vlen  l d  BUFv s  BUFv i  d  s
 fori  vlen  l d  BUFv  i  d  s
 newnode  thisnode  vtbase  Next node to be used for 
 spawning a sibling	 
 vtbase  vtbase  vtbase    New base for building 
 timeontarget tree	 
 ifvlen    newnode  nodecnt
   If list is too long and 
 spawnmsortnewnodethispid  if next node is valid 
 vpnode  thisnode   spawn a sibling 
 vppid  thispid   and send it a list	 
 vtype  v   The type field holds the 
 bsendvnewnodethispid  address of the msg ptr	 
  v    Msg ptr is set to null	 

  else ifvlen   mergesortv  Sort if cannot split	 
 ifvlen   mergesortv  Sort the other list	 
 whilev  v  MESG  brecvb vtype  v 
 forb  BUFv b  BUFv d  BUFv l  l   merge	 
  whilel  l  l  b  b  l d  b 
  whilel  l  l  b  b  l d  b 
 

 bfreev bfreev
 
Listing  The mergesort program
  March  
Section  Discretion on Receive   
In each level of recursion where a sibling is created  the type eld of the message
for the sibling is lled with the address of the automatic pointer variable v  These v
pointers on the program stack are set to null before the merging phase 	 which begins
when the recursive mergesort function starts to unwind Since there is at most one sibling
created in each level the list sent to each sibling must contain an address that is di
erent
from the others  the address of the v pointer in e
ect when the sibling is created
type
v 
v 
type
type
v 
second sibling
first sibling
new sibling
type
type
v 
v 
type
type
v 
second sibling
first sibling
send
new sibling
v 
v 
type
type
v 
second sibling
first sibling
type
L	
L  L
L
stack of parent pro
123123
cess
Figure  Giving away a list for the third time stack grows up
After the expansion phase the program progresses to line  and  where the re
maining numbers are sorted using a sequential mergesort algorithm performed by the same
mergesort function During the merging phase each sibling returns a message of the type
it was assigned  A process selectively waits for the message for the current recursion
level by polling the v pointer at that level at the same time the process repeatly requests
a message and stores it into the pointer whose address is equal to its message type 
v 
v 
v 
v 
v 
v 
type
v 
v 
v 
ty
123
123
123
123pe
Figure  Getting an outofsequence reply
When the program reaches line  v can take on one of the three possibilities
 v is not null because its list has not been given away
  March  
   Chapter  ReactiveProcess Layers
  v is null because although its list has been given away a reply has not been
received or
 v is not null because although its list has been given away the reply was received
while the program was waiting for a dierent reply
The distribution of work is accomplished by divide and conquer the mergesort example
can be used as a template for other divideandconquer applications Assigning deferred
messages into holding pointers is sucient for this application because no more than one
message for each type needs to be queued When more than one message of each type must
be deferred the process has to store them in a more general list structure
   The RPCdiscretion layer rlayer
While discretion is used in the mergesort program the process still takes messages in the
same order they arrive However some programs can be made simpler by creating an
illusion that messages are dispensed by the kernel in an order other than 	rst come 	rst
serve Such eects can be achieved with layering as well
The implementation of a remote procedure call RPC is one example Suppose we want
to make available a generic 	le operation read implemented by message exchange with a
	le controller a process responsible for maintaining a 	le A prototype function might look
like the one in Listing 

  typedef struct  int fsnode   Structure of one entry of 
 int fspid   FSTRUCT  the process	s file table
 
 FSTRUCT filetab  The process	s file table
 
 typedef struct  int operation  Format of request message 
 int mynode   to be sent to the file 
 int mypid   server process to request 
 int readsize  REQUEST  for a read operation
 
   define OPREAD   Code read request
 
  readfdbuflen
  int fd len
  char buf
  
  REQUEST request
  char reply
  March  
Section  Discretion on Receive   
  request  REQUEST  bmallocsizeofREQUEST
 	 request
operation  OPREAD 
   request
mynode  mynode
  request
mypid  mypid 
  request
readsize  len 
  bsendchar  request filetabfdfsnode filetabfdfspid
  reply  brecvb
  bcopyreplybuflen
 bfreereply
	 returnlen
  
Listing  An incorrect implementation of the C read function
The filetab array contains the node and pid of all lecontroller processes accessible
by this process The read function sends a request to a le controller selected from filetab
using fd as the index When the le controller nishes reading the requested amount of
data the data is sent back in a message The function is shown to be waiting for the reply
using the normal brecvb function
  reply  brecvb
However the brecvb function is not adequate because it may pick up the wrong
message if another message arrives before the reply message A receivediscretion mechanism
must be used to ensure that only the reply message for the read function is returned The
reply messages called the RPC messages must therefore be distinguishable from other
messages that the process uses Furthermore messages that arrive before the reply message
must be queued and released in a transparent way so that the requesting program cannot
distinguish a local read from a RPC read
The rprimitives implement the new message properties by layering and by adding two
more functions RPC send and RPC receive The message header for this layer contains
a RPC ag and a chaining pointer Since RPC calls do not interleave in a process a
process can have no more than one outstanding reply message at any one time Storing one
distinguished type in a Boolean variable is therefore sucient for positively identifying a
  March  
   Chapter  ReactiveProcess Layers
reply message The deferh and defert pointers are used to implement a queue for non
RPC messages The next pointer in the message header is used to chain deferred messages
into a linked list for the queue
  typedef struct HEADER  int isrpc
 struct HEADER next  HEADER
 	define BODYOF
h 
hsizeof
HEADER  given header find body 
 	define HEADOF
b 
bsizeof
HEADER  given body find header 
 HEADER deferh defert  queue for holding nonrpc messages 
The rrecvb function replaces the brecvb function for receiving normal messages Instead
of calling brecvb immediately it checks the queue for any deferred messages If there are
deferred messages a message is removed from the queue and returned Otherwise brecvb
is called
 char rrecvb

  
   char p
  if
deferh  p  
char  deferh
  deferh  deferhnext
  return
BODYOF
p 
  return
BODYOF
brecvb

  
The rrecvrpc function is a function that waits for a reply message It calls brecvb
repeatly until a reply message is received The RPC message is then returned Meanwhile
all nonRPC messages that have arrived are stored in the queue
 char rrecvrpc

  
 char p
 while
p  brecvb

 
 if


HEADER pisrpc    return
BODYOF
p
 if
deferh defert  defertnext  
HEADER  p
 else defert  deferh  
HEADER  p
 

HEADER  pnext  
 
  
The rsend function clears the RPC ag before sending the message The rsendrpc
function sets the ag before sending the message
 rsend
pnodepid
 char p
  March  
Section  Discretion on Receive   
  int node pid
  
  HEADER 	
HEADOFp

isrpc  
  bsendHEADOFp
 node pid

  
 rsendrpcpnodepid

 char 	p
  int node pid
 
 HEADER 	
HEADOFp

isrpc  
 bsendHEADOFp
 node pid

 
If replies from the le controller are sent using rsendrpc the read function can be correctly
dened as
 readfdbuflen

  int fd len
 char 	buf
 
 REQUEST 	request
 char 	reply
 request  REQUEST 	
 rmallocsizeofREQUEST


 requestoperation  OPREAD 
 requestmynode  mynode

 requestmypid  mypid 

  requestreadsize  len 
 rsendchar 	
 request filetabfdfsnode filetabfdfspid

 reply  rrecvrpc

 bcopyreplybuflen

 rfreereply

 returnlen

 
Listing  A correct implementation of the C read function
The introduction of the RPC message type makes it possible for standard utility func
tions to be implemented by message passing however the use of RPC and other discretion
mechanisms in utility functions has the potential eect of diminishing the available concur
rency in a program For example the use of read in a program forces all nonRPC messages
to wait while read is being completed regardless of whether some of these messages can be
consumed without waiting for read to complete
  March  
   Chapter  ReactiveProcess Layers
    The CSPdiscretion layer csplayer
Layering can also be used to implement the CSP synchronization primitives In Hoares de
nition of CSP send and receive are performed by P expression and Pvariable respectively
where P is the process reference of the communication partner In later CSP variants such
as OCCAM send and receive are performed by Cexpression and Cvariable respectively
where C is the channel connecting the sender and the receiver Both send and receive
functions will block until the communication partner has completed the complementary op
eration on the same channel The send and the receive functions can be implemented with
a mutual exchange of messages between the two processes We will show an implementation
of CSP with channels
n
 
	n
 
 p
 


p
 
j
k
otherend
othernode
otherpid
	n
 
 p
 


n

p

k
j
otherpid
othernode
otherend
A logical c
12345678901234567
12345678901234567
hannel
Figure  Structure of a channel in a channelbased CSP implementation
Since messages associated with dierent channels may arrive in an order other than the
one in which CSP communication is to take place messages must be tagged with a type
eld and those that have arrived early must be deferred Let us construct a channel using
two logical communication endpoints one each in the sender and the receiver If we identify
the endpoints in each process by a small array index the connectivity of the channels can
be completely described by four arrays in each process
  typedef struct  int type int value  CSPMSG
 int otherend MAXCHAN
	 int otherpid MAXCHAN

 int othernodeMAXCHAN
 CSPMSG chanqueueMAXCHAN
In each process the entries othernodej and otherpidj identify the process at
the other end of channel j The entry otherendj is channel js identity at the other side
  March  
Section  Discretion on Receive   
of the channel i e  the channel j on this side and the channel otherendj on the other
side both refer to the same channel An unambiguous typing system can be constructed by
giving messages for channel j the type otherendj The chanqueue array is an array
of pointers that holds queued messages for channels Since each channel can have no more
than one pending message only one pointer for each channel is needed for buering early
messages The cspsend and the csprecv functions can be written as
  cspsendchanexpr
 int chan expr
	 

 CSPMSG sp  CSPMSG  bmallocsizeofCSPMSG
 spvalue  expr 
 sptype  otherendchan
 bsendsp othernodechan otherpidchan
 whilechanqueuechan 
 sp  CSPMSG  brecvb
  chanqueuesptype  sp 
	 bfreechanqueuechan chanqueuechan  	
 
 csprecvchanvar
 int chan var
 

 CSPMSG sp  CSPMSG  bmallocsizeofCSPMSG
  sptype  otherendchan
 bsendsp othernodechan otherpidchan
 whilechanqueuechan 
 sp  CSPMSG  brecvb
 chanqueuesptype  sp 
 var  spvalue
 bfreechanqueuechan chanqueuechan  	
 
In both functions a message buer is allocated and sent to the other side of the channel
The process then waits for a reciprocal message from the other side if one has not already
arrived The process frees that message clears the messagequeuing pointer and returns
The only dierence between the send and the receive functions is that in cspsend the
value to be sent is stored in the value eld before the send In csprecv the value is
retrieved from the message received before it is freed
A more elaborate implementation of a superset of CSP were created by AJ Martin 	

and Marcel van der Goot
  March  
   Chapter  ReactiveProcess Layers
   A more general typediscretion layer tlayer
When userdened message types are needed in a program with type discretion the type
information can be encoded in the message body and discretion can be handled by the
program itself as in the mergesort example Alternatively we can hide the message type
in the message header as in the tlayer example below
In the tlayer the program supplies a type for the message when it is sent with the
tsend function The tsend function stores the message type into the header before
the send In the receive function the program species the type of message to wait for
Messages of other types are queued if they arrive before a message of requested type is
received
  typedef struct HEADER  int type
 struct HEADER next  HEADER
 tsend	p
node
pid
type
 char p
 int node
 pid
 type
 
 		HEADER HEADOF	ptype  type
 bsend	HEADOF	p
 node
 pid
  
The two pointer arrays deferh and defert implement the queues This queue
structure imposes a limit on the range of usable types but a more general queue structure
can be used instead The trecvb function takes a message type as an argument It waits
for and puts messages into the respective queue while the queue of the desired type remains
empty When the queue is nonempty a message is removed from the queue and returned
to the program
  HEADER deferhMAXTYPE
 defertMAXTYPE
  char trecvb	type
  int type
  
  char p int t
  while	deferhtype
 
  p  brecvb	
 t  		HEADER  ptype
 if	deferht defertt  deferttnext  	HEADER  p
 else defertt  deferht  	HEADER  p
  March  
Section  Other Layers   
  HEADER  pnext  	

  
  p  char  deferhtype

  deferhtype  deferhtypenext

	 returnBODYOFp

 
Section  Other Layers
 A owcontrolling layer flayer	
Layering can also be used to implement transparent ow control of messages Suppose
we have an application where it is necessary to limit the number of unconsumed messages
produced by each process We can introduce a layer in which an acknowledgment message
is sent for every message consumed and have the send function block until the number of
messages sent is no more than a preset value over the number of acknowledgments received
In the following example when more than ten messages are outstanding the send
routine will call brecvb to wait for messages Since brecvb does not distinguish normal
messages from acknowledgment messages we will use the rlayer mechanism to selectively
wait for acknowledgment messages in the flayer routines
 typedef struct  int node pid isack

  struct HEADER next
  HEADER

 define BODYOFh hsizeofHEADER  given header find body 
 define HEADOFb bsizeofHEADER  given body find header 
 define COUNTMAX 	
 static int ocount
  number of outstanding messages 
 HEADER deferh defert
  queue for holding normal messages
Since the receiver has to send an acknowledgment to the sender the flayer message
header must contain the ID of the of the sending process in addition to the next eld of
the rlayer header The header must also contain the ag isack to dierentiate a normal
message from an acknowledgment message
 char frecvb
  
 HEADER p q

 ifdeferh  p  deferh
 deferh  deferhnext
 
 else  while  p  HEADER  brecvb

 ifpisack break

 ocount
 bfreep
  
  March  
   Chapter  ReactiveProcess Layers
  q  HEADER  bmallocsizeofHEADER
 	 q
isack  	 bsendqp
nodep
pid
  returnBODYOFcharp
  
In the receive function if there are any queued messages one message is removed
from the queue If the queue is empty the function calls brecvb repeatedly until a normal
message is received In both cases an acknowledgment is sent to the sender and the message
returned to the caller While waiting for a normal message any acknowledgment messages
received cause the outstanding message counter to decrement
  char fsendpnodepid
  char p
  int node pid
  
 HEADER q
  whileocount  COUNTMAX
 
 q  HEADER  brecvb
 ifq
isack  ocount

 bfreeq 
 else  ifdeferh defert  defert
next  q
 else defert  deferh  q
 q
next   
 
	 q  HEADER  HEADOFp
  q
node  mynode
 q
pid  mypid 
 q
isack  
 ocount
 bsendchar  q node pid
 
In the send function as long as the counter value is larger than ten brecvb is called
to obtain a message If the message is a normal message it is queued if the message is an
acknowledgment message the counter is decremented If the outstanding message counter
is or has become less than COUNTMAX the outgoing message is sent and the outstanding
message counter is incremented
If the communication graph is xed i e  channellike connectivity it is more e	cient
to have a separate counter for each channel and to send an acknowledgment for every
COUNTMAX messages in each channel Each acknowledgment message represents the con
sumption of COUNTMAX messages
  March  
Section  Other Layers   
  The CK primitives
The old CK Cosmic Kernel primitives the original message primitives for the Cosmic Cube
can also be built from the reactive primitives by layering The primitives are dened around
a data structure called a message descriptor This is very similar to the way in which the
C standard IO functions are dened around the FILE structure
typedef struct
short node
short pid
short type
short seg
char buf
unsigned short msglen
unsigned short buflen
short lock
 MSGDESC
We have treated messages as information carriers Sending and receiving messages are
similar to memory allocation operations in C in that it is the carrier that is aected The
transfer of information is merely a side eect of moving these carriers The CK primitives
on the other hand treat messages as information encoded in binary bit patterns and stored
in arrays of memory cells When a message is being sent the system fetches the informa	
tion from a designated storage buer
 when a message is received the system writes the
information into a designated storage buer
Since the send and receive requests are not always completed when the send and receive
functions return processes are allowed to run asynchronously while the transactions are
being completed However in order to avoid access conicts in the buers a lock variable
is used for each transaction to indicate whether the transaction has completed The buf
and lock variables in the MSGDESC structure are used to hold the buer and the completion
lock
When a message descriptor is used to send a message the node and pid elds store the
ID of the destination process The type and msglen elds store the message type and the
length of the message The buf pointer references a memory buer where the message is
  March  
   Chapter  ReactiveProcess Layers
contained When send is called the call will return immediately but the lock remains set
until the send operation is complete
When a message descriptor is used to receive a message the type eld is set to the type
of the message to be received The buf eld is set to reference the memory buer where
the message body is to be stored The buflen eld contains the size of the memory buer
When a receive function is called the call will return immediately but the lock remains
set until the receive operation is complete When receive is complete the node and pid
elds contain the ID of the sending node The msglen eld contains the actual length of the
message Incoming messages that do not have matching receive requests waiting for them
will be queued
typedef struct HEADER  int snode spid
int msglen
int type
struct HEADER next  HEADER
Other functions in the CK primitives are described in detail in the CK programming guide
 In making the transition from the CK primitives to the RK  Reactive Kernel primitives
which we use on our machines a compatibility library was created for the old CK programs
by layering The message header for a CK layer would therefore contain the sender node and
pid the message length and the message type It would also contain a pointer for making
linked lists for discretionary receives The details and the listings for the implementation
have been omitted for brevity
   The RK primitives xprimitives
The RK primitives or xprimitives can also be built from the blayer functions by layering
The RK primitive set includes the following list of functions	
char xmalloc 			
 bmalloc
char xrecv 			
 nbrecv
char xrecvb 			
 brecvb
char xrecvrpc 			
 rrecvrpc
xsend 			
 bsend
xsendrpc 			
 rsendrpc
xfree 			
 bfree
int xlength 			
 llength
  March  
Section  Layering on LightWeight Processes   
The xmalloc xrecvb xsend and xfree functions are equivalent to the bmalloc
brecvb bsend and bfree functions respectively The xrecv function is equivalent
to the nbrecv function the nonblocking receive The xlength function is equivalent to
the llength function the function that returns message length The RPC functions are
similarly equivalent to those of the rlayer functions
The RK primitives can therefore be implemented using a combination of llayer nb
layer and rlayer However in the actual implementation of the Reactive Kernel all three
of the layers are incorporated into the basic kernel for greater eciency
The xprimitives and associated functions will be discussed in the next section in con
junction with the description of the Cosmic Environment the generic multicomputer op
erating environment in which the xprimitives are supported as the primary programming
system
Section  Layering on LightWeight Processes
Any layering that applies to heavyweight processes and that makes sense in the context of
the lightweight processes can be applied to lightweight processes as well If we represent
the kernel handler layer routines and user program as four separate components the chain
of control ow is shown in Figure 
user
code
layer
functions
context
switcherkernel
reactive
return to deliver
message
context switch
to deliver
message
call handler to
deliver message
return to kernel
to get message
context switch
back to handler
to get message
call layer function
to get message
Figure  Control ow for heavyweight processes
  March  
   Chapter  ReactiveProcess Layers
kernel
reactive
layer
functions
user
code
call handler to
deliver message
return to kernel
to get message
return to layer
function to get
message
call user code to
deliver message
Figure  Control ow for lightweight processes
The control ow for lightweight processes shown in Figure  is identical except for the
absence of the handler component
Although these two programming models are essentially interchangeable lightweight
processes are more e	cient in most machines because they avoid the contextswitch cost
However programs composed of lightweight processes are more di	cult to develop because
processes are not protected against each other in case of a programming error The processes
must in practice coexist in the same address space
  March  
Section  The Cosmic Environment Specication   
Chapter  Cosmic Environment
The Cosmic Environment or CE is a multicomputer programming specication that also
exists as an implementation on a number of multicomputers Details for using CE can
be found in The C Programmers Abbreviated Guide to Multicomputer Programming	

We will concentrate here on the reasoning behind the design of our implementation but rst
we will give a short denition of the Cosmic Environment Specication The specication
covers the process model the message system and the library functions
Section  The Cosmic Environment Specication
The agents of a computation in CE are
Processes Each process is identied by a unique process ID which is
a nodepid pair Node identies the multicomputer node
containing the process and pid distinguishes one process from
another on the same multicomputer node
Messages Each message is tagged by the ID of its destination process
receiving
sending
a process
a message
message system
a
12345
12345
12345
12345
12345
12345
12345
12345
12345
12345
12345
12345
12
12
12
12
12
12
1
1
123
123
queue
Figure  Elements of a computation
Message system The message system accepts messages from the processes routes
them according to their destination process ID and delivers them
to their destination processes Messages are queued enroute to
  March  
   Chapter  Cosmic Environment
their destinations message order between any pair of processes is
preserved
In CE a process can allocate and release message buers send and receive messages
create other processes and terminate itself The functions available to a C program are
char xmallocn
unsigned n
 Allocates and returns a message buer
sucient for n bytes of data
xfreep
char p
 Releases a message buer
char xrecvb  Waits for and returns a message from the
message system
char xrecv  Returns a message from the message system
if one is available returns a null pointer
otherwise
xsendpnodepid
char p int node pid
 Frees the message buer p from the calling
process and sends the message buer to the
process whose ID is nodepid
spawnnamenodepidoption
char name option int node pid
 Runs the program called name and assigns it
the ID nodepid
int mynode  Returns the node number of the calling
process
int mypid  Returns the pid number of the calling
process
exit  Terminates the calling process
This specication is short and simple When our emphasis is on the study of multicom	
puter programming we do not need unnecessary features to distract us what we do need is
a system that does not inhibit creativity CE preserves the value of our work by making it
easy to provide ecient implementations for its specication on many multicomputers that
are otherwise software	incompatible
  March  
Section  Our Cosmic Environment Implementation   
Our CE specication was designed with the following two rules in mind
 Programming systems should be portable
 Programming manuals are evil
The rst design rule regards the portability of CE A programming environment is portable if
many types of machines can be made to support the programming environment Portability
is easy to achieve with CE because its functions are easy to provide in most multicomputers
and multiprocessors CE can be supported at the userprogram level with a compatibility
library or at the system level with a reactive kernel The reactive kernel makes kernel
implementation or substitution simple because it does not require much support from the
hardware
The second design rule regards programming manuals Manuals are a necessary evil
Therefore whenever possible CE has been made easy to explain in order to shorten the
manuals Besides this obvious advantage for people who do not enjoy reading manuals CE
has become simple and intuitive because making it easy to explain has also made it easy to
use
Having a short programming manual is selfrewarding In an evolving system where
old features are constantly being revised or dropped and new features are constantly being
added keeping a large manual uptodate is a nontrivial task for a small research group
By keeping the manual simple we not only make manual revision less laborious but also
make system improvement easier since we are not obliged to support any misfeatures that
have not been previously documented Our view is that the less a user has to know in order
to e	ciently complete the work the better
Section  Our Cosmic Environment Implementation
An implementation of the CE specication is a programming environment that embodies
the specication Currently we have implementations that contain drivers for the Cosmic
Cube the iPSC the iPSC the Symult  and for the ghost cube 
 a set of network
connected workstations treated as a single multicomputer For historical reasons we retain
  March  
   Chapter  Cosmic Environment
the use of the word cube to mean a multicomputer even though not all multicomputers
are binary ncubes Other implementations that use shared memory for message passing
exist for the Sequent and for the Cray XMP
  Structure of our CE implementation
We start with the process model A process group contains a set of processes connected to
the message system Figure 	 Processes communicate with each other by sending and
receiving messages
 and they refer to each other by means of their process IDs
Message System
send
receive
a pro
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
1234
1234
1234
1234
1234
1234
1234
1234
cess
Figure  A process group
In order for the set of processes to communicate with the outside world
 the logically
uniform message system is physically partitioned into two parts One resides in the multi
computer and is called the node message system the other resides outside of the multicom
puter and is called the host message system The two parts are connected by a message
gateway
 and the separation is made transparent to the processes Figure  Processes
are then allowed to run either on the hosts or on the nodes
node systemhost
1234
1234
1234
1234
1234
1234
1234
1234
1234
1234
1234
1234
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
system
Figure  Partitioning into two parts
Since our multicomputers are used in classes for student experiments
 there are many
more users who need to use the multicomputers than there are available multicomputers
But since most experiments require fewer nodes than are available in a multicomputer
 we
  March  
Section  Our Cosmic Environment Implementation   
want to support several users simultaneously on the same multicomputer Space sharing is
the sharing of a multicomputer by more than one user such that each user is given a separate
subset of nodes in a multicomputer The programming environment within each subset is
indistinguishable from one in which the user owns an entirely separate multicomputer having
the same number of nodes in the subset Our message gateway must therefore interface with
more than one host message system and pass messages to and from each users nodes Figure

multicomputer network
1234
1234
1234
1234
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
1234
1234
1234
1234
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
1234
1234
1234
1234
1234
1234
1234
1234
1234
1234
1234
1234
1234
1234
1234
1234
TCPIP
Figure  A multicomputer shared by two users
In our implementation the host system is built on top of the TCPIP network and
the host processes run on any network	connected host that uses the Berkeley UNIX socket
mechanism The node system is built on top of the multicomputer network and may involve
either a replacement kernel in each node or a set of emulation routines for the CE functions
In this particular implementation the gateway is a single ifc process and each host
message system is a single messageswitcher process The message switcher is the spoke of
the host message system It is connected to each host process and to the ifc process via
TCPIP stream sockets Message	sending functions in a host process convert CE messages
into TCPIP messages before sending them to the message switcher Depending on the
  March  
   Chapter  Cosmic Environment
ifc
interface
hardware
internet TCPIP stream socket connection
host system message switcher process
multicomputer
multicomputer interface ifc pro
123
123
123
123
123
123
123
123
123
123
cess
Figure  Host messagesystem implementation
ID of the destination process the message switcher will send a message either to another
host process or to the ifc process The ifc process waits for messages from both the
multicomputer and the switchers When it gets a message from a switcher it converts the
message into a multicomputer message and sends it to a multicomputer node owned by the
user who owns the switcher
When the ifc process gets a message from the multicomputer the node ID of the sender
is used to determine the destination switcher process The ifc process then converts the
message into a TCPIP message and sends it to the switcher When the switcher gets a
message from the ifc process it sends the message to the destination host process The
receive function in the host process then converts the message into a CE message to be
returned to the user program
cubed
cube
12
12
12
12
12
1212
12
12
12
12
12
12
12
12
12
12
123
123
123
123
dmon
Figure  Cosmic Environment with unied resource management
  March  
Section  Our Cosmic Environment Implementation   
Since we have several multicomputers and since some of them are of the same type
we centralize the allocation of all multicomputers in a process called the cube dmon
When a multicomputer is requested by type the cube dmon tries to assign an available
multicomputer of the required type by searching the list of all multicomputers registered to
it Thus the user is not concerned with locating an available machine because it makes no
dierence which one is assigned
We connect all ifc processes and switcher processes with the cube dmon via TCPIP
stream sockets These sockets do not carry much trac they are merely tokens of partici	
pation in CE for the switchers and the ifc processes
  Cosmic Environment exterior
Having been spoiled by the convenience of the Network File System 
NFS on workstations
the rst thing that we decided that we did not want to know is where to go to access the
multicomputers Like les in a NFS environment CE is equally accessible from everywhere
in the same network The cube dmon resides on a known host in a network and a
conguration le in each participating machine is initialized to contain the network address
of the cube dmon
Every utility that accesses CE connects to the cube dmon using the network address
found in the conguration le making CE available and equally accessible from anywhere
within the same network The most frequently used utility is the program called peek
which prints the status of CE
CUBE DAEMON version  up  days  hours on host ganymede
 	 
d cosmic cube b  venus fly trap 
h
 	 d cosmic cube b  ceres TEST  
h
 sim mikep 	 d ipsc cube  b  saturn iPSC  h
group david 	 d ipsc cube  b  titan iPSC d 
h
 	 n s  b  psyche ginzu  d
group apl 	 n s  bb salieri ginzu  h
 	 n s  b perseus S  d
group sharon	 n s  b perseus S  m
group tony 	 n s  bc  mozart S  h
The peek utility lists all available occupied and fragmented multicomputers In the
display above user tony and user sharon each occupy  nodes in a 	node S without
  March  
   Chapter  Cosmic Environment
interfering with each other User apl is using  nodes of a node S User david is
using a 	node iPSC
 and user mikep is using a node iPSC

To use a multicomputer we must rst allocate a multicomputer We specify the mul
ticomputer type and the cube dmon picks the best allocation according to an algorithm
specic to that type To allocate a node s we can enter getcube n s A
peek will now show the following list
CUBE DAEMON version  up  days  hours on host ganymede
 	 
d cosmic cube b  venus fly trap h
 	 d cosmic cube b  ceres TEST  h
 sim mikep 	 d ipsc cube  b  saturn iPSC  h
group david 	 d ipsc cube  b  titan iPSC d 
h
 	 n s  b  psyche ginzu  d
group apl 	 n s  bb salieri ginzu  h
 	 n s  b perseus S  d
group wenking	 
n s  b neptune S  s
group sharon 	 n s  b perseus S  m
group tony 	 n s  bc  mozart S  
h
GROUP group wenking	 TYPE reactive IDLE s
   SERVER s r q neptune 

 
s
   FILE MGR s r q neptune 
 
s
  CUBEIFC s r q perseus 
  s
In this example the allocation algorithm carves out a node subset from the multi
computer shared by sharon and tony instead of from the one used by apl After the
allocation any multicomputer programs that we run on the hosts or on the nodes become
part of our process group The host processes will be connected to our switcher and the node
processes will be spawned on our nodes Host processes are shown in the extended peek
display below the main list In this example a set of server programs was automatically
started and added to the process group when getcube returned
  Cosmic Environment processes
While CE is not in use the only active processes in the hosts are the cube dmon process
and the ifc processes Each ifc process resides in a host containing an interface to a
multicomputer and maintains a TCP
IP connection to the cube dmon process The cube
dmon keeps track of its set of ifc connections that a connection remains open is an
  March  
Section  Our Cosmic Environment Implementation   
indication that the multicomputer attached to the ifc process is ready for use An ifc
process passes the multicomputer status to the cube dmon via its TCPIP connection
The cube dmon process passes allocation and deallocation commands to the ifc process
via the same connection
When a user requests a multicomputer by running the getcube program the getcube
process connects to the cube dmon and sends it a set of allocation requirements If the
requirements can be fullled the requested multicomputer or a partition of the multicom
puter is marked as allocated in cube dmons table An allocation command is then sent to
the corresponding ifc process The ifc process initializes nodes allocated to the user and
then connects to the users getcube process The getcube process then fades to background
to become the switcher process giving the user the appearance that the getcube command
has terminated as an indication that the allocation has completed
A set of service processes is started by the getcube process as it fades to background
These processes are responsible for such mundane tasks as the details of process spawning
le access and printing of error messages Additional host processes and utilities are run
by the user to perform computation
Porting CE to another multicomputer involves the creation of a new plugin node system
for the new multicomputer We have a choice of implementing the CE node system on top
of the native node kernel or writing a new kernel that implements the CE node system
The Cosmic Cube and the S	

 both have the CE node system as their native system We
replaced the iPSC	 kernel with a custom kernel On the iPSC and on earlier versions of
the iPSC	 the CE node system is layered on top of their native systems  the NX kernels
When we layer a CE node system on top of the native node kernel the ifc process
is linked with the native host library for the multicomputer and it interacts with the
multicomputer via the native message functions To the native system running underneath
the ifc process appears to be just an ordinary host process of the native system The
CE node system can operate within the connes of useraccessible functions of the native
  March  
   Chapter  Cosmic Environment
system because it has simple requirements it does not need special capabilities from the
native system and it does not interfere with the functioning of the native system
   Program compilation
Dierent commercial multicomputers will invariably provide dissimilar methods of compiling
programs for their multicomputers The compiler options are dierent those with the same
name may have dierent meanings to dierent compilers and some that are available to one
compiler may be missing for another compiler The sequence of operations that the user has
to go through may be dierent and the set of end products may also be dierent However
we recognize that only a small set of the options is useful and we can easily hide any
dierence among the compilers by the use of a program that runs programs By declaring
that only a limited set of commonly used compiler ags are supported the compilation
tools for all machines can be described in one table
host ghost cosmic
iPSC iPSC
S
compiler
cch ccgh cccos ccipsc ccipsc ccs
linkablefile suffix
o gho O	 o	 o
	 so
runnablefile suffix
gh cos ipsc ipsc s
archiver
arh argh arcos aripsc aripsc ars
archivefile suffix
a gha A	 a	 a
	 sa
The following sequence will compile the program myprogramc for all of these machines
and the runnable object code generated will be named myprogram myprogramgh mypro
gramcos myprogramipsc myprogramipsc and myprograms respectively
  cch o myprogram myprogramc lcube
  ccgh o myprogram myprogramc lcube
  cccos o myprogram myprogramc lcube
  ccipsc o myprogram myprogramc lcube
  ccipsc o myprogram myprogramc lcube
  ccs o myprogram myprogramc lcube
To illustrate the amount of complexity hiding that can be performed actual compilation
for the iPSC	 can be done only on the controller box of the iPSC	 
 the Intel 	
The program ccipsc copies the source les to the 	 for compilation and copies back
  March  
Section  Our Cosmic Environment Implementation   
compiled object les when compilation is completed It creates an illusion that compilation
takes place where the ccipsc command is issued
  Spawning programs
Like compilers dierent multicomputers supply their own method of running a node pro
gram We can hide the dierences by using programs that run other programs but unlike
the compilers we no longer have to dierentiate one multicomputer from another by giving
them dierent names While a compiler can be invoked by the user at any time a program
loader can be invoked only when the user has an active process group
We can therefore eliminate another level of complexity by having the generic loader
spawn check the type of the multicomputer being used and have it run the loader com
mand specic to that multicomputer Thus to load the program generated in the previous
example into any of the multicomputers we can run spawn myprogram	 regardless of the
multicomputer we are using
Utilities such as the nodeprogram compilers are called machinespecic utilities util
ities such as spawn are called machinedependent utilities and utilities such as peek are
called machineindependent utilities The node system for each type of multicomputer
therefore contains the ifc process the machinespecic utilities the machinedependent
utilities and the compiler libraries
  Data representation and conversion
We have tried to simplify CE and at the same time to hide the dierences between dierent
multicomputers but it is not always possible to do both The dierence in data represen
tation among processors of dierent multicomputers and hosts is one that we cannot hide
in vanilla C When two communicating processes are run on two machines having dierent
data representations data in messages sent from one process to another need to have their
  March  
   Chapter  Cosmic Environment
         
vax        
         
Listing  Three representations of   in doubleprecision oatingpointnumber format
representations converted before they can be used We can always move the conversion
problem into the compiler but we still have to decide how the problem is to be solved
Datarepresentation problems have been a subject of study ever since computers were
rst connected by networks The most common solution is to dene an interchange data
representation The sender converts data items in its outgoing messages from the senders
representation to the interchange representation the receiver converts data items in its
incoming messages from the interchange representation to the receivers representation A
set of conversion routines with the same name but having dierent functions on dierent
machines is provided to make programs portable A program needs only to be capable of
converting its data to and from the interchange representation rather than to and from all
possible representations
In the case of a multicomputer however message tra	c is usually much higher and
message latency is usually much lower between the nodes than between the hosts Having to
convert the data in each internode message to and from an interchange representation can
signicantly reduce the performance of messageintensive applications unless the interchange
representation happens to be identical to the representation of the multicomputer
Our solution is therefore to make the interchange representation adjustable we dene
the interchange representation for a process group to be the representation used by the
multicomputer of the process group Node processes are not required to convert the data
in their messages and if they do the functions that they call to perform the conversion
will have no eect A host process is required to convert message data to the interchange
representation before it sends a message and from the interchange representation after it
receives a message Host processes already have a large permessage overhead and they
can absorb the extra work of converting the data
  March  
Section  Our Cosmic Environment Implementation   
The node programs never need any conversion routines but host programs must carry
routines that convert data representations to and from those of all multicomputers that CE
supports The conversion routines check the multicomputer type before deciding how data
is to be converted Adding a new multicomputer to CE may require that host programs be
recompiled if the data format for the multicomputer is not already supported
In order to preserve the CE specications conversions are done in place because mes
sage buers are treated like memory buers from malloc Having to convert a message and
put the converted data in another buer weakens the specication In order to have such
conversion make sense however the location and the size of each data item in the messages
must be the same for all processes However dierent machines do have dierent sizes and
alignment rules for the same data type
struct test  char AA
short BB 
long CC 
int DD   
	

 AAABBCCCCDDDD
vax AAABBCCCCDDDD
	
	 AAABBCCCCDD
Listing  Three layouts of a structure in order of increasing byte address
For data sizes we made the decision that in all the machines that we support data
items will have the following sizes and a message should include only the following data
types
doubleprecision floatingpoint number 	 bits
singleprecision floatingpoint number 
 bits
long integer 
 bits
short integer  bits
character  bits
For alignment we add any necessary padding to force each data item to align on its
strictest alignment boundary A kbyte data type should be aligned on a kbyte boundary
The bottom of a data structure should also be rounded out by padding it to the alignment
  March  
   Chapter  Cosmic Environment
boundary of the largest data item in the structure Whenever possible a structure should
be rearranged to minimize the amount of padding necessary
When data items are aligned using these rules the location of each data item in a
message is the same for all machines A set of conversion routines can be used to perform
in place conversion on the items
htocspn ctohspn Convert short integers
htoclpn ctohlpn Convert long integers
htocfpn ctohfpn Convert singleprecision floatingpoint numbers
htocdpn ctohdpn Convert doubleprecision floatingpoint numbers
The htoc set of functions converts data from the format used by the calling process
to the interchange format The ctoh set of functions performs the reverse conversion
Parameter p is a pointer to an item of the appropriate type and parameter n is the number
of consecutive data items to be converted by the functions There is no conversion routine
for the character type because the basic units of the messages are bytes and their correct
ordering is enforced by the ifc process
The data representation problem may require rethinking after machines with a bit
data bus become available Datatype conversion is only an inconvenience and it can always
be taken care of by writing a new compiler that inserts code to do the conversion for the
user However such is beyond the scope of this research
  March  
Section  Mathematical Framework and Analysis   
Chapter  Model of Simulation
Section  Mathematical Framework and Analysis
 Systems and elements
A system consists of a system body a set of system inputs and a set of system outputs
It is a black box whose only external connections are the inputs and outputs In a
representation of a simulator each individual output conveys an atomic property of the
simulated system A property is atomic if at any point during the simulation the simulator
contains all information about that property up to some simulated time but none beyond
that simulated time
System
system input
system
123456789
123456789
123456789
123456789
123456789
123456789
123456789
output
Figure  Representation of a system
A system can be dened recursively as a collection of systems linked together by arcs
each arc connects an output of its source system to an input of its destination system and
each arc represents the source systems direct inuence on the destination system The
recursion terminates with systems that are called elements the behavior of each element is
dened algorithmically to correspond to a model of some physical device or process
e
 
e

e

system input
system output
an elemen
123
123
123
123
123
123
123
123
123
123
123
123
123
t
Figure  Representation of a system composed of elements
If the hierarchy that is induced by this recursive denition is attened by expanding
each system recursively into its constituent systems and elements we obtain a system that
  March  
   Chapter  Model of Simulation
is composed entirely of elements In order to simplify the following exposition we shall
without loss of generality discuss a system that is composed entirely of elements
In a composite system each element input can be connected to no more than one
arc whereas each element output can be connected to any number of arcs The set of
system inputs is the set of unconnected element inputs whereas the set of system outputs
can be any subset of the element outputs Systems without any inputs are called closed
systems In order to simplify the mathematical framework we shall close each system with
an environment element e e that provides inputs to all unconnected system inputs and
accepts outputs from all unconnected system outputs
e
 
e

e

e
123
123
123
123
123
123
123123
123
123
123
123
123
123
123
123
e
Figure  Closing a system into a closed graph
The representation is now a graph that can be described as below
srca
a
dsta
12345
12345
12345
12345
12345
12345
12345
12345

Figure  Arc source and destination
E  The set of elements in a system
A  The set of arcs in a system
U   E   fe eg
oute
e
inpe
123
123
123
123
123

Figure  Element inputs and outputs
inpe  The set of all arcs terminating at e
oute  The set of all arcs originating from e
srca  The source element of a
  March  
Section  Mathematical Framework and Analysis   
a  a a 
a a
123
123
123
123
1234
1234
1234
1234
1234
12
12
12
12
123
123
123
123
123
123
123
123
1234
1234
1234
1234

Figure  Arcs a
  
form a path of length 
a  a a 
a a
123
123
123
123
123
123
123
123
123
1234
1234
1234
1234
12
12
12
12
123
123
123
123
123

Figure  Arcs a
  
form a circuit of length 
dsta  The destination element of a
path A path of length n is a sequence of arcs	 a  a  a   a n   such
that
dsta i 
 srca i  for    i  n  
circuit A circuit of length n is a path of length n in which srca  
 dsta n 
  States and time
The state of a system includes both its internal state and the state of its outputs Let
S
U
t  t  be the state description of the closed system between the time t  and t 	
t    t 	 and let S
L
t  t  be the state description restricted to the subset or member	
L The state of the closed system can be written as a Cartesian product of the environment
state and the system state
S
U
t  t  
 S
e e
t  t  S
E
t  t 
Similarly	 the system state can be written as the Cartesian product of the element states
S
E
t  t  
 S
e 
t  t   S
e 
t  t  S
e 
t  t     S
e n
t  t 
A simulator is said to be progressive if it can compute the following function for any
valid input description	 S
inpE
t  t 	 which is a description of input state over a time
interval	 and any valid initial state of the system	 S
E
t  t 
S
inpE
t  t  S
E
t  t   S
E
t  t 
  March  
   Chapter  Model of Simulation
A simulator may be able to compute more state information for some of its outputs
than is specied above For example if the system can compute the following function for
some      the output o is said to have a delay of no less than   at time t 
S
inpE
t  t  S
E
t  t    S
o
t  t  	  
If   is the largest value for the above to remain true then   is the delay of the output o
at simulated time t  The delay of a system at simulated time t  is dened to be the
smallest of all output delays of the system at t  The denition of a progressive simulator
precludes the possibility of negative delays
  Knots and progress
In this section we shall dene a set of rules that allows us to recursively construct progres

sive system simulators by connecting progressive element simulators in the same manner in
which the elements of the system are connected We shall call such a simulator a composite
simulator In order to discuss progress we make a minimal assumption that information
computed at any element simulator e will be available to all dstoute We shall assume
for the moment that elements are deterministic that is S
inpe
t  t  and S
e
t  t 
completely determine S
e
t  t  Thus in order to determine whether a simulator is pro

gressive we need to consider only the arc state S
A
t  t 
A simulator lacks progress if and only if there exists a combination of S
inpE
t  t 
and S
E
t  t  such that the simulator fails to compute S
a
t  t  for some a  A Let
t K be the time value t   t K  t  such that the simulator can compute S
A
t  t K
but not S
A
t  t K
 
 Let K  A be the set of arcs such that the simulator can compute
S
a
t  t K but not S
a
t  t K
 
 The set K is called a knot in the simulation The
presence of a knot is synonymous with a lack of progress
Knot Simulator can compute S
a
t  t K
 
 for all a  K
  March  
Section  Mathematical Framework and Analysis   
a  
a 
NAND
System input
Composite system
Figure  Example of a knotcontaining system
Simulator can compute only S
a
t  t K for all a   K
An example of a knotcontaining system is a zerodelay NANDgate with one of its
inputs connected to its output as shown in Figure 	
	 Although the element simulator
for the NANDgate may be progressive the composite simulator for this system cannot be	
For example if the input to the system is the following
S
inpE
  
 
 for   t   
  for    t  
then the composite simulator can compute only the following for the arc a 
S
a 
  
 
  for   t   
 for    t  
The simulator cannot compute S
a 
for    t   because a selfconsistent state assignment
for a  cannot be found	 The set of arcs fa g is a knot	
Theorem 	  If a is an arc of knot K then the following conditions hold
a	 inpsrca is not empty i e  srca is not a source node in the directed
graph of elements	
b	 The delay of srca at t K is 	
c	 Some member of inpsrca is also a member of K	
Proof
a	 If the set of arcs inpsrca is empty then srca is a closed system	
A closed system does not need any information from its environment in
order to compute its state  it is able to compute its outputs up to any
arbitrary time	 Therefore inpsrca cannot be empty	
  March  
   Chapter  Model of Simulation
b By the denition of a knot the simulator can compute up to t K for
all arcs in inpsrca If the delay for srca is greater than zero the
simulator would be able to compute up to t K
 
for a Since it cannot
by denition the delay of srca must be zero
c If no member of inpsrca is in K then by the denition of a knot
the simulator should be able to compute up to t K
 
for all members of
inpsrca Furthermore since delay cannot be negative the simulator
should be able to compute up to t K
 
for a Therefore if a is in K some
member of inpsrca must also be a member of K
  Rules of thumb  sucient conditions for progress
Corollary 	 Every knot contains a circuit
Proof	 There is a nite number of arcs in a system If for every arc a i   K
there is at least one arc a j   K such that a j   inpsrca i then
there must be a circuit in K
Corollary 
	 If the system contains no circuits then the composite simulator is pro
gressive
Proof	 Since every knot must contain a circuit a system that does not contain
any circuits cannot have knots
Corollary 	 If every element has a delay greater than  then the composite simulator
is progressive
Proof	 Follows directly from Theorem  part b
Corollary 	 If in every circuit there is some element with nonzero delay then the
simulator is progressive
Proof	 From Corollary  if K exists it must contain a circuit From Theorem
 if such a circuit exists all the elements in it must have zero delay
  March  
Section  Mathematical Framework and Analysis   
Therefore if all circuits have at least one element with nonzero delay
then K cannot exist
Although the progress conditions stated in Corollaries   and  identify a set
of systems with progressive simulators they do not identify either by themselves or all
together the set of all systems with progressive simulators These are not minimal condi
tions because there are systems with progressive simulators that do not satisfy any of the
three corollaries The corollaries are useful as simple rules of thumb because there exists an
eective procedure for testing each of them
   Nonexistence of necessary and sucient progress conditions
   Simulation and Boolean satisability
An algorithm that tests for a necessary and su	cient condition if any such condition does
exist must be NPhard Figure 
 shows a system that tests for the satisability condition
in a set of Boolean clauses The system contains a zerodelay NAND gate a counter a clock
source and a network of zerodelay gates forming the clauses A simulator for the system
is not progressive if and only if there exists a counter output such that all of the clauses
are true If there is an algorithm that can determine whether a simulator for any system of
this form has progress we can use it to determine whether any collection of clauses can all
be true at the same time Since the latter operation Boolean satisability  is known
to be NPcomplete the algorithm must be NPhard Therefore any generic algorithm that
tests for a necessary and su	cient condition must be NPhard
   Simulation and simultaneous equations
Another way to demonstrate the futility of searching for a necessary and su	cient condition
is to examine the relationship between simulation and simultaneous equations We dene a
progressive simulator to be one that can compute the following function for any valid input
description S
inpE
t  t  and any valid initial state S
E
t  t 
S
inpE
t  t  S
E
t  t    S
E
t  t 
  March  
   Chapter  Model of Simulation
counter
clock
clause k
clause 
clause 
Figure  A circuit to evaluate satisability of a set of clauses
Let H
e
be the mapping associated with a progressive simulator for the element e we can
express a composite simulator as the following set of equations
 
e  E
S
e
t  t  	 H
e
S
e
t  t  S
inpe
t  t 
Since S
e
t  t  describes S
oute
t  t 
 and since S
A
t  t  and S
E
t  t  de
termine S
E
t  t 
 a composite simulator can also be expressed as the following set of
equations
 
a  A
S
a
t  t  	 G
a
S
srca
t  t  S
inpsrca
t  t 
G
a
is H
srca
restricted to the arc a These are simultaneous equations in the form
 
i
 X
i
	 F
i
X

 X

  X
n

Furthermore
 any set of simultaneous equations can be transformed into a physical
system for which a composite simulator can be constructed The set of all simulators and
the set of all simultaneous equations must be equivalent
X

X

X

F

F

F

F
n
X
n
 
i
 X
i
	 F
i
X

 X

  X
n
 
Figure  Mapping equations into physical system
  March  
Section  Operational Framework   
In any set of simultaneous equations only one of the three possibilities listed below can
exist
 The simultaneous equations have no solution
 The simultaneous equations have exactly one solution
 The simultaneous equations have more than one solution
Since a simulation is progressive if and only if its set of simultaneous equations has a
solution any test for determining progress of a simulator can be used as a test for deter
mining the existence of solutions for simultaneous equations and vice versa Since the test
for the latter has not been found the test for the former also has not been found The
search for a necessary and sucient condition is therefore both dicult and so far futile
Section  Operational Framework
Although an e	ective simultaneousequation solver for the general case does not exist the
simultaneousequation representation brings us one step closer to an operational model
because e	ective procedures 
 such as Gaussian elimination for ordinary linear equations

 exist for specic classes of equations
The equations for a simulation are generally dicult to analyze because its variables and
constants describe states over the entire simulation interval and the equations themselves
can be arbitrarily complex We may be able to obtain a set of simpler equations however if
we restrict the analysis to those simulations that span only a short interval If the interval
of a simulation can be broken down into a nite number of smaller intervals such that
each interval can be computed by an e	ective procedure we will have found an e	ective
procedure for the simulation
 Breaking a simulation into smaller slices
Any equation whose associated output has a delay   such that      can be reduced to a
constant equation by restricting the simulation to an interval equal to  Let L be the set of
output arcs with a nonzero delay at time t Suppose L is nonempty let  be the smallest
  March  
   Chapter  Model of Simulation
nonzero delay The state of all arcs between t and t  are related by the following set of
simultaneous equations justications to follow shortly
  a  A S
a
t t  	
 
G
a
S
srca
t t  if a  L

G
a
S
srca
t t S
inpsrca
t t  if a  L
If equations like these can be solved simulation for a system can be performed by
dividing the simulation interval into wide slices and repeatly solving for S
A
t t  
computing S
E
t t and advancing to time t Since the set of equations above covers
a slice of time let us simply refer to it as a slice The operation of a composite simulator
that advances one slice at a time can be described by the actions of its element simulators
Figure  depicts the sequence of actions taken by the simulator for element e whose
output arc a has a nonzero delay of  At the beginning of the slice that starts at t Figure
a the simulator has progressed to t and has computed S
e
t t Since the delay for a
is  S
e
t t contains the output state description S
a
t t 
S
inpe
tt
tt
t
t
S
a
tt
t
t
t
t
element e
S
e
tt
t
t
S
a
tt
t
S
a
tt
S
e
tt
t
cb
arc a
a
12
12
12
123
12
12
123
123
123
123
d
Figure  Elementsimulator operation for an element with a nonzero delay
Since  is no larger than  the equation for S
a
does not depend on the state of other arcs
and the simulator can output the state description S
a
t t   Figure b without
any additional inputs If the state description over the interval t t   can be computed
for every arc in the system S
inpe
t t   will become available to e Figure c
Since element simulators are assumed to be progressive the simulator for e will compute
  March  
Section  Operational Framework   
t 
t
t 
S
a
tt 
t 
S
a
t t 
S
e
t t 
t
S
inpe
tt 
t
t 
element e
S
e
tt
t
S
a
tt
c db
arc a
12
12
123
123
a
Figure  Elementsimulator operation for an element with a zero delay
S
e
t t    from S
e
t t and S
inpe
t t    and will be ready for the next slice Figure
d
If the delay is zero Figure  the simulator for e does not contain any output state
description beyond the starting time of the slice Figure a The equation for S
a
depends on the state of other arcs and the simulator is unable to produce S
a
t t   until
it has received S
inpe
t t  Figure b If e is not a member of a zero	delay circuit
Corollary  S
inpe
t t  will eventually be available When S
a
t t  is computed
the simulator will be ready for the next slice
A slice that does not contain zero	delay circuits can be solved by simple variable sub	
stitution
 a slice that contains zero	delay circuits called an obligatory slice requires simul	
taneous equation solving A system has a progressive simulator if and only if a solution
exists for every slice of a system If a slice has no solutions then the slice contains a knot
  Slices and knots
For a system that contains only deterministic elements a non	obligatory slice always has
exactly one solution An obligatory slice however can have three possible outcomes no
solution one solution and multiple solutions All three of the outcomes can be found in
the cross	coupled zero	delay XORNOR circuit in Figure 
S
a 
t t    A function of the environment
S
a 
t t    A function of the environment
S
a 
t t     S
a 
t t    S
a 
t t  
S
a 
t t    S
a 
t t   S
a 
t t  
  March  
   Chapter  Model of Simulation
    
a a 
a a 
    
Figure  A system that contains all three types of slices
When the inputs a  and a  are both  over the t t	 interval
 the set of simultaneous
equations for the circuit can be reduced to the set of two equations below
 which has no
solution
S
a 
t t 	    S
a 
t t 		
S
a 
t t 	   S
a 
t t 	
The slice is a nosolution slice A nosolution slice contains a knot
 and no simulator is
able to complete the simulation when a nosolution slice is encountered When a  is  and
a  is 
 the set of simultaneous equations for the circuit can be reduced to the set of two
equations below
 which has arbitrarily many solutions
S
a 
t t 	    S
a 
t t 		
S
a 
t t 	    S
a 
t t 		
The value of the pair a  a 	 can be either  	 or  	 but their value can switch
between the two
 spontaneously and for arbitrarily many times The slice is a multiple
solution slice A simulator can make progress if it is able to continue using one of the
solutions When both a  and a  are 
 the only solution for the simultaneous equations
is a     and a     The slice is a singlesolution slice
  Implementation considerations
Thus far
 we have analyzed the composite simulator using only abstract models
 because any
real simulator is bounded by these frameworks We can never nd progressive simulators
for more systems than those indicated by these frameworks We can derive a number of
  March  
Section  The Generic Simulator Model and Its Derivatives   
simulators directly from the framework but in order for any implementation to cover the
range indicated by the framework it must satisfy the following two conditions
Eventual Delivery The simulator must make available any information
that is present in the simulation to any element that
requires it for further computation
Slice Resolution The simulator must have mechanisms to resolve any
obligatory slice that has a solution
Simulators that satisfy both of these properties are called complete simulators Not all
simulators are or need to be complete simulators For example if every element in a system
has a nonzero delay slice resolution is not necessary Complete simulators that operate
on all possible systems are beyond our goal we often restrict ourselves to specic subjects
such as discreteevent simulation We will temporarily restrict ourselves to systems that do
not require slice resolution in order to allow for the development of a working simulator
model
Section  The Generic Simulator Model and Its Derivatives
Since it is sucient to synchronize the elements through their inputs and outputs strict
synchronization of all elements on slice boundaries is unnecessary elements should be al
lowed to progress at their own pace as their input data becomes available Furthermore if
  for an element is larger than  the element does not have to stop producing output at
t 	  because it already has computed S
out
e

t t	  
Tape
Write head
Read head
Recorded region
t  
gap 
1234
1234
1234
123
123
123
1231234567890123
1234567890123
1234567890

Figure  Representation of an arc
  March  
   Chapter  Model of Simulation
If we ignore the existence of obligatory slices we can construct a generic simulator
model using a set of multitape automata We replace each arc in the system with a read
head a write head and a tape such that
 As information is produced by the originator of the arc the information and the sim
ulation time are recorded along the length of tape as the write head advances The
recorded time strictly increases
 The read head recovers the recorded information and the time from the tape as it
advances
 Both tape heads move in one direction only but the read head will never move past
the write head
Since information over periods of time is written onto the tape by its source element be
fore being read from the tape by the destination element element simulators are decoupled
in simulated time The gap between a write head and a read head on the same tape is called
the slack Since the element simulators are moved forward by consuming and producing
slack this simulator model is called the slackdriven simulator model
A slackdriven simulator is not a complete simulator because the model does not include
a mechanism to solve simultaneous equations	 when a system encounters an obligatory slice
and equationsolving is required the element simulators involved will stop They are blocked
while waiting for each other to produce more tape	 this condition is called deadlock We
will describe in brief a few derivatives of the slackdriven simulator some of which are
more permissive and some more restrictive	 thus some are more complete and some are less
complete than the slackdriven simulator
  Messagedriven simulation
A slackdriven simulator can be expressed as a set of concurrent messagepassing processes
in which the processes are the element simulators and the message streams are the tapes
Whenever a stretch of tape is written by the slackdriven simulator the information on
the tape is sent in a message	 whenever a stretch of tape is read the information in a
  March  
Section  The Generic Simulator Model and Its Derivatives   
received message is read Since the slack is represented by messages queued in transit
a messagepassing implementation of a slackdriven simulator is called a messagedriven
simulator
simulator simulator
messages
process pro
12
12
1
1
123456
123456
123
123
123
123
123
123
123
123
123
cess
Figure  Replacing tape by messages
Since a messagedriven simulator is an exact implementation of a slackdriven simulator
the simulation will not make any further progress when equationsolving is required
  Concurrent eventdriven simulation
The slackdriven simulator satises eventual delivery because each stretch of tape written is
immediately available to the destination process The messagedriven simulator duplicates
that property by immediately packing and sending the output information as a message
oblivious to the value of the information content of the message An eventdriven simulation
is a modied messagedriven simulation in which message trac is reduced by classifying
messages and by treating dierent types of messages dierently
Messages are classied by whether they are needed at the receiving end Messages that
are considered to be nonessential are held back with the objective of combining as many
nonessential messages as possible with the next essential message and packaging them
all in a single entity The total volume of messages in the simulation is reduced without
impeding the progress of the simulation Whether a message is needed however depends
on the state of the simulation and is often impossible to determine on the basis of local
information alone
  March  
   Chapter  Model of Simulation
In eventdriven systems however messages containing state transitions are more likely
to be needed than those that do not most eventdriven simulators make the classication
on that basis alone Since the transitions are often called events and since there is generally
one in each message for such a simulator these simulators are called eventdriven simulators
Messages containing no events are called null messages Eventdriven simulators were rst
explored by Chandy Misra and Bryant 	 
 though their derivation paths are dierent
from ours This exposition illustrates that null messages are a consequence of applying a
more general model to a specic class of subjects rather than a necessity when going from
a sequential simulator to a distributed simulator
Culling null messages as is true with many other methods for reducing message volume
violates the rule of eventual delivery because the rules that decide whether a message is
needed at the receiving end can fail Without additional mechanisms to assure eventual
delivery of necessary null messages deadlock may still occur A ring of elements with
stable values for their cyclic outputs will fail to produce progress because each element is
waiting for its preceding element to produce a message yet none will arrive if they send
only messages containing transitions
Cannot produce more information
Cannot send this information
Information waiting to be sent
because it has not received any more information
because it does not contain any transitions
containing state   from t  	 to t  
delay 
1
1
	
Figure  Example of deadlock in an eventdriven simulation
  Sequential simulator
A sequential simulator is a simple example of a backtracking simulator for eventdriven
systems If we describe it in the context of our model a sequential simulator keeps all of
its read heads aligned during the simulation All read heads are initially aligned at t  
at the start of the simulation Each write head records not only the output state derived
  March  
Section  The Generic Simulator Model and Its Derivatives   
from the element input but also the expected output state assuming that the element will
encounter no further input change
If there are currently no state transitions recorded under the read heads the sequen
tial simulator is free to move the read heads forward without delivering any of the state
descriptions to any elements The state description on the portion of the tapes covered
by the motion were produced on the assumption that no transition has occurred over that
period and the assumption was shown to be valid When a transition is encountered the
assumption by its destination element is shown to be false and the transition is delivered
to its destination element so that a new output can be computed Since the delay of an
element must not be negative the tape already covered by the read heads will never have
to be revised
In an implementation of the sequential simulator the set of tapes is replaced by a
merged list of pending events Each pending event represents an expected change in an
output of an element given that the input state of the element remains unchanged Items
in the list are sorted in an ascending order with respect to their time values
The position of the read heads is kept in a single variable called the global clock Moving
the read heads forward is accomplished by storing increasingly larger values into the global
clock as events are pulled from the list of pending events The simulator repeatedly sets the
global clock to the time of the earliest event in the list pulls that event from the list and
delivers it to the destination element All events in the list except the topmost event are
subject to revision because the assumptions of the elements that posted them  that their
inputs will remain unchanged  may now be shown to be false The event pulled from
the top of the list will never need to be revised because the assumption of the element that
  March  
   Chapter  Model of Simulation
e  
s
t
e 
e  
time
e 
e  
s
t
e 
e  
time
e 
an event entry
update event list
identify destination element
sorted event list
simulated time
some element simulators
update time
deliver event
Figure  Model of a sequential simulator
posted it is now shown to be correct The sequence of events pulled from the list represents
the result of the simulation
Suppose an obligatory slice is encountered during the simulation If the state under
the read heads forms a selfconsistent state assignment for the slice then there will be no
events scheduled to change that assignment The simulator will pass over the slice without
detecting it If the state assignment is not selfconsistent there will be events that change
the state assignment As the result of delivering such events more events may be scheduled
for the current simulation time because some destination elements may have a zerodelay
If the intermediate state assignments eventually lead to a consistent state assignment the
pool of events under the read head will become empty and the global clock will be allowed
to advance if not the simulator will be stuck processing an endless stream of events having
the same event time
Since there is one event delivery for every transition a sequential simulator is also
labeled eventdriven however unlike the concurrent eventdriven simulator described pre
viously the sequential simulator will never deadlock The simulator is a complete simulator
  March  
Section  The Generic Simulator Model and Its Derivatives   
  Concurrent backtracking simulators
Messagedriven simulators do not backtrack because every piece of information that each
element simulator produces is correct Backtracking simulators produce speculative infor
mation that can be revised when assumptions fail In a sequential eventdriven simulator
the amount of backtracking is limited by the alignment of the read heads Since alignment is
costly and reduces concurrency concurrent backtracking simulators do not align read heads
The element simulators are allowed to produce outputs and to consume inputs according
to their own heuristics and assumptions When those assumptions are shown to be wrong
they have to restart the simulation from the point where the computation went wrong by
backing up the write heads to discard erroneous information
When a write head needs to be moved back behind a read head the destination element
of the read head has already consumed and may have produced its state and output based
on false inputs it too must be rolled back In order to roll back to the time at which
the input becomes invalid the element simulator has to store a sequence of past states in
addition to its current state
Not all of the past state needs to be stored however In the Time Warp simulator of
David Jeerson 	
 a behindthescenes mechanism called the global virtual time is used
to compute concurrently the lower bound of time for which rollback may still occur The
global virtual time attempts to keep track of the minimum time of all events and elements
in the simulation Any saved state with a time value less than the global virtual time can
be discarded because no element will ever roll back to an earlier time
The advantage of a backtracking simulator is that when a processor of the machine is
otherwise idle spare cycles can be used for speculative computing Since this simulator must
keep a record of past states for the elements the concurrent backtracking simulator trades
o space for speed by using larger processing nodes than would otherwise be necessary
Concurrent backtracking simulators are complete simulators and they handle obliga
tory slices the same way as do sequential simulators When one is encountered and if the
  March  
   Chapter  Model of Simulation
state assignment of the elements involved is already selfconsistent the simulator moves
ahead without detecting it If the state assignment is not selfconsistent some of the ele
ments involved will be rolled back to the starting time of the slice and perhaps some more
after that The urry of rollbacks ends when a selfconsistent state is achieved
   Branchandbound simulators
If a backtracking simulator is likened to a depthrst search then its breadthrst equivalent
resembles a branchandbound simulator This is one that trades o space for speed by using
more processing nodes rather than larger nodes	 than would otherwise be necessary
Suppose an element simulator computes to a point where its output can take on one of
several states depending on some inputs that have not yet arrived Instead of producing a
speculative output as would a backtracking simulator the element simulator will in eect
fork the simulation into a set of concurrent branches to follow each of the possibilities In
each branch when the decisive input has nally arrived should the input not match the
assumption for a branch then the branch will be terminated bound	
Agency 
Agency 
Researc
123456
123456
123456
123456
123456
123456
123456
123456
1234567
1234567
1234567
1234567
her
Figure  A researcher submitting a grant
For comparison suppose that a research grant request has to be approved in tandem
by two government agencies The rst agency spends a long time classifying the grant into
one of three classes A B or C The second agency spends a long time deciding whether
the grant will be accepted based on the classication and the available funding for each
class A researcher submitting a grant can be represented by the system in Figure 
In a messagedriven simulator only one agency simulator can be active at any one time
The time it takes to simulate the approval of the grant is equal to the sum of the time taken
in each agency because the operation is sequential In a backtracking simulator while the
simulator for the rst agency is working the second agency can choose and pursue one
  March  
Section  The Generic Simulator Model and Its Derivatives   
message driven simulator
AAR AAR AAR
petition
C
OK
12312123 121212 1212312
123
123
123
123
123
123
123
123
123
C
backtracking simulator
AAR AAR AAR
C
OK
inconsistency detected rollback needed
BC
OK
assume
12
12
12
12312123 121212 1212312
123
123
123
123
123
123
123
123
123
B
branchandbound simulator
AAR AAR AAR
AAR AAR
AAR AAR
OK
C
OK
OK
OK
C
B
A
C
C
C
assume C
assume A
assume
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
12
12
123
123
1212
12
123
123
12
12
123
123
12
12
12
12
12
12
123
123
12
12
123
123
12
12
1212
12
123
123
12
12
123
123
12
12
123
123
123
B
Figure  Comparison between three simulators
but only one of the three possible outcomes produced by the rst agency In a branch
andbound simulator three copies of the simulation are produced each pursuing one of the
three possibilities
A branchandbound simulator is also a complete simulator If there are any nosolution
slices all branches will be terminated and none will remain at the end If there are any
multisolution slices but no nosolution slices more than one set of simulations will remain
at the end and each will correspond to one possible outcome If there are only single
solution slices then exactly one set will remain The simulator will fail however if the
number of solutions is unbounded because the computing resource is bounded
  March  
   Chapter  Model of Simulation
The branchandbound simulator is the only interesting type of distributed simulator
that so far as we know is still to be explored Ecient algorithms to fork and terminate
the simulator may provide hope for the simulation of systems with very little intrinsic
parallelism and whose grain size is too small or whose behavior too unpredictable for
rollback to be protable
  Timedriven simulators
Thus far we have discussed simulators that resolve slices by trialanderror backtracking
and by exploring all possibilities branchandbound In both methods each element sim
ulator needs only local information for progress Neither method is appropriate however
when the number of possibilities that must be explored is innite Exact simulation of such
a system may require solving simultaneous equations analytically When the equations can
be solved they yield functions of time reducing the simulation to a simple task of func
tion evaluation When an analytical solution is inappropriate or dicult to nd empirical
approximations must be used
I V
	
 
CAP LOAD
V
I
LOAD
CAP
I 
 iV 
V 

Z
Idt
A physical system  and its logical represen
1234
1234
1234
1234
1234
1234
123456
123456
123456
123456
123456
123456
tation
Figure  An example of a continuous system
An example of such a system is an electrical circuit In the system in Figure  the
voltage across a capacitor is the integral of the current through the capacitor the current
in turn is a function of the voltage across the capacitor
The equations V 

R
Idt
I 
 iV 
In order to simulate this kind of system we need to nd a replacement system that is
discrete but that will either approximate the behavior of the target system or converge to
  March  
Section  The Generic Simulator Model and Its Derivatives   
the nal state of the target system The usual method of building a simulator for such a
system is to divide the simulation interval into a sequence of small slices We then assume
that information exchange takes place only at the boundaries of these slices and information
about the others can be accurately extrapolated between the boundaries
For example when integration of a continuous function is involved discrete methods
such as taking the Riemann sum can be used to approximate the integral of the function
Although discrete integration is seldom exact we can get increasingly better approximations
by reducing the size of the slices when the size is reduced the Riemann sum approaches
the integral However due to accumulated numerical errors the simulation may eventually
diverge and produce an output that is valid only for a limited span of simulated time
Simulators of this type are called timedriven simulators because they are moved forward
at one time slice per step Simulators of this type are also complete
  Summary
The slackdriven simulator is a generic simulator model that covers a large array of existing
and hypothetical simulators Simulators that perform computation on speculation such as
the concurrentrollback simulator are called optimistic simulators Simulators that produce
no output other than that implied by the input are called conservative simulators We will
concentrate on the messagedriven simulator which is a conservative simulator
We are particularly interested in the characteristics of the simulator itself not those of a
simulator plus any system it simulates Thus we have chosen the most revealing simulation
subject devised a series of conservative simulators and reported in the following chapters
the results obtained
  March  
   Chapter  LogicCircuit Simulator Experiments
Chapter  LogicCircuit Simulator Experiments
A Boolean network is a network of Boolean logic gates connected such that each input is
driven from the output of another gate or from an input to the network A logic circuit
is a Boolean network that includes a notion of time Each logic element in the network is
assigned a positive value called the delay of the element The input and output states of
the gates are timevariant If F is the Boolean function of a logic gate whose delay is 
then the input state I and output state O are related by the equation
Ot  	 F It
Thus unlike a Boolean network which has a static value that is computed by solving a
set of simultaneous equations a logic circuit can have timedependent behaviors such as
memory and oscillation Simulation is a way of computing the behavior of a logic circuit
x
Figure  A logic circuit whose behavior is dierent from its Boolean network
The Boolean network in Figure 
 can be described by the equation x 	 NOT x
which does not have a solution As a logic circuit however the network is an oscillator
Although the inputoutput relationship of a logic circuit when it does reach a stable state is
consistent with the corresponding Boolean network our denition of a logic circuit simulator
is one that reproduces the behavior of a logic circuit rather than one that solves for a stable
state The other denition is used by simulators such as MOSSIM  which simulates
and veries digital integrated circuits
Most existing circuits found in computers and other digital systems belong to a class of
circuits called clocked logic circuits Clocked logic circuits are very well suited for the stable
statesolving form of simulation because they are designed to reach a stable state during
each clock cycle and because only the nal state of a clock cycle is needed to determine
the future state The exact sequence and timing of transitions that lead to a stable state
  March  
Section  Why Logic Circuits   
are usually not important only the nal stable state of the circuit is important Such
simulators however will not work very well for the unclocked or selftimed logic circuits
Section  Why Logic Circuits
We study logiccircuit simulation because it stresses a distributed simulator and is itself of
practical interest It is easy to construct examples of logic circuits with diverse behaviors
structural diculties such as large fanin and fanout and highly nonuniform activity levels
Simple logic gates exhibit responses in which an input event may or may not inuence the
outputs depending on the internal state of the element and on the states of other inputs
yet they require very little computation to simulate their behavior Thus the performance
results shown later involve practically no computation other than the distributed simulation
itself They are therefore uncluttered studies of how well the simulator itself performs
A number of related simulators each supporting an array of dierent simulation modes
have been written during the course of this study These simulators run on multicomputers
such as the Cosmic Cube Intel iPSC and Symult 	

 Since they are written to run
under the Cosmic Environment they can be compiled for all of these machines without
modication The historical relationship between these simulators is shown in Figure 	
The arrows indicate predecessorsuccessor relationships
hybrid
sequential
CMBvariant
pruned
variant
CMB
sequential
coordinated
progressive
Figure  A number of circuit simulators and their relationship
  March  
   Chapter  LogicCircuit Simulator Experiments
Of the ve simulators shown results obtained on three of them  the CMBvariant the
coordinatedsequential and the progressivehybrid simulators  are of interest The se
quential simulator and the prunedCMBvariant are used for comparison only The pruned
CMBvariant simulator will not be discussed
The CMBvariant simulator is a straightforward implementation of the generic simu
lator in which the basic unit of information transfer is a block of state description over a
time interval The CMBvariant simulator shows excellent speedup as the number of nodes
is increased but since it is totally oblivious to the content and eect of its information
carriers much of the work it has to do can be eliminated when an eventdriven system is
simulated on one node using a sequential simulator However sequential simulators can
not be readily distributed and they cannot in their original form benet from the use of
multicomputers
The three succeeding simulators attempt to combine the advantages of sequential and
distributed simulators The prunedCMBvariant simulator is a CMBvariant simulator
with sequential simulation mechanisms added The coordinatedsequential simulator is a
sequential simulator with CMBvariant mechanisms added The progressivehybrid simula
tor is the nal merger of the two In the following sections we will describe each of these
simulators in their chronological order
  March  
Section  CMBVariant Simulator   
Section  CMBVariant Simulator
The CMBvariant simulator for logic circuits is a proof of concept for the generic simulator
model described in Chapter  Since this is a demonstration of a generic model in order
to cover the greatest range of possible simulation subjects special but useful properties
of logic circuits have been ignored in building this simulator In particular the simulator
ignores the fact that logic circuits are eventdriven systems We will discuss such systems
in greater detail when we compare the result of this simulator to ones that do make use of
the eventdriven properties
domain of eventdriven simulators
domain of the generic simulators
logic circuit systems
Figure  Domain of the generic simulator model
The tapewriting and reading processes in the generic simulator model are replaced by
messagesending and receiving processes in the CMBvariant simulator These are light
weight reactive processes and the simulator is a reactive kernel for the reactive processes
As in a usual reactiveprocess program the distribution of the simulation task on a multi
computer is accomplished by partitioning the set of reactive processes across a set of reactive
kernels that run on a multicomputer
We will present a simplied description of the CMBvariant simulator the actual im
plementation contains extensive measurement setups and programming shortcuts that are
inappropriate to report here The simulator presented however is functionally correct ex
presses the same principle as does the actual implementation and is easier to understand
 The element simulators
First of all a reactive process is represented by two pointers the entryfunction pointer and
the data pointer The entryfunction pointer always contains the reference to the function
that handles the next message for the process but the data pointer can hold any private
data structures needed by the process For an element simulator the private data may
  March  
   Chapter  LogicCircuit Simulator Experiments
include one data structure for each of the elements outputs An output data structure
contains the references to all inputs to which it connects Each input reference contains the
ID of the element that owns it and the index that identies the input within the element
One output structure can contain more than one reference because an output can connect
to more than one input
The private data may also include one input data structure for each of the elements
inputs Each input data structure contains the ID of the process and the identity of the
output to which it connects Each input can and must connect to one output
data
entry





data
entry





A	
process Bprocess A
input reference
input structure
output structure
B	
output reference
Figure  Process structure and a simple example of connectivity
We may need a variable
sized message format to describe a piece of tape recording
because the information on the tape can be arbitrarily complex In the interest of simplicity
however we choose to represent each tape recording with more than one simple xed
sized
message We will call the structure a STATEFRAGMENT We use the name fragment to
contrast it with the name event used in the study of traditional event
driven simulation
systems and to convey the fact that every entity is a fragment of a continuum that can be
merged with adjacent entities and sliced into arbitrarily many entities
The essential elds of a fragment are shown in Listing  When a fragment is received
by a process the inputid eld identies the element input to receive the fragment The
state and span elds describe the duration of a state at that input
  March  
Section  CMBVariant Simulator   
  struct STATEFRAGMENT
 
 int inputid  Index of the input at the dest element	 

 int state  State contained in this fragment	 
 int span  Duration of this fragment	 
 STATEFRAGMENT next  Pointer to make a linked list of fragments	
  
Listing  Structure of a FRAGMENT
When a piece of tape is to be written by an element in the generic simulator model
the corresponding process in the CMBvariant simulator produces one fragment or a stream
of several fragments to carry the information recorded on the tape When a fragment has
arrived at its destination the entry function of the destination process is called to accept
the fragment It is worth noting that reactiveprocess programming systems are themselves
eventdriven systems whose inputs are fragments Thus the simulator is always an event
driven system even though the system it simulates may not be
  inverterentryppsb
 PROCESS pp
 STATEFRAGMENT sb

 
 OUTPUTppsbstatesbspan
 freefragmentsb
 
Listing  An inverter in a CMBvariant simulator
Listing  contains a sample entry function for an inverter element As in an ordinary
reactive process the two parameters to its entry function are the process structure and the
input message When called the entry function simply outputs another fragment of the
same length but with a complementary state value The delay of the inverter is equal to
the dierence between the amount of fragments produced and the amount of fragments
consumed Such dierences are set up during initialization by producing one fragment for
each output of every gate such that each fragment has a span that equals the delay of its
output
The OUTPUT function takes on four parameters The rst two parameters are the process
structure and an index that identies an output of the element The function needs these
  March  
   Chapter  LogicCircuit Simulator Experiments
two parameters in order to access the list of destination input references for the output
fragments The next two parameters describe the state and the span of the fragment In
this example there is only one output for the inverter and its output index is  The state
of the new fragment is the complement of the state contained in sbstate and the length
of the fragment is the same as sbspan
Since an inverter has only one input it does not have to check the inputid of the
fragments it receives and it can immediately process any fragments it receives without
waiting for other fragments to arrive For a gate with more than one input however it
usually has to dierentiate the fragments it receives Listing  contains a sample entry
function for a twoinput XORgate	
  xorentryppsb
 PROCESS pp
	 STATEFRAGMENT sb

 
 int outspan outstate
 QUEUEFRAGMENTppsb
 whileQEMPTYpp  QEMPTYpp 

   outstate   QHEADppstate  QHEADpp state 
  outspan  MIN QHEADppspan  QHEADpp span 
 
 OUTPUTppoutstateoutspan
  TRIMQUEUEppoutspan
  TRIMQUEUEpp outspan
  
  
Listing  An XORgate in a CMBvariant simulator
In a twoinput XORgate both of the inputs must have at least one fragment present
before the gate can produce output fragments The gate must therefore maintain a fragment
queue for each of its input structures When a fragment is received the entry function can
check the queues before deciding whether the fragment needs to be queued
 but in the
interest of simplicity the function always queues the fragment  The QUEUEFRAGMENT
function puts the fragment sb into an input queue of pp according to sbinputid
  March  
Section  CMBVariant Simulator   
The QEMPTY function returns TRUE if the specied input queue for the process pp is
empty While both queues are nonempty  a length of fragment is removed from each
queue to produce an output fragment The state of the output fragment is equal to the
exclusiveor of the states of the fragments to be removed 		 The length of the output
fragment and of each fragment to be removed equals the length of the shorter fragment
at the head of the queues 	
 The QHEAD function returns a pointer to the rst fragment
in the specied queue
The output of the exclusiveor gate remains the same as long as both inputs remain
unchanged The length of the shorter fragment is the length of time both inputs are known
to remain unchanged When fragments are consumed output is produced 	 and a length
equal to the length of the output fragment is trimmed from both queues 		
The loop repeats until one of the queues becomes empty and the gate can no longer
produce any additional output fragments from its queues The inverter and the XORgate are
simple because they are both strict i e  they do not have any partial inputstate assignment
such that the state of the outputs is not inuenced by the state assignment of the remaining
inputs
An ORgate on the other hand is nonstrict If any of the inputs is 	 its output will be
	 regardless of the state of its other inputs An ORgate can therefore continue to produce
fragments in some situations where not all of its inputs are available Listing  contains
a sample entry function for an ORgate
  orentryppsb
 PROCESS pp
	 STATEFRAGMENT sb

 
 int outspan outstate
 QUEUEFRAGMENTppsb
 while 
  
   ifQEMPTYpp  QHEADppstate  TRUE
  
 	 outstate  TRUE
 
 outspan  QHEADppspan
  March  
   Chapter  LogicCircuit Simulator Experiments
   else
  ifQEMPTYpp 	 

 QHEADpp 	state  TRUE		
  
 outstate  TRUE
  outspan  QHEADpp 	span
  else
 ifQEMPTYpp	 

 QEMPTYpp 		
 
 outstate   QHEADpp	state  QHEADpp 	state 	
 outspan  MIN QHEADpp	span  QHEADpp 	span 	
  else break
 TRIMQUEUEppoutspan	
 TRIMQUEUEpp outspan	
 OUTPUTppoutstateoutspan	
 
 
Listing  An ORgate in a CMBvariant simulator
When the process receives a fragment it is added to the queue as in the case of the
XORgate But then instead of checking both of the queues for fragments the function
checks rst for possible nonstrict input conditions Lines  check the input whose index
is 	
 lines  check the input whose index is  If a fragment for an input is available
and its state is TRUE then a nonstrict input condition exists The new output fragment
is specied to have a state value of TRUE and a span equal to the span of the fragment in
the queue The function then continues to line  where fragments are trimmed from the
queues and an output fragment is produced If no nonstrict conditions have been detected
the process will compute and produce fragments in the same manner as the XOR process

When a nonstrict condition is detected on one input the queues in both of the inputs
are trimmed  because the state of the other input does not matter However it is
possible that the queue for the other input is empty or does not contain enough fragments
to cover the amount to be trimmed In this case the trimming extends to fragments that
have not yet arrived The process must therefore record the decit incurred and deduct it
from fragments that arrive later
  March  
Section  CMBVariant Simulator   
  typedef struct  int delay  Delay of the element
 IDATA inpq  One per gate input 
	 ODATA outq 
 ELEMENT  One per gate output 
 typedef struct  STATEFRAGMENT qh  Points to top 
 STATEFRAGMENT qt  Points to bottom 
 int deficit 
 IDATA  Deficit of the queue
The details for the process are complete we are ready to show the essential mechanisms
that support the processes The process structure contains an entry function an array of
input data structures one for each element input and an array of output data structures
one for each element output These data structures are set up during initialization The
input structure contains the decit count and a pair of queue pointers one for the head of
the queue and one for the tail
  QUEUEFRAGMENTppsb
 PROCESS pp
	 STATEFRAGMENT sb

 IDATA Q
 Q  ELEMENT ppdatainpq  sbinputid
 ifQdeficit

   ifsbspan  Qdeficit  Qdeficit  sbspan 
  freefragmentsb return 

 	 else  sbspan  Qdeficit
  Qdeficit   



  ifQqh

  ifsbstate  Qqtstate  Qqtspan  sbspan 
 freefragmentsb return 

  else  Qqt  Qqtnext  qt
 qtnext   


 else

 Qqh  Qqt  sb
 sbnext  




Listing  CMBvariant QUEUEFRAGMENT function
The QUEUEFRAGMENT function adds the fragment sb to the sbinputidth input
queue of the process pp It checks rst for the decit  If a decit exists the span of
the fragment is used to satisfy the decit if the fragment is totally consumed 		
	 the
  March  
   Chapter  LogicCircuit Simulator Experiments
function returns Otherwise the balance is advanced to the next step where fragments are
added to the queue  If there are already other fragments in the queue  and if
the last fragment has the same state as the new fragment  the two are simply merged
	
 Otherwise the fragment is linked into the queue 
	

 
	

  TRIMFRAGMENTppiddebit
 PROCESS pp
	 int id

 int debit
 
 IDATA Q
 STATEFRAGMENT sb
 Q  ELEMENT ppdatainpq  id
   whiledebit  Qqh

 	 ifQqhspan  debit  Qqhspan  debit
 
 debit   
  else  debit  Qqhspan
  sb  Qqh 
  Qqh  sbnext 
  freefragmentsb  

  Qdeficit  debit
 
Listing  CMBvariant TRIMFRAGMENT function
The TRIMFRAGMENT function removes debit amount of fragments from the idth input
queue of the process pp As long as there are more fragments in the queue the spans of
as many fragments as necessary taken from the head of the queue are used to satisfy the
debit Any remaining debit is added to the decit of the queue
  The simulator message system
The list of references and indices for each output structure described above represents a
onelevel tree The root of the tree is the sending process and the leaves of the tree are
the receiving processes The job of the OUTPUT function is simple enough  it allocates a
fragment for each leaf process and sends it along the branch that leads to the process In
such a simulator however gates with a large fanout such as a clock driver may have to
send the same information to the same destination computing node many times
  March  
Section  CMBVariant Simulator   
Because messages between computing nodes are usually more expensive than messages
within the same computing node we reduce the internode messages by organizing the tree
as a twolevel tree The intermediate tree nodes are a set of input port processes one for
each computing node that contains a destination process An output sends its fragment to
its input ports and an input port duplicates and forwards the fragment to the destination
processes in its own computing nodes
input port
output port
 
the tree
 
 

a node

a node in
 
Figure  A sample circuit and a possible mapping to a multicomputer
Many mechanisms can be added to the output structure for a more more ecient
simulator and such mechanisms account for the majority of the dierences between the
actual implementation and this description Here we will present a simple OUTPUT function
that converts fragments into messages that are immediately sent
  typedef struct  int count  Number of siblings 
 int node  Dest processs node 
	 int pid  Dest processs pid 

 int inputid  ODATA  Dest processs input 
The output data structure contains the number of ports connected and a list of ref
erences to those ports A reference for a process in the simulator contains the node and
the pid of the destination simulator process It also contains a pid because the element
processes are embedded in the simulator by reactivehandler layering Only the node and
the pid need to be stored in the output structure because in our implementation there is
only one simulator process for every node and all of them have the same xed pid Listing
	 contains a sample OUTPUT function

  March  
   Chapter  LogicCircuit Simulator Experiments
  OUTPUTppidstatespan
 PROCESS pp
 int id
	 int state

 int span
 
 int j
 ODATA op
 STATEFRAGMENT sb
   op  ELEMENT ppdataoutq  id
  forj   j  opcount j
 	 
 
 sb  newfragment 
  sbinputid  opinputidj 
  sbstate  state 
  sbspan  span 
  ssendmsgopnodejoppidj
 
  
Listing  CMBvariant OUTPUT function
The OUTPUT function allocates a fragment for each branch of the tree  initializes
it with the input index of the destination input  sets the state and span 	
 and
sends the fragment  The ssend function is a layered message function that sends
the message to another process in the simulator If a twolevel tree structure is used each
fragment goes to an input port process that is identical to the inverter process except that
the state is not inverted a buer process The main function for the simulator is identical
to that of a reactive kernel
  struct  int entry
 char data   PROCESS
	 struct  int pid 

 char msgbody  MESSAGE
 simulatormainloop
 
 PROCESS proc
  MESSAGE mesg
  while 
  
 	 mesg  MESSAGE  xrecvb
 
 proc  processtable  mesgpid
  procentryproc mesgmsgbody
  
  
Listing  CMBvariant main loop
  March  
Section  CMBVariant Simulator   
This is the end of our description of a simple distributed simulator derived directly
from the generic simulator model The description is complete except for the storage
allocationdeallocation mechanisms the initializationtermination mechanisms and the
resultrecording mechanisms
  The variants
Although this simulator exhibits excellent performance for some cases much can be done
to improve its performance for dicult cases The number of actual messages for example
can be reduced in a logic circuit simulation by using a more elaborate OUTPUT function In
particular if message sending is deferred by putting fragments into outputholding queues
the opportunity to merge multiple fragments into a single message increases When two
successive fragments with the same state are put into the same holding queue the two can
be merged into a fragment with a larger span saving both space and handling time Even
if they cannot merge multiple fragments can be concatenated onto a single longer message
to share the permessage overhead
If sending is deferred forever however the simulator will fail to make any progress
Good eciency can be achieved with a proper balance of message deferral and message
sending Before we devised and evaluated a number of ow control methods there were
two methods that represented the two extremes of possibilities the two original CMB
methods 	Hence our methods are called variants
 In the deadlockavoidance method no
fragments are deferred and deadlock does not occur In the deadlockdetection method no
message is sent until the simulation runs into a deadlock or unless the outputholding queue
contains an event A deadlockdetection mechanism running concurrently in the simulator
message system detects the deadlock and forces deferred messages to be sent
We generally call those methods that are more likely to send messages eager methods
and those that are less likely to send messages lazy methods Thus the deadlockavoidance
method is at the eager end of the spectrum and deadlockdetection method is at the lazy
end To explore the middle ground we needed to hold back messages by some criteria we
  March  
   Chapter  LogicCircuit Simulator Experiments
could select but in order to prevent deadlock detection from dominating the timing we
needed a cheaper way of ensuring progress than by using standard deadlock detection
When simulator processes defer sending output messages they may cyclically deny
themselves input messages leading to deadlock However deadlock implies that some node
has an empty inputmessage queue Since the emptiness of the queue is a local condition
we make use of that condition to modify the behavior of the simulator to prevent deadlock
Our strategy is called indenitelazy message sending and is implemented by replacing the
xrecvb function in the simulators main loop with a nonblocking xrecv
  simulatormainloop
 
 PROCESS proc	

 MESSAGE mesg	
 while 
 
 ifmesg  MESSAGE  xrecv
 
  proc  processtable  mesgpid	
   procentryproc mesgmsgbody	
   else
  
 
 takeactiontopromoteprogress	
  
  
  
Listing  CMBvariant indenitelylazy main loop
The function xrecv returns a message for an element simulator if the nodes input
message queue is not empty The simulator goes on to deliver the message as before if a
message is returned While an element simulator is consuming a message it may either send
or withhold any output that the element simulator produces according to the heuristics in
eect at the time
If the nodes inputmessage queue is empty a null pointer is returned and deadlock is
a possibility The simulator will take special actions to break potential deadlocks Actions
can generally be classied into two types For the sourcedriven type the simulator selects
a deferred output to send as a message	 for the demanddriven type the simulator selects
  March  
Section  CMBVariant Simulator   
a blocked element and sends a demand message to its predecessor to request that queued
outputs be sent The end result is that deadlock is prevented
  Variant algorithms
We have experimented with many CMB variants Since many of them are closely related
and all of them show similar performance results we will describe the operation and report
the performance of just six variants AE that are representative of the range of possibilities
that we have studied
A Eager message sending This is the deadlockavoidance CMB simulator
B Eager event Since successive fragments with the same state value can be merged into
one fragment the eagerevent variant detains all output fragments until a fragment
that cannot be merged with its predecessor is produced When xrecv returns a null
pointer the detained fragment that extends to the earliest time is sent This is called
an eagerevent variant because state changes are called events in eventdriven systems
and because this simulator will eagerly send eventconveying fragments
C Indenitelazy singledispensation All output fragments produced by element simula
tors are queued Messages are sent only when xrecv returns a null pointer The output
queue that extends to the earliest time is selected and one fragment from that queue
is sent
D Indenitelazy multipleevent This scheme is a variation on C motivated by charac
teristics of multicomputer message systems that make it economical to pack multiple
events into fewer messages All output fragments produced by element simulators are
queued When xrecv returns a null pointer the output queue that extends to the
earliest time is selected to generate a message using all of the fragments in that queue
instead of just one
E Demanddriven Although we usually think of simulation as sourcedriven from inputs
one can equally well organize the simulation as demanddriven from outputs In the pure
  March  
    Chapter  LogicCircuit Simulator Experiments
demanddriven form all output fragments produced by element simulators are queued
When xrecv returns a null pointer the input port that lags furthest behind is picked
to select the destination for a demand message Upon receipt of a demand message if
the output queue is not empty the simulator sends all fragments in the output queue
if the output queue is empty the simulator propagates the demand message For the
demanddriven variant the message header must also carry a type eld to distinguish
a normal message from a demand message
  struct  int pid 
 int type 
 char msgbody 	 MESSAGE

 simulatormainloop
 
 PROCESS proc
 MESSAGE mesg
  while 
   
  ifmesg  MESSAGE  xrecv
  
  ifmesgtype  DEMANDTYPE
 
 
  handledemandmessagemesgmsgbody
  	 else
  
  proc  processtable  mesgpid
 procentryproc mesgmsgbody
  	
 	 else
 
 takeactiontopromoteprogress

 	
 	
 	
Listing  CMBvariant demanddriven main loop
F Demanddriven adaptive Demand messages single out critical paths in a simulation
In an adaptive form of demanddriven simulation a threshold is associated with each
communication path Outputs of element simulators are queued only up to the thresh
old when the threshold is exceeded the contents of the queue are sent as a message
Demand messages operate as in E but also cause the threshold to be decreased for
processes that get them In the examples that we show the threshold is halved The
  March  
Section  CMBVariant Simulator   
simulator is accordingly able to adapt itself to the characteristics of the system being
simulated
  Instrumentation
Although execution time is one of the most natural bases of comparison between any two
programs that perform the same function and although it is used below to illustrate the per
formance of our distributed simulators on dierent commercial multicomputers execution
time on these concurrent computers depends both on the algorithm and on the charac
teristics of the particular computer When we wish to isolate the characteristics of the
algorithm from those of the computer we run our simulator programs under the control of
a multicomputer simulator sweep mode A close examination of the main routine of the
simulator reveals that it can be transformed with minimal modication into a lightweight
reactiveprocess program under yet another layer of the reactive kernel	
  SIMDATA simulatordata
 simulatormainloopsimpmesg
	 PROCESS simp

 MESSAGE mesg
 
 PROCESS proc
 simulatordata  SIMDATA simpdata
   ifmesg
  
  ifmesgtype  DEMANDTYPE
 	 
 
 handledemandmessagemesgmsgbody
   else
  
  proc  simulatordataprocesstable  mesgpid
  procentryproc mesgmsgbody
 
   else
 
 takeactiontopromoteprogress
	 

 
Listing  CMBvariant main loop as a lightweight process
  March  
   Chapter  LogicCircuit Simulator Experiments
e
e
e
e
S
e
e
e
e
S
e
e
e
e
S
e
e
e
e
S
reactive kernel
multicomputer
simulator
CMBvariant
simulator
element
simulator
Figure  Structure of a sweepmode simulation
The process structure in this reactive kernel is described by the SIMDATA structure in
the above listing The structure contains a list of element simulator processes and any other
data structures private to this instance of the simulator
Sweepmode simulation for an N node multicomputer is accomplished with a reactive
kernel that runs N copies of the simulators as reactive processes Execution time is then
measured in a unit called a sweep  	 which corresponds here to a 
xed time required
to call an element once The time required for other operations such as sending a message
can be set to a particular number of sweeps Normally a message sent by one node in one
sweep is available in the destination node at the next sweep However to test the sensitivity
of the algorithms to message latency we can also set the latency to larger values
In the realmode simulation the simulator is linked with a reactive heavyweight handler
and run directly on the multicomputer There is one copy of the simulator process in
each node and each simulator process runs a subset of the elements as embedded reactive
processes Each node runs at its own pace and execution time is measured with the host
computers realtime clock
   Experimental results
Performance measurements have been made on a variety of logic networks including those
that are representative of networks found in computers and VLSI chips and those that
  March  
Section  CMBVariant Simulator   
e
e
e
e
S
e
e
e
e
S
e
e
e
e
S
e
e
e
e
S
RK RK RKRK
multicomputer network
a computing
node
Figure  Structure of a realmode simulation
are designed specically to test or to stress the simulator Six dierent network types
each in several sizes up to  logic gates have been the principal vehicles for these
experiments The majority of the logic gates have delays of between  and 	ns with 
ns
being a typical value Each simulation was run for a predetermined simulated interval
and a set of measurements including the real elapse time was recorded A larger variation
in performance was observed among networks with dierent characteristics than between
algorithm variants
The parallel multiplier is a good example of an ordinary logic network The  
array multiplier used in several experiments employs  logic gates to generate the 
	bit
product of two bit binary inputs The multiplier network contains only limited con
currency and does not contain tight circuits that give the simulator articial performance
advantages or troubles that depend on element distribution It also contains moderately
high fanout in the multiplier and multiplicand lines this puts pressure on the message
system In all fairness the distributed simulation of this multiplier network is expected to
do neither too badly nor too well
For the simulation the most signicant bit of the product is connected back to the
multiplier input via an inverting delay The delay is such that the multiplier reaches a
  March  
   Chapter  LogicCircuit Simulator Experiments
stable state before the multiplier input changes The multiplicand input is set to a value
that causes the circuit to oscillate The resulting activity level is quite low The entire
circuit is idle  of the time For the other  of time there is a wavefront of activity
moving diagonally down the array After the wavefront hits the bottom	left corner the
multiplier input changes and broadcasts the change to 
 gates A trace of the product
outputs shows that the simulator and the circuit are running correctly
wave frontbroadcastidle
Figure  Three phases of the oscillating multiplier
The plot in Figure  portrays in a log	log format the sweep count in the sweep	mode
versus the number of nodes N  for the simulation of the 
 
 multiplier network under
all six CMB variants
It is not useful to continue the plot beyond 
  
nodes since at this point there are as
many nodes as simulated gates Each horizontal division represents a factor of two in the
number of nodes used each vertical division represents a factor of two in sweep count or
time The placement of elements in nodes for these trials is a systematic pattern that tends
to put related elements into the same node
The rst remarkable characteristic of these performance measurements is that they are
so similar across this class of variant algorithms Algorithms A E and F produce more
messages than B C and D but in the sweep mode in which messages are free but element
invocations are expensive there is little dierence between the variants The performance
under sweep	mode execution exposes the intrinsic characteristics of the algorithm and is not
related to such multicomputer characteristics as the relationship between node computing
time and message latency
  March  
Section  CMBVariant Simulator   
log
 
 sweeps
log
 
 nodes
sequential simulator
D
C
E
B
F
A
      	 
    







	





Figure  A gate multiplier sweepmode
The performance is divided roughly into two regimes the rst regime being one of near
linear speedup in N for the rst 
 octaves and the second regime being one of diminishing
returns in N as the computing time approaches an asymptotic minimum value In the
linear speedup regime these simulators nearly halve the sweep count with each doubling of
resources until limiting eects are reached Load balance is assured by the weak law of large
numbers when there are many elements per node While each node has a suciently large
pool of work node utilization remains high The simulators approach asymptotic minimal
time as they exhaust the available concurrency in the system being simulated The gradual
knee of the curve originates from progressively lesseective statistical load balancing as
the number of elements per node diminishes with largerN  The gross characteristics of these
curves are similar to those of other concurrent programs  and are quite understandable
and predictable
Like many other concurrent algorithms a more ecient sequential algorithm exists for
the CMBvariant simulator when applied to circuit simulation The heavy horizontal line
  March  
   Chapter  LogicCircuit Simulator Experiments
represents the number of sweeps a sequential eventdriven simulator requires for this same
simulation We observe at log
 
N  node that all of the CMB variants are somewhat
ine	cient in comparison with the sequential eventdriven simulator We shall refer to this
extra work that the CMBvariant simulator does as the overhead of distributing the sim
ulation We will discuss the sequential eventdriven simulator and additional performance
measurements in the next and subsequent sections
  March  
Section  Sequential Simulator   
Section  Sequential Simulator
At N   the sequential simulator does better than do the CMBvariant simulators for two
reasons The rst is that logic circuits are eventdriven systems in which the time it takes
for a sequential simulator to handle and process a fragment is zero if the fragment does
not convey an event 	A fragment conveys an event if its state di
ers from the fragment
that precedes it A message that carries an eventconveying fragment is an event message
a message that does not is a null message The second is that logic gates are simple and
the time it takes for an element simulator to process an eventconveying fragment is almost
zero
Since the messagehandling times for null messages and event messages are identical in
the CMBvariant simulator the ratio at N   	N is number of nodes used between the
time taken by the sequential and the CMBvariant circuit simulators reects the proportion
of event messages in a CMBvariant circuit simulator The cost of handling null messages
is the overhead of the CMBvariant simulator at N   
 Sequential simulator mechanism
Like the CMBvariant simulator our sequential simulator is also a reactiveprocess program
with embedded lightweight reactive processes Each message in this simulator called an
event describes a state transition and includes the following elds
  struct EVENT
 
 int inputid  Index of the input at the dest element	 

 int time  Time of the transition	 
  
Listing  Sequentialsimulator event structure
The time eld of an event represents the time when a state change will occur at the
input 	identied by the value of the inputid eld of the process that receives the event
The function contained in Listing  can be used as an entry function for an inverter gate
  inverterentryppep
 PROCESS pp
 EVENT ep

 
  March  
   Chapter  LogicCircuit Simulator Experiments
  SENDEVENTpp  eptime	

 freeeventep	
 
Listing  An inverter in sequential simulator
When the simulator delivers an event to the inverter the inverter will generate an
output event with an event time that is ppdelay units larger The SENDEVENT function
takes three parameters Like the OUTPUT function of the CMBvariant simulator the rst
two parameters are the process structure and the index that identies an output of the
element the third parameter is a time value whose sum with the element delay becomes
the time of the output event Listing 	
 contains a simple output routine for the sequential
simulator
 SENDEVENTppidtime
 PROCESS pp	
 int id	
 int time	
  

 EVENT ep	
 ODATA op	
 int ot	
 op  ELEMENT  ppdataoutq  id	
 ot  ELEMENT  ppdatadelay  time	
 forj  	 j  opcount	 j
 
  ep  newevent 	

 epinputid  opinputidj	
 eptime  ot 	
 ADDEVENTepoppidj	
 
 
Listing  The SENDEVENT function in sequential simulator
The routine allocates an event structure 	 for every input connected lls in the
receiver input index 	 lls in the time of the event 	 and inserts the event into the
event list 	 This routine is structurally similar to the OUTPUT routine of the CMBvariant
simulator except that node numbers are not used to identify processes because all processes
reside in the same node In order to reduce the number of events that must be sorted when
  March  
Section  Sequential Simulator   
more than one input is connected outputevent duplication in the actual implementation
is performed at the time of event delivery
It is interesting that the entry function for an XORgate is identical to that of an inverter
Listing  contains the more complex ORgate entry function
  orppep PROCESS pp EVENT ep
	 

 ifepinputid 
 ppentry  or  SENDEVENTppeptime 
 else 
 ppentry  or  SENDEVENTppeptime 
 freeeventep
 
 or ppep PROCESS pp EVENT ep
 

  ifepinputid 
 ppentry  or SENDEVENTppeptime 
   else 
 ppentry  or   
 	 freeeventep
  
  or ppep PROCESS pp EVENT ep
  

  ifepinputid 
 ppentry  or   
  else 
 ppentry  or SENDEVENTppeptime 
  freeeventep
	 
		 or  ppep PROCESS pp EVENT ep
	 

	 ifepinputid 
 ppentry  or  
	 else 
 ppentry  or  
	 freeeventep
	 
Listing  An ORgate in sequential simulator
When both gate inputs are  the entry function is or When an event is received
the event is distinguished by the input it a	ects If the event is for the input whose index
is  the entryfunction pointer is set to or and an output event is produced 
 If
the event is for the other input the entry function is set to or and an output is also
produced 
 The actions for the other three entry functions are similar
An element can compute its output state based only on a transition from one of its
inputs because the transition carries the assurance that the other inputs of the element
have not changed Such assurance can be provided in several ways The most common
method is to keep the set of yettobedelivered events 
the pending events sorted by time
  March  
    Chapter  LogicCircuit Simulator Experiments
glitch or no glitch
  


Figure  A circuit containing a dynamic hazard condition
in an event list and to deliver the event with the smallest time value rst Since element
delays cannot be negative an event cannot trigger events with smaller time values When
an event is delivered to an element it is assured that the other inputs of the element and
indeed of all other elements will remain unchanged up to the time of the event
  struct  int pid 
 char msgbody  MESSAGE
	 simulatormainloop
simpmesg
 PROCESS simp
 MESSAGE mesg
 
 PROCESS proc
  proc  
SIMDATA 
simpdataprocesstable  mesgpid
   
procentry
proc mesgmsgbody
  
Listing  Sequentialsimulator main loop as a lightweight process
The simulator main loop is similar to that of the CMBvariant simulator	 the message
system however has a di
erent property The message system for the CMBvariant simu
lator dispenses messages on a rstcome rstserved basis	 for the sequential simulator the
message with the smallest time value is dispensed rst
  Hazards in sequential simulators
Although a sequential simulator will always produce a valid simulation result it may not
always produce the same result as the CMBvariant simulator Some input conditions in a
logic circuit may trigger more than one possible outcome and a sequential simulator has
no consistent way of choosing one For example the ORgate in Figure  can produce
either no transitions or two transitions in response to two simultaneous input events This
condition corresponds to a static hazard in the terminology of Boolean minimization
  March  
Section  Sequential Simulator    
Both of these responses are correct because the temporal relation between the two
input events is beyond the capability of the model to resolve the one that is produced
depends on the order in which the two input events are consumed Since both input events
have the same time value they can be taken from the list in either order If the lowgoing
transition is taken rst two output transitions will be produced if the highgoing transition
is taken rst no output transitions will be produced The CMBvariant simulator however
consistently picks the response in which no output transitions are produced
Although both responses are considered to be correct the sequential simulator can com
pare unfavorably with the CMBvariant simulator when there are too many extra events
For the comparison to be meaningful we must devise a sequential simulator that will con
sistently make the same choices as does the CMBvariant simulator In a system in which
every element has a nonzero delay this can be accomplished by withdrawing the rst of
the two output events when the second output event is to be produced and canceling both
events Each output data structure must maintain a reference to the last unconsumed event
that it has produced When another output event is to be produced if the previous event
has not been consumed and if the two events have the same time value then no events
are produced and the previous event is withdrawn The following SENDEVENT function
implements this mechanism
  SENDEVENTppidtime
 PROCESS pp
	 int id

 int time
 
 EVENT ep
 ODATA op
 int ot
  op  ELEMENT  ppdataoutq  id 
   ot  ELEMENT  ppdatadelay  time
 	 forj   j  opcount j
 
 
  ifoplastej  oplastejtime  ot
  
  DELEVENToplastej
  oplastej  
  else
  March  
    Chapter  LogicCircuit Simulator Experiments
  
   ep  newevent 
 	 ep
inputid  op
inputidj
  ep
time  ot 
  op
lastej  ep 
  ADDEVENTepop
pid j
  
  
	 
Listing  A SENDEVENT function that reduces glitches
Missing from Listing  is the part that places a backreference pointer into each
event structure The backreference is used by the simulator to dissociate an event from its
output by setting the corresponding lastej to 	 when the event is delivered
  Instrumentation
The sequential simulator also exists in two modes
 sweep mode and real mode Like the
CMBvariants
 the sweepmode simulator consumes one sweep for every element input de
livery In the real mode
 the CMBvariant simulator must poll the systems input message
queue once for every null message or event message delivered the sequential simulator is
also made to poll the same queue once for every event message delivered
 even though this
is never necessary Polling for messages consumes a signicant amount of time in many
multicomputers but there is nothing inherently costly about the operation It should be
possible in a future machine to poll the queue by checking only a single predened memory
location that has been mapped into each processs memory space
The resulting realmode simulator runs at a speed of about  s per event for our
examples on the iPSC and the Symult 
 and at about  s per event on our
iPSC The polling time is about  s for the Symult  and  s for the iPSC
The iPSC multicomputers were running Cosmic Environment in compatibility mode instead
of in the potentially more ecient native mode The exact speed depends on the size of
the event list The event list is implemented with a tree structure called the leftist tree
 This data structure shows Ologn	 timing characteristics for insertion and deletion
operations in even the most highly unbalanced cases
 but it does not provide an easy way to
  March  
Section  Sequential Simulator    
traverse the tree in a sorted order The leftist tree is an excellent choice for the simulators
because treetraversal is not needed in a simulator
  Big multiplier results
The sweepmode simulation results shown in section  indicate a   overhead when
N 	 
 the realmode results generally show a   overhead This is not unexpected
because the time required in the sweep mode to deliver a message to an element is assumed
to be the same in all simulators in reality the CMBvariant simulator has to do more work
per message than does the sequential simulator
We cannot at this moment reproduce the same sweepmode performance comparisons
using real multicomputers because we do not have access to any multicomputers with K
nodes We do however have access to an assortment of multicomputers of various sizes and
vintages that we can use to explore various regions of the result graph Figure 

 contains
the timing result for a simulation of the 
gate array multiplier from section  The
simulation is run for a duration of  s in simulated time under a 
node iPSC
log
 
seconds
log
 
nodes
sequential simulator
A
B
C
D
E
F
 
      







Figure  A gate multiplier for  s on an iPSC	
Aside from a larger overhead the realmode curves generally reect the upper third
of the sweepmode curves One consistent characteristic for this and other simulations is
a relatively low overhead for the variant F results at N 	 
 Variant A and F share the
  March  
    Chapter  LogicCircuit Simulator Experiments
property that messages can be sent eagerly while message sending in the other variants
must wait until a null pointer is returned by a call to xrecv  even if the messages are
to be sent from a simulator process to itself Variant F has a lower overhead than variant
A because it makes eager only those elements on critical paths thus allowing messages on
noncritical paths to merge As the simulation becomes more distributed however more
elements become part of a critical path and the advantage of variant F disappears
When N   variant A E and F fail as more of the eagerlysent demand and null
messages become internode messages and overload the buering capacity of the message
system The other variants are able to continue because many messages are eliminated by
being detained and merged with other messages
log
 
seconds	
log
 
nodes	
sequential simulator
D
B
C
E

       







Figure  A gate multiplier for 
s on an iPSC
Figure  contains the result of the same simulation on a node iPSC Due to an
excess of null messages variant A and F fail for all N  due to a lack of memory none of the
variants will run when N   nor will the sequential simulator run at N   Our iPSC
has only onehalf megabyte of memory per node whereas the iPSC has  megabytes per
  March  
Section  Sequential Simulator    
log
 
 seconds
sequential simulator
A
B
C
D
E
F
log
 
 nodes
D
B
C
E
      	 


	







Figure  Combining the iPSC and iPSC graphs with sequential timing aligned
node The sequential simulator result is an estimate derived from a simulation of a smaller
circuit  to be described later
The results that we are able to obtain from the iPSC simulation indicate a contin
uation of the nearlinear speedup until N  	 when there are fewer than  elements in
each node The total speedup obtained is 	 when the two sets of results are combined in
Figure 	
A 	node Symult  multicomputer allows us to explore a large overlapping portion
of these two combined graphs Since the S nodes are much faster than the iPSC
nodes the simulation interval has been scaled from s to s in order for the timing to
remain meaningful when N  	 Figure 	 matches Figure 	 closely but every variant
is able to complete its simulation for every N on the S Variant F resembles variant A
because as queuing limits vanish throughout the simulator the simulator eectively becomes
a variantA simulator Variant F is a little worse than variantA because it still must produce
demand messages in addition to any eagerly sent message Variant E however resembles
other variants
  March  
    Chapter  LogicCircuit Simulator Experiments
log
 
 seconds
log
 
 nodes
sequential simulator
A
B
C
D
E
F
      	 


	







Figure  A gate multiplier for s on a Symult 	
  Small multiplier results
Since we do not have a node multicomputer it is necessary to experiment with smaller
circuits to observe the asymptotic eects predicted by the sweepmode simulation for large
N  Figure 	 contains the results for the simulation of a   arraymultiplier consisting
of 	 logic gates The iPSC and iPSC simulations were performed over a simulated
interval of s The S simulation was performed over an interval of s to preserve
accuracy when many nodes are used
Not only is the reduction in slope more visible dierences between various modes are
also more apparent There are   and  elements per node when all of the nodes in the
iPSC S and iPSC respectively are in use
Compared to the iPSC curves the S curves show a steeper slope a larger overall
speedup and a closer match with the sweepmode curves The attening of the curves for
the iPSC is due to the eect of message latency The average message latency for the
iPSC when N  	 is   s this is comparable to the sperevent processing
  March  
Section  Sequential Simulator    
log
 
 seconds
log
 
 nodes
sequential simulator
A
D
C
B
E
F
      	 

	





Figure  A gate multiplier for s on an iPSC
log
 
 seconds
log
 
 nodes
sequential
simulator
A
B
C
D
E
F
      	 



	



Figure  A gate multiplier for s on an iPSC
log
 
 seconds
log
 
 nodes
sequential
simulator
A
B
C
D
E
F
      	 


	





Figure  A gate multiplier for s on a Symult 
  March  
    Chapter  LogicCircuit Simulator Experiments
time of the sequential simulator The usermode message latency for the S is    s
this is smaller than the  sperevent processing time
We can observe the e	ect of latency by varying latency in the sweepmode simulation
Figure 
 contains two plots one for N   and the other for N  
 A message
sent during a sweep is available to its destination in the following sweep when latency is 
When latency is nonzero the message is delayed by an amount equal to the latency When
simulation becomes dominated by latency time increases linearly with latency
log
 
sweeps N  
E
     







log
 
sweeps N  

log
 
latency
E
     







Figure  Eect of increased latency on simulation performance
In all of the results that we have shown the sourcedriven variants B C and D are
the most robust variants and they show a larger speedup than the other variants when N
is large The demanddriven variant E is hindered by a large message latency An idling
process may be delayed for two message cycles  send a demand message receive a normal
message  before it can continue When internode message latency is large variant E
performs poorly Variant F does better because it becomes variant A when processes are
idle more frequently
   Circuit topology vs activity level
A CMBvariant circuit simulator must supply every element input with enough fragments
to cover the entire simulation interval Since its simulation time is only weakly dependent
  March  
Section  Sequential Simulator    
on the content of those fragments it is more strongly inuenced by the static characteristics
of the circuit connectivity such as degree of fanout than by the dynamic characteristics
of the circuit operation such as number of events produced A sequential simulator on the
other hand depends only on the number of events produced
log
 
seconds
log
 
nodes
sequential simulator
A
B
C
D
E
F
 	 
     




	
		
	

	
Figure  A gate multiplier for 	s on a Symult  	 fast oscillation

For example if a circuit contains a crosscoupled latch the delay of the gates in the
latch determines the number and the span of the fragments produced and the number of
fragments produced determines the simulation time for the CMBvariant simulator The
number of times the latch is used determines the number of events generated in the latch and
the number of events generated determines the simulation time for a sequential simulator
We can expect the sequential simulator performance to change by a greater degree
compared to the CMBvariant simulator if we run the simulation using the same multiplier
circuit but with a dierent activity level Figure 	 is obtained by driving the array
multiplier at an elevated oscillation frequency Four times as many events are produced
and the time taken by the sequential simulator has increased by a factor of  The time
taken by the CMBvariant simulators however has increased by only a factor of 

  March  
   Chapter  LogicCircuit Simulator Experiments
Since fragments are more likely to carry transitions the possibility of consecutive frag
ments merging into a single fragment is reduced It becomes less protable for the simulator
to withhold messages The time taken by variant A has increased by a factor of only 
and variant A performs better than the other variants when N is not too large
  Hybrid possibilities
The CMBvariant simulator implements an algorithm that distributes well but like many
other algorithms there are sequential implementations that are more ecient than the
concurrent implementation However the CMBvariant simulator is unusual in that it is
an exact implementation of an algorithm that can be dened recursively 	 each element
simulator can also be a composite simulator We can view the simulator process on each
node as being a composite simulator that simulates the set of elements assigned to that
node We refer to the set of elements collectively as a macro element The circuit simulator
becomes one whose elements are not the logic gates but the macro elements
 of these one
exists in each node
hybrid
CMBvariant
N  
logtime
lognodes
sequen
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
tial
Figure  Modied Laer Curve
Since the elements in a macro element must reside in the same address space and since
their operations must be interleaved it is a tempting thought that there may be a way to
introduce sequential simulator eciency into the simulation of elements in a macro element
  March  
Section  Sequential Simulator   
Suppose such a hybrid simulator were to exist When N   all logic gates would reside in
the same node the simulator would have the same performance as a sequential simulator If
N were large there would be one logic gate per node and the performance would converge
to the performance of CMBvariant simulator
Figure 	
 depicts a hypothetical performance plot of a hybrid simulator a sequential
simulator and a CMBvariant simulator We will call this hybridsimulator curve the
modied Laer curve in recognition of economist Arthur B Laer who showed that tax
revenue is xed on two ends on the plot of revenue vs tax rate The quest for the algorithm
and for the control over the shape of the curve between these two end points guides the rest
of the experimental work which will be discussed in the next chapter
  March  
    Chapter  Hybrid Simulators
Chapter  Hybrid Simulators
Section  Coordinated Sequential Simulator Hybrid
One way to build a hybrid simulator is to use a modied sequential simulator for each
macro element and to connect the sequential simulators using a CMBvariant simulator
Since a CMBvariant simulator provides coordination for a set of sequential simulators
this hybrid simulator is called the coordinated sequential simulator designated hybrid
When N  	 hybrid	 is identical to the sequential simulator as the modication does not
introduce extra work for the simulator when the macro element is a closed system
A macro element is an open system if any of its element inputs connect to an element
output in another node Macroelement connectivities are handled by the CMBvariant
simulator and macroelement simulators must satisfy the requirements of the CMBvariant
simulator
 Output state descriptions produced by each macroelement simulator are packed
into fragments and sent to the encircling CMBvariant simulator The CMBvariant simu
lator distributes the fragments according to the connectivity of the macro elements When
a macroelement simulator receives a fragment events extracted from the fragment are
entered into the event list
 The algorithm
Since asynchronous events can be injected by other macroelement simulators event order
for a macroelement simulator cannot be guaranteed by the the repeated delivery of the
earliest event from the event list The simulator may not be able to consume the event at
the top of the list because an event with a smaller time value may yet arrive from another
macro element To avoid a simulation error we can employ a temporal marker in each
macro element to indicate the smallest time value for any future external events As long
as the time of the rst event in the event list is less than the marker time the event can
be safely consumed If the event time will be greater than the marker time the simulator
must wait
  March  
Section  Coordinated Sequential Simulator Hybrid   
The encircling CMBvariant simulator assures that the time of the next event on any
macroelement input is greater than or equal to the time of the macroelement input The
time of a macroelement input is equal to the total span of fragments that have passed
through it and is updated whenever a fragment is received for that input The minimum
macroelement input time is a convenient temporal marker
Output fragments are produced by a macroelement simulator whenever additional
output descriptions are computed Since elements are strictly synchronized in a sequential
simulator the output of all elements in a macro element are known up to the same simulated
time Thus the entire state of the macro element can be treated as an atomic property
Chapter  and all arcs with the same source and destination nodes can be merged into
one arc
In order to compute the temporal marker we store the input time of each macroelement
input in a special stopper event The stopper is added to the event list along with the
other events When a macroelement input receives a fragment in addition to injecting new
events it adds the span of the fragment to its stopper time and it repositions the stopper
in the event list As long as the event at the top of the event list is not a stopper the
macroelement simulator is free to consume the event when a stopper appears at the top
of the event list the simulator is made to wait for more inputs
  Sorting with a dierent key
A macroelement simulator derived from a conventional sequential simulator has an e	ective
delay of zero because its eventconsumption rules prevent the simulator from producing any
output description that has a time value larger than its own minimum input time A circuit
of these macroelement simulators will deadlock unless a set of alternative consumption
rules is used to produce a positive delay

The event with the smallest simulated time will be delivered rst is merely a conve
nient consumption rule that satises the following correctness conditions for a sequential
simulator When an event is delivered to an element
  March  
   Chapter  Hybrid Simulators
  The event will not need to be recalled and
 No future events for the element will have a smaller event time
We can satisfy both conditions and provide a nonzero delay by sorting events according to
the following ordered pair
key  t
e
	 d
m
 t
e


where t
e
is the event time and d
m
the mdelay
 is the delay of a minimumdelay path
the shortest path
 between the destination element of the event and any macroelement
output Macroelement outputevents therefore have a d
m
of  The rst member of a key
is the dominant member when keys are to be compared
key
 
 key
 
 
 
key

   key
 
 
key

   key
 
  and key

  key
 

Intuitively if input events for an element are ordered according to this key they are
ordered in t
e
as well because d
m
is the same for all input events of the element An event
whose destination element has an mdelay of d
m
can be deferred in the event list by d
m
amount of time relative to those events for the macroelement outputs because its eects
cannot propagate to the outputs before t
e
	 d
m
 The eective delay of a macro element is
therefore the minimum mdelay of its macroelement inputs
Theorem   An event produced by an element with a positive delay must have a key that
is larger then the key of the event that triggers it
Proof Let the delay of the element be  the time of the input event be t
e
 and the
mdelay of the element be d
m

By the denition of element delay any output event triggered by this
input event must have a time value of at least t
e
	  By the denition of
mdelay the destination element of the output event must have an mdelay
of a least d
m
  Therefore the rst part of the key for the output event
must be no less than d
m
  	 t
e
	  or t
e
	 d
m
 which is equal to the rst
part of the key for the input event
  March  
Section  Coordinated Sequential Simulator Hybrid   
The second part of the key for the output event is t
e
  which is greater
than the second part of the key of the input event Therefore the key of the
output event must be larger than the key of the input event
Theorem  Any event appearing at the top of the event list is valid
Proof An event must come either from another element in the same macro element
or from another macro element Events from other macro elements are
assumed to be correct because the macroelement simulators follow the rules
of a CMBvariant simulator
dst	e
t

src	e
t


	A
v
 B
v

 	A
t
 B
t


mdelay  A
v
 B
v
delay   
mdelay  A
t
  B
t
e
v
e
t
Figure  An event that invalidates another event
If the event is produced locally let the event at the top of the list be
e
t
 and let 	A
t
 B
t

 be the key of that event Let e
v
be the event that an
element consumes to invalidate e
t
 and let 	A
v
 B
v

 be its key
By the denition of a key A
t
 B
t
is the mdelay of dst	e
t

 and A
v
 B
v
is the mdelay of src	e
t

 Let   be the delay of src	e
t

 By the denition of
mdelay we have the inequality
A
t
 B
t
  A
v
B
v
  
which we can rearrange into
    	A
v
 A
t

  	B
t
B
v


We also have 	B
t
B
v

     because the delay of src	e
t

 is   and 	A
v
A
t

  
 because e
t
is the event at the top of the event list The only solution to
the inequality above is 	A
v
A
t

   and 	B
t
B
v

   
Since the key of e
t
is no greater than the key of e
v
 it follows that  
must be zero and that the two events must have the same event time Since
  March  
   Chapter  Hybrid Simulators
the ordering of the two events is beyond the ability of the model to resolve
it is correct to assume in this case that e
t
is earlier in time and is therefore
valid
Suppose e
t
is the event at the top of the event list and let the rst part of its key be
called the eventlist time Since all macroelement output events have an mdelay of zero
and since all new events have keys that are at least as large as the key of e
t
 the state of all
macroelement outputs is known up to the eventlist time The eective delay of a macro
element is therefore equal to the delay of the shortest path between any macroelement
input and output
  The simulator mechanism
The sequentialsimulator discussion in section  hints that complexities are being moved
into the message system of the reactive kernel 	the kernel of the lightweight reactive
element processes
 When a reactive kernel needs an event its message system provides the
event with the smallest time value of all events in the message system
multicomputer message system
CMBvariant message system
sequentialsimulator message system
sequentialsimulator kernel
element pro112
1
2
12
12
12
112
12
2
12
12
12
11
1
2
12
12
12
112
1
1
1
1
121
1
12
12
12
12
cesses
Figure  Layering in the hybrid simulator
In hybrid the message system of a sequential simulator is sandwiched between the
message system of a CMBvariant simulator and the kernel of the sequential simulator
When the kernel needs an event its message system provides that event having the smallest
key as long as that event is not a stopper If it is the message system waits for the stopper
to be relocated When the message system of the CMBvariant simulator receives more
fragments it moves the stoppers The hybrid simulator can therefore be constructed by
layering reactive kernels
  March  
Section  Coordinated Sequential Simulator Hybrid   
  struct  int entry
 char data   PROCESS
	 struct  int pid 

 char msgbody  MESSAGE
 SIMDATA simulatordata
 sequentialsimulatormainloopsimpmesg
  PROCESS simp
   MESSAGE mesg
  
  PROCESS proc
 
 simulatordata  SIMDATA simpdata
  proc  simulatordataprocesstable  mesgpid
  procentryproc mesgmsgbody
  
Listing  Hybrid main loop
The kernel of the sequentialsimulator main loop can be expressed as the lightweight
reactiveprocess program shown in Listing  It returns to its message system for more
events The messagesystem layer for the sequential simulator Listing 	 takes care of
sorting the events and getting external events from the message system of a CMBvariant
simulator The message system of the sequential simulator is also a lightweight reactive
process

  PROCESS seqsim  Sequential simulator process structure only   
 sequentialsimulatormessagesystemmsys sb
	 PROCESS msys

 STATEFRAGMENT sb
 
 breakstatefragmentintoeventsmsyssb
 freefragmentsb
  whiletopoflisteventisnotstoppermsys
   
  seqsimentryseqsimgettopoflisteventmsys
  
 	 
Listing  Hybrid embedded message system
  March  
   Chapter  Hybrid Simulators
It returns to the message system of the CMBvariant simulator for a fragment which
it digests into individual events After that as long as the event with the smallest time is
not a stopper the message system will remove the event from the event list and deliver it
to the sequentialsimulator kernel
  The simulator output
Sending only the macroelement output events is not enough to satisfy the requirements
for a CMBvariant simulator Whenever the eventlist time has increased more is known
about the outputs even if no output event has been produced The rule for eventual delivery
requires that null messages be generated
Like the CMBvariant simulator several variants of the hybrid simulator have been
created and they are characterized by how and when messages are sent Eventual delivery
is also assured by the same indenitelazy evaluation mechanism not shown in the listings
above Three adjustable parameters are available for the hybrid simulator	
Queuelimiting Messages are sent when an adjustable limit on the number of queued
output events is reached or when null is returned by xrecv
Demanddriven Demand messages are sent after an adjustable delay as measured by
the number of successive nulls returned by xrecv while a macroelement
simulator is waiting for more inputs Demand messages are sent to the
source nodes of the inputs whose stoppers are at the top of the event list
Queued messages for that output addressed by the demand message are
sent when a demand message is received
Eagermessage Each output has a prompter event that stores the sum of an adjustable
value and the simulated time of the last output action When a prompter
event reaches the top of the event list messages are sent for that output
and the prompter is rescheduled
  March  
Section  Coordinated Sequential Simulator Hybrid   
  Expectation
Tight synchronization between elements in the same computing node greatly reduces the
volume of internode messages especially null messages by combining internode arcs having
common source and destination nodes into one single arc Tight synchronization however
can also reduce concurrency When a simulator process is blocked because of a stopper
appearing at the top of the event list elements that do not depend on the input of that
stopper are also prevented from making progress Concurrency is reduced because this
forces dierent subcircuits in the same node to progress at the same rate and ignores
nonstrict input conditions in which an element can still make progress when some of its
inputs are blocked
sequential
lognodes
log
12345678
12345678
12345678
12345678
12345678
12345678
12345678
12345678
12345678
time
Figure  Expected performance of the hybrid simulator
The purpose of this experiment is to construct a simulator that will do as little work as
possible at small N rather than be as ecient as the CMBvariant simulators at large N 
After all we can already get CMBvariantsimulator performance by running a CMBvariant
simulator We expect the simulator performance graph to start at N 	 
 at sequential
simulator speed We expect to see sublinear speedup due to the lost concurrency load
imbalance and extra work required to deal with the message system We then expect the
performance to bottom out at a level above the CMBvariant simulator when N is large
  March  
   Chapter  Hybrid Simulators
  Experimental results
Like the CMBvariant simulator and the sequential simulator hybrid is also written in
the form of a reactive program making it suitable for sweepmode simulation however
a sweepmode simulator has not been implemented The realmode simulator has been
implemented and a node Symult 	

 was used as the primary test vehicle Although
simulation was performed using a multitude of simulation parameters only a handful will
be shown because related variants produce similar results The variants are
Queue limit   
 null xrecvs before demand message
Queue limit   
 null xrecvs before demand message
Queue limit  	
 Prompter delay  
ns
Prompter delay  
ns
Figure  contains the simulation result of a   arraymultiplier running on a 
node S	

 for 

 s simulated time It is shown alone left and superimposed over the
CMBvariant result right
hybrid only both
log
 
seconds
log
 
nodes

  	     








	
log
 
seconds
log
 
nodes

  	     








	
Figure  A gate multiplier for 

 s on a Symult 	
  March  
Section  Coordinated Sequential Simulator Hybrid   
The general characteristic of these curves matches our expectation In the multiplier
example the extra work that the simulator has to do and the diculty it has in subdividing
the multiplier for load balancing result in no speedup from N   to   For larger N  the
curves show a slope of    until N   where the curves level out Between N   and
	
 the curves cross over those of the CMBvariant simulator The demanddriven modes
perform consistently better than the queuelimiting modes The eagermessage modes per
form well for small N  but they bend upward for large N due to an excess of null messages
The more eager of the two curves bends upward sooner than the lesseager one
Due to the combining of arcs hybrid curves are strongly inuenced by element dis
tribution only when N is large Figure  contains results of simulation using randomized
element placement Compared to Figure 
 the CMBvariant curves are shifted upward
uniformly for all N  and the hybrid curves are bent upward when N is large The hybrid
curves show little change when N is small
hybrid only both
log
 
seconds
log
 
nodes
    
  	 

	






log
 
seconds
log
 
nodes
    
  	 

	






Figure  A gate multiplier for s on a Symult  with random placement	
Since one end of the hybrid curves is pegged to the sequential simulator time we can
also expect a larger change for the hybrid simulator than for the CMBvariant simulator
  March  
   Chapter  Hybrid Simulators
when we increase the circuit activity level Figure  contains the results of simulation using
the same multiplier circuit that is operated at a higher oscillation frequency The hybrid
curves are shifted upward by two octaves while the CMBvariant curves are shifted only by
one octave A high activity level is more favorable to the CMBvariant simulator because
fewer of the messages are null messages
hybrid only both
log
 
seconds	
log
 
nodes	

       









sequential simulator
A
B
C
D
E
F
log
 
seconds	
log
 
nodes	

       









Figure  A faster oscillating gate multiplier for 

s on a Symult 	
Results from the multiplier example in this chapter and better results from other
circuits to be shown in Chapter  have conrmed that the hybrid simulator is working
and performing to our expectation Our next step is to go beyond the limitations of the
hybrid simulator to construct a new hybrid simulator that will converge to the CMB
variant simulators when N is large
  March  
Section  Progressive Hybrid Simulator Hybrid    
Section  Progressive Hybrid Simulator Hybrid
The hybrid simulator cannot achieve CMBvariant performance at large N because po
tential concurrency is lost when nonstrict conditions are ignored and elements in a macro
element are synchronized Two separate mechanisms are used to recover the lost concur
rency First when an input of an element becomes blocked it must be allowed to continue
if it can still make progress due to a nonstrict input condition Second when some el
ements are blocked we must allow those that are not blocked to continue ahead of the
blocked elements
When a stopper appears at the top of the event list elements connected to the input
of the stopper may be blocked Since hybrid macro elements are simulated by sequential
simulators when an element in a macro element becomes blocked the entire macro element
is blocked When an element becomes blocked in hybrid	 the macro element is in e
ect
reorganized by moving the blocked element out of the macro element More blocked elements
may result due to arcs leading from the blocked element to the new macro element When
only unblocked elements remain however the macroelement simulator can continue to
make progress When a blocked element has received more inputs and becomes unblocked
it is put back into the macro element
To take advantage of nonstrict input conditions stoppers in hybrid are replaced by
blocker events in hybrid	 A blocker appearing at the top of the event list does not cause
the simulator process to stop instead it is delivered like a normal event For every blocker
there is a matching antiblocker it has the same simulation time as the blocker and they
annihilate each other in the simulator Macroelement inputs produce both blockers and
antiblockers Whereas the hybrid simulator relocates the stopper as more state fragments
are received the hybrid	 simulator instead adds an antiblocker with a time value equal to
the previous blocker adds any events carried by the fragment and adds a blocker with the
time equal to the new time of the hybrid stopper
  March  
   Chapter  Hybrid Simulators
When an element receives either a blocker an antiblocker or a normal event the
element determines whether it is blocked It is not blocked if all of its inputs are unblocked
or if its remaining unblocked inputs contain a nonstrict input condition it is blocked
otherwise When an unblocked element becomes blocked it sends a blocker with a time
equal to the current input event When a blocked element becomes unblocked it sends an
antiblocker with a time equal to the previous blocker
In a hybrid simulator when N is small most of the element inputs are not blocked
and the simulation takes on the characteristics of a hybrid simulator When N is large
many of the element inputs are blocked and the simulation produces the eciency of a
CMBvariant simulator However one clear disadvantage of hybrid compared to hybrid
 is that internode arc merging is no longer possible and the simulator is potentially more
sensitive to element placement
  The mechanism
  struct EVENT  int etype  type of the event 
 int inputid  id of the element input 
	 int time 
   time of the event 
 genericgateppep
 PROCESS pp
 EVENT ep

 ifeptime  elementtimepp eptime  elementtimepp
   setinputbitsppep
  computestateandblockagepp
  if wasblockedpp  isblockedpp addantiblockerppeptime
  if oldoutput pp  newoutputpp addoutputeventppeptime
  ifwasblockedpp  isblockedpp addblocker ppeptime
  savenewstatepp
  freeeventep
 

Listing  Generic logicgate handler for hybrid
  March  
Section  Progressive Hybrid Simulator Hybrid   
A sample element entry function appears in Listing  In addition to the usual inputid
and time elds the hybrid event structure also contains an etype eld to distinguish
among normal events blockers and antiblockers Since nonstrict input conditions are
utilized it is now possible for an element to receive events with a time value smaller than
the time of the element These events are for inputs that were previously blocked but
the element was able to progress further because a nonstrict input condition was present
These events do not contribute to the operation of the element other than to determine
the current input state of the element Therefore when such an event is received its event
time is simply set to the element time 	
 before it is processed like other events
Each element input contains a pair of variables One indicates the state the other
indicates blockage Each output contains two pairs of variables one for the old state and
blockage and one for the new state and blockage When an event is received by the process
the setinputbits function is called to set or clear the aected bits in the input structure
of the element The new output state and blockage are then computed from the new input
state and blockage 	 If the element has become unblocked due to the event 	 an
antiblocker is sent If the element has changed state 	 a normal event is sent If the
element has become blocked 	 a blocker is sent The ordering of lines  assures that
the event following a blocker is an antiblocker
The sequentialsimulator main loop the kernel to these element processes tests the
blockage ag before and after an entry function is called blocked elements are separated
from unblocked elements by treating them dierently Listing  is the kernel written as a
heavyweight reactive process
  sequentialsimulatormainloop
 
 MESSAGE mesg	

 PROCESS proc	
 mesg  getnextevent	
 proc  processtable  mesgpid	
 ifblockedproc
  
   procentryproc mesgmsgbody	
  March  
   Chapter  Hybrid Simulators
   else
  
  ifeventtimemesg	 
 elementtimeproc		
  
  queueeventprocmesg	
   else
 
  proc
entry	proc mesg
msgbody	
 ifblockedproc		 movequeuedeventsbacktoeventlistpp	
 
 
 
Listing  Hybrid main loop
When an event is returned from the message system which contains the event list
the main loop identies the receiver of the event  and checks its blockage ag 	 If the
element is not blocked it is in the sequential
simulator domain and the event is delivered
to it as if it were in a normal sequential simulator 	
If the element is blocked the main loop checks its readiness to consume the event	 The
event cannot be consumed if its time is larger than the time of the element	 The element
lacks information about the future state of its blocked inputs necessary to consume an event
that arrives at a future time	 The event is queued for the element 	 If the event time
is less than or equal to the element time the element has enough information to consume
the event and the event is sent to the element 	 If the element is now unblocked its
queued events are moved back into the event list to be delivered again for the element	
Queued events cannot be delivered directly to the element when the element becomes
unblocked because they are ones that arrived while some inputs of the element were blocked	
There may be events for the blocked inputs that have yet to arrive and that need to be
delivered in the proper order with respect to the queued events when the element becomes
unblocked	 Moving all queued events back into the event list is inecient when the queue
is long and when moves have to be done frequently	 The actual implementation of the
hybrid
 simulator contains an elaborate mechanism for minimizing wasted eorts and
  March  
Section  Progressive Hybrid Simulator Hybrid   
this accounts for the largest dierence between the hybrid presented here and the actual
implementation
  Experimental results
Like the other simulators hybrid is written in the form of a reactiveprocess program
making it suitable for sweepmode simulation but as in the case of hybrid a sweepmode
simulator has not been created Figure 		 contains the simulation results of a 
 
 array
multiplier running on a 
node S for  s simulated time It is shown alone left
and superimposed over both the CMBvariant result and the hybrid result right
Queue limit    null xrecvs before demand message
Queue limit    null xrecvs before demand message
Queue limit   Prompter delay  ns
Prompter delay  ns
hybrid only all 
log
 
seconds
log
 
nodes
    
   	




	





log
 
seconds
log
 
nodes
    
   	




	





Figure  A gate multiplier for s on a Symult 	
The most noticeable dierence between hybrid and hybrid curves in this graph is
that whereas hybrid curves level o at largeN  hybrid curves keep going down Hybrid
  March  
   Chapter  Hybrid Simulators
curves start out very much like hybrid curves because most of the elements in the hybrid
 simulators are running under the hybrid mode As more and more nodes are used in
the simulation hybrid element simulators start to become idle more frequently and their
curves start to level o In the hybrid simulator instead of becoming idle more of the
elements enter the CMBvariant mode to provide additional speedup over hybrid
The other remarkable aspect of hybrid curves is that they are all very much alike
until that point where most of the hybrid curves level o It is after this transition point
that progresspromoting actions begin to dominate and a variety of dierent performance
results are produced depending on the properties of the progresspromoting action in use
The hybrid curves appear to converge toward the CMBvariant curves but nothing
conclusive can be deduced from this graph because a 	node machine lacks su
cient nodes
to demonstrate this eect The convergence is much more obvious when elements are placed
randomly Placement has a much stronger eect on the hybrid simulator than it does on
the hybrid simulator because random element placement greatly increases the number of
internode arcs for the hybrid simulator
Figure  shows the result of random element placement same placement for all simu
lations shown in this graph The hybrid curves converge immediately to the CMBvariant
curves at N   Reduction in internode null messages by bundling internode arcs allows
the hybrid simulator to show a small speedup at small N 
Convergence is also more evident when we increase the circuit activity level Figure 
shows the results of simulating the multiplier with enhanced activity level Convergence
begins at a smaller N because the sequentialsimulator time is now closer to the CMB
variant time when N    The hybrid curves start out closer to the CMBvariant curves
and they converge to the CMBvariant curves at N   
Although we do not have a larger machine for looking at cases where there are fewer
elements per node we can reduce the number of elements per node by using smaller test
  March  
Section  Progressive Hybrid Simulator Hybrid   
hybrid only all 
log
 
seconds
log
 
nodes
    	 
  
	









log
 
seconds
log
 
nodes
    	 
  
	









Figure  A gate multiplier for s on a Symult  with random placement	
hybrid only all 
log
 
seconds
log
 
nodes
    	 
  








sequential simulator
A
B
C
D
E
F
log
 
seconds
log
 
nodes
    	 
  








Figure  A fasteroscillating gate multiplier for s on a Symult 	
circuits We tested a 	 	 arraymultiplier that contains  gates At N  	 there are
no more than two gates in each node
  March  
   Chapter  Hybrid Simulators
hybrid only all 
log
 
seconds
log
 
nodes
    	 
  







sequential
simulator
A
B
C
D
E
F
log
 
seconds
log
 
nodes
    	 
  







Figure  A gate multiplier for 	s on a Symult 
The CMBvariant curves diverge wildly some of them do better than the hybrid
curves and some do worse Overall the hybrid curves seem to follow the better CMB
variant curves
  March  
   Chapter  Additional Performance Results
Chapter  Additional Performance Results
This chapter summarizes the simulation results of a few selected circuits that were used in
this research They are generally presented in the following order
 Description of the circuits
 Sweepmode simulation results on an emulated multicomputer
 Realmode simulation on a Symult  with systematic element distribution
	 Realmode simulation on a Symult  with random element distribution

 A few sets of realmode simulation on smaller circuits of the same type
Each set of realmode simulations contains results from running the CMBvariant simulator
the hybrid simulator and the hybrid simulator Results from other multicomputers are
similar and are not shown
  March  
   Chapter  Additional Performance Results
Section  D Clock Network
 Description
A clock network is an arbitrarily extensible array of logic gates that oscillates when properly
initialized The frequency of the oscillation is determined by local characteristics and
the phase at any node in the network is locked to the phase of the adjacent nodes A
clock network can be used to provide synchronous communication for an arbitrarily large
boundeddegree multicomputer network
register
controller
ack
req
data outdata out
cloc
123
123
123
123
123
123
123
123
123
123
234
234
234
234
234
234
234
234
234
234
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
123
k
Figure  A FIFO consisting of  units
A clock network is a generalized selftimed FIFO circuit As shown in Figure  a
FIFO is made of a number of FIFO units connected into a chain a FIFO unit contains a
controller and a register The registers in a FIFO are connected in a chain via their data
inputs and outputs the controllers are connected via their request and acknowledge signals
Each controller provides a clock signal to enable and disable the latches in its register The
acknowledge and request signals allow the controllers to determine when the FIFO unit
immediately preceding it has data for it and when the FIFO unit immediately following it
has taken the data from it
Each FIFO unit leads but is never more than a half cycle ahead of the following unit
and lags but is never more than a half cycle behind the preceding unit Thus if registers
were computers and registertoregister links were communication channels the data one
computer latches in at its kth clock tick is the data put out by the preceding computer at
  March  
Section  D Clock Network   
req
ac
123
123
123
123
123
123
123
123
123
123
234
234
234
234
234
123
123
123
123
123
k
Figure  A Celement FIFO consisting of  units
that computers kth clock tick With a little extra delay synchronous communication can
also take place in the reverse direction
A simple FIFO control can be constructed using a C element and an inverter A C
element is a statestorage device such that when all of its inputs are high the output
becomes high when all of its inputs are low the output becomes low and the output
remains unchanged otherwise In the FIFO shown in Figure  the output of a C element
is connected to an input of the C element in the following unit The inverted output of a
C element is connected to an input of the C element in the preceding unit The output of
the C element is also used as the clock to the register
Figure  A   array of selfoscillating FIFO units
  March  
    Chapter  Additional Performance Results
The FIFO structure can be extended to a higher dimension by crossconnecting a set
of FIFO controls with another set of FIFO controls Figure  contains a twodimensional
array of  FIFO units with the registers omitted The edges are terminated in such a way
that the array will oscillate This is essentially the same network that is used in the clock
network simulation except that each 	input C element is replaced by a gate circuit
The circuit in Figure  has 
 gates
  Sweepmode results
log
 
sweeps
log
 
nodes
    	 
      






	








Figure  Sweepmode CMBvariant simulation of an gate clock network
Figure 	 contains the sweepmode results of an   clocknetwork containing 	 logic
gates The speedup is linear until there are fewer than 	 elements in each node The null
message overhead is a little larger than  at N   and the crossover occurs between N  
and N   Unlike the multiplier example we used in previous chapters the clock network
shows a much greater dierence between the mosteager variant and the lazier variants This
  March  
Section  D Clock Network   
is typical of circuits with many tight loops where unnecessary null messages can persist as
they travel around the loops The lazier variants annihilate such null messages to achieve
an improved performance over the mosteager variant
Also unlike the multiplier example load balancing is simple because a clock network
shows a steady and uniform activity level at every part of the circuit Although the CMB
variant simulators are relatively insensitive to the eect of load balance and activity level
the hybrid simulators are more favorably inuenced as we can see in Figure 
  Realmode results
The performance at N 	 
 and the linear speedup for most of the lazier CMB variants
t the sweepmode prediction well The realmode curves dier from the prediction in
that the eager CMBvariant curve is not uniformly worse over all N  and the curve for
the adaptive demanddriven variant worsens more rapidly than predicted These two CMB
variants are not robust in circuits that contain many closed loops where null messages can
circulate because the persistence of the null messages depends on runtime conditions such
as congestion and order of message arrival As a consequence the result of the simulation
can vary signicantly from run to run but when N is small the behavior is more restricted
and the prediction of the sweepmode simulation prevails
The hybrid
 and hybrid curves are similar to those of the multiplier circuit except
these curves show a greater speedup due to better load balance for the clock network Thus
these curves are more similar to those of the multiplier with an enhanced activity level 
there is no signicant initial penalty at N 	  The activity level for this multiplier is more
uniform because a new wave of activities is injected into the multiplier before old ones have
completed The hybrid
 curves atten and bend upward between N 	 
 and   while
the hybrid curves continue straight down as they close in toward the CMBvariant curves
The next set of graphs shows the eect of randomized element distribution The CMB
variant curves have shifted very little but the hybrid
 curves become much shallower and
the hybrid curves show the characteristic upward hump for random element distribution
  March  
   Chapter  Additional Performance Results
Realmode results for an    network
CMBvariant
log
 
seconds
log
 
nodes
    	 
  
	










	
hybrid
log
 
seconds
log
 
nodes
    	 
  
	










	
hybrid
log
 
seconds
log
 
nodes
    	 
  
	










	
all 
log
 
seconds
log
 
nodes
    	 
  
	










	
Figure  An gate clock network for 
s on a Symult 
  March  
Section  D Clock Network   
Figures  and  show the results in regions where there are many more logic ele
ments than nodes The three additional sets of simulation results use progressively smaller
clock circuits the last one has on average one logic gate per node for N 	 
 As the
number of gates is reduced speedup achieved by the hybrid simulators is reduced because
the advantage that can be obtained from running sequential macroelement simulators de
creases The CMBvariant simulators which reect the ratio of null messages and event
messages show very little change relative to the sequential simulator
The lazy CMBvariants are hardy and robust simulators They show good speedup
relative to themselves all the way down to  element per node in a fashion consistent with
the sweepmode prediction
  March  
   Chapter  Additional Performance Results
Realmode results with random element distribution
CMBvariant
log
 
seconds
log
 
nodes
    	 
  
	










	
hybrid
log
 
seconds
log
 
nodes
    	 
  
	










	
hybrid
log
 
seconds
log
 
nodes
    	 
  
	










	
all 
log
 
seconds
log
 
nodes
    	 
  
	










	
Figure  An gate clock network for 
s on a Symult 
  March  
Section  D Clock Network   
Realmode results for a     network
CMBvariant
log
 
seconds
log
 
nodes
   	   
  









	
hybrid
log
 
seconds
log
 
nodes
   	   
  









	
hybrid
log
 
seconds
log
 
nodes
   	   
  









	
all 	
log
 
seconds
log
 
nodes
   	   
  









	
Figure  A gate clock network for s on a Symult 	
  March  
   Chapter  Additional Performance Results
Realmode results for a      network
CMBvariant
log
 
seconds
log
 
nodes
      	 
 
	








hybrid
log
 
seconds
log
 
nodes
      	 
 
	








hybrid
log
 
seconds
log
 
nodes
      	 
 
	








all 
log
 
seconds
log
 
nodes
      	 
 
	








Figure  A gate clock network for s on a Symult 
  March  
Section  D Clock Network   
Realmode results for a      network
CMBvariant
log
 
seconds
log
 
nodes
      	 
 







 
hybrid
log
 
seconds
log
 
nodes
      	 
 







 
hybrid 
log
 
seconds
log
 
nodes
      	 
 







 
all 
log
 
seconds
log
 
nodes
      	 
 







 
Figure  A gate clock network for 	s on a Symult 
  March  
   Chapter  Additional Performance Results
Section  TreeRing Example
 Description
Unlike the multiplier and the clock network the treering circuit has no identiable func
tions it is one of the circuits we invented to test the simulator It is made of a cycle of to
pulse distributors whose outputs are then summed together by a ring of 	input ORgates
Each to pulse distributor is composed of seven to
 distributors connected in a tree
structure A test circuit with 
 distributors appears in Figure
12
12
12
12
12
123
123
123
123
1
1
12
1
12
12
12
12
123
123
1234
1234
34
1
12
12
1
12
12
12
12
123
123
3
12
12
12
12
12123
12312
12
1234
1234
1234
1234
12
12
12
12
12
12
123
123
123
123
123
123
123
1
12
12
12
12
1234
1234
1234
1234
12
123
123
123
123
23
4
1234
1234
1234
123
123
12
1234
1234
1234
12
12112
12
1212
12
12
123
123
123
12
12
12
12
12
12
123
123
123
123
12
12
1
1
12
12
12
12
123
123
123
12312
12
12
12
12
123
123
123
12
12
12
123
123
123
123
1
1
12

Figure  A unit tree ring
  March  
Section  TreeRing Example   
Each to pulse distributor has one input and two outputs Pulses appearing at the
distributors input are alternatively passed to one of its outputs Thus a to distributor
spreads the pulses among its eight outputs A to pulse distributor consists of a toggle
	ip 	op made of 
 logic gates and a to demultiplexor made of  logic
123456789
123456789
123456789
123456789
123456789
123456789
123456789
gates
Figure  A to pulsedistributor circuit
  Simulation results
Sweepmode simulation has not been done for this circuit The graphs on the following
pages are for the simulation of a unit circuit using both systematic and random element
distribution a 
unit circuit a unit circuit and nally a unit circuit Treering circuits
have a lower activity level than the others examined here because only one of the eight
leaves in each unit can be active at any time Accordingly the CMBvariant curves show
an overhead of four to ve octaves relative to the sequential simulation results The CMB
variant speedup is otherwise linear with respect to itself
The hybrid curves are not as smooth as those of the other circuits because each
treering circuit contains two sets of subcircuits with very dierent properties the pulse
distributor and the ring of ORgates Partitioning of the circuit over dierentsized multi
computers produces very dierent locality relations which strongly aect the performance
of the hybrid simulators The eect of locality can also be seen in the simulation with ran
dom element distribution While the hybrid curves for the clock network merely worsen
those for this circuit converge immediately to the CMBvariant curves at N   The
CMBvariant simulator however is not strongly in	uenced by locality
  March  
   Chapter  Additional Performance Results
The CMBvariant curves which are pegged to the ratio of null messages verses event
containing messages show very little change as the size of the circuit is decreased The
hybrid simulator curves show a steady attening in slope and hybrid curves eventually
lose all speedup when there are only 	 gates left in the circuit
  March  
Section  TreeRing Example    
Realmode results for a unit network
CMBvariant
log
 
seconds
log
 
nodes
    	 
  
	









hybrid
log
 
seconds
log
 
nodes
    	 
  
	









hybrid
log
 
seconds
log
 
nodes
    	 
  
	









all 
log
 
seconds
log
 
nodes
    	 
  
	









Figure  A gate tree network for 
s on a Symult 
  March  
   Chapter  Additional Performance Results
Realmode results with random element distribution
CMBvariant
log
 
seconds
log
 
nodes
    	 
  
	









hybrid
log
 
seconds
log
 
nodes
    	 
  
	









hybrid
log
 
seconds
log
 
nodes
    	 
  
	









all 
log
 
seconds
log
 
nodes
    	 
  
	









Figure  A gate tree network for 
s on a Symult 
  March  
Section  TreeRing Example   
Realmode results for a unit network
CMBvariant
log
 
seconds
log
 
nodes
    	 
  
	









hybrid
log
 
seconds
log
 
nodes
    	 
  
	









hybrid
log
 
seconds
log
 
nodes
    	 
  
	









all 
log
 
seconds
log
 
nodes
    	 
  
	









Figure  An gate tree network for s on a Symult 	
  March  
   Chapter  Additional Performance Results
Realmode results for a unit network
CMBvariant
log
 
seconds
log
 
nodes
    	 
  
	









hybrid
log
 
seconds
log
 
nodes
    	 
  
	









hybrid
log
 
seconds
log
 
nodes
    	 
  
	









all 
log
 
seconds
log
 
nodes
    	 
  
	









Figure  An gate tree network for s on a Symult 
  March  
Section  TreeRing Example   
Realmode results for a unit network
CMBvariant
log
 
seconds
log
 
nodes
    	 
  
	









hybrid
log
 
seconds
log
 
nodes
    	 
  
	









hybrid
log
 
seconds
log
 
nodes
    	 
  
	









all 
log
 
seconds
log
 
nodes
    	 
  
	









Figure  An gate tree network for s on a Symult 
  March  
   Chapter  Additional Performance Results
Section  FIFO Loop
 Description
While the clock network example uses a D array of crossconnected FIFO controllers the
FIFO loop example uses a circularly connected linear array of FIFO controllers and FIFO
registers Refer to the gure in the clock network section The registers are made of a
bank of 	 crosscoupled latches with clocked inputs Each latch is made of 
 logic gates as
shown in Figure 	
load
Q
Q
D
D
clear
Figure  Circuit for one latch
Since the design of the controller constrains the FIFO to contain no more than  unit
of data for every pair of FIFO units and since we chose to initialize the FIFO loop with
alternating data units of all ones and all zeros the number of FIFO units must be a multiple
of four
 Simulation results
Figure 		 contains the CMBvariant sweepmode simulation result using a loop of 	
FIFO units The FIFO loop is an example with a lot of usable concurrency However
unlike the clock network the lazier simulation variants are not any better than the most
eager simulation variant evidently due to the majority of the circuit loops being found in the
crosscoupled latches Nonessential null messages do not remain long in the crosscoupled
latch because the load signal and the reset signal must be long enough for the crosscoupled
latch to settle down to a nal value In doing so the input to one of the crosscoupled
  March  
Section  FIFO Loop   
latches is held low for a suciently long time that all freerunning null messages in the
crosscoupled latch are eliminated due to the nonstrict input condition of the NANDgates
Yet there are still essential null messages in the simulation and the overhead estimate
of the sweepmode simulation is between  and  octaves The curves should show a linear
speedup up to N  	
 before they start to level o
log
 
sweeps
log
 
nodes
     	 
     







	





Figure  Sweepmode CMBvariant simulation of an gate FIFO loop
The realmode CMBvariant curves for the FIFO loop circuit matches the sweepmode
predictions well The curves for the hybrid simulators are also as expected The hybrid
curves atten out and cross over the CMBvariant curves earlier than they do in the previous
examples because the gates in this circuit are under nonstrict input conditions most of the
time and because hybrid simulators are unable to make use of such conditions
One unique characteristic of this circuit is that when the circuit size is reduced to 
FIFO units all three sets of results show evidence that the curves are bending upward at
N   This characteristic is not observed in the sweepmode result and is an indication
that some tight loops are broken up and distributed across node boundaries At N  

  March  
   Chapter  Additional Performance Results
there are  or  elements per node With granularity approaching the number of gates in a
crosscoupled latch a misalignment in a systematic distribution will cause the majority of
the crosscoupled latches to be split across node boundaries
  March  
Section  FIFO Loop   
Realmode results for a element loop
CMBvariant
log
 
seconds
log
 
nodes
    	 
  
	










hybrid
log
 
seconds
log
 
nodes
    	 
  
	










hybrid
log
 
seconds
log
 
nodes
    	 
  
	










all 
log
 
seconds
log
 
nodes
    	 
  
	










Figure  An gate FIFO loop for s on a Symult 
  March  
   Chapter  Additional Performance Results
Realmode results with random element distribution
CMBvariant
log
 
 seconds
log
 
 nodes
      	 



	








hybrid
log
 
 seconds
log
 
 nodes
      	 



	








hybrid
log
 
 seconds
log
 
 nodes
      	 



	








all 
log
 
 seconds
log
 
 nodes
      	 



	








Figure  An gate FIFO loop for s on a Symult 
  March  
Section  FIFO Loop   
Realmode results for a element loop
CMBvariant
log
 
seconds
log
 
nodes
    	 
  
	










hybrid
log
 
seconds
log
 
nodes
    	 
  
	










hybrid
log
 
seconds
log
 
nodes
    	 
  
	










all 
log
 
seconds
log
 
nodes
    	 
  
	










Figure  A gate FIFO loop for s on a Symult 	
  March  
    Chapter  Additional Performance Results
Realmode results for a element loop
CMBvariant
log
 
seconds
log
 
nodes
    	 
  
	










hybrid
log
 
seconds
log
 
nodes
    	 
  
	










hybrid
log
 
seconds
log
 
nodes
    	 
  
	










all 
log
 
seconds
log
 
nodes
    	 
  
	










Figure  A gate FIFO loop for s on a Symult 
  March  
Section  Economy and Performance of a Multicomputer   
Chapter  Summary
Section  Economy and Performance of a Multicomputer
Multicomputers are appealing because they improve and with advances in VLSI technol
ogy promise to continue to improve the two most prominent gures of merit of computing
systems performance and economy Performance is proportional to the processing speed
of a machine
Performance   processing speed
Economy is inversely proportional to the cost of running a program	 it is therefore both
proportional to the processing speed and inversely proportional to the cost of the machine
Economy  
processing speed
machine cost
In most cases performance and economy are at odds with each other because higher speed
is achieved by using faster circuits	 however the increase in the machine cost is greater than
the increase in the processing speed In a multicomputer speed is increased not by having
faster circuits but by having many cooperating computers Hence it is possible to improve
economy by increasing performance without causing a proportionally larger increase in the
machine cost
single
processor
computer
Path B
Path A
Figure  Two idealized multicomputer evolution paths
Whether one agrees that economy can be improved however depends on how one sees
the basic premise of multicomputing Shown in Figure 
 are two idealized evolutionary
  March  
   Chapter  Summary
paths leading from the same singlenode computer We will in our idealized model consider
computers to be made entirely of memory because a fairly fast processor can be built in
the area required for a few thousand bytes of fast memory When we compare two single
processor computers we compare two collections of memory attached to two identical
zerosized processors Thus any two singleprocessor computers in our comparison have
the same speed regardless of their size dierences We will also assume that programs do
not take up more memory as they become more distributed
Along path A we build an N node multicomputer by putting together N copies of the
singlenode computer Performance has improved by a factor of N because there are now N
singlenode computers and each is as fast as the original economy has not changed because
the total machine cost has increased by the same factor
Along path B the circuitry of a singlenode computer is regrouped into N smaller
nodes Performance has improved by a factor of N because each of the N smaller nodes is
as fast as the original economy has also improved by a factor of N because performance
has improved while the cost of the machine has remained constant
These paths A and B also have a strong inuence on multicomputer programming The
cost C of running a program in this idealized model is
C 	 SNT
S 	 Price per node per unit time 
  size of the node
N 	 Number of nodes in the machine
T 	 Time it takes for the program to complete
When drawn as a D logloglog plot which we call the cost space the surfaces of constant
cost are given by
  March  
Section  Economy and Performance of a Multicomputer
1234
1234
1234
1234
1234
1234567
567
567
567
567
567
1234567
1234567
1234567
1234567890
890
890
890
890
890
890
890
890
890
1234567890
1234567890
1234567890
  
P
log N
log T 
log S
C plane
Figure  Multicomputer cost space
log S log N log T   log C
Constantcost surfaces called the C planes appear as planes perpendicular to the
  direction vector Suppose we have an application whose singlenode cost is marked
by point P in Figure 	
 If we can nd a point that is lower than P for the same application
we have found a point with higher performance if we can nd a point that is on a plane
closer to the origin we have found a point with lower
1234
1234
1234
1234
1234
1234567
567
567
567
567
567
1234567
1234567
1234567
1234567890
890
890
890
890
890
890
890
890
890
1234567890
1234567890
1234567890
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
1234567
123456
cost
P
log N
log T 
log S
C plane
A plane
a costeective curve
P
a costineective curve
attainable region
lowercost region
log T 
log N
123
123
123
1234567890
1234567890
1234567890
1234567890
1234567890
1234567890
1234567890
1234567890
1234567890
1234567890

Figure  Intersection with A plane
  March  
   Chapter  Summary
Surfaces corresponding to path A correspond to constant node cost thus they appear
as planes perpendicular to the Saxis We call such a plane an A plane Figure  shows
the A plane containing P  The intersections of an A plane with C planes form lines of
slope  on the A plane Since superlinear speedup is impossible by our denition	 the
grey area shown in Figure  
right is the possible range of N and T  The cheese area
is the range intersected by those C planes that are closer to the origin than the C plane
containing P  The noncheese area 
which is the same as the grey area in this case is the
range intersected by those C planes that are further away from the origin The only way to
have the application be costeective is for it to exhibit a linear speedup starting at N  
Any deviation from linear speedup means that the performance curve of the application
has crossed into a C plane that is further away from the origin	 and that the program will
be more costly to run In practice	 there are many contributing factors to the actual cost
of running a program that may more than make up for the ineciency	 but	 in the long
run	 what we can aord to buy and what we are able to build will ultimately determine the
performance improvement we can get by adding no
1234
1234
1234
1234
1234
1234567
567
567
567
567
567
1234567
1234567
1234567
1234567
1234567890
890
890
890
890
890
890
890
890
890
890
1234567890
1234567890
1234567890
123
123
123
123
123
123
123
123
123
123
12345
45
45
45
45
45
12345678
678
678
678
678
678
678
45678
45678
45678
des
P
log
N
log
T 
log
S
C plane
B plane
a costeective curve
P
a costeective curve
attainable region
lowercost region
log
T 
log
N
1234567890
1234567890
1234567890
1234567890
1234567890
1234567890
1234567890
1234567890
1234567890
123
123
123
1234567890
1234567890

Figure  Intersection with Bplane
Surfaces corresponding to path B appear as planes perpendicular to the 
		 direction
vector We call such a plane a B plane All points on a B plane have the same SN product	
  March  
Section  Overhead and Latency   
and correspond to multicomputers with the same total cost The plane that contains P is
shown in Figure  The intersections of a B plane with C planes form horizontal lines
on the B plane An application becomes cheaper to run if it shows any speedup relative
to the node case Performance is improved because the time required to perform the
computation is reduced Cost is reduced because the computation is now on a C plane that
is closer to the origin The area that is both grey and cheese is that range that is attainable
by the application and where both performance and economy are improved
In practice neither of the two paths can continue indenitely In path A we are limited
by the maximum physical size of a machine we are able to build and by the amount of
concurrency we can nd in computations In path B we are limited by the minimum
amount of hardware required to construct a node 	 computers are not made entirely of
memory and most programs do take up more memory as they become more distributed
ultimate
machine
node count
node size
A
B
single
processor
computer
Figure  Two idealized multicomputer evolution paths in the path space
To continue path A must use smaller and smaller nodes and path B must use more and
more hardware The two paths 
Figure  will eventually meet at the ultimate machine
where all nodes are of a sensibly minimal size and the machine contains as many nodes as
we can assemble in one machine
Section  Overhead and Latency
Along path B we encounter a series of multicomputers with progressively smaller nodes
Those with singleboard nodes are called the mediumgrain multicomputers examples of
mediumgrain multicomputers are the Cosmic Cube the iPSC the iPSC and the
  March  
   Chapter  Summary
Symult  Those with singlechip nodes are called the  negrain multicomputers an
example of a negrain multicomputer is the Mosaic Due to the reduced node cost when
nodes become smaller and more abundant	 the programming emphasis for a multicomputer
shifts from one of achieving a linear speedup to one of exploiting the maximum concurrency
Since mediumgrain nodes are few and expensive	 the primary goal of programming
such multicomputers is to protably utilize all available CPU cycles Cycles can be lost
to sources in the application itself
 loadimbalance	 extra synchronization	 and insucient
concurrency these internal delays are called overheads Cycles can also be lost to sources
in the system
 message handling	 kernel operation	 and network congestion these external
delays are called latencies In a mediumgrain multicomputer	 overheads and latencies
are countered by employing at least several times more concurrency in the program than
there are nodes in the multicomputer The weak law of large numbers	 together with the
clustering of related elements	 covers most of the problems Nodes are seldom idle because
the chance that all of their elements are blocked is low The cost of message transactions
is low because clustering causes most of the interactions to take place between elements of
the same node
To exploit more concurrency	 we must use more nodes in the multicomputer and fewer
program elements in each node Although we can no longer overwhelm overheads and
latencies by an abundance of concurrency	 we no longer have to be obsessed with linear
speedup	 because nodes become cheaper as they decrease in size Instead	 programming for
negrain multicomputers emphasizes the exploitation of all available concurrency in the
program Factors that prevent the exploitation of available concurrency are distinguished
from factors that merely require the use of more nodes
Latencies are factors that can prevent the full exploitation of concurrency For example	
when a message is delayed enroute to a waiting element	 the element is blocked and the
program may not progress as fast as it could Overheads	 on the other hand	 do not prevent
the full exploitation of concurrency When an element is blocked waiting for a message
  March  
Section  FineGrain Multicomputer Programming   
that has not been produced it is blocked only because the program has less concurrency
than there are nodes Synchronization operations such as the use of null events in the
conservative discreteevent simulators are also overheads They keep more of the nodes
busy without interfering with the exploitation of concurrency in the system being simulated
An element with unconsumed normal events may still be blocked awaiting a null event If
the required null event has been produced and sent we would attribute the blockage to
message latency if the null event has not been produced then we would attribute the
blockage to lack of concurrency
Section  FineGrain Multicomputer Programming
To fully exploit the concurrency of a program we must remove all latencies and overheads
Overheads can be mitigated by putting one program element in each node but latencies
can only be reduced by careful hardware and software design
On the hardware side message latency can be reduced with highspeed routers These
routers move messages in the network via a modied form of circuit switching called worm
hole or cutthrough routing which moves a message one step through the network in a time
comparable to one memory cycle Since a router is able to store and fetch messages at a
rate close to the bandwidth of the memory sending a message from one node to any other
node is comparable to copying the same message from one buer to another buer
On the software side we must without giving up generality provide the thinnest cush
ion possible between the processes and the hardware The Reactive Kernel and a negrain
lightweight programming environment such as ReactiveC or Cantor make an ideal com
bination because the program is never further than one function call away from the system
The execution units for these programming environments especially the more restricted
ones like Cantor are small enough that nearly all of the concurrency in the program can
be exploited
We have aimed in the direction of negrain multicomputers in all of our research and
our work on the discreteevent simulation is no exception The CMBvariant simulator is
  March  
   Chapter  Summary
ideally suited for negrain machines because it is written in a negrain notation and is
able to fully exploit the concurrency of the system it simulates The simulator takes on a
large overhead at N   but this overhead does not prevent the simulation from attaining
a large speedup at a large N  In many of the logic circuits we tested nearlinear speedup
continues until there are only two or three elements in each node
Since the CMBvariant simulator does not use any special techniques to reduce the over
head on a mediumgrain multicomputer the qualities that contribute to the performance
characteristics of the simulator persist as the simulation becomes more distributed The
hybrid simulators were created to demonstrate the eect of those techniques The overhead
is reduced when N is small but the eect of these techniques vanishes and the performance
converges to that of the CMBvariant simulator when N is large
Section  The Next Frontier
We have fully dispersed all available concurrency in a discreteevent simulation program
when we put one element on each node If there were more nodes in a multicomputer than
elements in the simulation we would not be able to utilize those leftover nodes However
we can still change the program to one that contains more concurrency In a medium
grain multicomputer where it is necessary to use concurrency to overwhelm latencies and
overheads rollback simulators such as Time Warp seek to produce additional concurrency
by computing on speculation
The memory in each node of a negrain multicomputer is insu	cient for storing the
previous states of its element in a rollback simulator However when there are more nodes
than elements previous states can be stored on unused nodes When an element has reached
a synchronization point where its future is to be decided by a message that has yet to arrive
the element picks a possible outcome and ships a copy of its old self to an unused node for
storage Alternatively the element can make a copy of its new self which it spawns and
runs on an unused node But rather than becoming dormant the old self can continue
to run and produce more copies until all possible outcomes have been exhausted This is
  March  
Section  The Next Frontier   
the concurrent branchandbound simulator it is the next frontier to be explored when a
negrain multicomputer becomes available
  March  
   Chapter  Bibliography
Chapter  Bibliography
  GA Agha Actors A Model of Concurrent Computation in Distributed Systems
MIT Press 
 	 WC Athas and CL Seitz Multicomputers MessagePassing Concurrent
Computers IEEE Computer August 
 
 CL Seitz J Seizovic and WK Su The C Programmers Abbreviated Guide to
Multicomputer Programming CaltechCSTR 
  WK Su R Faucette and CL Seitz C Programmers Guide to the Cosmic Cube
Caltech CS DF 
  J Seizovic The Reactive Kernel CaltechCSTR 
  GM Birtwhistle OJ Dahl B Myrhaug and K Nygaard Simula Begin
Petrocelli New York 

  Dan Ingalls The Smalltalk  Programming System Design and Implementation
Proceedings of the Fifth ACM Conference on Principles of Programming Systems
Janurary 
  CAR Hoare Communicating Sequential Processes CACM 	 August

  CR Lang The Extension of ObjectOriented Language to a Homogeneous
Concurrent Architecture CaltechCSTR May 	
  InMos Ltd The Occam Programming Manual PrenticeHall 
  William J Dally VLSI Architecture for Concurrent Data Structure Caltech CS
	TR 
  March  
    Chapter  Bibliography
  RE Bryant Simulation of Packet Communication Architecture Computer Systems
MITLCSTR		 November 

  KM Chandy and J Misra Distributed Simulation A Case Study in Design and
Verication of Distributed Programs IEEE Software Engineering September 


  DR Jeerson Virtual Time ACM Transactions on Programming Languages and
Systems  July 
	
  WC Athas FineGrain Concurrent Computations Caltech CS TR	 
	
  Donald E Knuth The Art of Computer Programming V Sorting and Searching
AddisonWesley 

  MR Garey and DS Johnson Computers and Intractability A Guide to the
Theory of NPCompleteness WH Freeman and Company 


 	 AJ Martin A MessagePassing Model for Highly Concurrent Computation
Caltech CSTR		 
		
 
 M Schuster RE Bryant and D Whiting MOSSIM II A SwitchLevel Simulator
for MOS VLSI Users Manual Caltech CS TR	 
	
  March  
   Chapter  Bibliography
  March  
