Reactive-Process Programming and Distributed Discrete-Event Simulation by Su, Wen-King
Reactive-Process Programming 
and 
Distributed Discrete-Event Simulation 
Thesis by 
Wen-King Su 
In Partial Fulfillment of the Requirements 
for the Degree of 
Doctor of Philosophy 
California Institute of Technology 
Pasadena, California 
1990 








To my thesis advisor, Dr. Charles L. Seitz, whose care and dedication made it 
all possible. 
To my committee members, Dr. Charles L. Seitz, Dr. Mani Chandy, 
Dr. Alain Martin, Dr. Drad Sturtevant, and Dr. Eric Van de Velde, for 
their careful review and analysis of my research. 
To our technical editor, Dian De Sha, who spent glorious days and nights 
tracking and hunting the blunders and blemishes in my writing. 
To our operations manager, Arlene DesJardins, who takes care of every little 
day-to-day detail and makes the department feel like a nice big family. 
To my peers, Bill Athas, Bill Dally, John Ngai, and Craig Steele, for their help 
and advice. 
To my junior co-workers, Nanette Boden, Charles Flaig, Glenn Lewis, 
Mike Pertel, and Jakov Seizovic, for their feedback and support. 
To our system managers, Don Speck, Chris Lee, and Joe Beckenbach, for 
keeping our machines running smoothly. 
To our guests from abroad, Sven Mattisson and Lena Peterson, for their 
enthusiasm and friendship. 
To my advisors at UC Davis, Dr. Wen C. Lin of EE/CS and 
Dr. George E. Bruening of BioChem, for my enlightenments. 
To my buddies from UC Davis, Glenn Saito and John Bakos, for their help in 
my college years. 
To my teachers and counselor at Casa Roble High School, Mr. Gomez, 
Dr. Smithson, MT. Hoffman, Mr. Scalatta, Mr. Pickard, Mr. Hellen, 
Mrs. Sproul, and Mrs. Cruzen, who worked to keep me involved in school. 
To Xerox Corporation for supporting this work through a Xerox 
special-opportunity fellowship. 
To my parents, who endured many difficult times to bring me here and to 
raise me in this land of opportunity. 
And to Freedom and Liberty; 
sacred to our very heart and soul, yet sadly denied to so many. 
The research described in this thesis was sponsored in part by the Defense Advanced 
Research Projects Agency, DARPA Order number 6202; and monitored by the Office 
of Naval Research under contract number N00014-87-K-0745. 
lV 
A.bstract 
The same forces that spurred the development of multicomputers - the demand for 
better performance and economy - are driving the evolution of multicomputers in 
the direction of more abundant and less expensive computing nodes- the direction 
of fine-grain multicomputers. This evolution in multicomputer architecture derives 
from advances in integrated circuit, packaging, and message-routing technologies, 
and carries far-reaching implications in programming and applications. This thesis 
pursues that trend with a balanced treatment of multicomputer programming and 
applications. First, a reactive-process programming system - Reactive-C - is 
investigated; then, a model application- discrete-event simulation -is developed; 
finally, a number of logic-circui t simulators written in the Reacti ve-C notation are 
evaluated. 
One difficulty m multicomputer applications is the inefficiency of many dis-
tributed algorithms compared to their sequential counterparts. When better for-
mulations are developed, they often scale poorly with increasing numbers of nodes, 
and their beneficial effects eventually vanish when many nodes are used. However, 
rules for programming are quite different when nodes are plentiful and cheap: The 
primary concern is to utilize all of the concurrency available in an application, rather 
than to utilize all of the computing cycles available in a machine. We have shown in 
our research that it is possible to extract the ma..x.imum concurrency of a simulation 
subject, even one as difficult as a logic circuit, when one simulation element is as-
signed to each node. Despi te the initial inefficiency of a straightforward algorithm, 
as the the number of nodes increases, the computation time decreases linearly until 
there are only a few elements in each node. We conclude by suggesting a technique 
to further increase the available concurrency when there are many more nodes than 
simulation elements. 
Contents 
List of Figures 
List of Program Listings 
1 Introduction 
1. 1 Motivation 
1. 2 History 
v 
1. 3 Outline 
2 Reactive-Process Programming 
2.1 Definition of a Reactive Process 
2. 2 Reactive-C Programming System 
3 Reactive-Process Layers 
3. 1 Simple Layers . . . . . . 
3. 1. 1 The bottom layer (b-layer) 
3 .1. 2 The length-carrying layer (1-layer) 
3 . 1. 3 The non-blocking-receive layer (nb-layer) 
3. 1. 4 Handler layering 
3 . 2 Message Type 
3. 3 Discretion on Receive 
3. 3. 1 Discretion using b-layer functions 
3.3.2 The RPC-discretion layer (r-layer) 
3.3.3 The CSP-discretion layer (csp-layer) 
3. 3. 4 A more general type-discretion layer ( t-layer) 
3. 4 Other Layers . . . . . . . . . . . . 
3 . 4. 1 A flow-controlling layer ( f -layer) 
3.4.2 The CK primitives 



























3. 5 Layering on Light- Weight Processes ....... . .. . ..... . 45 
4 Cosmic Environment 
4 . 1 The Cosmic Environment Specification 
4 . 2 Our Cosmic Environment Implementation 
4. 2. 1 Structure of our CE implementation 
4.2.2 
4.2 . 3 
4.2.4 
4 . 2 . 5 
4 . 2 . 6 
Cosmic Environment exterior 
Cosmic Environment processes 
Program compilation 
Spawning programs 
Data representation and conversion 
5 Model of Simulation 
5 . 1 Mathematical Framework and Analysis 
5. 1. 1 Systems and elements 
5 . 2 
5.3 
5 . 1. 2 States and time 
5. 1. 3 Knots and progress 
5 . 1. 4 Rules of thumb - sufficient conditions for progress 
5 . 1. 5 Non-existence of necessary and sufficient progress conditions 
5. 1 . 5. 1 Simulation and Boolean satisfiability 
5. 1 . 5. 2 Simulation and simultaneous equations 
Operational Framework . . . . . . . . . . . 
5 . 2. 1 Breaking a simulation into smaller slices 
5 . 2. 2 Slices and knots . . . . . . 
5. 2. 3 Implementation considerations 
The Generic Simulator Model and Its Derivatives 
5.3 . 1 Message-driven simulation 



























5.3.3 Sequential simulator 76 
5.3 . 4 Concurrent backtracking simulators 78 
5.3 . 5 Branch-and-bound simulators 79 
5.3 . 6 Time-driven simulators 81 
5.3 . 7 Summary 82 
6 Logic-Circuit Simulator Experiments 84 
6.1 Why Logic Circuits? 85 
6.2 CMB-Variant Simulator 87 
6.2.1 The element simulators 87 
6.2.2 The simulator message system 94 
6.2 . 3 The variants 97 
6.2.4 Variant algorithms 99 
6 .2 .5 Instrumentation 101 
6.2.6 Experimental results 103 
6 . 3 Sequential Simulator 107 
6.3.1 Sequential simulator mechanism 107 
6.3.2 Hazards in sequential simulators 110 
6 . 3 . 3 Instrumentation 112 
6 . 3 . 4 Big multiplier results 113 
6.3.5 Small multiplier results 115 
6 . 3.6 Circuit topology vs. activity level 118 
6 . 3 . 7 Hybrid possibilities 120 
7 Hybrid Simulators 122 
7.1 Coordinated Sequential Simulator (Hy brid-1) 122 
7 .1.1 The algorithm 122 
7 . 1.2 Sorting with a different key 123 
Vlll 
7. 1. 3 The simulator mechanism 
7 .1. 4 The simulator output 
7. 1. 5 Expectation 
7 . 1. 6 Experimental results 
7. 2 Progressive Hybrid Simulator (IIybrid-2) 
7. 2. 1 The mechanism 
7. 2. 2 Experimental results 
8 Additional Performance Results 
8.1 2-D Clock Network 
8. 1. 1 Description 
8. 1 . 2 Sweep-mode results 
8 . 1. 3 Real-mode results 
8. 2 Tree-Ring Example 
8. 2. 1 Description 
8. 2. 2 Simulation results 
8 . 3 FIFO Loop 
8. 3. 1 Description 





Economy and Performance of a Multicomputer 
Overhead and Latency 
Fine-Grain Multicomputer Programming 


























List of Figures 
1. 1 Block diagram of a multicomputer 
2.1 Possible behavior of a reactive process 
2. 2 Representation of a process . . . 
2. 3 Operation of a Reactive-C kernel 
2. 4 Specification of the factorial process 
2 . 5 The divide step 
2 . 6 The combine step 
lX 
3.1 Mapping a binary tree to a multicomputer 
3. 2 Process structure comparison . . . 
3. 3 Structure of a 1-layer message buffer 
3. 4 An example of a FIFO queue 
3 . 5 Expansion steps in the merge-sort program 
3. 6 Giving away a list for the third time (stack grows up) 
3 . 7 Getting an out-of-sequence reply 
3. 8 Structure of a channel in a channel-based CSP implementation 
3. 9 Control flow for heavy-weight processes 
3. 10 Control flow for light-weight processes 
4. 1 Elements of a computation 
4. 2 A process group 
4 . 3 Partitioning into two parts 
4. 4 A multicomputer shared by two users 
4 . 5 Host message-system implementation 
4 . 6 Cosmic Environment with unified resource management 
5. 1 Representation of a system . . . . . . . . . . 
5. 2 Representation of a system composed of elements 



























5. 4 Arc source and destination 
5. 5 Element inputs and outputs 
5 . 6 Arcs ao-4 form a path of length 5 
5. 7 Arcs a0 _ 4 form a circuit of length 5 
5. 8 Example of a knot-containing system 
X 
5. 9 A circuit to evaluate satisfiability of a set of clauses 
5. 10 Mapping equations into physical system . . . . . 
5. 11 Element-simulator operation for an element with a non-zero delay 
5.12 Element-simulator operation for an element with a zero delay 
5. 13 A system that contains all three types of slices 
5. 14 Representation of an arc 
5. 15 Replacing tape by messages 
5. 16 Example of deadlock in an event-driven simulation 
5. 17 Model of a sequential simulator 
5. 18 A researcher submitting a grant 
5. 19 Comparison between three simulators 
5. 20 An example of a continuous system 
6.1 
6.2 






A logic circuit whose behavior is different from its Boolean network 
A number of circuit simulators and their relationship 
Domain of the generic simulator model 
Process structure and a simple example of connectivity 
A sample circuit and a possible mapping to a multicomputer 
Structure of a sweep-mode simulation 
Structure of a real-mode simulation 
Three phases of the oscillating multiplier 
6. 9 A 1376-gate multiplier, sweep-mode 





























6.11 A 1376-gate multiplier for 40J.LS on an iPSC/2 113 
6.12 A 1376-gate multiplier for 40J.LS on an iPSC/1 114 
6. 13 Combining the iPSC/2 and iPSC/1 graphs with sequential timing aligned . 115 
6 . 14 A 1376-gate multiplier for 100J.Ls on a Symult 2010 . 116 
6.15 A 116-gate multiplier for 100J.LS on an iPSC/1 117 
6. 16 A 116-gate multiplier for 100J.LS on an iPSC/2 117 
6. 17 A 116-gate multiplier for 400J.Ls on a Symult 2010 117 
6. 18 Effect of increased latency on simulation performance 118 
6.19 A 1376-gate multiplier for 100J.LS on a Symult 2010- fast oscillation . 119 
6. 20 Modified Laffer Curve . 120 
7. 1 An event that invalidates another event . 125 
7.2 Layering in the hybrid-1 simulator . 126 
7. 3 Expected performance of the hybrid-1 simulator 129 
7. 4 A 1376-gate multiplier for 100J.tS on a Symult 2010 130 
7. 5 A 1376-gate multiplier for 100J.Ls on a Symult 2010 with random placement 131 
7. 6 A faster oscillating 1376-gate multiplier for 100J.LS on a Symult 2010 132 
7. 7 A 1376-gate multiplier for 100J.LS on a Symult 2010 . . . . . . . . 137 
7 . 8 A 1376-gate multiplier for 100J.LS on a Symult 2010 with random placement 138 
7. 9 A faster-oscillating 1376-gate multiplier for 100J.Ls on a Symult 2010 139 
7. 10 A 116-gate multiplier for 400J.LS on a Symult 2010 140 
8 . 1 A FIFO consisting of 4 units 
8 . 2 A C-element FIFO consisting of 4 units 
8 . 3 A 3 X 4 array of self-oscillating FIFO units 
8. 4 Sweep-mode CMB-variant simulation of an 1841-gate clock network 
8 . 5 An 1841-gate clock network for 50J.LS on a Symult 2010 
8. 6 An 1841-gate clock network for 50J.Ls on a Symult 2010 









8.8 A 241-gate clock network for 200J.LS on a Symult 2010 150 
8.9 A 65-gate clock network for 500J.Ls on a Symult 2010 151 
8.10 A 12-unit tree ring 152 
8.11 A 1-to-2 pulse-distributor circuit 153 
8.12 A 1142-gate tree network for 50J.LS on a Symult 2010 155 
8.13 A 1142-gate tree network for 50J.LS on a Symult 2010 156 
8.14 An 857-gate tree network for 70J.LS on a Symult 2010 157 
8.15 An 572-gate tree network for 100j.LS on a Symult 2010 158 
8.16 An 287-gate tree network for 200J.Ls on a Symult 2010 159 
8 . 17 Circuit for one latch 160 
8 . 18 Sweep-mode CMB-variant simulation of an 1067-gate FIFO loop 161 
8 . 19 An 1067-gate FIFO loop for 100J.LS on a Symult 2010 163 
8 . 20 An 1067-gate FIFO loop for 100J.Ls on a Symult 2010 164 
8 . 21 A 459-gate FIFO loop for 100J.Ls on a Symult 2010 165 
8.22 A 155-gate FIFO loop for 200J.Ls on a Symult 2010 166 
9 . 1 Two idealized multicomputer evolution paths 167 
9.2 Multicomputer cost space 169 
9.3 Intersection with A plane 169 
9.4 Intersection with B-plane 170 
9.5 Two idealized multicomputer evolution paths in the path space 171 

xiv 
6. 16 Sequential-simulator main loop as a light-weight process 
6.17 A SEND_EVENT function that reduces glitches 
7.1 Hybrid-1 main loop . . . . . . . 
7. 2 Hybrid-1 embedded message system 
7. 3 Generic logic-gate handler for hybrid-2 








Chapter 1 Introduct ion 
Section 1.1 Motivation 
Advances in application s, programming methods. and computer a rchitC'ctures ar(> inext rica-
bly intertwined. Architectures and programming methods develop in rr>sponse to <iernands 
from applications; t hey also give ri se to new application s . Simulation is an <~pp li cation 
that contri butes to and benefits from the developme nt of faster and more economical corn-
puters . Discrete-event simulation can produce a broad var iety of interaction patterns and 
timi ng relationships; it is. the refore, a model applicat. ion for the s tudy of multicompur-
ers and reactive-process programming. This research is a st ud.v of both reactive- process 
programming and distributed discrete-event simul ation 011 multicomputers. 
COMMU 1I CATIO N NETWORK 
c1 c2 c3 Cv 
l ______ _ 
y 
) 
N computing "nodes'' 
Figure 1.1 Block diagram of a multicomputer . 
A multicomputer (Figure 1.1) is composed of a collection of node computers connected 
to each other via a message-passing network. Multicompute rs can be divided into three 
categories by t heir node size: 
Category 
Node :VI emory 
N I Examples Size per Node 
Coarse-grain cabine t :::::64MB 4-64 Network of supercomputers 
Medium-grain circuit- board :::::2MB 16- 4096 iPSC. NCUBE, Symult 2010 
Fine-grain chip ::::: 161\:B 1024- 65536 Mosaic 
Each node has its own private memory that is not directly accessible by other nodes . and 
each node can contain multip le processes. Processes on d ifferent nodes run asynchronously; 
2 
processes within a single node are interlea\·ed to produce the same effect as if 1 hey were in 
different nodes. Communication between processes is performed \·ia messa.ge passing. 
Section 1.2 History 
Simulation and programming have long influenced each other. Although one ran arguP that 
every computation is. in fact, a simulation of some physical or abstract pron-'ss . the first 
effort to provide a programming system for discrete-event simulation was the development of 
Simu/a [6] , which was based on the Algol programming langu age. Discrete-event s imulation 
operates on a system of components (p hysical processes) that interact by discrete art.ion~. 
St rue! ured languages such as Algol permit the modular representation of these components . 
. \ s such languages became available, discrete-event simul at ion techniques began to emerge 
from t he tra.clitional event- list-oriented simulation techniques. Each Simula module contains 
its O\Vn set of private data and procedures , and is , in effect, a process that interacts with 
ot hers to perform a simulation. 
Although it was initial ly conceived as a simulation language. Simula became a general-
purpose, object-oriented. multiple-process programming language. The ass imil ation of 
object-oriented and multiple-process programming concepts led to the development of CSP 
[8], Smalltalk [7], and other systems that are more closely identified with programming. 
Although Smalltalk was created to make programming simple, its programming model also 
gave it the potential for concurrent operation of its objects. CSP was created to study 
and unify diverse distributed programming constructs by using concurrent processes and 
synchronous messages. Smalltalk and Simula are both object-oriented s:v·sterns: CSP in -
cludes the concept of independent, interacting processes without the distraction of such 
object-oriented concepts as inheritance. 
Multicomputer implementations for variants of Simula [9] and Small talk [11] were shown 
to be feasible and useful. Occam [10], a CSP variant with static interprocess communication 
graphs. provided a programming system for transputer-based multicomputers. Howen~r. 

4 
Time System. The idea was to save the state of a process whenever the process encounters 
a synchronization point; then, instead of blockin):!, the process until the synchronization is 
complete, to have the process select a possible outcome and continue to execute. vVhen 
the synchronization is finally complete, if the outcome differs from the selected outcome, 
the process and all those that it has since affected are rol/ed back, and process execution 
restarts at the synchronization point . 
Methods for reducing overhead were studied intensively because nodes in a medium-
grain multicomputer are few and expensive. However, as multicomputers evolve toward 
their next incarnation - the fine-grain multicomputers - nodes become abundant and 
cheap. With a myriacl of single-chip nodes, fine-grain multicomputers promise significantly 
better cost- us.-performance ratios and total computing capacity than do the medium-grain 
multicomputers. The Mosaic C, currently being developed at Caltech, is an example of 
a fine-grain multicomputer. While each node of the Mosaic C contains a 16-bit CPU, a 
message router, and only 16 Kbytes of RAM, the enti re Mosaic C will contain 16K nodes . 
A number of fine-grain, reactive-process-based programming languages have been devel-
oped in anticipation of the fine-grain multicomputer. Among them is the Cantor notation, 
which most strongly influenced the programming methods used in this research. (Can-
tor is being developed by W.C. Athas [15] using a model similar to the Actor notation 
[1].) Reactive-process programming systems are similar to CSP, but impose additional con-
straints on the operation of the processes in order to simplify the operating systems of 
the fine-grain multicomputers. Cantor also allows us to express programs in finely divided 
objects that are distributed over many small nodes. 
The inversion of the cost ratio between the processor and the memory forms a new set 
of ground rules for multicomputer programming. The shifting focus has strong implications 
for programming in general: The memory, rather than the processor, is now the scarce com-
modity. Programming techniques that buy speed by using a large number of idle memory 
cells are no longer favorable, but ones that buy speed by using idle processors are. fnstead 
of trying to have something useful happen in every available CPC cycle in the rnarhine. 
application writers should now focus on extracting as much concurrency as possible from 
the application. 
[n this experiment. the concept of fine-grain, reactive-process programming influenced 
si mulation. The overhead that prompted the development of optimistic approaches for 
medium-grain multicomputers was recast in a more benign role. Having this overhead 
merely required the use of a larger number of inexpensive processors in the multicomputer, 
and did not reduce the amount of concurrency that could be extracted from the system being 
s imulated. A programming system similar to Cantor was developed for this research, and 
a number of conservative simulators suitable for fine-grain multicomputers were developed. 
Section 1.3 Out li ne 
Since this research is a study of both programming and simulation. this thesis is divided into 
two major parts: Chapters 2 through 4 deal with programm ing, and Chapters 5 through 8 
deal with simulation. The two parts are only loosely interdependent , and do not reflect the 
extensive two-way influence that exists between simulation and programming. For example, 
the lazy-e valuation model of simulation guided us in the design of the x-primitives, which 
are the message-handling functions of our reactive-process programming system; and the 
s upport mechanisms in the simulator were modeled after the mechanisms of the Reactive-C 
programming system. 
Chapter 2 introduces reactive-process programming and the Reartive-C implementation 
of its basic mechanisms. Reactive-Cis merely the ordinary C programming language used 
with a particular programming discipline. It is useful for exposing the simplicity of reactive-
process programming systems - a level of simplicity that is necessary for any programming 
system for fine-grain multicomputers. It is not the best tool, however , for studying reactive-
process programming. Therefore, a slightly higher-level programming system is used in 
Chapter 3 to demonstrate the generality and simplicity of reactive-process programming. 
6 
Chapter -1 describes the Cosmic Ern·ironment, a programming environment that embodiPs 
the reacti\·c-process programming discipline. 
The d iscussion of simulation begins in Chapter .5 with the model of simulation. The 
subject s.vstem being simu lated is recursively defined to be a collection of interacting systems 
or dements. and elements a re s imulated by a set of s imulators that interact by message-
passing. The condition for progress is discussed in detail, a generic simulator is described, 
and the derivation of a variety of simulators is shown. Chapter 6 describes a direct im-
plementation of the generic simulator using the Reactive-C notation. Logic circuits are the 
subject of choice. because they are diverse and because they expose properties of the simu-
lators b.v imposing few processing requirements of their own . The performance we observed 
is shown to be that which was expected: The time required for a simulation decreases 
linearly as the number of computing nodes increases . Comparing the performance to the 
sequential simulator shO\vs that the overhead does not interfere with the ability to utilize 
the concurrency available in the system . Chapter 7 introduces new simulators that do not 
have an overhead when only one node is used. However , the speed increase is no longer 
linear: Performance converges to that of the previous simulator as more nodes are used. 
Although only one test circuit was used throughout these two chapters, additional results 
on a few other circuits are presented in Chapter 8. The results are all similar, even though 
the circuits being simulated are quite different . 
Finally. Chapter 9 defends the rationale for simulation on fine-grain multicomputers. 
and discusses some of its implications on programming and simulation. 
7 
Chapter 2 Reactive-Process Programming 
Reactive-process progranlllling is a discipline in which processes are inactive until they are 
triggered by inputs. When suitab le inputs are present, a process and its inputs \viii rear/ 
in a single atomic anion in which tlw inputs a re consumed. Reactive-process programs ca.n 
be written in specifically· designed notations such as Cantor; they can also be written in 
vanilla notations such as C. Although Cantor hides many rough edges to make programming 
simpler, Cis perhaps better in exposing the mechanics of reactive-process programming. We 
will use C for our discussion . and assume that readers are familiar with C. 
A reactive-process program can be written as a simple combination of data structu re 
and function. as a full-fledged heavy-weight process with its own process context, or as a 
complex multi-tasking operati ng system. The diversity arises from a small and elegant set 
of properties that allows reactive-process programming systems with very different capa-
bilities to be built on top of one another in a consistent manner. Since the tailoring of a 
programming system to specific requirements is made simple, an application no longer has 
to be twisted a round the system : instead . the system can be crafted to suit the intrinsic 
needs of the application. 
In this chapter, we will describe reactive-process programming in its simplest form; the 
next chapter will be devoted to examples of building more-complex programming systems 
on top of si m pier ones. 
Section 2.1 Definition of a Reactive Process 
A reactive process can be characterized by its two run-states: 
Waiting: While a process is waitin g, it is completely inert. The process will remain 
in the wa.i ting state as long as there is no message ready for it to receive; 
otherwise. the process will be run, taking the earliest-arriving message as its 
input. 
8 
R11nning: \Vhile a proce~s is running. it cannot receive an.v more messages. :\ proces:-. 
can run for onl~' a finite period of time before it returns to the waiting sta te. 
vVhile a process is running. it can: 
a. modify its intemal s tate. 
b. send messages. 
c. instantiate other processes, or 
d. self-destruct . 
Message buffers remain attached to a process until they are expli cit ly released 
by the process. 
state 
waj t run wait run wait 
time 
Figure 2.1 Possible behavior of a reactive process . 
The reactive-process programming environment has these additional properties: 
l. Processes do not exist until they are instantiated. 
2. Processes persist until they self-dest ruct. 
3. Each process has a unique procc!-,s ID. 
-L Messages are ad dressed by the desti nation-procPSS ID . 
. 5. Message order between any pair of processP~ is pr<"sen·ed. 
b. Messages no t immediate ly consumed a.re queued. 
1. Messages with a valid destination will c\·entually be delivered. 
8. Message buffers are allocated by calling an allocate function. 
9. Message buffers can be released by cal ling either a deallocate or a s<"nd function. 
Sectio n 2.2 R eactive-C P rogramming System 
Reactiv·e-C is a minimalist implementation of a reac ti ve-process programming environment 
usi ng the C progra mming language. As shown in Figure 2.2. a process in Reactive-C is 
represented by a. process struct ure that includes two pointers: a function pointer and a 
data pointer. The fun ctio n pointer refe rences a C function. the current entry function of 
the process . The entry functi on is cal led when a process is run. 
entry ptr 
data ptr current entry 
function of 
the process 
process structure h 
..... .----- set of functions 
Figure 2.2 Representation of a process . 
The data pointer references an arbitrary data st ructu re main tained by the process . 
Both the data st ructure and the two pointers a re s tate variables of the process that owns 
them, and the process can modify them at a ny time while it is running. When a process 
starts to run , the triggering message and the process s tru cture are passed to t he entry 
function as function arguments. A process ret urns to the waiting s tate by returning from 








getrnessage(); identify_process(rnesg); (*proc->entry)(proc,rnesg); 
Figure 2.3 Operation of a Reactive-( kernel. 
Listing 2.1 is a sample kernel loop o f the Reactive-C programming envi ronment. As 
shown in Figure 2.:3 . the kernel re peated ly gets a message from the message queue, idenlifie:; 
the rccei\·er. and calls the entry function of the receiving process. 
1 kernel_loop() 
2 { 
3 char •rnesg; 
4 PROC *proc; 
6 while(1) 
7 { 
8 rnesg = getrnessage(); 




Listing 2.1 Kernel of Reactive-C programming environment. 
Listing 2.2 contains an example of a reactive-process program that computes a factorial m 
logarithmi c time on an arbitrarily large machine. 
1 typedef struct { REF ID; int HI, LO; } FAC_DATA; 
3 fac_1(proc,rnesg) 
4 RC_PROC *proc; FAC_DATA *rnesg; 
5 { 
6 FAC_DATA *rnesg2; 



















half (rnesg->HI + rnesg->L0)/2; 
mesg2 (FAC_DATA *) rc_malloc(sizeof(FAC_DATA)) ; 
rnesg2->ID = rc_rnyid(); 
mesg2->HI = mesg->HI; 
mesg2->LO = half+1; 
















mesg2 = (FAC_DATA *) r c_malloc (sizeof(FAC_DATA )); 
mesg2->ID = rc_myid( ); 
} 
} 
mesg2->HI = half; 




(char •) mesg; 
fac_2; 
35 fac_2(proc,mesg) 








proc->entry = fac_3; 
43 fac_3(proc,mesg) 
44 RC_PROC *proc; FAC_DATA •mesg; 
45 { 
mesg->LO; 
46 ((FAC_DATA *)(proc->data))->LO •= mesg->LO; 
47 rc_free(mesg); 
48 rc_send(((FAC_DATA • )(proc->data))->ID , proc->data); 
49 rc_exit(); 
50 } 
Listing 2.2 Reactive-C factorial program. 
The t.hree functions in Listing 2.2 (fac_l, fac_2, and fac_3 ) are in a suitable form for 
entry function s because their a rguments are the process structure and the input message. 
and because they are assured to return in finite time. However, they do not represent actual 
processes; they are merely message-handling functions for processes that reference them by 
their entry pointers . 
Let a factorial process be a process that references any of the three functions. Initially. 
a factorial process waits for a message whose structure is defined b.v the C data structure 
called FAC_DATA. The message is called a FAC_DATA message. 
typedef struct { REF ID; int LO, HI; } FAC_DATA; 
ID: Data structure containing the caller's process ID. 
LD: Low end of a number range. 
HI: High end of a number range. 
After receiving t he message (Figure 2.4), the factorial process computes the product 
of all integers within the closed interval: [LD,HI ]. The factorial process stores the product 
12 
Figure 2.4 Specification of the factorial process. 
in the LD field of another FAC_DATA message, which is returned to the requester. Thus , 
sending a FAC_DATA message with a 1 in the LO field to the factorial process will cause the 
the factorial of HI to be comput~>d. 
To compute the factorial of a \·alue. the requesting process (caller) instantiates a new 
process whose entry pointer contains the address of the fac_l function. We shall call this 
new process the fac_1 process. The factorial is computed by a divide-and-conquer method 
that iterates using the difference between HI and LD. 
9 if(mesg->HI <= mesg->LO) 
When the fac_1 process receives its first message. it compares the two ends of the 
interval described in the message. [f HI equals LD , then there is only one integer m the 
interval. If HI is 0 (therefore less than LO, which must be 1 at this point), then the factorial 
of 0 is to be computed. In either case, the correct reply value is equal to the number already 





Therefore, when LD 2: HI , the message is bounced back to the caller, untouched. The 
rc_send function called in line 11 causes the message buffer mesg to be sent to the process 
whose ID is mesg->ID , which is, in this case, the ID of the caller. Since rc_send dissociates 
the message buffer from the process, the process does not have to release it explicitly before 




Figure 2.5 The di vide step. 
16 half (mesg->HI + mesg->L0) /2; 
If HI is greater tha.n LD. the fac_l process computes a. midpoint that divides the in-
terval into two sma.ller intervals. Two more fac_l processes are created to work on these 
two intervals (Figure 2.5). These processes are called the siblings of this process, and an 
initia.lization message is sent to each sibling as it is created. 
18 mesg2 = (FAC_DATA *) rc_malloc(sizeof(FAC_DATA)); 
Message buffers are allocated by the rc_malloc ca.ll. The function rc_malloc has the 
same semantics a.s the malloc function in C. Depending on the implementation, rc_malloc 
can be identical to C malloc , can be built on top of C malloc , or can be an entirely different 










After a message buffer ha.s been allocated, it is filled with da.ta to be sent to a sibl.ing. 
Lines 19- 21 a.re for the sibling tha.t handles the upper half of the interval. The rc_myid 
function returns the ID of the process. The process becomes the caller of its sibl ings a.fter 
its ID has been stored and sent in the ID fields of the initia.lization messages. The fac_l 
process will receive one reply from ea.ch of its siblings. When two replies are received. the 
14 
process multiplies the values contai ned in th ei r LO fie lds and returns tlw p rod uct to i ts own 
caller . 
22 rc_spawn(fac_1, mesg2); 
Processes a re created with the rc_spawn func t ion call. At line 2:2. a nev,: process 
s tru ctu re is created. t he entry pointer of the new process is initialized to referencP the 
fun ction fac _ l ( first parameter to the rc _spawn func t ion ). and the message mesg2 (second 




proc->ent ry = 
(char *) mesg; 
fac_2; 
Figure 2.6 T he combine s t ep . 
The process must now return from the fac_l fu nction in order to wait for the replies 
from its siblings ( Figure 2.6). The process sends i ts reply using t he same message buffer 
that it received , but to prevent losing the reference to that message buffer, it ass igns the 
message buffer into the data pointer of its process structure . Furthermore . sin ce t he process 
is now waiting for a reply message instead of a factorial request message, the entry pointer 
is changed to referen ce the function that handles the firs t reply message. By s toring the 
address of the fac _2 function into the ent r y field , the fac_l process becomes a fac_2 
process. T h e process then returns from the fac_l function to indi cate that it is going back 
to the waiting s tate . 
15 
35 fac_2(proc,mesg ) 







((FAC_DATA •) (proc->data))->LO mesg->LO; 
rc_free(mesg); 
proc->entry = fac_3; 
Tlte fac_2 process waits for the first reply message. When it arrives. its reply \·alue is 
simp l.v copied into the LD field of the original message buffer, since the process nePds a value 
from each reply before the product can be computed. The reply message buffer from the 






((FAC_DATA •)(proc->data))->LO •= mesg->LO; 
rc_free(mesg); 
rc_send(((FAC_DATA •)(proc->data))->ID, proc->data); 
rc_exit(); 
\\.hen the fac_3 process gets the second reply message, the returned value is multiplied 
into the LD field of the original message buffer. The reply message buffer is also freed. The 
original message buffer, now contai ning the product of the two reply values, is sent back to 
the caller. Lastly. the process terminates by calling rc_exit. 
Listing 2.3 is a sam ple program that calls the factorial program. It waits for an in-
put number. computes the factorial of the input number, prints the factorial, and then 
terminates. 
1 rc_main(proc,mesg) 
2 RC_PROC *proc; 
3 char •rnesg; 
4 { 
5 int hi; 
6 FAC DATA •mesg2; 
8 rc_free(rnesg); 
10 printf("Enter number: "); scanf("%d" ,&hi); 
12 rnesg2 = (FAC_DATA *) rc_malloc(sizeof(FAC_DATA)); 
13 mesg2->ID rc_myid(); 
14 rnesg2->HI = hi; 
15 rnesg2->LO = 1; 
16 rc_spawn(fac_l,mesg2); 
18 proc->entry = main_reply; 
19 } 
16 
21 main_reply (pr oc,mesg) 




printf("%d\ n" ,mesg->LO); rc_free (mesg); rc_exit () ; 
Listing 2.3 Factorial main program 
The basic Reactive-C primitive~ a re ~ummarized below: 
char •rc_malloc(); :\ I locates a mPssage bu ffer. 
rc _free (); Releases a message buffer. 
rc _send (); Sends and releases a message buffer. 
REF rc _myid(); Returns the ID of t.he calling process. 
rc_spawn(); Instantiates a new process. 
rc_exi t (); Termin a tes t he calling process. 
Delibe rately omitted from the list is a function that receives a message. In Reactive-C, a 
message is implicitly requested wh en a process is created or when a process returns from its 
ent ry function. T he request is fulfilled when its cur rent entry function is called. The other 
unusual aspect of the Reactive-C p rimi tives is t hat rc_spawn does not return the ID of the 
new process; thus. the onl_v direct way for a parent p rocess to get the ID of the sibling is to 
receive t he ID from a message sent by the sibling. 
Reactive-Cis a minimalist reactive-process programming system . (The kernel code for 
a s ingle-processor system is only 12-1 lines long.) Since the parent process can always send 
its ID to the sibling during spawn. and since the sibling can always send its ID back to its 
parent via a message. it is not necessary for t he spawn function in a minimalist system to 
return an ID. The goal of React ive-C is to create a system that is minimal but that is not 
necessarily easy on the programmer. However, a close relative of the Reactive-C turns out to 
be well suited for writing event- dri ven s imulators. Another derivative, the R eactive J\ernel, 
proves to be very use ful in imple ment ing the inner kernel and the handlers of mult icomputer 
operating systems . Deta ils of the Reactive Kernel can be fo und in t he Master's thesis of 
.J akov Seizovic [.5] . 
17 
Reactive-C is strongly influenced by the Cautor programming language, which is a 
fme-grain reactive-process programming system in which process spawn1ng uses {uture8 to 
immediately return the sibling ID. The properties and programming paradigms rela.ted 
to fine-grain reactive-process programming a re ex plored in detail the Doctoral thesis of 
\N.C. Athas [15]. 
In the next chapter. we will focus on the universality of reactive-process programming, a 
property that is best illustrated using fu ll-fledged , coarse-grain reactive processes . Although 
we will be leaving the Reactive-C environment fo r now, we should bear in mind that duality 
exists between a Reactive-C process and its heavy-weight counterpart: What is applicable for 
one is equally applicable for the other. Heavy-weight programs are used for the remainder 
our discussion because they are simpler to describe. 
Universality of a programming system requires the programming system to efficiently 
support a large variety of other programming systems. Layering, or the implementation 
of new functions on top of basic functions. is the principal means by which universality is 
achieved. 
18 
Chapter 3 Reactive-Process Layers 
In contrast to a. light-weight Reactive-C process, which has onl!· a. function and a data 
structure, we can gene rally consider a heavy-weight process to be one that. although its 
structu re is machine dependent, has its own code. data. stack . and thread of control. \Ye 
can run heavy-weight reactive processes under the Reactive-C programming environment 
with minimal overhead by us ing a dedicated, light-weight reactive process. called a handler. 
In one possible arrangement, the data pointe r of a handler references a table containing 
three segment pointers (for the code. data , and stack segments) and a context structure 
(containing the frozen records of a suspended heavy-weight process). When a message is 
received by a handler, the ent ry function for the handler performs a context switch to 
resume the execution of the heavy-weight process. When the heavy-weight process calls a 
receive function, it saves the process context. restores the system context, and returns to 
the handler. The handler returns from its entry function to request a new message. 
In this manner, the combination of the heavy-weight process and its handler appears to 
the kernel as an ordinary Reactive-C process. The cost of supporting a heavy-weight process 
under a handler, as opposed to supporti ng it under the kernel. is no more than one extra 
level of function call. A handler for a heavy-weight process is an example of layering. A 
handler that supports multiple heavy-weight processes is used in the Reactive Kernel node 
operating system for running normal user processes. 
Section 3-1 Simple Layers 
3.1.1 The bottom layer (b- layer) 
As we did for Reactive-C , we shall establish the grou ndwork for the discussion of universality 
and layering with an example . Listing 3.1 contains a heavy-weight react ive-process program 
that computes a factorial in the same manner as the Reactive-C example. We shall refer to 
the programming system used in this example as t he bottom, or b-/ayer. 



































FAC _DATA *mesg ( FAC_DATA *) b_recvb( ) ; 
FAC_DATA *mesg2; 
int half, k; 
if(mesg->HI <= mesg->LO) 
{ 
b_send (mesg,mesg->pn ,mesg->pp) ; 






(mesg->HI + mesg- >L0) / 2; 
mypid()•nnodes ( ) + mynode () ; 
mesg2 (FAC_DATA * ) b_malloc(sizeof ( FAC_DATA )) ; 
mesg2->pn mynode () ; 
mesg2->pp mypid () ; 
mesg2->HI mesg->HI; 
mesg2->LO half+l; 
spawn("pfac" , ( 2*k+2) 'l.nnodes(), (2*k+2 )/nnodes (), '" ') ; 
b_send(mesg2, (2*k+2) 'l.nnodes (), (2*k+2) / nnodes () ) ; 
mesg2 = (FAC_DATA *) b_malloc(sizeof(FAC_DATA )); 
mesg2->pn mynode( ) ; 
mesg2- >pp mypid( ) ; 
mesg2- >HI half; 
mesg2->LO mesg->LO; 
spawn("pfac", (2*k+1 ) 'l.nnodes(), (2•k+1) / nnodes () , ""); 
b_send(mesg2, ( 2*k+1 ) 'l.nnodes(), (2*k+1) / nnodes () ); 
data mesg; 
FAC_DATA *mesg = ( FAC_DATA *) b_recvb(); 












FAC_DATA *mesg = ( FAC_DATA *) b_recvb(); 




Listing 3.1 Heavy-weight factorial program. 
A comparison between the Reactive-C example and the b- layer example reveals nume r-
ous similarities. T he three ent ry-function candidates are replaced by th ree program blocks; 
each block is headed by a line t hat waits for and receives a message: 
20 
7 { FAC_DATA *mesg = (FAC_DATA *) b_recvb(); 
Instead of messages being passed to it as function arguments, a b-layer process must 
perform an explicit b_recvb caJl to get a message. The b_recvb call suspends the process 
until il message arrives. The message is then returned to t he process by the b_recvb 
function. 
typedef struct { int pn, pp; int HI, LO; } FAC_DATA; 
A b-layer process is identified by its node and pid pair rather than by just a REF value. 
There is no reason why it should not use the same single-value representation that Reactive-
C uses, except that heavy-weight processes require better control over process placement 
because they take up a great deal of memory. Thus, wherever ID was used, it is replaced 
with the node and pid pair. 
19 k mypid()*nnodes() + mynode(); 
26 spawn("pfac", (2*k+2)%nnodes(), (2*k+2)/nnodes(), 1111); 
27 b_send(mesg2, (2*k+2)%nnodes(), (2*k+2)/nnodes() ) ; 
34 spawn("pfac", (2*k+l)'l.nnodes(), (2*k+l)/nnodes(), II II) ; 
35 b_send(mesg2, (2*k+l)'l.nnodes(), (2*k+l)/nnodes() ) ; 
Listing 3.2 Program fragments for mapping a binary tree to a multicomputer . 
Both b_send and spawn need node and pid as their arguments. In order to g1ve a 
process better control over the placement of its siblings, a process is allowed to define the 
node and pid of the new processes it creates. The three program fragments shown in Listing 
3.2 map a tree structure onto a multicomputer such that if the tree is balanced, the number 
of processes in any two nodes will differ by no more than 1. 
As shown in Figure 3.1, the t ree is first mapped to a linear array such that a process with 
an ID of (node ,pid) on a multicomputer with N nodes will have an index of k = pid*N 
+ node. The two sibli ngs of the process will have an index of 2k+1 and 2k+2, respectively. 
The list is than folded into the multicomputer using the '"!." and the " / " operators. 
The function s rnypid, rnynode , and nnodes return the pid of the process, the node of 





~ k 2k + 1 
I o lll 2 1 3 1 • 1--- -- -
2k + 2 
Figure 3.1 Mappin g a binary t ree to a multicomputer . 
whose program file name is specified in t he first argument , and whose ID is specified in the 
second and third arguments. The program file in th is case is named pfac. The first process 














stack seg ptr 
data seg ptr 
code seg ptr 
Figure 3.2 Process structure comparison. 
next action 
The equivalence between the light-weight and heavy-weight processes is most obvious 
when the process structures of the two factorial processes a re compared at the time that they 
are both waiting for their first reply message (Figure 3.2). The light-weight factorial process 
retains its message buffer in the data pointer of its process structure; the heavy-weight 
factorial process retains its message buffer in a pointer located on its program stack. The 
light-weight factorial process specifies its next action with the ent ry pointer of its process 
structure; the heavy-weight factorial process specifies its next action with the program 
22 
counter stored in its context structure. 
The basic b-layer primitives can be summarized in the following list. The set is minimal, 
given the decision that processes are allowed to directly control process placement. 
char •b_ma11oc( ); Allocates a message buffer. 
b_free () ; Releases a message buffer. 
char •b_recvb () ; Receives a message. 
b_send (); Sends and releases a message buffer. 
int mynode(); Returns the node of the cal ling process . 
int mypid () ; Return s the pid of the calling process. 
int nnodes(); Returns the number of nodes in the machine. 
spawn(); Instantiates a new process. 
exit(); Terminates the cal ling process. 
3.1.2 The length-carrying layer (1-layer) 
We shall introduce the general concept of layering by a very simple example. We will create 
a new set of functions, the 1-layer functio ns, that are parallel to the b-layer functions with 
the exception that 1-layer functions contain an additional function for accessing the length 
of a message buffer. To store the length information, we will make each message buffer a 
li ttle larger than it needs to be, and store the length information in the extra space. 
r buffer address seen by b-layer programs 
header I body 
\....__ buffer address seen by 1-layer programs 
Figure 3.3 Structure of a 1-layer message buffer . 
That extra space is placed at the front of each message buffer and is called the header 
of the message; the rest of the message is .called the body. We can hide the header by 
having 1-laye r functions work only with pointers to the body of the message. As a result , 
the 1-layer functions become a super set of the b-layer functions. 
1 typedef struct { int length; } HEADER; 
2 #define BODY_OF(h) (h+sizeof(HEADER)) I • given header, find body • I 
3 #define HEAD_OF(b) (b-sizeof(HEADER)) I• given body, find header • I 
23 
The HEADER structure shown above defines the content of the header for an 1- layer 
message buffer. T he only field in this header is an integer that contains the length of the 
message body. In order to allow all data types in the message body, headers shou ld normally 
be padded to the ma.;..: imum data alignment requirement of the hardware. l n the interest of 
simplicity. however. padding is neglected for our exam ples. 
5 char *l_malloc(n) 
6 
7 { 
int n · . 
8 char *p ; 
10 p = b_malloc (n + s i zeof (HEADER)) ; 
11 ( (HEADER *) p ) - >length = n; 
12 r eturn(BODY_OF (p)); 
13 } 
15 char *l_recvb( ) { return ( BODY_OF (b_recvb ( ) )); } 
The two fu nct ions t hat return message buffers - receive and allocate - call the cor-
responding b- layer functions to get message buffers. When one is obtained, the pointe r to 
the body of the buffer is retumed by the fun ctions. In addition , the l_mallo c function 
stores the buffer length into the message header before it returns . Similarly, a function that 
takes a message buffer as input has to locate t he real begi nning of the message buffer before 
passing it to the corresponding b-layer function. 






int node, pid ; 
23 b_send(HEAD_OF(p) , node, pid) ; 
24 } 








return(( ( HEADER • ) HEAD _OF(p )) - >length); 
T his is the simplest a pplication of layering; it does not change the message propert ies 
in any way. By adding more fields t o the header structure, we can just as easily include any 
infor mation t hat we wo uld like to send along with a message, su ch as length of the message 
buffer, message type, and sender node a.nd pi d. 
24 
3.1.3 T he non-blocking-receive layer (nb-layer) 
.--\ process running in a reactive-process programming environment should not monopolize 
the processor by running nonstop for long periods between receive calls. for if a process does 
not call a receive function. other processes in the same node will not get a chance to run . 
. -\ conventional multi-tasking operating system makes scheduling, fair b.v interrupting a 
long-running process wi th a timer in order to wrest control away from a process . The same 
thing can be done in a Reactive-C implementation of a heavy-weight programming system by 
treating a timer-interrupt mechan ism - as a process resource. A process , therefore. includes 
an interrupt mechani sm and an interrupt service routine. When a process is interrupted by 
the timer. the interrupt servtce routine of the process calls a receive function to relinquish 
control. 
A timer-interrupt ts just one of the ways to make a process call a receive function 
periodically. While a timer may st ill be needed as a backup mechanism to stop runaway 
processes. the preferred method is to convert a non-reactive process into a reactive process 
by having the process call a receive function periodically during extended computations. 
AI though the messages received may not be needed right away, they can always be queued 
by the process until they are needed. 
It is better for the process to be de-scheduled at choice points in the program rather 
than at arbitrary points selected by the timer. Choice points are places in a program where 
much of the system resources used by the program, such as floating-point accelerators. 
direct-memory-access units, and processor registers, are released by the process as a normal 
part of the program execution. The amount of state information that needs to be saved 
and restored when a program is stopped and restarted at a choice point is usually small 
and can be reliably predicted during compile time. 
Calling a receive function , either from a timer-interrupt handler or from a choice point, 
presents a problem, however. A process that relinquishes control by calling a receive func-
tion will not be re-started until a message is ready for it. As a result, a node can sit 
25 
id le with runnable processes su:--pended becau se there are no messages queued for them. 
Furthermore. if a suspended process does not receive an\· more messages. it will remain 
suspended indefinitely. 
vVh at we need is a receive function that does not block. This function can be imple-
mented by having the process send a uniquely identifiable message to itself just before it 
calls a blocking recei \·e function. \Ve can create such messages by the same layeri ng mech-
anism tha t we used for message length. Let us prefix the new functions with nb_, and let 
us invent a new receive function. nb_rec v. A call to nb_recv has the same effect as a call 
to a normal receive fun ction. except that in cases where a normal receive function would 
block. nb_recv returns a nu ll pointer. (A nb_recv call may still return a null pointer at 
other times but it will a! ways cause the process to release control first.) 
Below is a set of routines that implement the nb-layer functions. We wi ll list only those 
funct ions that are different in form from the 1-layer functions. First of all, two private 
varia bles are needed. The token_got variable indicates whether a uniquely identifiab le 
token message has been previously allocated. The token_msg pointer cont ains the token 
message if i t is allocated and if the process is cu rrently holding it; the pointer contains null 
otherwi se. 
1 typedef struct { int is_token; } HEADER; 
3 
4 
static int token_got 
static char *token_msg 
6 char *nb_recv() 
7 { 




10 if(!token_got) { t oken _msg = l_malloc(O); t oken_got = 1; } 
12 if( token_msg) { ((HEADER *)HEAD_OF(token_msg))->is_token = 1; 
13 b_send(HEAD_OF(token_msg), mynode( ) , mypid( )); 
14 t oken_msg = 0; } 
15 p = b_recvb( ); 
16 if (((HEADER * ) p)->is_token) { token_msg p; return(NULL); } 
17 return(BODY_OF(p)); 
18 } 
The first thing that the non-blocking nb_recv does is to check for the exis tence of the 
token message. If the token message has not bee~ allocated, the function allocates it. Next. 
26 
the function checks to see if it is rurr<>ntl.v holdine; tlw token message. If it is. the function 
sends the token message to its<>lf. so that a. subsequent b_recvb caJI is guaranteed to return. 
Lastly, it calls b_recvb to get a message. If the message obtained is a token message. the 
token message is saved and null is returned. Otherwise. the 111essagP i:-- rPturned. 
20 char •nb_recvb() 
21 { 





p = b_recvb(); 
while(((HEADER • ) p)->is_token) { token_rnsg 
return(BODY_OF(p)); 
29 nb_send(p,node,pid) 
30 char *p; 
31 int node, pid; 
32 { 
33 ((HEADER •)HEAD_OF(p))->is_token 0; 
34 b_send(HEAD_OF(p), node, pid); 
35 } 
p; p b_recvb(); } 
The blocking nb _recvb waits for a non-token message and returns that message when 
it is received. If a token message is received first, it is stored in token_msg and nb_recvb 
continues to wait for the next message. The nb_send function clears the token flag in the 
message header before sending the message because it can only send ordinary messages. 
In o rder to improve efficiency. detection of token messages is ordinarily integrated into 
the kernel so that the kernel can defer token messages until the input message queue is 
otherwise empty. The primary effect is that processes with pending non-token messages are 
favorably scheduled. The side effect is that processes have a reliable method of determining 
whether the input queue of the node is empty. This special treatment of token messages 
constitutes the basis for indefinite-lazy computation in distributed simulation. This will be 
discussed in a later section. 
3.1.4 Handle r layering 
Running a heavy-weight process inside a handler is an example of layering. We can also run 
a light-weight process inside a heavy-weight process, or a light-weight process inside another 
light-weight process . When each handler process controls just one reactive process. the ID 
27 
of the handler is sufficient to uniquely ident ify the process . Wh en there may be more than 
one process inside a handler. a secondary pid needs to be included in the message header 
to dist inguish them. Examples o f handler layering are t he Reactive 1-.:ernel for heavy- weight 
processes and simu lators fo r light-weight processes. 
typedef struct { int pid2; } HEADER; 
3 struct PROC ptab[MAX_PID2]; 
5 main_loop() 
6 { 
7 char *mesg; 
8 PROC *proc; 
10 while(1) 
11 { 
12 mesg = b_recvb(); 
13 proc = ptab + (( HEADER *)p)->pid2; 
14 ( *proc->entry)(proc,BODY_OF(mesg)); 
15 } 
16 } 
I * message header *I 
I * process table *I 
Shown above is the main loop of a heavy-weight process capable of handling more than 
one light-weight p rocess . The message functions resemble the 1-layer functions, but with 
the second pid. rather than t he message length, in t he message header . The heavy-weight 
process repeatly calls b _recvb to get a message, finds the real destination process by the 
pid2 field. and calls the entry function of the process. If this program fragment looks 
familiar. it is because this is the main loop of the Reactive-C kernel. The Reactive-C kernel 
is itself a reactive-process program. 
Although the defini t ion of a react ive-process program is fixed as stated in the beginning 
of Chapter 2, certain properties of t he programming system are implementation-dependent. 
Handler layering provides a way of runni ng a programming system with a different set of 
properties on top of anot he r programming system. For example, assume that we have a 
programming system in which all messages to non-exist ing processes are thrown away. To 
implement systems such as the Cantor run-t ime system , messages to non-existing processes 
must be preserved . Suppose we were to support Cantor by running a Cantor handler under 
·a reactive kernel. As far as the kernel is concerned. all messages will find their destination 
processes. namely, the Ca ntor ha ndler processes. \Vhen t he handler gets a message. the 
28 
message is be:·ond the jurisdiction of the kernel: the handler can do any numb~r of things 
with it. In particular. the handl~r can queue messages for Cantor process~s that have not 
yet been created . 
Section 3.2 Message Type 
It is convenient in many computations for a process to respond differently to different types 
of messages. In the factorial examples, there are three types of messages: the message 
from the parent. the first message to arrive from the siblings, and the second message to 
arrive from the siblings. These messages do not have to be distinguished by type because 
they are identified by their order of arrival. In the Reactive-C example, different responses 
to differen t messages are specified by storing different function pointers into the process 
st ru cture after each message is received. In the b-layer version, the responses are specified 
by the locations in the program where b_recvb is called. 
head Figure 3.4 An example of a FIFO queue. tail 
In the next example, however, it is necessary to distinguish messages by type. The 
FIFO (first-in-fi rst-out queue) structure shown in Figure 3.4 can be constructed with the 
chain of carrier processes described in Listing 3.3. The carrier processes are connected 
into a singly linked list by the next_node and next_pid variables in each process. The 
FIFO is accessed by a reference to the head carrier and a reference to the tail carrier. 
When an item is to be added to the FIFO, the item is sent as a message to the tail of 
the FIFO. The process at the tail of the FIFO spawns a new carrier for the new item and 
returns the reference of the new carrier to the caller. When an item is to be retrieved 
from the FIFO , a message is sent by the caller to the head of the FIFO. The process at 
the head of the FIFO sends its item and the reference of the next carrier to the caller. 
The process then removes itself from the FIFO. Message types are needed because the two 
2!) 
commands - .. new item·· and .. retrien' itPrn can arrive in any ord e r whPn a FIFO is 
only one element long. 








































int caller_node, caller_pid; 
int next_node, next_pid; 
while ( 1) 
{ 
} 




case ADD_VALUE: spawn_anywhere("carrier",&next_node,&next_pid); 
req->type = SET_VALUE; 
case 
case 
b_send(req, next_node, next_pid); 
break; 































listing 3.3 The carrier program for building FIFO. 
When a carrier receives an ADD_ VALUE message. it spawns another carrier, and the 
message is passed to the new carrier after its message type is set to SET_VALUE (16-19) . 
The spa-wn_any-where function will spawn the specified process on some available node and 
return the node and pid of the process in the next_node and the next_pid variables . 
When a carrier receives a SET_ VALUE message. the process is the new tail process. 
The value field of the message is copied into the value variable of the carrier. The next 
30 
reference of the carrier is initialized to a null ID. The ID of the carrier is writt e n into the 
message. and the message is returned to the caller (2 1 "29). AftPr the message i» received 
by the caller. the cal ler's tail reference is updated. 
When a carrier receives a GET_ VALUE message, its value a nd its next-carrier reference 
a re copied in to the message. The message is sent back to the cal le r and the process Pxits 
(31-38). 
Section 3.3 Discretion on Receive 
Discretion on receive means allowing a process to select ce rtain messages to cons ume while 
deferring other messages. The Reactive-C, the b-layer, and ot her simple layered va ri ants all 
have the same message property in that they do not supply any mechanisms for discretion; 
t heir processes have no choice but to take messages in the order they arrive. Di scretion ca.n, 
however, be implemented inside a process. 
3.3.1 Discretion using b-layer functions 
An example in which disc retion is implemented in the program is a merge-sort program. in 
which the list to be sorted is spli t recursively along the branches of a time-on-target tree 
unti l every processing node in the machine is used. The machi ne should have a power-of-two 
number of nodes to sup port this doubling approach. 
Figure 3.5 Expansion steps in the merge-sort program. 
At the beginning of the sort, the zeroth-gene ration process is created in a machine with 
2n nodes, and a list of numbers to be sorted is sent to the process as a message. The 
zeroth-generation process t hen proceeds to fill the machine with processes in a total of n 
expansion steps. In the kth expansion step . every process in the machine creates a new 
31 
kth-generation process. giving half of its list to the new process aud keeping the o t h~>r half 
for itself. After 11 steps . there will be 2n processes on the machine. each holding Ljrt h of 
the original list. 
The processes begin to sort their share of the list locally. \Vhen sort ing is complete. 
the expansion steps are reversed to merge the fragmented li sts. In the kth merging ~tep (k 
decreasing), each kth-generation process sends its list back to its parent in a repl_v message. 
After n steps. only the zeroth-generation process remains. The list that it now holds is the 
sorted version of the original list . 
When the process structure is fully instantiated, each kth-generation process has a 
sibling for every generation number from k+ 1 ton. Since the computation is asynchronous, 
returning messages from the si blings may arrive in a different order from the order of the 
merging steps. Since each process needs to consume reply messages from its s iblings in the 
order of decreasing generation number, each sibling will need a different message type for 
its reply message. a nd the process will selectively wait for a certain message in each merging 
step. 
The sorting program in Listing 3.4 first appeared in "Multicomputers: Message-Passing 
Concurrent Computers" [2]. The first version of the program, which uses integer-based 
types, was written by C.L. Seitz; the version appearing in Listing 3.4 and in the IEEE 
paper was modified by the author to use pointer- based types. 
1 typedef struct MESG MESG; I• Message header structure. •I 
2 struct MESG { int pnode, ppid; I• Address of the parent process. •I 
3 int tbase I• Base for time-on-target tree. •I 
4 int len I• Number of elements in the vector.• / 
5 MESG **type ;} I• Type field for filtering message . • / 
6 #define BUF(v) ((double •)(v+l)) I• Data follows MESG immediately. • I 
8 unsigned int this_node, this_pid, node_cnt; 
10 main() 
11 { MESG •v; 
13 this node rnynode(); I• Node number of this process. •I 
14 this_pid mypid (); I• Pid number of this process. •I 
15 node_cnt nnodes(); I• number of nodes in this machine. •I 




if(v->len > 1) merge_sort(v); 
b_send(v, v->pnode, v->ppid); 
22 merge_sort(v) 
23 MESG *v ; 
24 { unsigned 11, 12, i, new_node; 
25 MESG *v1, •v2, *v3; 
26 double *d, *s, *b1, *b2; 
32 
I* Sort the list. 







11 = ( 
12 = ( 
v1 




v->len ) I 2; 
(MESG *) b_malloc(sizeof(MESG)+sizeof(double)*l1); 
(MESG *) b_malloc(sizeof(MESG)+sizeof(double)*l2); 
v1->len 11, d BUF(v1), s = BUF(v); i--; ) *d++ *s++; 
= v2->len = 12, d = BUF(v2) ; i--; ) •d++ *s++; 












v1->tbase = v2->tbase = v->tbase << 1; 
if(v1->len > 20 && new_node < node_cnt) 
{ 
spawn("msort " ,new_node,this_pid , ""); 
v1->pnode this node 
v1->ppid = this_pid 
v1->type = &v1 
b_send(v1,new_node,this_pid); 
v1 = 0; 
} else if(v1->len > 1) merge_sort(v1); 
if(v2->len > 1) merge_sort(v2); 
I* spawning a sibling. 
I* New base for building 




I• If list is too long and *I 
I• if next node is valid *I 
I* spawn a sibling •I 
I* and send it a list. *I 
I* The type field holds the *I 
I* address of the msg ptr. *I 
I* Msg ptr is set to null. •I 
I• Sort if cannot split. 
I* Sort the other list. 
52 while(!v1) { v3 = (MESG *) b_recvb(); *v3->type = v3; } 
54 for(b1 = BUF(v1), b2 = BUF(v2), d = BUF(v); 11 II 12; I* merge. *I 
55 { while(ll && ( 112 I I (12 && *b1 <= *b2))) { 11--; *d++ *b1++; } 
56 while(l2 && (111 II (11 && *b2 <= *b1))) { 12--; *d++ = *b2++;} 
57 } 
58 b_free(v1); b_free(v2); 
59 } 
Listing 3.4 The merge-sort program. 
In each level of recursion where a. sibling is created (41), t he type field of the messa.ge 
for the sibling is filled with the address of the automatic pointer variable, vl (44). These vl 
pointers on the program stack are set to null before the merging phase (46), which begins 
when the recursive merge_sort function starts to unwind. Since there is at most one sibling 
·created in each level , the list sent to each sibling must contain an address that is different 
from the others - the address of the vl pointer in effec t when the sibling is created. 
33 
L30 L41- L4.5 
fr r$l srblmg 







Figure 3.6 Giving away a list for the third time (stack grows up) . 
first s1bling 
After the expansion phase, the program progresses to line 48 and .50, where the re-
maining numbers are sorted using a sequential merge-sort algorithm performed by the same 
merge_sort function. During the merging phase, each sibling returns a. message of the type 
it was assigned (19). A process selectively waits for the message for the current recursion 
level by polling the vl pointer at that level; at the same time, the process repeatly requests 
a. message and stores it into the pointer whose address is equal to its message type (52) . 
Figure 3.7 Getting an out-of-sequence reply. 
When the program reaches line 52, vl can take on one of the three possibilities: 
1. vl is not null, because its li s t has not been given away; 
2. v1 is null, because although its list has been given away, a reply has not been 
received; or 
3. vl is not null, because although its list has been given away, the reply was received 
while the program was waiting for a different reply. 
The distribution of work is accomplished by divide and conquer; the merge-sort example 
can be used as a template for other divide-and-conquer applications. Assigning deferred 
34 
messages into holding pointers is sufficient for thi s application because no more than one 
message for each type needs to be queued . vVhen more than one messa11,e of each type must 
be deferred. the process has to store them in a more general list struct 11re. 
3.3.2 T he RPC-discretion layer (r-layer) 
While discretion is used in the merge-sort program, t he process sti ll takes messages in the 
sa me o rder they a rrive. However. some programs can be made simpler by creating an 
illus ion that messages are dispensed by the kernel in an order other than first come, first 
sen ·e. Such effects can be achieved with layering as well. 
The implementation of a remote procedure call ( RPC ) is one exam ple. Suppose we want 
to make available a generic fil e operation , read, implemented by message exchange with a 
file r.ontroller. a process responsible for maintaining a file. A prototype fun ct ion might look 
like the one in Listing 3.5. 
1 typedef struct { int 
2 int 
4 FSTRUCT file_ tab [20] ; 
6 typedef struct { int 
7 int 
8 i nt 
9 int 
11 #define OP_READ 3 
13 read(fd,buf,len) 
14 int fd, len; 












I* Structure of one entry of 
} FSTRUCT ; I* the process's file table. 
I* The process's file table. 
I* Format of request message 
I* to be sent to the file 
I* server proc ess to request 
} REQUEST; I* f or a read operation. 











( REQUEST *) b_malloc (s izeof(REQUEST)) ; 




26 b_send((char * ) request, file_tab[fd] . fs_node , file_tab[fd] . fs_pid ); 
28 reply= b _recvb(); 
29 bcopy(reply,buf,len ) ; 
30 b_free ( reply); 
31 return(len); 
32 } 










The file _ tab array contain!> the node and p id of all file-controller processes accessible 
by this process. The read function sends a request to a file controller selected from file_ tab 
using fd as the index. \tVhen the file controller finishes reading the requested amount of 
data . the data is sent back in a message. The funct ion is shown to be \Vaiting for the reply 
using the normal b _recvb function. 
28 r eply= b_r ecvb (); 
However. the b_recvb function IS not adequate because it may pick up the wrong 
message if another message arrives before the reply message. A receive-discretion mechanism 
must be used to ensure that only the reply message for the r ead function is returned. The 
reply messages. called the RPC messages, must therefore be distinguishable from other 
messages that the process uses. Furthermore, messages that arrive before the reply message 
must be queued and released in a transparent way so t hat the requesting program cannot 
distinguish a local r ead from a RPC read. 
The r - primitives implement the new message properties by layering and by adding two 
more functions: RPC send and RPC receive. The message header for this layer contains 
a R PC flag and a chaining pointer. Since RPC calls do not interleave in a process, a 
process can have no more than o ne outstanding reply message at any one time. Storing one 
distinguished type in a Boolean variable is therefore sufficien t for positively identifying a 
reply message. The def e r _h and def er_ t pointers are used to implement a queue for non-
RPC messages. The next pointer in the message header is used to chain deferred messages 
into a linked ust for the queue. 
1 t ypedef struct HEADER { i n t is_rpc; 
2 s truct HEADER *next ; } HEADER; 
4 #define BODY_OF (h ) (h+sizeof ( HEADER)) I • g i ven header, find body • I 
5 #def i ne HEAD_OF (b ) (b-sizeof ( HEADER)) I • given body, f i nd header • I 
7 HEADER •defer_h, •defer_t; I • queue f or holding non- r pc messages • I 
The r _recvb function replaces the b_recvb function for receivi ng normal messages. Instead 
of calling b_recvb immediately, it checks the queue for any deferred messages. If there are 
36 
deferred messages. a message is removed from the queue a nd returned. Otherwise. b_rec vb 
is called. 
9 char *r_recvb() 
10 { 
11 char *p; 
13 if (defer_h ) { p = (char *) defer_h; 
14 defer_h = defer_h->next; 
15 return(BODY_OF(p)); } 
17 return(BODY_OF(b_recvb())); 
18 } 
The r _recvrpc function is a fun ction that waits for a reply message. It calls b_recvb 
repeatly until a reply message is rereived. The RPC message is then returned. Meanwhile. 
all non- RP C messages that bave ar rived are stored in the queue. 
20 char *r_recvrpc( ) 
21 { 
22 char *p; 






if(((HEADER *)p)->is_rpc 1) return(BODY_OF(p)); 
if(defer_h) defer_t = defer t->next (HEADER * ) p; 
else defer_t = defer_h = (HEADER * ) p; 
((HEADER*) p)->next = 0; 
30 } 
31 } 
The r _send function clea rs the RPC flag before sending t be message. The r _sendrpc 
function sets the flag before sending the message. 
33 r_send(p,node,pid) 
34 char *p; 






((HEADER *)HEAD_OF(p)) -> is_rpc 
b_send(HEAD_OF ( p ), node, pid); 
41 r_sendrpc(p,node ,pid) 
42 char *p; 






b_send(HEAD _OF(p), node, pid); 
o· . 
1; 
If replies from the file cont roller a re sent using r _sendrpc , the read function can be correctly 
defined as: 




int f d , len; 
char •buf; 
66 REQUEST •r equest; 












(REQUEST*) r_malloc(sizeof ( REQUEST) ) ; 




75 r_send( ( char * ) request, file_tab[fd] .fs_node , file_tab[fd] . fs_pid ) ; 
77 reply= r_recvrpc (); 
78 bcopy( r eply,buf,len) ; 
79 r_free(reply); 
80 retur n(len ) ; 
81 } 
Listing 3.6 A correct implementation of t he C read function. 
The introduction of t he RP C message type makes it possible for sta ndard utili ty func-
tions to be im ple mented hy message passing: howe ver. the use of RP C and ot her discretio n 
mechanisms in ut ility funct ions has t he potential effec t of diminishing the available concur-
rency in a program. For exam ple, the use of read in a program forces all non-RP C messages 
to wait while read is being completed . regardless of whe ther some of t hese messages can be 
consumed wi thout waiting for read t o complete . 
3 .3.3 T he CSP -d iscret ion layer ( csp- layer) 
Layering can also be used to implement the CSP synchronization primitives . In Hoare's defi-
ni tion ofCSP , send and receive are performed by ?!expression a nd ? ?variable, respectively. 
where Pis the process reference of the comm unication par tner . In later CSP variants, such 
as OCCAM, send and receive are performed by C !expression and C?variable, respectively, 
where C is t he cha nnel connecting t he sender and the receiver . Both send a nd receive 
func tion s will block until the comm unication partner has completed t he complementary op-
eration on the same cha nnel. The send a nd the receive fun cti ons can be implemented with 
a mu tual exchange of messages between t he two processes. We will show an implementation 
of CSP with channels. 
38 
Figure 3.8 Structure of a channel in a channel-based CSP implementation. 
Since messages associated with different channels may arrive in an order other than the 
one in which CSP communication is to take place. messages must be tagged with a. type 
field, and those that have arrived early must be deferred. Let us construct a channel using 
two logi cal communication endpoints, one each in the sender and the receiver. If we identify 
the endpoints in each process by a small array index . the connectivity of the channels can 
be completely described by four ar rays in each process: 
typedef struct { int type; int value; } CSP _MSG; 
3 int other_end [MAX_CHAN] ; 
4 int other_pid [MAX_ CHAN] ; 
5 int other_node[MAX_CHAN]; 
6 CSP MSG *chan_queue[MAX_CHAN]; 
In each process, the entries other _node [j] and other _pid [j] identify the process at 
t he other end of channel j. The entry other _end [j] is channel j 's identity at the other side 
of the channel; ie, the channel j on this side and the channel other_ end [j] on the other 
side both refer to the same channel. An unambiguous typing system can be constructed by 
giving messages for channel j the type other_end[j]. The chan_queue array is an array 
of pointers that hold s queued messages for channels . Since each channel can have no more 
than one pending message, only one pointer for each channel is needed for buffering early 
messages. The csp_send and the csp_recv functions can be written as: 
8 csp_send(chan,expr) 
9 int chan, expr; 
10 { 
11 CSP_MSG *sp = (CSP_MSG *) b_malloc(sizeof(CSP_MSG)); 
13 sp->value = expr ; 
14 sp->type other_end[chan]; 
15 b_send(sp, other_node[chan], other_pid[chan]); 
39 
17 while ( !chan_queue[chan] ) { sp = (CSP _MSG •) b_recvb(); 
18 chan_queue[sp->type] = sp; } 
20 b_free(chan_queue[chan]); chan_queue[chan] = 0; 
21 } 
23 csp_recv(chan,var) 
24 int chan, •var; 
25 { 
26 CSP_MSG *sp = (CSP_MSG *) b_malloc(sizeof (CSP_MSG )); 
28 sp->type = other_end[chan]; 
29 b_send(sp, other_node[chan], other_pid[chan]); 
31 while(!chan_queue[chan]) { sp = (CSP_MSG *) b_recvb(); 
32 chan_queue[sp->type] = sp; } 
33 •var = sp->value; 
35 b_free (chan_queue[chan] ) ; chan_queue[chan] 0; 
36 } 
In both function s. a message buffer is allocated and sent to the other sidP oft he cha nne I. 
The process then waits for a reciprocal message from the other side, if one has not already 
arrived. The process frees that message, clears the message-queuing pointe r. a nd returns. 
The only difference between the send and the receive functions is that in csp_send. thP 
value to be sent is stored in the value field before the send. In csp _re cv. t.he value is 
retrieved from the message received before it is freed. 
A more elaborate implementation of a superset of CSP were created by A . .J. Martin [1 J 
and Marcel van der Goot. 
3.3.4 A more general type-discretion layer (t- layer) 
When user-defined message types are needed in a program with type discretion , t he type 
information can be encoded in the message body, and discretion can be handled by the 
program itself, as in the merge_sort example. Alternatively, we can hide t he message type 
in the message header, as in the t-layer example below. 
In the t-layer. the program supplies a type for the message when it is sent with the 
t_send function. The t_send function stores the message type into the header before 
the send . In the receive function, the program specifies the type of message to wait for. 











typedef struct HEADER { int type; 




int node, pid, type; 
((HEADER *)HEAD_OF(p) ) ->type type; 
b_send (HEAD_OF(p), node, pid); 
10 } 
The two pointer a rrays. defer_h and defer_t , implement the queues. Th is queue 
stru ctu re imposes a limit on the range of usable types, but a more general queue struc ture 
can be used instead. The t_recvb fun ct ion takes a message type as an argument . It waits 
for and pu ts messages in to the respective queue while the queue of the desired type remains 
empty. vVhen t he queue is non-empty, a message is removed from the queue and retu rned 
to the program. 
12 HEADER *defer_h[MAX_TYPE], *defer_t[MAX_TYPE]; 
14 char *t_recvb(type) 
15 int type; 
16 { 












p = b_recvb(); 
t = ((HEADER *) p)->type; 
if(defer_h[t]) defer_t[t] 
else defer_t[t] = 
((HEADER *) p)->next = 0; 
28 p =(char*) defer_h[type]; 
defer_t[t]->next 
defer_h[t] 
29 defer_h[type] = defer_h[type]->next; 
30 return(BODY_OF(p)); 
31 } 
Section 3.4 Other Layers 
3.4.1 A flow-controlling layer (£-layer) 
(HEADER *) p; 
(HEADER *) p ; 
Layering can also be used to implement transparent flow control of messages. Suppose 
we have an application where it is necessary to limit the number of unconsumed messages 
produced by each process. We can introduce a layer in which an acknowledgment message 
is sent for every message consumed, and have the send function block until the number of 
messages sent is no more t ha n a preset value over the number of acknowledgments received. 
·H 
In t he following example . when mo re t han ten messages a re outstanding. the send 
routine will call b_recvb to wai t. fo r messages. Sin ce b_recvb does no t dist ingui sh normal 
messages from acknowledgment messages. we will use t he r-laye r mechanism to selectively 
wai t fo r acknowledgment messages 111 the f-l ayer routines: 
1 typedef struct { int node, pid, is_ack; 
2 struct HEADER *next; } HEADER; 
4 #define BODY_OF(h) (h+sizeof(HEADER)) I • given header, find body • / 
5 #define HEAD_OF(b) (b-sizeof(HEADER)) I • given body, find header • I 
6 #define COUNT_MAX 10 
8 
9 
static int o_count; 
HEADER •defer_h, *defer_t; 
I * number of outstanding messages. • I 
I • queue for holding normal messages.* / 
Sin ce the receiver has to send a n acknowledgment t o the sende r, the £-layer message 
header must contain t he ID of the of t he sending process in addition to t he next fi eld of 
the r-layer header. The header mu st also contain the fl ag is_ack to different iate a norm al 
message from a n acknowledgment message. 
11 char *f_recvb() 
12 { 





if(defer_h) { p = defer_h; defer_h = defer_h->next; } 
else { while(l) { p =(HEADER*) b_recvb(); 
if(!p->is_ack) break; 
o_count--; b_free(p); } } 
20 q = (HEADER*) b_malloc(sizeof(HEADER)); 
21 q->is_ack = 1; b_send(q,p->node,p->pid); 
23 return(BODY_OF(((char*)p))); 
24 } 
In the receive function, if there are any queued messages , one message is removed 
from the queue. If the queue is empty, the function calls b_recvb repeatedly unt il a normaJ 
message is received. In bot h cases , a n acknowledgment is sent to the sender and the message 
ret urned to the caller. While waiting for a normal message , any acknowledgment messages 
received cause the outstanding message counter to decrement. 
26 char *f_send(p,node,pid) 
27 char *Pi 













while(o_count >= COUNT_MAX) 
{ 
q = ( HEADER*) b_recvb(); 
if(q->is_ack) { o_count--; b_free(q); 
else { if(defer_h) defer_t 
else defer t 
q->next = 0; 
39 } 
41 q =(HEADER*) HEAD_OF(p); 
42 q->node mynode(); 
43 q->pid mypid (); 
44 q->is_ack 0; 
46 o_count++; 








In the send function, as long as the counter value is larger than ten, b_recvb is called 
to obtain a message. If t he message is a normal message, it is queued; if the message is an 
acknowledgment message, the cou nter is decremented. If the outstanding message counter 
is or has become less than COUNT _MAX, the outgoing message is sent and the outstanding 
message counter is incremented. 
If the communication graph is fixed (ie., channel-like connectivity), it is more efficient 
to have a separate counter for each channel, and to send an acknowledgment for every 
COUNT _MAX/2 messages in each channel. Each acknowledgment message represents the con-
sumption of COUNT_MAX/2 messages. 
3.4.2 The CK primitives 
The old CK (Cosmic Kernel) primitives , the original message primitives for the Cosmic Cube. 
can also be built from the reactive primitives by layering. The primitives are defined around 
a data structure called a message descriptor. (This is very si milar to the way in which t he 







unsigned short msglen; 




We have treated messages al:i information carriers. Sending and receiving messages are 
simil ar to memory alJocat ion operations in C. in that it is the ca rrier that is affected. The 
transfer of information is merely a. sid e effect of moving t hese carrie rs. The CK primi tives, 
on the othe r hand. t reat messages as information encoded in binary bit patterns and stored 
in ar rays o f memory cells. Wh en a message is being sent , the system fetches the informa-
tion from a designated storage buffer; when a. message is received, the system writes the 
information into a designated storage buffer. 
Since the send an d receive requests are not always completed when the send and receive 
funct ions ret urn. processes are allowed to run asynchronously while the transactions are 
being completed. However, in order to avoid access confli cts in t he buffers, a lock variable 
is used for each tran saction to indica te whether the transaction has completed . The buf 
and lock variables in the MSGDESC structure are used to hold the buffer and the completion 
lock. 
\Vhen a message descriptor is used to send a message, the node and pid fields store the 
ID of t he destination process. T he type and rnsglen fields store the message type and the 
length o f the message. The buf pointer references a memory buffer where the message is 
contained . \Vhen send is called, the call will return immediately, but the lock remains set 
until the se nd operation is complete. 
vV hen a. message descriptor is used to receive a message, the type field is set to the type 
of the message to be received. The buf field is se t to reference the memory buffer where 
the message body is to be stored . The buflen field contains t he size of the memory buffer. 
When a receive function is called, the call will return immediately, but the lock remains 
set until the receive operation is complete. When receive is complete, the node and pid 
fi elds contain the ID of the sending node. The rnsglen field contains the actual length of the 
message. Incoming messages that do not have matching receive requests waiting for them 
will be queued. 
typedef struct HEADER { int snode, spid; 
int msglen; 
int type; 
struct HEADER *next; } HEADER; 
Other functions in the CK primitives are described in detail in t.he CK programming guide 
[·I]. In ma kin.e; the t ran sition from the CK primitives to the RK (R eactive I\ernel) primitives. 
which we u:oc on our machines. a compatibility library was created for t he old CK programs 
by layering . The message hea.cler for a CK layer would therefore contain the sender node and 
pid. the message length. a.nd the message type. It would also contain a pointer for making 
linked li s ts for di:-.cretionar.\· receiq~s . The details and the listings for the implementation 
have been omitted for b revity. 
3.4.3 T he RK prim it ives (x-primitives) 
The RK primitives. o r x-primitivcs . can also be built from the b-layer functions by layering. 
The RK primitive ::-.et includ es the following list of functions: 
char *xmalloc(); ---> b_malloc () ; 
char *xrecv(); ---> nb_recv(); 
char *xrecvb(); ---> b_recvb(); 
char *xrecvrpc (); ---> r_recvrpc(); 
xsend(); ---> b_send(); 
x s endrpc () ; - --> r _s endrpc () ; 
xfree(); ---> b_free(); 
int xlength(); ---> l_length(); 
The xmalloc, xrecvb. xsend, a nd xfree functions are equivalent to the b_malloc, 
b_recvb, b_send. and b_free functions. respectively. The xrecv function is equivalent 
to the nb _recv function, the non-blocking receive. The xlength function is equivalent to 
the l_length function, the function that returns message le ngth . The RPC functions are 
si milarly equivalent to those of the r-layer functions. 
The RK primitives can therefore be implemented using a combination of 1-layer, nb-
layer , and r-layer. However. in the actual implementation of the Reactive Kernel, all three 
of the layers are incorporated into the basic kernel for greater efficiency. 
T he x-primitives a nd associated functions will be discussed in the next section in con-
junction with the description of the Cosmic Environment, the generic multicomputer op-
45 
e rating environment in which the x-primitives a re supported as the primary programming 
system. 
Section 3.5 Layering on Light- Weight Processes 
Any la.ve rin g that applies to heavY -weight processes and that makes se nse in the context of 
the light-weight procef:>ses can be applied to light-weight processes as welL If we represent 
the kernel, handler. layer routines, r1ncl user program as four separate components, the chain 
of control flow is shown in Figure :L9. 
ret u m to kernel 
to get message 
call handler to 
deliver message 
context switch 
back to handler 
to get message 
context 
switcher 
con text switch 
to deliver 
message 
call layer function 
to get message 
return to deliver 
message 
Figure 3.9 Control flow for heavy- weight processes. 
The control flow for light-weight processes, shown in Figure 3.10, is identical except for the 
absence of the handler component. 
return to kernel 
to get message 
return to layer 
function to get 
message 
call handler to call user code to 
deliver message deliver message 
Figure 3 .10 Control flow fo r light-weight processes . 
Although these two programming models are essentially interchangeable, light-weight 
processes are more efficient in most machines because they avoid the context-switch cost. 
However, programs composed of light-weight processes <~re more difficult to develop because 
processes are not protected against each o th er in case of a pro(l,ramming e rror. The processe. 
must. in practice, coexist in the sa me add res:-. space. 
47 
Chapter 4 Cosmic Environment 
The Cosmic Environment. or CE. is a multicomputer programming speci fi cat ion that also 
exists as an implementation on a number of multicomputers. Detai ls for usi ng CE can 
be found in '·The C Programmer's Abbreviated Guide to Multicomputer Programming."[:3). 
vVe will concentrate here on the reasoning behind the design of our implementation. but firs t 
we will give a short definition of the Cosmic Environment Specification. The specification 
covers the process model , the message system , and the library functions. 
Section 4 .1 The Cosmic Enviro nment S p e cificatio n 
The agents of a computation in CE are: 
Processes: Each process is identified hy a unique process ID. which is 
a (node , p i d) pair. Node identifies the multicomputer node 
containing the process, and pid distinguishes one process from 
another on the same multicomputer node. 





0 a process 
Figure 4.1 Elements of a computation . 
Message system: The message system accepts messages from the processes , routes 
them according to thei r destination process ID, and delivers them 
to their destination processes. Messages are queued enroute to 
48 
their destinations; message order between any pair of processes is 
preserved. 
In CE, a process can allocate and relea.5e message buffers, send and receive messages. 
create other processes. and terminate itself. The functions available to a C program are: 
char *xmalloc (n) Allocates and returns a message buffer 
unsigned n; 
sufficient for n bytes of data. 
xfree (p) Releases a message buffer. 
char *p; 
char *xrecvb() Waits for and returns a message from the 
message system. 
char *xrecv () Returns a message from the message system 
if one is available; returns a null pointer 
otherwise. 
xsend(p,node,pid) Frees the message buffer. p. from the call ing 
char *p; int node, pid; 
process, and sends the message buffer to the 
process whose ID is (node, pid). 
spawn(name ,node ,pid,option) Runs the program called name and assigns it 
char *name, *Option; int node, pid; 
the ID (node, pi d) . 
int mynode () Returns the node number of the calling 
process. 
int mypid() Returns the pid number of the calling 
process. 
exit() : Terminates the calling process. 
This specification is short and simple. When our emphasis is on the study of multi com-
puter programming, we do not need unnecessary features to distract us; what we do need is 
a sys tem that does not inhibit creativity. CE preserves the value of our work by making it 
easy to provide efficient implementations for its specification on many multicomputers that 
are otherwise software-incompatible. 
49 
Our CE spec ification was designed with the following two rules in mind: 
1. Programming systems should be portable. 
2. Programming manuals are evil. 
The first design rule regards the port ability of CE. A programming envi ronme nt is portable if 
many types of machines can be made to su pport the programming environment. Portabi lity 
is easy to achieve with CE because its fun ctions are easy to provide in most multicomp11tcrs 
and multiprocessors . CE can be supported at the user-program level with a compatibility 
library, or at the system level with a reactive kernel. The reactive kernel makes kernel 
implementation or substitution simple because it does not require much support from the 
ha rdware. 
The second design rule regards programming manuals. Manuals are a necessary evil. 
Therefore, whenever possible, CE has b12en made easy to explain in order to shor ten the 
manuals. Besid es this obvious advantage for people who do not enjoy reading manuals , CE 
has become simp le and intuitive because making it easy to explain has also made it easy to 
use. 
Having a sho rt programming manual is self-rewarding. In an evolving system where 
old features are constantly being revised or dropped and new features are constantly being 
added, keeping a large manual up-to-date is a non-trivial task for a small research group. 
By keeping the manual simple, we not only make manual revision less laborious, but also 
make system improvement easier, since we are not obliged to support any mis-features that 
have not been previously documented . Our view is that t he less a user has to know in order 
to efficiently complete the work, the better. 
Section 4.2 Our Cosmic Environment Implementation 
An implementation of the CE specification is a programming environment that embodies 
the specification. Currently we have implementations that contain drivers for the Cosmic 
Cube. the iPSC/1. the iPSC/2, the Symu/t 2010, and for the ghost cube - a set of network-
connected workstations treated as a single multicomputer. (For historical reasons, we retain 
50 
the use of the word "cube'' to mean a multicomputer even though not all multicomputers 
are binary n-cubes.) Other implementations that use shared memory fo r message passing 
exist for the Seq uent and for the Cray X-?viP. 
4.2.1 Structure of our CE implementation 
\Ve start with the process model. A process group contains a set of processes connected to 
t he message system (Figure 4 .2 ). P rocesses communicate with each other by sending and 
receiving messages, and they refer to each other by means of their process IDs. 
Message System 
a process 
Figure 4.2 A process group . 
In order for the set of p rocesses to communicate wi th the outside world , the logically 
uniform message system is physical ly pa rtitioned into two parts: One resides in the multi-
compu ter and is called t he node message system; the other resides outside of the multi com-
puter and is called the host message system. The two par ts are connected by a message 
gateway, and the separation is made transparent to the processes (Figure 4 .3). Processes 
are then a llowed to run either on the hosts or on the nodes. 
Figure 4.3 Partitioning into two parts. 
Since our multicomputers are used in classes for student experiments, there are many 
more users who need to use the multicomputers tha n there are available multicomputers. 
But since most experiments req uire fewer nodes than a re available in a multicomputer. we 
51 
wa.nt to support several users simu ltaneously on the same multicomputer. Space sharing is 
the s haring of a multicomputer by more than one user such t hat each user is given a separate 
subset of nodes in a multicomputer. The programming environment \vithin each subset is 
indistinguishable from one in which the user owns an entirely separate multicomputer having 
the same number of nodes in the subset . Our message gateway must therefore interfac<' with 







TCP / IP multicomputer network 
__ _______ , 
' 












In our implementation, the host system is built on top of the TCP / IP network. and 
the host processes run on any network-connected host that uses the Berkeley UNIX socket 
mechanism. The node system is built on top of the multicompu ter network, and may involve 
either a replacement kernel in each node or a set of emulation routines for the CE functions. 
In this particular implementation, the gateway is a single ifc process . and each host 
message system is a single message- switcher process. The message switcher is the spoke of 
the host message system. It is connected to each host process and to the ifc process via 
TCP /IP stream sockets. Message-sending functions in a host process convert CE messages 







multicomputer interface (ifc) process 
internet TCP /IP stream socket connection 
host system message switcher process 
Figure 4.5 Host message-system implementation. 
ID of the destination process. the message switcher will send a message either to another 
host process or to the ifc process. The ifc process waits for messages from both the 
multicomputer and the switchers. When it gets a message from a switcher, it converts the 
message into a multicomputer message and sends it to a multicomputer node owned by the 
user \vho owns the switcher. 
When the ifc process gets a message from the multicomputer. the node ID of the sender 
is used to determine the destination switcher process. The ifc process then converts the 
message into a TCP/IP message and sends it to the switcher. When the switcher gets a 
message from the ifc process, it sends the message to the destination host process. The 
receive function in the host process then converts the message into a CE message to be 
returned to the user program. 
cube d::emon 
Figure 4.6 Cosmic Environment with unified resource management. 
Since we have several multicomputers. and since some of them are of the same t:vpe. 
53 
we centralize the a llocation of all multicomputers In a process called the cube dcemon. 
When a multicomputer is requested by type. the rube dcemon tries to ass ign an a.vailable 
multicomputer of the required type by searching the list of all multicomputers registered to 
it. Thus. the user is not concerned with locat in g an available machine because it makes no 
difference which one is assigned. 
'vVe connect all ifc processes and switcher processes with the cube dcemon via TCP /IP 
stream sockets. These sockets do not carry much traffic; t hey are merely tokens of partici-
paLion in CE for the switchers and the ifc processes. 
4.2.2 Cosmic Environment exterior 
Having been spoiled by the convenience of the Network File System (NFS) on workstations, 
the first thing that we decided Lhat we did not want to know is where to go to access the 
multicomputers. Like files in a NFS environment, CE is equally accessible from everywhere 
in the same network. The cube dcemon resides on a known host in a network, and a 
configuration file in each participating machine is initiauzed to contain the network address 
of the cube da:>mon. 
Every utility that accesses CE connects to the cube dcemon using the network address 
found in the configuration file, making CE available and equally accessible from anywhere 
within the same network. The most frequently used utility is the program called peek , 
which prints the status of CE: 
CUBE DAEMON version 7.2, up 9 days 20 hours on host ganymede 
{ } 3d cosmic cube, b:OOOO [ venus fly trap] 2 . 3h 
{ } 6d cosmic cube, b :OOOO [ ceres TEST J 2.3h 
{ sim mikep } 4d ipsc2 cube ' b:OOOO 
[ saturn iPSC2 J 2 .1h 
{group david } 7d ipse cube b:OOOO [ titan :iPSC d7] 3.4h 
{ } 28n s2010 b:OOOO [ psyche :ginzu J 4 . 9d 
{group apl } 4n s2010 b:OOOb [salieri :ginzu J 6 . 9h 
{ } 48n s2010 b:OOOO [perseus : 52010 J 4 . 9d 
{group sharon} 8n s2010 b:0007 [perseus :52010 J 29.2m 
{group tony } 8n s2010 b:OOOc [ mozart :52010 J 4.7h 
The peek utility lists all available, occupied, and fragmented multicomputers. In the 
display above. user tony and user sharon each occupy 8 nodes in a 64-node S2010 without 
interfering with each other. Cs('r apl is using 4 nodes of a 3:2-node S2010. {"ser david is 
using a 1:28-node iPSC/1. and user mikep is using a 16-node iPSC/:2. 
To use a multicomputer, we must first allocate a multicomputer. \tVe specify the mul-
ticomputer type. and the cube dcrmon picks t he best allocation accord ing to an algorithm 
specific to that type. To allocate a :3-node s2010. we can enter ··getcube 3n 
peek will now show the following list: 
CUBE DAEMON version 7.2, up 9 days 20 hours on host ganymede 
{ } 3d cosmic cube, b:OOOO [ venus fly trap] 
{ } 6d cosmic cube, b:OOOO [ ceres TEST J 
{ sim mikep } 4d ipsc2 cube ' b:OOOO 
[ saturn iPSC2 J 
{group david } 7d ipse cube b:OOOO [ titan :iPSC d7] 
{ } 28n s2010 b:OOOO [ psyche :ginzu J 
{group apl } 4n s2010 b:OOOb [salieri :ginzu J 
{ } 45n s2010 b:OOOO [perseus :52010 J 
{group wen-king} 3n s2010 b:0007 [neptune :$2010 J 
{group sharon } 8n s2010 b:0007 [perseus :52010 J 
{group tony } 8n s2010 b:OOOc [ mozart :52010 J 
GROUP {group wen-king} TYPE reactive IDLE S.Os 
( -1 -1) 
( -1 -2) 
(--- ---) 
SERVER Os 








[neptune 18339] 3.0s 
[neptune 18340] 3.0s 












In this example. the allocation algorithm carves out a 3-node subset from the multi-
compu ter shared b.v sharon and tony. instead of from the one used by apl. After the 
allocation. any multicomputer programs that we run on the hosts or on the nodes become 
part of our process group. The host processes will be connected to our switcher and the node 
processes will be spawned on our nodes. Host processes are shown in the extended peek 
display below the main list. In this example, a set of server programs wa.c; automatically 
started and added to t he process group when getcube returned. 
4.2.3 Cosmic Environment processes 
While CE is not in use, the only active processes in t he hosts are the cube da=mon process 
and the ifc processes. Each ifc process resides in a host containing an interface to a 
multicomputer. and maintains a TCP / IP connection to the cube da=mon process. T he cube 
da=mon keeps t rack of its set o f ifc connections; that a connection remains open is an 
indication that the multicomputer attached to the ifc process is ready for use. An ifc 
55 
process passes the multicomputer status to t he cube da>mon via its TC'P/IP connection. 
The cube da>mon process passes allocation and deallocation commands to the ifc process 
via the same connection. 
\\"hen a user requests a multicomputer by running the getcube program. the getcube 
process connects to the cube da>mon a nd sends it a set of allocat ion requ irements. If the 
requirements can be fulfilled. the requested multicomputer or a. partition of the multicom-
puter is marked as allocated in cube da>mon·s table. An allocation command is then sent to 
the correspond ing ifc process . The ifc process initializes nodes allocated to the user and 
then conn ects to the user's get cube process. The getcu be process then fades to background 
to become the switcher process, givi ng the user the appearance that the getcube command 
has terminated as a n indication that the allocation has completed. 
A set of service processes is started by the getc ube process as it fades to background. 
These processes are responsible for such mundane tasks as the details of process spa,vning, 
file access, and printing of error messages. Additional host processes and utilities are run 
by the user to perform computat ion . 
Porting CE to another multicomputer involves the creation of a new plug-in node system 
for the new multicomputer. We have a choice of implementing the CE node system on top 
of the native node kernel or writing a new kernel that implements the CE node system. 
The Cosmic Cube and the S2010 both have the CE node system a.c; their native system. We 
replaced the iPSC/2 kernel with a custom kernel. On the iPSC/1 and on earlier vers ions of 
the iPSC/2, the CE node system is layered on top of their native systems - the NX kernels. 
When we layer a CE node system on top of the native node kernel, the ifc process 
is linked with the native host library for t he multicomputer. and it interacts with the 
multicomputer via the nati ve message functions. To the native system running underneath, 
the ifc process appears to be just an ordinary host process of the native system. The 
CE node system can operate within the confines of user-accessible functions of t he native 
56 
system because it has s impl e requirements: it does not need special capabi li ties from the 
native system a nd it does not interferP with the functioning of the native system. 
4 .2 .4 Program compilation 
Different commercial multicomputers wi ll invaria bly provide dissimilar method s of compili ng, 
programs for thei r multicomputers . The com piler options a re different; those with the same 
na me may have diffe rent meanings to different compilers and some that are available to one 
com piler may be missing for another compiler. The sequ ence of operations that the user has 
to go through may be different , and the se t of end products may also be di fferent. However. 
we recogni ze that only a. small set of the options is useful. and we can easily hide any 
difference among the compile rs by the use of a. program that runs programs. B_v declaring 
that on ly a limi ted set of commonly used compiler flags are supported . the compil ation 











ghost cosmic iPSC/1 
ccgh cccos ccipsc 
.gh.o .086 .o286 
.gh .cos .ipse 
argh arcos aripsc 
.gh.a .A86 .a286 
iPSC/2 S2010 
ccipsc2 ccs2010 




T he following sequence will compile the program myprogram. c for all of these machines. 
and the runnable object code generated will be named myprogram, myprogram. gh. mypro-
gram. cos , myprogram. ipse. myprogram . ipsc2, a nd myprogram. s2010 , respectively. 
'l. cch -o myprogram myprogram.c -leu be 
'l. ccgh -o myprogram myprogram .c -leu be 
'l. cccos -o myprogram myprogram.c -leu be 
'l. ccipsc -o myprogram myprogram .c -lcube 
'l. ccipsc2 -o myprogram myprogram.c -leu be 
'l. ccs2010 -o myprogram myprogram.c - leu be 
To illustrate the amount of complexity hiding that can be performed , actual compilation 
for the iPSC/ 1 can be done only on t he controller box of the iPSC/ 1 - the In tel 286/310. 
T he program ccipsc copies the source files to the 286/310 for compilat ion. and cop ies back 
57 
compiled object fi les when compilation is completed. It creates an illusion that com pilation 
takes place where the ccipsc command is issued. 
4.2.5 Spawning programs 
Like compi lers. different multicomputers supply their own method of running a node pro-
gram . We can hide the differences by us ing programs t hat run other programs : but, unlike 
t he compilers. we no longer have to differentiate one multicomputer from another by giving 
them different names. While a compiler can be invoked by the user at any time. a program 
loader can be invoked only when the use r has a n act ive process group. 
We ca.n therefore elimi nate another level of complexity by hav ing the gener ic loader. 
spawn, check t he type of the multicomputer being used and have it run the loader com-
mand specific to that multicomputer. Thus. to load the program generated in t he previous 
example into any of the multicomputers, we can run " spawn rnyprogram, ., regardless of the 
multicomputer we are using. 
Utili ties such as the node-program compilers are called machine-specific utilities: util-
ities such as spawn are called machine-dependent utili ties; and utilities such as peek are 
called machine-independent utili ties . The node system for each type of multicomputer. 
t herefore, contain s the ifc process, the machine-specific utilities, the machin e-dependent 
utili ties. a nd the compiler ubraries . 
4.2.6 Data representation and conversiOn 
We have tried to simplify CE and, at the same time, to hide the differences between different 
multicomputers : but. i t is not always poss ible to do both. The difference in dat.a represen-
tation among processors o f different multicomputers and hosts is one that we cannot hide 
in vanilla C . When two communicating processes ase run on two machin es having different 
data representations, data in m essages sent from one process to another need to have their 
representations converted before they can be used. We can always move the conversion 
problem into the compiler. but we st ill have to decide how the problem is to be solved. 
58 
68020: 01000000 00001001 00100001 11111011 01010100 01000100 00101101 00011000 
vax: 01001001 01000001 11 01 1010 0000111 1 00100001 10100010 11000010 01101000 
80286: 00011000 00101101 01000100 01010100 11111011 00100001 00001001 01000000 
Listing 4.1 Three representations of 7i in double-precision floating-point-number format . 
Data-representation problems have been a subject of study e\·er since computers were 
first connected by networks. The most common solution is to define an interchange data 
representation. The sender converts data items in its outgoing messages from the sender's 
representation to the interchange representation; the receiver converts data items in its 
incoming messages from the interchange representation to the receiver's representation. A 
set of conversion routines with the same name but having different functions on different 
machines is provided to make programs portable. A program needs only to be capable of 
converting its data to and from the interchange representation, rather than to and from all 
possible rep resentations. 
In the case of a multicomputer, however. message traffic is usually much higher and 
message latency is usually much lower between the nodes than between the hosts. Having to 
convert the data in ea.ch internode message to and from an interchange representation can 
sjgnificantly reduce the performance of message-intensive applications unless the interchange 
representation happens to be identical to the representation of the multicomputer. 
Our solution is therefore to make the interchange representation adjustable: we define 
the interchange representation for a process group to be the representation used by the 
multicomputer of the process group . Node processes are not required to convert the data 
in their messages, and, if they do. the functions that they call to perform the conversion 
will have no effect. A host process is required to convert message data to the interchange 
representation before it sends a message, and from the interchange representation after it 
receives a message. Host processes already h ave a large per-message overhead, and they 
can absorb the extra work of conver t ing the data. 
The node programs never need any conversion routines, but host programs must carry 
routines that convert data representations to and from those of all multicomputers that CE 
supports. The conversion routines check the multicomputer type before deciding how data 
59 
is to be converted. Adding a new multicomputer to('£ ma.v require that host program~ be 
recompiled if t he data fo rmat fo r the multicomputer is not already supported . 
In order to prese rve the CE specificat ions, conversions a re done in place. because mes-
sage buffers are treated ljke memory buffers from malloc . Having to convert a message and 
put the converted data in another buffer weakens the specifi cation. l n order to have such 
conversion make sense, however. the location and the size of each data item in the messages 
must be the same for all processes . However , different machines do have different sizes and 
alignmen t rules for the same data type. 








Listing 4.2 Three layouts of a structure , in order of increasing byte a ddress . 
For data sizes, we made t he decision that in all the machines that we support data 
items will have the following sizes, and a message should include only t he following data 
types: 
double-precision fl oating-point number 64 bits 
single- precision floating-point number 32 bits 
long integer 32 bits 
short integer 16 bits 
character 8 bits 
For alignment, we add any necessary padding to force each data item to align on its 
st rictest alignment boundary: A k-byte data type should be al igned on a k-byte boundary. 
The bottom of a data structure should also be rounded out by padding it to the alignment 
boundary of the largest data item in the structure. Whenever possible, a st ructure shou ld 
be rearranged to minimize the amount of padd ing necessary. 
60 
\Vhen data items are al igned using these rules, t he location of each data item in a 
message is the same for al l machines. i\ set of conversion routines can hE' used to 1w rform 
in place conversion on t he items: 
htocs(p,n) ctohs(p,n) Convert short integers. 
htocl(p,n) c tohl(p,n) Convert long integers. 
htocf (p,n) ctohf(p , n) Convert single-precision fl oating-point numbers. 
htocd(p,n) ctohd(p,n) Convert double-precision floating-point numbers. 
The htoc set o f functions con verts data from t he format used by the calling, precess 
to the interchange format. The ctoh set of functions performs the reverse conversion. 
Pa rameter p is a pointer to an item of the appropriate type and parameter n is the number 
of consecutive data. items to be converted by the fu nctions. There is no conversion routine 
for the cha r<~cter t~'pe bec<luse t he basic units of the messages are bytes and their correct 
o rd ering is enforced by t he ifc process . 
The data representat ion problem may require rethinking after machines with a 64-bi t 
data bus become available . Data- type conversion is only an inconvenience. and it can always 
be taken care of by writing a new compiler that inser ts code to do the conversion for 1he 
use r. However. such is be.vond t he scope of this research. 
Gl 
Chapter 5 Model of Simulation 
Section 5.1 Mathem atica l Framework and Analysis 
5 .1.1 Systems and e lements 
:\ s.1·stcm consists of a. s.1·s rem i>ody, a set of system inputs, <~nd a set of :;ystem outp11ts. 
lt is a ·'black box·· whose only external connections are the inputs and outputs. In <1 
representation of a simulator. each individual out put conveys an atomic property of the 
simulated sys tem. A property is atomic if at any point during the simulation the simulator 
contains all information about that property up to some simulated time, but none beyond 
that s im ul ated t ime. 
\ 
system input ~ 
System 
Figure 5 .1 Representation of a system. 
:\ system can be defined recursively as a collection of systems linked together by arcs: 
each arc connects an output of its source system to an input of its destination system. and 
each a rc represents the source system's direct influence on the destination system. The 
recursion terminates with systems that are called el ements; the behavior of each element is 
defined algorithmi cally to correspond to a model of some physical deYice or process . 
an element 
sys tem input 
...... ______ ____ ___ _ 
Figure 5. 2 Representation of a system composed of elements . 
If the hierarchy that is indu ced by t his recursive definition is flattened by expand ing 
each sys tem recursively into its constituent systems an d elements, we obtain a sys tem that 
62 
is composed entirely of elenlPlll,.,. In order to simplify t he following ex position. we shal l. 
without loss of generality, disc uss a ~ystem that is composed enti rely of elements. 
In a composice system. each element input can be conn ected to no more than one 
arc. whereas earh e lement output ran be connected to any number of arcs . The set of 
s.vstem inputs is the set of unconnec ted element inputs ; \vhereas the set of sys tem outputs 
can be any subset of the element outputs. Systems without any inputs are called closed 
systems. In order to s implify the mathematical framework, we shall close each system with 
an environment element , fe· that provides inputs to all unconnected system inputs and 
accepts outputs from all unconnected s.\·stem outputs. 
Figure 5.3 Closing a system into a closed graph . 
The representation is now a graph that can be described as below: 
E : The set of elements in a system. 
A : The set of arcs in a system. 
U: ::= EU {ee} 
inp(e): The set of all arcs terminating at P . 
out(e) : The set of all arcs originating from e. 
src(a) : The source element of a. 
dst(a) : The destination element of a. 
Figure 5.4 Arc source and destination . 
Figure 5.5 Element inputs and outputs. 
path: A path of length n is a sequence of arcs, (ao . a 1, a2, ... , an- 1), such that 
dst( a;) = sTc( rLt+ 1 ) for 0 :S i < n - 1. 
63 
Figure 5.6 Arcs ao-4 form a path of length 5 
ao 
Figure 5.7 Arcs ao_4 form a circuit of length 5. 
circuit: A circuit of length n is a path of length n in which sTc(n0 ) = clsl(an-l ). 
5.1.2 States a nd time 
The state of a system includes both its internal state an d the statP of its outputs. Let 
Su(t0 • t 1 ) be the state description of the closed system between the time t0 and t1 , t0 :=:; t1 , 
and let SL(t0 .t 1 ) be the state description restricted to the subset or mt>mber, L. The s tatP 
of the closed system can be written as a C<ntesian product of the envi ronment state and 
the system state: 
Similarly, the system state can be written as the Cartesian product of the element states: 
A simulator is said to be progressive if it can compute the following function for any 
valid input description, Sinp( E)(lo , tJ), which is a description of input state over a time 
interval , and any valid initial state of the system, SE(lo,lo). 
A simulator may be able to compute more state information for some of its outputs 
than is specified above. For example, if the system can compute the following function for 
some 8 2 0, the output o is said to have a delay of no less than 8 at time t 1. 
64 
If 8 is the largest value for the above to remain true. then 8 is the delay of the output oat 
simulated time 11 . The delay of a s.vstem at simulated time 11 is defined to lw the sma llest 
of all output delays of the s.vstem at t 1 . The definition of a progress ive si mulator precludes 
the rossibili ty of negative delays. 
5.1.3 Knots and progress 
In this sect ion. we shall define a set of rules that allows us to recursively construct progre!'-
sive system simulators by connecti ng progressive element simulators in the same manner in 
which the elements of the system are connected. \Ve shall call s uch a simulator a composite 
simulator. In order to d iscuss progress, we make a minimal assumption that information 
computed at an.v element simulator, e, will be avail<tble to all dst(oul(f )) . \\·e shall assume 
for the moment that elements are deterministic; that is, Sinp(e)(t0 . 11 ) and S~(l0 .t0 ) com-
pletely determine Se(to, t 1 ) . Thus. in order to determine whether a si1nulator is progressive, 
we need to consider only t he arc state, SA(to . t 1 ). 
A simulator lacks progress if and only if there exists a combination of Sinp( £)( 10 . t 1 ) 
and .)'E(to.lo) s uch that the simulator fai ls to compute Sa(to.l 1 ) for some n EA. Let If.: 
be t he time value, to :::; t1,· < t 1 , such that the s imulator can compute SA(Io. I,,.) but not 
SA(to.tl). Let l\' ~A be the set of arcs such that the simulator can compute Sn{lo./1\·) 
but not Sa(t0 , tl ). The set l\" is called a knot in the simulation. The presence of a knot is 
synon:>mous with a lack of progress. 
Knot: Simulator can compute S~(to, tt- ) for all a'/. A. 
Simulator can compute only Sa(to . t,,·) for a U a E /\·. 
NAND 
System input s Composite sys tem 
I I -----------------------------
Figure 5.8 Example of a knot-containing system . 
65 
An example of a. knot-containing, ~ystern is a zero-delay ~A ,\ D-gate wit h onf' of its 
inputs con nected to its output. as shown in Figure .').8. Althou~h t he element simulator 
for the NAND-gate may be progressive. the com posite simulator fo r this system ca nnot be. 
For example. if the input to the system is the following: 
for 0 :S: t < 1; 
for 1 :S: t :S: 2. 
t hen the composite s imulator can compute only the following for the arc a 2 : 
for 0 :S: t < 1; 
for 1 :S: t :S: 2. 
The s imul ato r cannot compute S~L2 for 1 :S: t :S: 2 because a se lf-consistent s tate ass ignmen t 
for a2 cann ot be found. The set o f a rcs {a2} is a knot. 
Theorem .5.1: If a is an a rc of knot X . t hen the following condi tions hold : 
a. inp(.q·c(a)) is not empty; ie, src(a) is not a source node in the directed 
graph of elements. 
b. The delay o f src(a) at t1,· is 0. 
c . Some member of inp(sTC(a)) is also a membe r o f h· . 
Proof: 
a . If the set of arcs, inp(src(a)), is empty, then src(a) is a closed system. 
A closed sys tem does not need any information from its environment in 
o rder to compute its s tate - it is able to compute its outputs up to any 
arbitrary t ime . Therefore. inp(src(a)) cannot be empty. 
b. By the d efin ition of a knot, the simulator can compute up to tl\" for a ll arcs 
in -inp(src(a)). If t he delay for src(a) is greater tha.n zero, the simulator 
would be able to compute up to tt for a . Since it cannot, by definition, 
the delay of src( a) must be zero. 
c. If no m ember of inp(src(a)) is in /\·,then, by the definition of a knot, 
the simula tor should be able to compu te up to tt for a ll members of 
inp(src(a)). Furthermore. s ince delay cannot be negative, t he s imulator 
66 
should be abiP to compute up to tt- for a. Therefore . if a is in /,· . so mP 
member of i11p(src(a)) must also be a member of /1·. 
5 . 1 .4 Rules of thumb - s uffi c ie nt conditions for prog r ess 
C'orol lary :).2: Every knot contains a circuit. 
Proof: The re is a finite nu mbcr of arcs in a system. lf for every arc. n, E J,· there 
is at least one arc. a1 E / \·,such that aj E inp(src(a,)). the n there mu::. t 
be a ci rcuit in A·. 
Corollary 5.:3: If the sys te m contains no circuit s, then the composite s imulator ts pro-
gress1 ve. 
Proof: Since every knot must contai n a ci rcuit, a system that does not co ntain 
any circui t s cannot have knots. 
Corollar:· .5.4: If every e lement has a delay greater than 0, then the composi te simulator 
is progressive. 
Proof: Follows directly from Theorem 5.1. part b. 
Corollary 5.5: lf in every circui t there is some element wi th non- zero delay, then t he 
simulator is progressive. 
Proof: From Corollary -5 .2, if /\" exists, it must contain a circuit. From Theorem 
5.1, if s uch a circuit exists, al l the elements in it mu st have ze ro dela:·. 
Therefore . if all circuits have at least one element with non-zero de lay. 
then /,· can not exist. 
Although the progress conditions stated in Coro llaries 5.3, 5.4, and 5.5 identify a set 
of systems with progressive simulators, t hey do not identify, either by themselves or all 
together, the set of a ll systems with progressive simulators. T hese are not minimal condi-
tions, because there are systems wi t h progressive simulators that do not satisfy a n.v oft he 
67 
three corollaries. The corollaries are useful as simple rules of thumb because there exists an 
effective procedure for testing each of them. 
5.1.5 N on-exis t e nce of necessary a nd s uffic ie nt p rogr ess conditions 
5.1.5.1 S imulat io n a nd B oolean sat isfia bility 
An algorithm that tests fo r a necessary and sufficient condition, if any such condition does 
exist. must be NP-hard. Figure 5.9 shows a system that tests for the satisnability condition 
in a set of Boolean clauses. The system contains a zero-delay NAND gate, a counter, a clock 
source, and a network of zero-delay gates forming the clauses. A simulator for the system 
is not progressive if and only if there exists a counter output such that all of the clauses 
are true. If there is an algorithm that can determine whether a simulator for any system of 
this form has progress , we can use it to determine whether any collection of clauses can all 
be true at the same time. Since the latter operation (Boolean satisfiability [17]) is known 
to be NP-complete, the algorithm must be NP-hard. Therefore, any generic algorithm that 
tests for a necessary and sufficient condition must be NP-hard. 
Figure 5.9 A ci rcuit to eva lua te satisfiabi lity of a set of clauses . 
5.1.5 .2 Simulation and s imultaneous equations 
Another way to demonstrate the futility of searching for a necessary and sufficient condition 
is to examine the relationship between simulation and simultaneous equations. We define a 
.progressive simulator to be one that can comp ute the following function for any valid input 
description. Sinp(E)(to,ti), and any valid initial state: 5£(lo,lo). 
68 
Sinp( £)(to. I 1 ), Sj;(to. to)>--- S£(to . I,) 
Let Hr be the mapping associated with a progressive simulator for the element e: we ran 
express a composite s imulator as the following set of equations: 
'ieEE S~(to.ti) = He(S'e(to.to), Sinp(e)(to , tJ)) 
Since S~(to. t1) describes S'out(e)(to, t 1 ), and since SA(to, t, ) and S£(to. to) determine 
S'£(t0 . t 1 ) . a composite simulator can also be expressed as the following set of equations: 
'ia E A S'a(to,td = Ga(S'src(a)(to, to), Sinp(src(a))(to.td) 
G~1 is Hsrc(a) restricted to the a rc a. These are simultaneous equations in the for m: 
Vi: Xi= Fi(Xl· -'Y2, ... ,Xn) 
Furthermore. any set of simultaneous equations can be t ra nsformed into a physical 
system for which a composite simulator can be constructed. The set of all simulators a nd 
the set of all simultaneous equations must be equivalent. 
Figure 5.10 Mapping equations into physical system. 
In any set of simultaneou s equ at ion s, only one of the t hree possibilities listed below can 
exist. 
l. The simultaneous equation s have no solution . 
2. The simultaneous equations have exactly one solu tion. 
3. The simulta neous equations have more than one solution. 
Sin ce a s imulation is progressive if and only if its set of simultaneous eq uations has a 
solution. any test for determining progress of a simulator can be used as a test for deter-
mining the existence of solutions for simul taneous equations, and vice versa. Since the test 
for the latter has not been found. the tei>t for the former also has not been found. rh(' 
search for a necessary and ~ufficient condition is. therefore. both difficult and. so far . futile. 
Sect ion 5.2 Operational Framework 
.\!th o ugh an effective simu ltaneous-equat ion solver for the genera l case does not exist. the' 
:-.i!llultaneou:,-equation representation brings us one step closer to an operational model. 
lwrause effective procedures s uch as Gaussian elimination for ordinary linear equations 
- exist for :-pecific c lasses of equations. 
Tlw equations for a simulation arP generally difficult to analyze becaus~> its variable!> and 
con:-tants dPscribe states o,·er the entire simulation interval. and the equations themselves 
can be ;ubitrarily complex. \Ve ma.v be able to obtain a set of simpler equations. however. if 
Wf' restrict the analysis to those simulations that s pan only a s ho rt interval. If the interval 
of a s imulati on can be broken down into a finite number of smaller intervals such that 
each interval can be computed by an effective procedure, we will have found an effective 
procedure for the simulation. 
5.2.1 Breaking a s imulation into smaller slices 
.\n~· equation whose a1'sociated output has a delay 8. such that b 2: CJ. can be reduced to a 
constant equation by restricting the simulation to an interval equal to CJ . Let L be the set of 
output arcs with a non-zero delay at timet. Suppose Lis non-empty, let CJ be the smallest 
non-zero delay. The state of all arcs between t and t + CJ are related by the following set of 
simultaneous equations (justifications to follow shortly): 
if a E L: 
if a fl. L. 
If equations uke these can be solved. simulation for a system can be performed by 
dividing the simu lation interval into CJ-wide slices . and repeatly solving for SA(t.t + CJ). 
computing SE(t, t + CJ). and advancing to timet+ CJ. Since the set of equations above covers 
a slice of time. let us simply refer to it as a slice. The operation of a composite simulator 
that advances one slice at a time can be described by the actions of its element simulators. 
70 
Figure .).11 depicts the sequencP of actions taken by the simu lator for element. f . whose 
output arc. a. has a non-zero clela.\· of~- At the beginning of the slice t hat starts a t I (Figure 
-5.ll(a)), the simulator ha s pro~ressed tot and has computed 5e(t,t). Since the delay for a 
is b. S'r(l.l) contains the output state description: S~(t.t +b). 
t+o t 
~S. ( )(t,t+o) <np E 
( c ,)_ .:K--~<--.. 
Sa(t+o,t+o+6) 
Figure 5.11 Element-si mulator operation for an element with a non-zero delay. 
Since a is no larger than b. the equation for Sa does not depend on the state of other arcs, 
and the simulator can output the state description, Sa(t.t +a), (Figure .'Ul(b)) without 
any additional inputs. 1f the state description over the interval (t .t +a) can be computed 
for every arc in the system . Sinp(e)U · t +a) will become available to e (Figure 5.ll(c)). 
Since element simulators are assumed to be progressive, the simulator for e will compute 












Figure 5.12 Element-simulator operation for an element with a zero delay. 
If the delay is zero (Figure 5.12), the simulator fore does not contain any output state 
description beyond the starting time of the slice (Figure 5.12(a)). The equation for Sa 
depends on the state of other arcs, and the simulator is unable to produce Sa(t, t +a) until 
it has received Sinp(e)(t, t +a) (Figure 5.12(b)). lf e is not a member of a zero-delay ci rcuit 
71 
(Corollary 5 .5). 5inp(c)(t.t+a) will e \·entu a ll .\· be available. \\"h en \ 1(1./+a) is computed . 
the s imula tor will be reac:ly fo r t he next s lice. 
A sli ce that does not conta in ze ro- delay circuits ca n be solved by simple variable sub-
stitution: a slice that contains ze ro-delay circuits (called a n obligatory sl ice) requires simul-
taneous eq uation solving . A syste m ha s a progressive s imulator if and on l.v if a sol u tion 
exists for every slice of a system. If a slice has no solutions. t hen the slice contains a knot. 
5.2 .2 S lice s a nd knots 
For a system that contain s only deterministic e le ments. a non-obligatory sli ce a lways has 
exact l.v one solution. An obligatory slice. however. can ha\·e three possible outcomes: no 
solution . one solution. and multiple sol11t ions. All three of the outcomes can be found in 
t he cross-coupled zero-delay XOR- NOR circuit in Figm e .') .13. 
a3 a4 






( t. t +a) 
A function of the environment. 
,\ fun ction of t he environment . 
-, ( S~1 (I. t + a) V Sa ( I , t + a)) 4 l 
Whe n the inputs a 1 and a 2 are both 0 over the ( I , t +a) in terval, t he set of simultaneous 




not require slice resolution . 111 order to allow for the development of a working s irnulator 
model. 
Section 5.3 The Generic Simulator Model and Its D e rivatives 
Since it is suffic ient to synchronize the elements t hrough t heir inputs and outputs. s tri ct 
s_vnchron iza.tion of all e lements on slice boundaries is unnecessary: elements should be al-
lowed to progress at their own pace as the ir input data becomes available. Furth ermore. if 
{;for an element is larger than a, the element does not have to stop producing output at 
t +a. because it already has compu ted 50 ut(e)(t,t +b) . 




Figure 5.14 Representation of an arc . 
If we ignore the existence of obligatory slices, we can construct a generi c simulator 
model using a set of multi-tape automata. We repl ace each a rc in the sys tem with a read 
head, a write head, and a tape, such th a t: 
l. As information is produced by the origin ator of the arc. the information and the sim-
ulation t ime are recorded along the length of tape as the write head advances. The 
recorded time stri ctly increases. 
2. The read head recovers the recorded information and the time from the ta.pe as it 
advances. 
3 . Both tape heads move 111 one direction only, but the read head will never move past 
the write head. 
Since information over periods of time is written onto the tape by its source element be-
fore being read from the tape by the destination element , element simulators a.re decoupled 
74 
in simulated time. The gap between a write head and a read head on the same tilpe i:; caiiPd 
the slack. Since the element simulators are moved forward by consuming and producing 
slack . this simulator model is called the slack -driven simulator model. 
A slack-d ri ven simulator is not a complete simulator because the model does not include 
a mechanism to solve simultaneous equations; when a system encounters an obligator_,. s lice 
aJtd equation-solving is required. t he element simulators involved will stop. Tlwy a re blocked 
while waiting for each other to produce more tape; this condition is called dear/Jock. \\·e 
will describe, in brief. a few derivatives of the s lack-driven simulator, some of which are 
more permissive and some more restrictive; thus. some a re more complete and some are less 
complete than the slack-driven simulator. 
5.3.1 Message-driven si mulation 
A slack-driven simulator can be expressed as a set of concurrent message-passing processes 
in which the processes a re the element simulators and the message streams are the tapes. 
\IVhenever a stretch of tape is written by the slack-d riven s imulator. the inform ation on 
the tape is sent in a message: whenever a st retch of tape is read , the information in a 
received message is read. Since the slack is represented by messages queued in transit. 




~~ J------?il --- ... __ -......... _ \. 
--------:::::messages 
Figure 5.15 Replacing tape by messages. 
Since a message-driven simulator is an exact implementat ion of a slack-driven simulator. 
the simulation will not make any further progress when equation -solving is required. 
75 
5.3 .2 Concurrent event-driven simulation 
The slack-driven simulator satisfies eventual deliver.v because each stretch of tape written is 
immediately available to the destination process. The message-driven simulator duplicates 
that property by immediately packing and sending the outp ut information as a message. 
oblivious to the value of the information content of the message. An event-driven simulation 
is a modifted message-driven simulation in which message traffic is reduced by classifying 
messages and by treating different types of messages differently. 
yJessages are classified by whether they are needed at the receiving end. Messages that 
are considered to be non-essential are held back with the objective of combining as many 
non-essential messages as possible with the next essential message, and packagi ng them 
all in a single entity. The total volume of messages in the simulation is reduced without 
impeding the progress of the simulation. Whether a message is needed, however, depends 
on the state of the simulation. and is often impossib le to determine on the basis of local 
information alone. 
In event-driven systems, however. messages containing state transitions are more likely 
to be needed than those that do not; most event-d ri ven simulators make the classification 
on that basis alone. Since the transitions are often called events, and since there is generally 
one in each message for such a simulator, these simulators are called event-driven simulators. 
\1essages containing no events are called null messages. Event-driven simulators were first 
explored by Chandy, Misra, and Bryant [13. 12], though their derivation paths are different 
from ours. This exposition illustrates that null messages are a consequence of applying a 
more general model to a specific class of subjects, rather than a necessity when going from 
a sequential simulator to a distributed simulator. 
Culling null messages, as is true with many other methods for reducing message volume. 
violates the rule of eventual delivery because the rules t hat decide whether a message is 
needed at the receiving end can fai l. Without additional mechanisms to assure eventual 
delivery of necessary null messages, deadlock may still occur. A ring of elements with 
76 
stable values for their cyclic outputs will fai l to produce progress becanse each e lement is 
waiting for its preced ing element to prod uce a message, yet none \viii arrive if they send 
on!~· messages containing transitions. 
delay = :3 ' ' ' I 
,-, ,' Information wait ing to be sent 
0 ,, conta ining: state = 0 from t = 3 to t = 6. 
Can not send thi s information 
because it does not contain any transitions . 
Cannot produce more information 
because it has not received a ny more information. 
Figure 5.16 Example of deadlock in an event-driven simulation . 
5 .3 .3 Sequent ia l s im ulator 
..-\ sequential simulator is a. simple example of a backtracking simulator for event-driven 
s:.'stems. If we describe it in the context of our model. a sequential simulator keeps a U of 
its read heads aligned during the simulation . (All read heads a re ini tially a ligned at t = 0 
at the start of the simulation.) Each write head records not only the output state derived 
from t he element input. but also the expected output state, assuming that t he element wi ll 
encou nter no fur t her input change. 
If there a re curre ntly no state transitions recorded under the read heads, the sequen-
tial simulator is free to move the read heads forward without deli vering any of the state 
descriptions to any elements . The state descrip tion on the port ion of the tapes covered 
by the motion were produced on t he assumption that no transition has occurred over that 
period, and the assumption was shown to be valid. When a transition is encountered, t he 
ass umption by its destina tion e lement is shown to be false and the transition is delivered 
to its destination element so that a new output can be compu ted. Since t he delay of an 
element must not be negative, the tape already covered by the read heads will never have 
to be revised. 
In an implementation of the sequential simulator, the set of tapes is replaced by a 
merged list of pe nding events. Each pending event represents an expected change in an 
77 
output of an element given that the inp11t state of the element remains unchan~ed. Items 
in the list are sorted in an ascending order with respect to their time values. 
The position of the read heads is kept in a s ingle variable called the global clock. :Vloving 
the read head s forward is accomplished by storing increasingly larger values into the global 
clock a.'i events are pul led from the list of pending events . The simulator repeat ecll.v sets the 
global clock to the time of the earliest event in the list, pulls that event from the list. and 
delivers it to t he destination element. All events in the list except the top-most event are 
su bject to revi sion because the assumptions of the elements that posted them - that their 
inputs will remain unchanged - may now be shown to be false. The event pulled from 
the top of the li st will never need to be revised because the assumption of the element that 
pos ted it is now shown to be correct. The sequence of events pulled from the list represents 
the result of the simulation. 
time 1-'--:_ - simulated tim.e 
: -- sorted event bst 
· I ·r destination element Icent i y 
t s el 
... -- -, 
/ 
/ 
f._ _ ... - an event ent ry 
e2 
v-
\:: L/ some el ' ement simulators r'\t7 
Figure 5.17 Model of a sequential simulator . 
rG) 
~ u:::e event list 
Suppose an obligator y slice is encountered during the simulation. If the state under 
the read heads forms a self-consistent state assignment for t he slice, then t here will be no 
events scheduled to change that assignment. The simulator will pass over the slice without 
detecti ng it. If the state assignment is not self-consistent, there will be events that change 
the state assignment. As the result of delivering such events, more events may be scheduled 
for the current simulation time because some destination elements may have a zero-delay. 
78 
[f the intermediate state a.ss i ~nments eventual ly lead to a consistent state assignment. the 
pool of e\·ents under t he read head will become empty and the global clock will be allowed 
to advance: if not, the s imulator will be stuck processing an endless stream of events having 
the same en>nt time. 
Since there is one event clelivery for every transition, a sequential simulator is also 
labeled eveut-cl riven: however, unlike the concurrent event-driven simulator described pre-
viously. the sequential simu lator will never deadlock. The simulator is a complete simulator. 
5.3.4 Concurrent backtracking simulators 
!\1cssage-dri\·en simulators do not backtrack. because every piece of information that each 
element simulator produces is cor rect. Backtracking s imulators produce speculative infor-
m<.llion that can be revised v.:hen assumptions fail. In a sequential event-driven s imulator. 
the amount of hackt racking is limited by the alignment of the read heads. Since alignment is 
cost ly and reduces concurrency. concurrent backtracking simulators do not align read heads. 
The element simulators are allowed to produce outputs and to consume inputs according 
to the ir own heuristics and assumptions. When those assumptions are shown to be wrong, 
the.\· have to restart the simulation from the point where the computation went wrong by 
backing up the write heads to discard erroneous information. 
Wh en a write head needs to be moved back behind a read head , the destination element 
of the read head has already consumed and may have produced its state and output based 
on false inputs ; it too must be rolled ba.ck. In order to roll back to the time at which 
the input becomes invalid, the element simulator has to store a sequence of past states in 
addition to its current state. 
Not all of the past state needs to be stored, however. In the Time Warp simulator of 
David Jefferson [14], a behind-the-scenes mechanism called the global virtual time is used 
to compute concu rrently the lower bound of time for which rollback may sti ll occur. The 
global virtual time attempts to keep track of the minimum time of all events and elements 
79 
1n the simulat ion. An.v sav0ci sta te with ::t time ,·al ue less than t he global vi rtual time can 
be di scarded. because no element will PV<'r roll back to ::tn earlier time. 
The advantage of a backt racking simul ator is th<'lt whe n a processor of the machine is 
otherwise idle, spare cycles ca n be used for spec ul<'ltive computing. Since this simulator mu st 
keep a record of past states fort he element s , tlw concu rrent backtracking simulator t rades 
off s pace for speed by us ing larger processing nodes th an would ot herwise be necessary. 
Concurrent backtracking simulators a re comp lete simulators, and they hand le obliga-
tory slices the same way as do sequentia l simulators. When one is encountered, a nd if the 
state ass ignment of the eleme nt s involved is a lready self-cons istent. t he simulator moves 
ahead without detec tin g it. If the ~tate as~ignrnent is not self- consistent, some of the ele-
ments involved will be rolled back to the start ing t ime of the sli ce . and perhaps some more 
a fter that. The flurry of rol!b<'lcks ends when a sP! f-con sis te nt state is achieved . 
5.3.5 Bra nch-and-bound simula tors 
If a backtracking simulator is likened to a depth-fi rst search, then its breadth-first eq uivalent 
resembles a branch-and-bound simulat or. This is one that trades off space for speed by using 
more processing nodes (rather than la rger nodes) than would otherwise be necessary. 
Suppose an element sim ulator computes to a point where its output can take on one of 
several states , depend ing on some inputs that have not yet arrived. Instead of producing a 
speculative output as would a backtracking simulator, the element simulator will, in effect. 
fork the simulation into a set of concurrent bran ches to fo llow each of the possibiliti es . In 
each branch, when the decisive input has finally ar rived . should the input not match the 
assumption for a branch, then the branch will be term.inated (bo und ). 
~-R-e-se-a-rc_h_e_r~~------)>~L~--A--ge_n_c_y_l--~--------)>~~L--A--ge_n_c_y_2 __ ~-----)>~ 
Figure 5.18 A researche r submitting a grant. 
80 
For comparison, s uppose that a re:>earch grant reque:-,t has to be approved in tandf'Ill 
by two government agencies. TIH' lirst agency spends a long time classifying the grant into 
one of three classes. A, B, or C. The second ag,Pncy spen d ::. a long time deciding whether 
the grant will be accepted based on the classification and the available funding for each 
class. A researcher su bmitting a grant ca n be represented by the system in Figure 5.18. 
mes::,age driven ~imulator: 
petition 
backtracking s imulator: 
inconsistency detected, roll back needed 
branch-and-bound simulator: 
assume A q 
assume B 
assume C q 
Figure 5.19 Comparison between three simulators . 
In a message-driven simulator, only one agency simulator can be active at any one time. 
The time it takes to simulate the approval of the grant is equal to the sum of the time taken 
in each agency, because the operation is sequential. In a. backtracking simulator, while the 
simulator for the first agency is working, the second agency can choose and pursue one 
81 
but only one of the t hree possible o utcomes produ ced by the fir:-.t agency. In a branch -
and-bound simulator. t hree copies of the simulation a re produc<'d. each pursuing one of the 
three possibilities. 
A branch-and-bound simulator is a lso a complete s imulator. If th ere a rc a ny no-solution 
s lices. all branches will be te rminated and none will remain at th e end . If there are any 
multi-solution slices, but no no-solution slices, more th an o ne set ofsinl\l lat ions will rematn 
at the end, and each wi ll correspond to one possible outcome. If there a re onl.v single-
solution slices, then exac tly one set will remain. The simula tor \\·ill fail. howe\·er. if the 
numbe r of solution s is unbounded. because the computing resource is bounded. 
The branch-and-boun d simulator is th e o nly inte resti ng t.\.!H' of d istributed simulator 
thal. so far as we know . is still to be ex plored. Efficient algorithms to fo rk an J termin ate 
the simulator may prov ide hope fo r the si mulation of systems with \·e r!' little int rins ic 
parallelism. and whose gr a in size is too small or whose behav ior too unpredictable for 
rollback to be profitable. 
5.3.6 T ime-driven simulators 
Thus far, we have discussed simulators that resolve sli ces by trial -and-erro r (backtracki ng) 
a nd by exploring all possi bilities (branch-and-bound). In both methods, each element s im-
ulator needs only local information for progress . Neither method is appropria te, however, 
when the number of possibili t ies that must be explored is infinite. Exact simul a tio n of such 
a sys tem may require solving simultaneous equations analytically. vVhen the equ ations can 
be solved, they yield functions of time, reducing the simulation to a simple tas k of func-
tion evalu ation. When a n analytical solution is inappropriate or difficult to find. empiri cal 
approximations must be used. 
An example of such a system is an elect rica l circuit. In the system in Figure 5.20. the 
voltage across a capacitor is the integral of the current through the capacitor; the current. 






A physical ,vstem ... and its logi cal representation 
Figure 5.20 An example of a continuous system . 
The eq nations: \1 = J I dt 
1 = i( v) 
In order to simul ate th is kind of system, we need to find a replacement system that is 
discrete but that will either approximate t he behavior of the target system or converge to 
the final state of the target system. The usual method of building a simulator for such a 
system is to divide the sim ulation interval into a sequence of small slices. We then assume 
that information exchange takes place only at the boundaries of these slices. and information 
about the others can be accurately extrapolated between the boundaries. 
For example. when integration of a continuous function is involved . discrete methods. 
such as taking the Rjemann sum, can be used to approximate the in tegral of the function. 
Although discrete integration is seldom exact, we can get increasingly better approximations 
by reducing the size of the slices; when the size is reduced, the Riemann sum approaches 
the integral. However, due to accumulated numerical errors. the simulation may eventually 
diverge and produce an output that is valid only for a limited span of simulated t ime. 
Simulators oft his type are caUed time-driven simulators because they a re moved forward 
at one time slice per step. Simulators of this type are also complete. 
5 .3.7 S umma ry 
The slack-driven simulator is a generic simulator model that covers a large array of existing 
and hypothetical s imulators. Simulators that perform computation on speculation. such as 
the concurrent-rollback simulator, are called optimistic simulators. Simulators that produce 
83 
no output other than that implied by the input are called conservative simulators. We will 
concentrate on the message-dri\·en simulator. which is a conservative simulator. 
We a re particularly interested in the characteristics of the simulator itself, not those of a 
s imulator plus any system it simulates. Thus. we have chosen the most revealing simulation 
s ubject, devised a. series of conservative simulators, and reported in the following chapters 
the results obtained. 
84 
Chapter 6 Logic- Circuit Simulator Experiments 
A Boolean network is a network of Boolean logic gates connected s uch that each input is 
driven from the output of another gate or from an input to the network. A logic circuit 
is a Boolean network that includes a notion of t ime: Each logic element in the network is 
assigned a positive value called the delay of the element. The input and output s tates of 
the gates are time-variant . IfF is the Boolean function of a logic gate whose delay is 6. 
then the input state, I. and output state, 0. are related by the equation: 
O(t + 8) = F(J(t)) 
Thus . unlike a Boolean network, which has a static value that is computed by solving a 
set of s imultaneous equations. a logic circuit can have time-dependent behaviors , such as 
memory and oscillation. Simulation is a w<ty of computing the behavior of a logic circuit. 
X 
Figure 6.1 A logic circuit whose behavior is different from its Boolean network . 
The Boolean network in Figure 6.1 can be described by the equation x = NOT(.r). 
which does not have a solution . As a logic circuit, however, the network is an oscillator. 
Although the input-output relationship of a logic circuit when it does reach a stable state is 
consistent with the correspond ing Boolean network, our definition of a logic circuit simulator 
is one that reproduces the behavior of a logic circuit rather than one that solves for a stable 
state. The other definition is used by simulators such as MOSSIM [19]. which simulates 
and verifies digital integrated circuits. 
Most existing circuits found in computers and other digital systems belong to a class of 
circuits called clocked logic circuits . Clocked logic circuits a re very well suited for the stable-
state-solving form of simulation . because they a re designed to reach a stable state during 
each clock cycle, and because on ly the final state of a clock cycle is needed to determine 
the future state. The exact sequence and timing of transitions that lead to a stable state 
85 
a re usually not important: onl.v the fin a l stable s tate of the ci rcui t is important. S uch 
si mulators . however. wi ll not work very well fo r the unclocked. o r self-t imed. logic circu its. 
Section 6.1 Why Logic Circuits? 
\\'e st ud.v logic-circuit simulation because it st resses a distributed simulator. and is itself of 
practical interest. It. is easy to cons truc t examples of logic circui ts with diverse behaviors . 
st ru ct ural difficulties such as large fa n-in and fan-out , a nd highly non-uniform activity le\·els. 
Simple logic ga tes exhibit responses in which an input event may or may not influence the 
outputs, depending o n the in te rnal st ate of the ele ment and on t he states o f other inputs; 
yet, the_v require very littl e computa tion to simulate their behavior. Thus, the performance 
r0sults shown la ter in volve practically no computation other than the distributed simulat ion 
it self. They a re. therefore. unclutte red s tudies of how well the simulator itself performs . 
A number of related simulators, each supporting an a rray of different simulation modes , 
have been written during t he course of this study. These simulators run o n multicomputers, 
such as the Cosmic Cube, Inte l iPSC, and Symult 2010. Since t hey a re written to run 
under the Cosmic Environment. t hey can be compiled for a ll of these machines without 
modification. The historical re lationship between these simula tors is shown in Figure 6.2 . 













Figure 6.2 A number of circuit simulators and their relationshi p. 
Of the five simulators shown, results obtained on t hree of them - the CMB-variant. the 
coordin ated- sequential. a nd t he progressive-hy brid simulators - are of inte rest. The se-
86 
quential simulator and the pruncd-CMD-,·ariant are used for comparison only. The pruned-
CMB-variant simulator will not be discussed. 
The CMB-variant simulator is a straightforward implementation of the generic simu-
lator in which the basic unit of information transfer is a block of state description over a 
time interval. The CMB-variant simulator shows excellent speedup as the number of nodes 
is increased, but, since it is totally oblivious to the content and effect of its information 
carriers, much of the work it has to do can be eliminated when an event-driven system is 
simulated on one node U!ing a sequential simulator. However, sequential simulators can-
not be readily distributed, and they cannot, in their original form, benefit from the use of 
multicomputers. 
The three succeeding simulators attempt to combine the advantages of sequential and 
distributed simulators. The pruned-CUD-variant simulator is a CMB-variant simulator 
with sequential simulation mechanisms added. The coordinated-sequential simulator is a 
sequential simulator with CUB-variant mechanisms added. The progressive-hybrid simula-
tor is the final merger of the two. In the following sections, we will describe each of these 
simulators in their chronological order. 
87 
Section 6 .2 CMB-Variant Simulator 
The CMB-variant simulator for logic circuits is a proof of concept for the generic simulator 
model described in Chapter .'). Sinn' t hi s is a demonstration of a gener ic modeL in o rde r 
to cover the g reatest range of possible sil1lul ation subjects, special but useful properties 
of logic circuit s ha\'e bee!l ignored in building this simulator. In par ti cular, the simulator 
ignores the fact that logic circuits are event-driven systems. We will discuss such systems 
in greater detail whe n we compare the res ult of th is simulator to ones that do make use of 
t he event-driven properties . 




"-- domain of event-driven simulators 
Figure 6.3 Domain of the generic simulator model. 
The tape-writi ng and -read in g processes in the generic simulator model are replaced by 
message-send ing and - rt"ceiv ing processes in the CM B-variant simulator. These are ligh t-
weight, reactive processes. and the simulator is a. reactive kernel for the reactive processes. 
As in a usual reactive-process program, the distribution of the simulation task on a multi-
computer is accompli shed by partitioning the set o f reactive processes across a set of reactive 
kernels that run on a multicomputer. 
We will present a simplified description of the CMB-variant simulator; the actual im-
plementation contains extensive measurement se tups and programming short-cuts that are 
inappropriate to report here. The simulator presented, however , is fun ctionally correct. ex-
presses the same principle as does the act ual implementation . and is easier to understand. 
6.2.1 The element simulators 
First of all, a reactive process is represented by two pointers: the entry-function poin ter and 
the data pointer. The entry-function pointe r always contains the reference to the funct ion 
that handles the next message for the process. but the data pointer can hold any private 
data structu res needed by the process. For an element simulator, the private data may 
88 
include one data s tructure for each of tlte eleme nt· :, outputs .. \n output data struct11re 
contains the references to all inputs to which it connects. Each input reference contains the 
ID of the element that owns it and the index that ide ntifies the input within the element. 
One output st ructure can contain 1110re than one reference. because an output ca n connect 
to more than one input. 
The private data may also include one input data structure for each of th e element ·s 
inputs. Each input data structure contains the ID of the process and the identity of the 









(._ . : output s tructure 
lllput s tructure 
Figure 6.4 Process struc ture and a simple example of connectivity. 
We may need a variable-sized message format to describe a piece of tape recording, 
because the information on the tape can be arbitrarily complex. In the interest of simplicity, 
however, we choose to represent each tape recording with more than one simple. fixed-sized 
message. We will call the structure a STATE_ FRAGMENT. We use the name fragment to 
contrast it with the name event used in the study of traditional event-driven simulation 
systems, and to convey the fact that every entity is a fragment of a continuum that can be 
merged with adjacent entities and sl iced into arbitrarily many entities. 
The essential fields of a fragment a re shown in Listing 6.1. When a fragment is received 
by a process. the input_id field identifies the element input to receive the fragment. The 



















I* Index of the lnput at the dest element. • / 
I* State contalned i n this fragment. • / 
I• Duratlon of this fragment. • I 
I • Pointer to make a linked llst of fragments.• / 
listing 6.1 Structure of a FRAGMENT 
\'\ .hen a. piece of tape is to be written by an element in the generic simulator model. 
the correspondi ng process in the CM B-variant s imulator produces one fragment or a stream 
of several fragments to carry the information recorded on the tape. vVhen a fragment has 
arrived at its destination. the entry fun ction of the dest in ation proces~ is called to accept 
the fragment. It is v.:orth noting that reactive-proces::. programm ing systems are themselves 
ev<.'nt-driven sys tems whose inputs a re fragments . T hu s . the simulator is always an event-
driven system. even though the system it simulates may not be. 
1 inverter_entry(pp,sb) 
2 PROCESS *PP; 
3 STATE_FRAGMENT •sb; 
4 { 
5 OUTPUT(pp,O, !sb->state,sb->span); 
6 free_fragment(sb); 
7 } 
listing 6.2 An inverter in a CMB-variant simulator 
Listing 6.2 contains a sample entry function for an inverter element. As in an ordinary 
reactive process, the two parameters to its entry function are the process structu re a nd t he 
in put message. When called, the entry function s imply outputs another fragment of the 
same length , but with a complementary state ,·alue. The dela.v of the inverter is equal to 
the d iffe rence between the amount of fragment s produced and thE' amount of fragments 
consumed . Such differences are set up during initiali zation by producing one fragment for 
each outp ut of every gate, such that each fragment has a span that equals the delay of its 
output. 
T he OUTPUT fun ction takes on four parame te rs. T he first two parameters are the process 
structure and an index that identifies an output of the element. The function needs these 
90 
two parameter1-. 111 order to access the list of destination input references for th<> oulptlt 
fragments. The next two parameters describe the state and the span of the fra{!;ment. In 
this example . there is only one output for the inverter. and its output index is 0. The state 
of the ttPw fragment is the complement of the state contained in sb->s1:ate and tllf' leng1 h 
of the frap;ment is the same as sb->span. 
Since an in\'erter has only one input. it does not have to check the input_id of t.lte 
fragments it recei\·es, and it can immediately process any fragments it receive5 without 
waiting for other fragments to arrive. For a gate with more than one input. however. it 
usually has to differentiate the fragments it receives. Listing 6.:3 contains a ~ample entry 

















int out_span, out_state; 
QUEUE_FRAGMENT(pp,sb); 




( Q_HEAD(pp,O)->state - Q_HEAD(pp,1 )->state ); 






Listing 6.3 An XOR-gate in a CMB-variant simulator. 
In a two-input XDR-gate. both of the inputs must have at least one fragment present 
before the gate can produce output fragments. The gate must therefore maintain a. fragment 
queue for each of its input structures . When a fragment is received, the entry function can 
check the queues before deciding whether the fragment needs to be queued; but. in the 
interest of simplicity, the function always queues the fragment (7). The QUEUE_FRAGMENT 
function puts the fragment sb into an input queue of pp according to sb->inpu1:_id. 
!Jl 
The Q_EMPTY function returns TRUE if the specified input queue for the procP~!:> pp i~ 
empty. While both que ues a re non-empty (9) . a length of fragnwnt i~ removed from Parh 
queue to produce an output fragment. The state of the output fragment is equal to t he 
exclusive-or o f the s tates of the fragments to be removed (11) . The length of the output 
fragment (and of each fragment to be removed) equals the length of the shorter fragment 
at t he head of the queues ( 1:2). The Q_HEAD fun ction returns a pointe r to t he first fragment 
In the specified queue. 
The outp ut of the exclusi\·e-or gate remains the same as long as both inputs remain 
unchanged. The length of the shorter fragment is the length of t ime both inputs are known 
to re main unchanged. When fragments a.re consumed , output is produced ( 14) . and a length 
eq ua l to the length of the output fragment is trimmed from both queues ( 16,17). 
The loop repeats until o ne o f the qu e ues becomes empty a nd the gate can no lo nger 
produce any a dditional output fragments from its queues. The inverter and the XOR- gate are 
simple because they are both strict: ie, they do not have any partial input-state assignment 
such that the state of the outputs is not influenced by the state assignment of the remaining 
inpu ts . 
An OR-ga te. on the other hand, is non -strict: If any of the inputs is 1. its output will be 
L, regardless of the state of its other inputs. An OR-gate can t he refore continue to produce 
fragments in some situations where not all of its inputs are available. Listing 6.4 contains 
a sample entry function for an· OR-gate : 
1 or _entry(pp,sb) 
2 PROCESS *pp; 
3 STATE_FRAGMENT *sb; 
4 { 








if(!Q_EMPTY(pp,O) && (Q_HEAD(pp,O)->state --TRUE)) 
{ 
out state = TRUE; 


























if( IQ_EMPTY(pp,O ) && !Q_EMPTY(pp,l)) 
{ 
out_state = ( Q_HEAD(pp,O ) - >state 
out_span = MIN ( Q_HEAD(pp,O )->span 
} else break; 
TRIM_QUEUE(pp,O , out_span ); 




l isting 6.4 An OR-gate in a CMB-variant simulator . 
When the process receives a fragment. it is added to the queue, as 1n the case of t he 
XOR-gate. But , t hen, instead of checking both of the queues for fragments, the function 
checks first for possible non -s trict input conditions. Lines 11- 16 check the input whose index 
is 0; lines 18-23 check the input whose index is 1. If a fragment for an input is available 
and its state is TRUE. then a non-strict input condition exists. The new output fragment 
is specified to have a state value of TRUE and a span equal to the span of the fragment in 
the queue. The function then continues to line 32 where fragments are trimmed from the 
queues and an output fragment is produced. If no non-strict conditions have been detected, 
the process will comp ute and produce fragments in the same manner as the XOR process 
(26- 28) . 
When a non-strict condition is detected on one input, the queues in both of the inputs 
a re trimmed (32- 33) because the state of the other input does not matter. However, it is 
possible that the queue for the other input is empty or does not contain enough fragments 
to cover the a mount to be trimmed. In t his case, the t rimming extends to fragments that 
have not yet arrived . The process must therefore record the deficit in curred and deduct it 
from fragments that arrive later. 
D3 
1 typedef struct { int delay; I* 
2 I DATA *inpq; I* 
3 O_DATA *Outq; } ELEMENT; I* 
Delay of the element.*/ 
One per gate input. *I 
One per gate output. *I 
5 typedef struct { STATE_FRAGMENT *qh; 
6 STATE_FRAGMENT *qt; 




Points to top. *I 
Points to bottom. *I 
Deficit of the queue*/ 
The details for the process are complete; we are ready to show the essential mechanis ms 
that support the processes. The process structure contains an entry function: an array of 
input data structures, one for each element input; and an array of output data structures, 
one for f'ach e lement output. These data structures are set up during initialization. The 
input structure co ntains the deficit count and a pair of queue pointers, one for the head of 
the queue and one for the tail. 
1 QUEUE_FRAGMENT(pp,sb) 
2 PROCESS *pp; 
3 STATE_FRAGMENT *sb; 
{ 
5 I_DATA *Q; 
7 Q = ((ELEMENT *)(pp->data))->inpq + sb->input_id; 
9 if(Q->deficit) 
{ 
11 if(sb->span <= Q->deficit) { Q->deficit -= sb->span 
12 free_fragment(sb); return; } 
13 else { sb->span Q->deficit; 











if(sb->state -- Q->qt->state) { Q->qt->span += sb->span 
free_fragment(sb); return; } 
else { Q->qt = Q->qt->next = qt; 




Q->qh = Q->qt 
sb->next = 0; 
sb; 
listing 6.5 CMB-variant QUEUE_FRAGMENT function . 
The QUEUE_FRAGMENT function adds the fragment, sb, to the (sb->input_id) th input 
queue of the process pp. It checks first for the deficit (9). If a deficit exists, the span of 
the fragment is used to sat isfy the deficit; if the fragment is totally consumed ( 11-12 ). the 
94 
function returns. Otherwise. the balance i:, advanced to the next step. where fragment:-- <HP 
added to the queue (11). ff there are already other fragments in the queue (11). and if 
the last fragment has the same state as the new fragment (19) . the two are si mply merged 
( 19 20). Otherwise. the fragment is linked into the queue (21-2:2 . 25 - 26). 
1 TRIM_FRAGMENT(pp,id,debit) 























Q = ((ELEMENT *)(pp->data))->inpq + id; 
while(debit && Q->qh) 
{ 
if(Q->qh->span > debit) { Q->qh->span -= debit; 
debit 0; } 





Q->deficit += debit; 
Listing 6.6 CM 8-variant TRIM_ FRAGMENT function . 
The TRIM_FRAGMENT function removes debit amount of fragments from the id-th input 
queue of the process pp. As long as there are more fragments in the queue, the spans of 
as many fragments as necessary, taken from the head of the queue, are used to satisfy the 
debit. r\ny remaining debit is added to the deficit of the queue. 
6.2.2 The simulator message system 
The li st of references and indices for each output structure described above represe nts a 
one-level tree. The root of the tree is the sending process and the leaves of the tree are 
the receiving processes. T he job of the OUTPUT function is simple enough - it allocates a 
fragment fo r each leaf process and sends it along the branch that leads to the process. In 
such a simulator, however, gates with a large fan-out, such as a clock driver, ma.v have to 
s0nd the same information to the same destination comput ing node ma ny times. 
05 
Because messages hetwPr>n computing nodes a re usually more expensive than messages 
within the same computing node. WE:' reduce the internode messages b_,. organizing the t reP 
a.s a. two- leve l tree. The in te rmediate tree nodes are a set of inpuL port processes, one for 
each computing node tha t contain s a destination process. An output sends its fragment to 
its input ports. and an input port duplicates and forwards the fragment to the destination 
processes in it s O\\"n computing nodes. 
,---------------, ,---------------, 
' ' ' ' 
: a node : ' ' ' 
' ' l---------------' l------------- __ , 
[] output port m input port 
Fig ure 6.5 A sample circuit and a possible mapping to a multicomputer . 
:vlany mechanisms can be added to the output structure for a more more efficient 
simulator. and such mechanisms account for the majority of the differences between the 
act ual implementation and this description . Here we will present a simple OUTPUT fun ction 
that converts fragments in to messages that are immediately sent. 
1 typedef struct { int count; I• Number of siblings. •I 
2 int •node; I• Dest process's node. •I 
3 int *pid2; I* Dest process's pid2. *I 
4 int •input_id; } O_DATA; I• Dest process's input •I 
The output data structure contains the number of ports connected and a list of ref-
erences to those ports. A reference for a process in the simulator contains the node and 
the pid of the destination simulator process. lt also contains a pid2, because the element 
processes are embedded in the si mulator by reactive- handler layering. Only the node and 
the pid2 need to be stored in the output structure , because in our implementation there is 
only one simulator process for every node. and all of them have the same fixed pid. Listing 
6.1 contains a sample OUTPUT function: 
1 OUTPUT(pp,id, s tate,span) 















11 op = ((ELEMENT * )(pp->data) )->outq + id; 













sb->span span ; 
s_send(msg, op->node [ j] ,op->pid2[j]); 
Listing 6.7 CMB-variant OUTPUT function . 
T he OUTPUT function allocates a fragme nt for each branch of the tree ( 15 ), initializes 
it with the input index of the destination input (16), sets t he state and span (17- 18), and 
sends the fragment ( 19). The s _send function is a layered message function that sends 
the message to another process in the simulator. If a two-level tree st ructur e is used, each 
fragment goes to an input por t process that is identical to the inverter process except that 
the state is not inverted (a buffer process). The main function for the simulator is identical 
to that of a reactive kernel: 




struct { int 
char 
(*entry)(); 
*data } PROCESS; 
pid2 
msg_body[]; } MESSAGE; 
7 simulator_main_loop( ) 
8 { 
9 PROCESS *proc; 









mesg = (MESSAGE*) xrecvb(); 
proc = process_table + mesg->pid2; 
(*proc->entry)(proc , mesg->msg_body); 
Listing 6.8 CMB-variant main loop . 
!)7 
This is the end of our description of a simple. distributed sim ulator derived directl.v 
from the generic simulator model. The description is complete except for the storage 
allocation/de-allocation mechanisms. the initialization/termination mechanisms . and the 
result-recording mechanisms. 
6.2.3 The variants 
Although this simulator exh ibits excellent performance for some cases. much can be done 
to improve its performance for difficult cases. The number of actual messages. for example. 
can be reduced in a logic circuit simulation by using a more elaborate OUTPUT function. In 
particu lar, if message sending is deferred by putting fragments into output-holding queues, 
the opportunity to merge multiple fragments into a single message increases. When two 
successive fragments with the same state are put into the same holding queue, the two can 
be merged into a fragment with a larger span, saving both space and handling time. Even 
if they cannot merge, multiple fragments can be concatenated onto a single, longer message 
to share the per-message overhead. 
If sending is deferred forever, however, the simu lator will fail to make any progress. 
Good efficiency can be achieved with a proper balance of message deferral and message 
sending. Before we devised and evaluated a number of flow control methods, there were 
two methods that represented the two extremes of possibilities: the two original CMB-
methods. (Hence, our methods are called variants .) In the deadlock-avoidance method, no 
fragments are deferred and deadlock does not occur. In the deadlock-detection method. no 
message is sent until the simulation runs into a deadlock, or unless the output-holding queue 
contains an event. A deadlock-detection mechanism running concurrently in the simulator 
message system detects the deadlock and forces deferred messages to be sent. 
We generally call those methods that are more likely to send messages eager methods. 
and those that are less likely to send messages lazy methods. Thus, the deadlock-avoidance 
· method is at the eager end of the spectrum, and deadlock-detection method is at the lazy 
end. To explore the middle ground, we needed to hold back messages by some criteria we 
as 
could select. but in order to prevent deadlock detection from dominating the timing. \\·e 
needed a cheaper wa:-' of ensuring progress than by using standard deadlock detection. 
When simulator processes defer sending output messages. they may c:-·clicaiJ_v deny 
themselves input messages. leading to deadlock. Howeve r. deadlock implies that some node 
has an <:>mpty input-message queue. Since the em ptiness of the queue is a local cond ition. 
we make use of that condition to modify the behavior of the s imulator to prevent deadlock. 
Our strategy is called indefinite-lazy message sending, and is implementPd by replacing the 
xrecvb function in the s imulator·s main loop with a non-blocking xrecv. 
1 simulator_main_loop() 
2 { 
3 PROCESS *proc; 
4 MESSAGE *mesg; 
6 while(1) 
7 { 
8 if(mesg = (MESSAGE *) xrecv()) 
9 { 
10 proc = process_table + mesg->pid2; 
11 (*proc->entry)(proc, mesg->msg_body); 






Listing 6.9 CMB-variant indefinitely-lazy main loop. 
The function xrecv returns a message for an element simulator if the node's input-
message queue is not empty. The simulator goes on to deliver the message as before if a 
message is returned. While an element si mulator is consuming a message, it may either send 
or withhold any output that the element simulator produces according to the heuristics in 
effec t at the time. 
If the node's input-message queue is empty, a null pointer is returned and deadlock is 
a possibility. The simulator will take special actions to break potential deadlocks. Actions 
can generally be classified into two types: For the source-driven type, the simulator selects 
a deferred output to send as a message; for the demand-driven type, the simulator selects 
99 
a blocked element, and send s a demand message to its predecessor to reque:-,t that quetlf•d 
outputs be sent. The end result is that deadlock is prevented. 
6.2.4 Variant a lgorithms 
\\"e ha\·e experimented with many CMB variants. Since many of them are closel.\· related. 
and all of them show similar performance results, we will describe the opera! ion ancl report 
the performance of just six variants (A-E) that are representative of the range of possibilities 
that we have studied: 
A Eager message sending: This is the deadlock-avoidance CMB simulator . 
B Eager e\·enl.: Since successive fragments with the same state value can be merged into 
one fragment , the eager-e\'ent variant detains aJl output fragments until a fragm ent 
that cannot be merged with its predecessor ts produced. When xrecv ret urns a null 
pointer , the detained fragment that extends to the earliest time is sent. This is called 
an eager-event variant because state changes are called events in event-driven system s. 
and because this simulator will eagerly send event-conveying fragments. 
C Indefinite-lazy. single-dispensation: All output fragments produced by element simula-
tors are queued. Messages are sent only when xrecv returns a null pointer . The output 
queue that extends to the earliest time is selected, and one fragment from that queue 
is sent. 
D Indefinite-lazy, multiple-event: This scheme is a variation on C, motivated by charac-
teristics of multicomputer message systems that make it economical to pack multiple 
events into fewer messages. All output fragments produced by element simulators are 
queued. When xrecv returns a null pointer, the output queue that extends to the 
earliest time is selected to generate a message using all of the fragments in that queu e, 
instead of just one. 
E Demand-driven: Although we usually think of simulation as source-driven from inputs. 
one can equally well organize the simulation as demand-driven from outputs. ln the pure 
100 
demand-driven form . a ll output fragments produced b.v clement sim ulators a re queued. 
\tV hen xrecv retu rns a null pointer. the input port that lags furthest behind is picked 
to select the destination for a demand message. t'pon receipt of a demand message, if 
the o utput queue is not empty. the simu lator sends all fragments in the output queue: 
if t he outp ut queue is em pty. the simulato r propagates t he demand message . For the 
demand-driven variant. t he message header must also carry a type field to distinguish 









msg _body[]; } MESSAGE; 
5 simulator_main_loop() 
6 { 
7 PROCESS *proc; 






















if(mesg = (MESSAGE * ) xrecv()) 
{ 












Listing 6.10 CMB-variant demand-driven main loop. 
F Demand-driven . adaptive: Demand messages single out cri tical paths in a simulation. 
In an adaptive form of demand-driven simulation . a t hreshold is associated with each 
communication path. Outpu ts of elem ent simulators are queued only up to the thresh-
old; when the threshold is exceeded, the contents of the queue are sent as a message. 
Demand messages operate as in E. but also cause the threshold to be decreased for 
processes that get them . In the examples that we show. the threshold is halved. The 
101 
simulator is according!!· able to adapt its<>lf to the characteristics of the system bPing 
simulated. 
6.2.5 Instrumentat ion 
Although execution time is o ne of the mos t natural bases of comparison between any two 
programs that perform t he same function, and although it is used below to illu:,trate the per-
formance of our dist ributed s imulators on different commercial multicomputers. execution 
time on these concurrent computers depends both on the algorithm and on the charac-
teristics of the particular computer. When we wish to isolate the characteristics of the 
algorithm from those of the computer. we run our s imulator programs under the control of 
a multicomputer s imulator (sweep mode). A close examination of the main routine of the 
simulator reveals t hat it can be transformed with minimal modification into a light-weight 
reactive-process program under yet another la.ver of the reactive kernel: 
1 SIM_DATA •simulator_data; 
3 simulator_main_loop(simp,mesg) 
4 PROCESS •simp; 
5 MESSAGE •mesg; 
6 { 

















simulator data (SIM_DATA *)(simp->data) ; 
if(mesg) 
{ 
if(mesg->type == DEMAND_TYPE) 
{ 








take_action_to_promote_progress( ) ; 
} 
Listing 6.11 CM 8-variant main loop as a light- weight process . 
The process structure in t his reactive kernel is described by t he SIM_OATA structure in 
the above listing. The structure contains a list of element sim ulator processes and any other 
102 
reactive kernel 
Figure 6.6 Structu re of a sweep-mode si m ulation . 
data s tructures private to this instance of the simulator. 






S\\"eep-rnode simulation for an N-node multicomputer is accomplished with a reactive 
kernel that runs N copies of the simulators as reactive processes. Execution time is then 
measured in a unit called a sweep [2, 1.5], which corresponds here to a fixed time required 
to call an element once. The time required for other operations, such as sending a message. 
can be set to a particular number of sweeps . Normally, a message sent by one node in one 
sweep is available in the destination node at the next sweep. However, to test the sensitivity 
of the aJgorithms to message latency, we can also set the latency to larger values. 
multicomputer network 






In the real-mode simulation , the simulator is linked with a reactive heavy-weight handler 
103 
and run directly on the multicomputer. There 1s one copy of the simulator proce"s in 
each node. and each simulato r process runs a subset of the elements as embedded reartive 
processes. Each node runs at its own pace. and execution time is measured with the host 
compnter·s real-time clock. 
6.2.6 Experimenta l results 
Performance measurements have been made on a variety of logic networks, including those 
that are representative of networks found in computers and VLSI chips, and those that 
a re designed specifically to test or to st ress the simulator. Six different network types, 
each in several sizes up to -1000 logic gates, have been the principal vehicles for these 
experiments. The majorit~· of the logic gates have delays of between 1 and 80ns. with 20ns 
being a typical value. Each simulation was run for a predetermined, simulated interval, 
and a set of measurements, including the real elapse time, was recorded. A larger variation 
in performancP was observed among networks with different characteristics than between 
algorit hm variants. 
The parallel multiplier is a good example of an ordinary logic network. The 14 xl4 
array multiplier used in several experiments employs 1376 logi c gates to generate the 28-bit 
product of two 14- bit binary inputs. The multiplier network contains only limited con-
cu rrency, and does not contain tight circuits that give the simulator artificial performance 
advantages or troubles that depend on element distribution . It also contains moderately 
high fan-out in the multiplier and multiplicand lines ; this puts pressure on the message 
system. In all fairness, the distributed simulation of th is multiplier network is expected to 
do neither too badly nor too well. 
For the simulation, the most signifi cant bit of the product is connected back to the 
multiplier input via an inverting delay. The delay is such that the multiplier reaches a 
stable state before the multiplier input changes. The multiplicand input is set to a value 
that causes the circuit to oscillate. The resulting activity level is quite low: The entire 
circuit is idle 25% of the time. For the other 75% of time, there is a wavefront of activity 
104 
movmg dia12-on ally down tlw array. .\ fter the wavefront hits the bot tom-left corner. tile 
multiplier input chang~>s and b roadcasts the change to 144 gates. A trace of the product 
outputs shows t hat the simula tor a nd the ci rcuit are running correctly. 
idle broadcast wave front 
Figure 6.8 Three phases of the oscil lating multiplier . 
The plot in Figure 6.9 po rtrays in a log-log format the sweep cou nt in the sweep-mode 
versus t he number of nodes .. V, for the simulatio n of the 14x l 4 multiplier network under 













0 2 3 4 
sequential simulator 
5 6 7 8 9 10 11 
log2( nodes) 
Figure 6.9 A 1376-gate multiplier , sweep-mode. 
It is not usefu l to con tinue the plot beyond 211 nodes, since at this point t here are as 
many nodes as simulated gates . Each horizontal divi s ion represents a factor of two in t he 
105 
number of nodes used: each \·enical di\·ision represents a factor of t\vo in sweep count or 
time. The placement of elements in nodes for these trials is a s.\·stematic pattern that tends 
to put related elements into the same node. 
The first remarkabl~ characteristic of these performance mea~urements is that they ;ue 
so similar across this class of variant algorithms. Algorithms A. £ . and F produce more 
messages than 8, C, and D; but in the sweep mode. in which messages are free but element 
invocations are expensive, there is little dirference between the variants. The performance 
under sweep-mode execution exposes the intrinsic characteristics of 1 he algorithm. and is not 
related to such multicomputer characteristics as the relationship between nod~ computing 
time and message latency. 
The performance is divided roughly into two regimes, the first regime being one of near-
linear speedup inN for the first 7-8 octaves. and the second regime being one of diminishing 
returns in N as the computing time approaches an asymptot ic minimum value. In the 
linear speedup regime. these simulators nearly halve the sweep count with each doubling of 
resources until limiting effects a re reached. Load balance is assured by the weak law of large 
numbers \vhen there are many elements per node. While each node has a sufficiently large 
pool of work, node utilization remains high. The simulators approach asymptotic minimal 
time as they exhaust the available concurrency in the system being simulated. The gradual 
"knee" of the curve originates from progressively less-effective statistical load balancing as 
the number of elements per node diminishes with larger N. The gross characteristics of these 
curves are similar to those of other concurrent programs [2]. and are quite understandable 
and predictable. 
Like many other concu rrent algorithms, a more efficient sequential algorithm exists for 
the C:vl B-variant simulator when applied to circuit simulation. The heavy horizontal line 
represents the number of sweeps a sequential event-driven simulator requires for this same 
simulation. We observe at log2 N =0 ( 1 node) that all of the CMB variants are somewhat 
inefficient in comparison with the sequential event-driven simulator. We shall refer to this 
106 
extra work that the C~IB-variant simulator doe~ a<; the o 1·crhead of di s tributing the si m -
ulation. We will discuss the sequent ial event-driven si mulat or and additional per formance 
measurements in the next and &ubsequent sections. 
107 
Section 6.3 Sequential Simulator 
At :V = l, the sequential simulator does better than do the CMB-\·ctriant simulators for two 
reasons: The first is that logic circuits are e vent-driven systems in which the tinw it takes 
for a sequen tial simulator to handle and process a fragment is zero if the fragment does 
not convey an even t. ( A fragment convey·s an event if its state differs from thP fragment 
that precedes it. A message that carries an event-conveying fragment is an cvPnt message: 
a message th at does not is a null message.) The second is that logic gates are si mpl e <1nd 
the time it takes for an e lement simulator to process an event-conveyi ng fragment is almost 
zero. 
Since the message- handling times for nuU messages and event messages are identical in 
the CMB-variant simulator, the ratio at N = 1 (N is num ber of nodes used) between the 
time taken by the seq uential a nd th e CMB-variant ci r cuit simulators reflects the proportion 
of event messages in a C~IB-variant circuit simulator. The cost of handl ing null messages 
is the overhead of the CMB-variant simulator at N = 1. 
6.3.1 Sequential simulator mechanism 
Like the CMB-variant simulator. our seq uential simulator is also a reactive-process program 
with embedded, light-weight , reactive processes. Each message in this simulator. called an 
event, describes a state transition and includes t he following fields: 






input_id; I * Index of the input at the dest element. *I 
time; I * Time of the transition. *I 
5 } 
Listing 6.12 Sequential-s imulator event structure. 
The time field of an even t represents the time when a state change will occur at the 
input (identified by the value of the input_id field ) of t he process t hat receives t he event. 
The function contained in Listing 6.13 can be used as a n entry function for an inverter gate. 
1 inverter_entry(pp,ep) 
2 PROCESS *pp; 
3 EVENT *ep; 
4 { 
108 
5 SEND_EVENT(pp, 0, ep->time); 
6 free_event(ep); 
7 } 
listing 6.13 An inverter in sequential simulator. 
When the simulator delivers a.n event to the inverter, the inverter will generate an 
out put e\·ent with all e\·ent time that is pp->delay units larger. The SEND_EVENT function 
takes three parameters: Like the OUTPUT function of the CMB-variant simulator. the first 
two parameters are the process structure and the index that identifies an output of t hP 
element; the third parameter is a time value whose sum with the element dela.y becomes 

























((ELEMENT *) (pp->data))->outq 
((ELEMENT *) (pp->data))->delay 


















l isting 6.14 The SEND_EVENT function in sequential simulator . 
The routine allocates an event st ructure (15) for every input connected, fills in t he 
receiver input index (16), fills in the time of the event (17), and inserts the event into the 
event list (19). This routine is s tructurally similar to the OUTPUT routine of the CMB-variant 
simulator. except that node numbers are not used to identify processes because alJ processes 
reside in the same node. In order to reduce the number of events that must be sorted when 
lOD 
more than one input is connect<'d . Olltput·f'W'nt duplication 1n the actual implementation 
is performed at the time of event del ive ry. 
It is interesting that the entry fun ct ion for an XOR-gate is identical to that of an inverter. 
Li s ting 6.15 contains the more com plex. OR-gate entry function. 
1 or_OO(pp,ep) PROCESS *pp; EVENT *ep; 
2 { 
3 if( 1 ep->input_id) { pp->entry or_01; 
4 else { pp->entry or_10; 
5 free_event(ep); 
6 } 
8 or _01 (pp, ep) PROCESS *pp; EVENT •ep; 
9 { 
10 if(!ep->input_id) { pp->entry = or_OO; 
11 else { pp->entry or _11; 
12 free_event(ep); 
13 } 
15 or_10(pp,ep) PROCESS *pp; EVENT *ep; 
16 { 
17 if(!ep->input_id) { pp->entry or_11; 
18 else { pp->entry or_OO; 
19 free_event(ep); 
20 } 






if( !ep->input_id) { pp->entry = or _10; 















When both gate inputs are 0, the entry function is or_OO. When an event is received, 
the event is distinguished by the input it affects. If the event is for the input whose index 
is 0. the entry-function pointer is set to or_Ol, a nd a n output event is produced (2) . If 
the event is for the other input, the entry function is set to or_lO and an output is also 
produced (3). The actions for the other three entry functions are similar. 
An e lement can compute its output state based only on a transition from one of its 
inputs. because the transition carries the assurance that the other inputs of the element 
have not changed. Such assurance can be provided in several ways. The most common 
method is to keep the set of yet- to- be-delivered events (the pending e1·ents) sorted by time 
llO 
1-0 glitch or no glitch? 
0 
Figure 6.10 A circuit containing a dynamic hazard condition 
in an en>nt list, and to deliver the event with the smallest time value first. Since element 
dcla_,·s cannot be negative, an e,·ent cannot t rigger events with smaller time values. When 
an event is delivered to an element. it is assu red that the other inputs of the element. and 
indePd of all other elements. will remain unchanged up to the time of the event. 
pid2 1 
2 
struct { int 
c har rnsg_body [] ; } MESSAGE; 
4 sirnulator_rnain_loop(simp,rnesg) 
5 PROCESS •simp; 
6 MESSAGE •rnesg ; 
7 { 




pr oc = (SIM_DATA • )(simp->data)->pr ocess_table + rnesg->pid2; 
(•proc->entry)(proc, rnesg->rnsg_body); 
Listing 6.16 Seque nt ial-simula tor main loop as a light-weight process . 
The simulator main loop is similar to that of the C~1B-variant simulator: the message 
sys tem. however, has a different property. The message system for the CM B-variant simu-
lator dispenses messages on a first -come, first-served basis; for the sequential simulator, the 
message with the smallest time value is dispensed first. 
6.3.2 Hazards in sequential simulators 
Al though a sequential simulator will always produce a valid simulation result. it may not 
a lways produce the same result as the CMB-variant simulator. Some input conditions in a 
logic circuit may trigger more than one possible outcome, and a sequential simulator has 
no consistent way of choosing one. For example, the OR-gate in Figure 6.10 can produce 
either no transitions, or two transitions in response to two simul taneous input events. This 
condition corresponds to a static hazard in the terminology of Boolean minimization. 
111 
Both of these responses rt.re correct because the tempora l relat ion between the two 
input. events is beyond the capa bility of t he model to resolve : the one that is produced 
depends on the order in which the two input events are consumed. Sin ce both input events 
have thE' :-.a me t ime value, they can be taken from the list in eithe r order. If t he low-going 
t ra11sition is take n first. two output t ransi tions will be produced; if the high-going transition 
is taken first. no outpu t transitions will be produ ced. The CMB-vari ant simulator. however. 
consistent l.v picks the response in which no ou tput t ran sitions a re produced . 
Although both responses a re considered to be correct, the sequenti al simulator can com-
pa.re unfavorably with the CM B-variant simulator when there are too many extra events. 
For the com pa ri son to be meaningful, we must devise a sequenti a l si mulato r that \viii con-
s istently ma ke the same choices as does t he CM B-varian t simul a tor. In a system in which 
every e lement has a non-zero de lay, this can be accomplished by withdrawi ng the first of 
the two o utput events when the second output event is to be produced. an d canceling both 
('vents. Each output data struct ure must main tain a referen ce to the last unconsumed event 
that it has produced. When another output event is to b e produced, if the previou s event 
has 110t been consumed a nd if the two events have the same time value . then no events 
are produced a nd the previous event is withdrawn. The following SEND_EVENT function 

















op ((ELEMENT *) 10 (pp->data))->outq + id 
ot ((ELEMENT *) 11 (pp->data))->delay + time; 







if(op->last _e[j] && (op->last_e[j]->time 
{ 
DEL_EVENT(op->last_e[j]); 
























Listing 6.17 A SEND_EVENT function that reduces glitches . 
Missing from Listing 6 . 1 "i is the part that places a back-reference pointer into each 
event structure. The back-reference is used by the simulator to dissociate an event from its 
output (by setting the corresponding last _e[j] to 0) when the event is delivered. 
6.3.3 Instrumentation 
The sequentia.l simulator also exists 1n two modes, sweep mode and real mode. Like the 
CMB-variants. the sweep-mode simulator consumes one sweep for every element input de-
livery. In the real mode. the CMB-variant simu lator must poll the system's input message 
queue once for ever~· null message or event message delivered; the sequential simulator is 
also made to poll the same queue once for every event message delivered, even though this 
is never necessary. Poll ing for messages consumes a significant amount of time in many 
multicomputers but there is nothing inherently costly about the operation. It should be 
possible in a future machine to poll the queue by checking only a single pre-defined memory 
location that has been mapped into each process's memory space. 
The resulting real-mode simulator runs at a speed of about -500f,Ls per event for our 
examples on the iPSC/2 and t he Symult 2010, and at about 3000f,LS per event on ou r 
iPSC/1. The polling time is about 170;.Ls for the Symult 2010 and 760f,LS for the iPSC/1. 
The iPSC multicomputers were running Cosmic Environment in compatibility mode instead 
of in the potentially more efficient native mode. The exact speed depends on the size of 
the event list. The event list is implemented with a tree structure called the leftist tree 
. [16). This data structure shows Olog(n) timing characteristics for inser tion and deletion 
operations in even the most highly unbalanced cases. but it does not provide an easy way to 
113 
traverse the tree in a sorted orclt>r. Th0 lefti::;t tree is an excellent choice for the s imulators 
because tree-traversal is not needed in a simulator. 
6.3.4 Big multiplier results 
The sweep-mode simulation results . shown in sect ion .5.2. indicate a 2-4 x overhead when 
:V = 1; the real-mode resu lts generally show a 4-8x overhead. This is not unexpected 
because the time required in the sweep mode to deliver a message to an element is assumed 
to be the same in all s imulators; in reality. the CMB-variant simulator has to do more work 
per message than does the sequential sim ulator. 
We cannot. at this moment. reproduce the same sweep-mode performance comparisons 
using real multicomputers . because we do not have access to any multicomputers with 21< 
nodes. We do. however. have access to an assortment of multicomputers of various sizes and 
vintages that we can use to explore various regions of the result graph. Figure 6.11 contains 
the timing result for a simulation of the 1316-gate array multiplier from section .5.2. The 








4 5 6 7 
log2( nodes) 
Figure 6.11 A 1376-gate multiplier for 40.us on an iPSC/2. 
Aside from a larger overhead , the real-mode curves generally reflect the upper third 
of the sweep-mode curves. One consistent characteristic for this and other simulations is 
a relatively low overhead for the variant F results at N = 1. Variant A and F share the 
property that messages can be sent 0ap;crly. while message sending in the other variants 
must wait until a null poi 11ter IS returned h_\· a call to xrecv - e\·en if the messages are 
to be sent from n simulator process to itself. Variant F has a lower overhead than variant 
A because it makes eager onl_v those elements on crit ical paths. thus allowing messages on 
uon-critical paths to merge. :\s the simulation becomes more distributed, however, more 
elements become part of a critical path. and the advantage of variant F disappears. 
vVhen N > \ variant A. E. and F fail as more of the eagerly-sent demand and null 
messages become internode messages and overload the buffering capacity of the message 
system. The other variants are ab le to cont inu e because many messages are eliminated by 








'~\:- -. _ sequential simulator 
2 3 4 -5 6 7 
log2( nodes) 
Figure 6.12 A 1376-gate multiplier for 40f.ls on an iPSC/1. 
Figure 6.12 contains the result of the same simulation on a 128-node iPSC/1. Due to an 
excess of null messages, variant A and F fail for all N; due to a lack of memory, none of the 
variants will run when N < 4, nor will the sequential simulator run at N = 1. (Our iPSC/1 
has only one-half megabyte of memory per node, whereas the iPSC/2 has 4 megabytes per 
node.) The sequential s imulator result is an est imate derived from a simulation of a smaller 
circuit (to be described later). 
lo~J2(8rconrf.,) 







se uentia.l simulator 
3 5 6 I 
log2( nodes) 
Figure 6.13 Combining the iPSC/2 and iPSC/1 graphs with seq uential timing aligned. 
The res ults th at we are able to obtain from the iPSC/1 simulation indicate a contin-
uation of the near-linear speedup until J\' > 64, when there are fewer than 22 elements in 
each node. The total speedup obtained is 64 when the two sets of results are combined in 
Figure 6.13. 
A 64-node Symult 2010 multicomputer a.llows us to explore a large overlapping portion 
of these two combined graphs. Since the S2010 nodes are much faster than the iPSC/1 
nodes. the simulation interval has been scaled from 40Jl.S to 10011-s ; in order for the timing to 
remain meaningful when N = 64 . Figure 6.14 matches Figure 6.13 closely, but every variant 
is able to complete its simulation for every N on the S2010. Variant F resembles variant ..-1. 
because as queuing limits vanish throughout the simulator , the simulator effectively becomes 
a variant-A simulator. Variant F is a little worse than variant A because it s till must produce 
demand messages in addition to any eagerly sent message. Var iant E, however; resembles 
other variants. 
6 .3 .5 S m a ll multiplie r r esults 









0 2 3 4 
Figure 6.14 A 1376-gate multiplier for 100J..Ls on a Symult 2010. 
ci rcui ts to observe the asymptotic effects predicted by the sweep-mode simulation for large 
.V. Figure 6.1.5 contains th e results for the simulat ion of a 4 x 4 array-mul t iplier consist ing 
of 116 logic gates. The iPSC/1 and iPSC/2 simulations were performed over a simulated 
interval of 100J..LS . The S2010 simulation was performed over an interval of 400ps to preserve 
accuracy when many nodes are used. 
Not only is the reduction in slope more visible, differences between various modes are 
also more apparent . There a re 1, 2. and 8 elements per node when all of the nodes in the 
iPSC/ 1, S2010, and iPSC/ 2 , respectively. are in use. 
Compared to the iPSC/1 curves, the S2010 curves show a steeper slope. a larger overall 
speed up, and a closer match with the sweep- mode curves. The fl attening of the curves for 
t he iPSC/1 is due to the effect of message latency. The average message latency for the 
iPS C/1 when N = 64 is ~ 3000ps; this is comparable to the 3000J..Ls-per-event processing 
time of the sequential simulator. The user-mode message latency for the S2010 is ~ 200ps; 
t his is s maller than the 600ps-per-event processing t ime. 
\Ve can obse rve the effect of latency by varying latency in the sweep-mode simulation. 
lli 








0 1 2 3 4 5 6 




0 1 2 
'",!"':~ ~ 




Figure 6.16 A 116-gate multiplier for 100J.Ls on an iPSC/2. 
l og2 (seconds) 
5 
0 1 2 3 4 5 6 7 
log2( nodes) 
log2(nodes) 
Figure 6.17 A 116-gate multiplier for 400J.Ls on a Symu lt 2010. 
118 
Figure 6.1 ' contains two plots. o ne for .\" = 2.')() and the other for .\' = 204 ' .. \ me»SCl,!!,l' 
sent during a sweep is available to its df'stination in the fol lowing sw<>ep when latency is 0. 
When latency is non-zero. th e m essage is delayed h~· an a mo unt equal to the latency. Wlwn 
simu lation becomes dominated b.v latency. time increases linearly with latency. 






0 2 3 4 
, ' E 
.5 
log2(swtcps) . A = 20-1 
1 ,') 
9 
0 2 3 
Figure 6.18 Effect of increased latency on simulation performance 
.j log2(latency) 
In all of t he results that we ha.ve shown. the sou rce-d riven variants. B. C. a nd D. are 
the most ro bu s t variants, and they show a larger s peed up than the other variants when ,V 
is large. The demand-driven variant E is hindered by a large message latency. An idling 
process may be delayed for two message cycles - send a demand message. receive a norm al 
message - before it can continue. When internode message la t e ncy is large . variant £ 
pe rforms poorly. Variant F does better because it becomes variant A when processes are 
idle more frequently. 
6.3.6 Circ uit topology vs. activity level 
A C:VlB-variant circuit simulator must supply every element input with e nough fragments 
to cover the entire simulation interval. Since its simulation time is only weakly dependent 
o n the conten t of those fragments , it is more s trongly influenced by the s tatic characteri stics 















··~- - C' 
·.-; D 
8 
5 6 I 
log2( nodes) 
Figure 6.19 A 1376-gate multiplier for 100J1s on a Symult 2010- fast oscillation . 
of the circuit operation. such as number of events produ ced. A sequential simulator . on the 
other hand. depends only on the number of events produced. 
For example. if a circuit contains a cross-coupled latch , the delay of the gates in the 
latch determines the number and the span of the fragments produced. and the number of 
fragments produced determines the simulation time for the CMB-variant simulator. The 
number of times the latch is used determines the number of events generated in the latch , and 
the number of events generated determines the si mulation time for a sequential simulator. 
We can expect the sequential simulator performance to change by a greater degree 
compared to the CMB-variant simulator if we run the simulation using the same multiplier 
circuit , but with a different activity level. Figure 6.19 is obtained by driving the arra_v 
multiplier at an elevated oscillation frequency. Four times as many events are produced, 
and the time taken by the sequential simulator has increased by a factor of 4. The time 
taken by the CMB-variant simulators, however , has increased by only a factor of 2. 
Since fragments are more likely to carry transitions. the possibility of consecutive frag-
me>nts merging into a single fragment is reduced. It becomes less profitable for the simu lator 
120 
to withhold messages. The time taken by variant A has increased b.v a factor of onl.v l..'). 
and variant A performs better than the other variants when ,V is not too large. 
6 .3.7 H y brid possibili t ies 
The C.\ !B-va.ri ant simulator implements an algorithm that distributes well. but. like many 
other algorithms. there are sequential implementations that are more efficient than the 
concu rrent implementation. However, the C'MB-variant simulator is unusual in that it is 
an exact implementation of an algorithm that can be defined recursively - each element 
simulator can also be a composite simulator. We can view the simulator process on each 
nodt> as being a composite simulator that simulates the set of elements a.c;signed to that 
node. \tVe refer to the set of elements. collectively, as a macro element. The ci rcuit simulator 
becomes one whose eleme nts are not the logic gates but the macro clements; of these one 
exists in each node. 
log( time) 





Modified Laffer Curve . 
hybrid 
Since the elements in a macro element must reside in the same address space, and since 
their operations must be in te rleaved , it is a tempting thought that there may be a wa:v to 
introduce sequential simulator efficiency into the simulation of elements in a macro element. 
Suppose such a hybrid simulator were to exist. When N = l , all logic gates would reside in 
the same node; the simu lator would have th e same performance as a sequential simulator. If 
121 
N were large, there would be one lo~ir gate per node and the performance would converl-!;e 
to the performance of CM B- variant simulator. 
Figure 6.20 depicts a hypothetical performance plot of a hybrid simulator. a sequential 
simulator, and a CMB-variant simulator. We will call this hybrid-simulator curve the 
modified Laffer curve (in recognition of economist Arthur B. Laffer. who showed that ta.x 
revenue is fixed on two ends on the plot of revenuers. tax rate). The quest for the algorithm 
and for the control over the shape of the curve between these two end points guides the rest 
of the experimental work. which will be di sc u s~f'd in the next chapter. 
122 
Chapter 7 H y b rid Simulators 
Section 7.1 Coordinated Sequentia l Simulator (Hybrid-1) 
One way to build a hybrid simulator is to use a modifi<>d sequential simulator for each 
macro element. and to con nect the sequen tial simulators using a C:\1 D-variant simulator. 
Since a CMD-variant sin1ulator provides coordination for a set of sequf'nt ial simulators. 
this hybrid simulator is called the coordinated sequential simulator (designated hybrid-1). 
'v\'hen J\' = 1. hybrid-! is identical to the sequential simulato r, as the modification does not 
introduce extra work for the simulator when the macro element is a closed system . 
A macro element is an open system if any of its element inputs connect to an element 
output in another node. :\lacro-element connectivities are handled by the CMB-variant 
simulator, and macro-element simulators must satisfy the requirements of the CMB-variant 
simulator: Output state descriptions produced by each macro-element simulator are packed 
into fragments and sent to the encircli ng CMB-variant simulator. The C\IID-variant simu-
lator distributes the fragments according to the connectivity of the macro elements. When 
a macro-element simulator receives a fragment, events ext racted from the fragment are 
entered into the event list. 
7.1.1 T he a lgor it h m 
Since asynchronous events can be injected by other macro-element simulators, event order 
for a macro-element simulator cannot be guaranteed by the the repeated delivery of the 
earliest event from the event list. The simulator may not be able to consume the event at 
the top of the list because an event with a smaller time \'alue may yet a rri ve from another 
macro element. To avoid a simulation error , we can employ a temporal marker in each 
macro element to indicate the smallest time value for any future external events. As long 
as the time of the first event in the event list is less than the marker time, the event can 
be safely consumed. If t he event time will be greater than the marker time. the s imulator 
must wait. 
123 
The encircling C\IB-variant simulator assures that the time of the next PvPnt on any 
macro-eiC'ment input is greater than or equal to the time of the macro-elemC'nt input. ThP 
tinw of a macro-element input is eq ual to the total span of fragment~ that have passed 
through it. and is updated whenever a fragment is received for that input. The minimum 
marro-elenH'nt input time is a convenient temporal marker. 
Output fragments are produ ced by a macro-element simulator whenever additional 
output descriptions are computed. Since elements are strictly synchronized in a sequential 
::,imulator. the output of all elemen ts in a macro element are known up to the same simulated 
time. Thus, the entire state of the macro element can be treated as an atomic property 
(('haptcr .)).and all arcs with the same source and destination node:, can be merged into 
one a rc. 
In order to com pute the temporal marker, we store the input time of Pach macro-element 
in put in a ~pec ial stopper event. The stopper is added to the event list along with the 
other events. \\"hen a macro-element input receives a fragment. in addition to injecting new 
events. it adds the span of the fragment to its stopper time. an d it repositions the s toppe r 
in the evPnt list. As long as the event at the top of the event list is not a stopper, the 
macro-element simulator is free to consume the event; when a stopper appears at the top 
of the event list, the simulator is made to wai t for more inputs. 
7 .1.2 Sorting with a differe nt key 
A macro-element simulator derived from a conventional sequential s imulator has an effective 
delay of zero because its event-consumption rules prevent the simulator from producing any 
output description that has a time value larger than its own minimum input time. A circuit 
of these mac ro-element simulators will deadlock unless a set of alternative consumption 
rules is used to produce a positive delay. 
"The event with the smallest simulated time will be delivered first" is merely a conve-
nient consumption rule that satisfies the following correctness conditions for a sequentia l 
::.imul ator . \Vhen an event is delivered to an element: 
124 
l. The event \\·iII not nPcd to hP recalled. and 
2. \o future ev<'nts for the Ple rnent will have a s maller event time. 
\Ve otn sati sf.v both conditions and pro\·idc a non-zero delay by sorting events according to 
the fo llowing ordered pair: 
when' I, is the evc11t time. a nd rlm (the rn_delay ) is the delay of a minimum-delay path 
(the shortest path) between the destination element of the event and any macro-element 
out put. :\lacro-element output-PvPnts therefore have a dm of 0. The first member of a ke.v 
is the dominant member \vhen ke.vs are to be compared. 
In tuitively. if input E'\'('nts for an element are ordered according to this key, they a re 
ordered in leas \vel !, becau:;e dm is the same for all input events of the element. An event 
whose des tination element has an rn_delay of elm can be deferred in the event list by rlm 
amount of time relative to those events for the macro-element outputs because its effects 
cannot propagate to the outputs before le + dm . The effective delay of a macro element is 
therefore the minimum rn_delay of its macro-element inputs. 
T heorem 7.1: An event produced by an e lement with a positive delay must have a key that 
is larger then the key of the event that triggers it . 
Proof: Let the delay of the element be o. the time of the input event be te. and the 
rn_delay of the element be dm . 
By the definition of e lement delay. any output event triggered by t his 
input event must have a time value of at least l e +b. By the definition of 
rn _delay. the destination element of the output event must have an rn_delay 
of a least dm - o. Therefore, the first part of the key for the output event 
must be no less than elm - 0 + te + o. or te + dm , which is equal to the first 
part of the key for the input event . 
125 
The :,Pcond part of tile ke.v for the output event is le+h, which is greater 
than the second part of the key of the input event. Therefore. the key of the 
output event mu st be larger than the key of the input event. 
Theorem /.2: Any event appearing at the top of the event list is valid. 
Proof: . .\n 0vent must come eithe r from anot her elem ent in the same macro element 
or from anot lH'r macro element. Events from other macro elements are 
assumed to be correct because the macro-element simulators follow the rules 
of a Ct\1 B-variant simulator. 
m_delay = 
delay = 
Figure 7.1 An event that invalidates another event . 
Tf the event is produced locally, let the event at the top of the list be 
e1 , and let (At, Bt) be the key of that event . Let ev be the event that an 
element consumes to invalidate e1, and let (Av, B v) be its key. 
By the definition of a key, At- Bt is the m_delay of dst( et), and A v - Bv 
is the m_delay of sTc(et) - Let 8 be the delay of src(et). By the definition of 
m_delay , we have the inequality: 
which we can rearrange into: 
We also have (Bt- Bt,) 2 o, because the delay of src(et) is o; and (Av-At) 2 
0, because et is the event at the top of the event list. The only solution to 
the inequality above is (Av- A 1) = 0 and (Bt- B v) = b. 
Since the key of et is no greater than the key of ev, it follows that b 
must be zero and that the two events must have the same event time . Since 
126 
the ordering of the t\\·o events is be.vond the ability of the model to rPsolve. 
it is correct to assume in thi s case thilt ~"ti s earlier in time. and is therefore 
valid. 
Suppose e1 is the event at the top of the event list , and let the first part of its ke.Y be 
called the e\'ent-list rime. Since all macro-element output events haw' an m_delay of zero . 
and since all new events have keys that are at least a.s large as the key of c1, the state of all 
macro-element outputs is known up to the event- list time . The effective delay of a macro 
element is therefore equal to the dela.v of the shortest path between any macro-element 
input a.nd output. 
7 .1.3 T he simulator mechanism 
The sequential-simulator discussion in section 6.2 hints that complexities are being moved 
into the message system of the reactive kernel (the kernel of the light-weight, reactive 
element processes). When a reactive kernel needs an e\·ent, its message system provides the 
event with the smallest time value of all events in the message system. 
mul ticom pu ter message system 
CMB-variant message system 
sequential-simulator message system 
sequential-simulator kernel 
element processes 
Figure 7.2 Layering in the hybrid-1 simulator . 
In hybrid-1, the message system of a sequential simulator is sand\viched between the 
message system of a CMB-variant simulator and the kernel of the sequential simulator. 
When the kernel needs an event, its message system provides that event having the smallest 
key. as long as that event is not a stopper. If it is, the message system wait~; for the stopper 
to be relocated. ·when the message system of the CMB-variant simulator receives more 
fragments , it moves the stoppers. The hybrid-1 simulator can therefore be constructed by 
layering reactive kernels. 
127 
1 struct { int (•entry)(); 
2 char •data } PROCESS; 
4 struct { int pid2 
5 char msg_body[]; } MESSAGE; 
7 SIM_DATA *simulator_data; 
9 sequential_simulator_main_loop(simp,mesg) 
10 PROCESS •simp; 
11 MESSAGE •mesg; 
12 { 
13 PROCESS *proc; 
15 simulator_data = (SIM_DATA •)(simp->data) ; 
17 proc = simulator_data->process_table + mesg->pid2; 
18 (•proc->entry)(proc, mesg->msg_body); 
19 } 
listing 7.1 Hybrid- ! main loop . 
The kernel of the sequential-simulator main loop can be expressed as the light-weight , 
reactive-process program shown in Listing 7.1. It returns to its message system for more 
events . The message-system layer for the sequential simulator (Listing 7.2) takes care of 
sorting the events and getting external events from the message system of a CMB-variant 
simulator. The message system of the sequential simulator is also a light-weight reactiYe 
process: 
1 PROCESS •seqsim; I* Sequential simulator process structure (only 1) •I 
3 sequential_simulator_message_system(msys, sb) 
4 PROCESS •msys; 









Listing 7.2 Hybrid-1 embedded message system . 
It returns to the message system of the CMB-variant simulator for a fragment, which 
it digests into individual events. After that, as long as the event with the smallest time is 
. not a stopper , the message system will remove the event from the event list and deliver it 
to the sequential-simulator kernel. 
128 
7 .1.4 The simulator output 
Sending only the macro-element output events is not enough to satisfy the requirl'ments 
for a CM B-variant simulator. Whenever the event-list time has increased. more is known 
about the outputs. even if no output event has been produced. The ru le for eventual delivery 
requires that null messages be generated. 
Like the CMB-variant simulator , several variants of the hybrid-1 simulator have bec11 
created, and they are characterized by how and when messages are sent. Eventual delivery 
is also assured by the same indefinite-lazy evaluation mechanism (not shown in the listings 
a.bovc). Three adjustable parameters are avai lable for the hybrid-1 simul ator: 
Qveue-limiting: Messages are sent when an adjustable limit on the number of queued 
output events is reached, or when null is returned by xrecv. 
Demand-driven: Demand messages are sent after an adjustable delay, as measured by 
t he number of successive nulls returned by xrecv while a macro-element 
simulator is waiting for more inputs. Demand messages are sent to the 
source nodes of the inputs whose stoppers are at the top of the event list. 
Queued messages for that output addressed by the demand message are 
sent when a demand message is received. 
Eager-message: Each output has a prompter event that stores the sum of an adjustable 
value and the simulated time of the last output action. When a prompter 
event reaches the top of the event list, messages are sent for that output 
and the prompter is resched uled. 
7 .1.5 Expectation 
Tight synchronization between elements in the same computing node greatly reduces the 
volume of internode messages, especially null messages, by combining internode arcs having 
common source and destination nodes into one single arc. Tight synchronizat ion , however. 
can also reduce concurrency. When a simulator process is blocked beca use of a stopper 
appearing at the top of the event list. elements that do not depend on the input of that 
129 
stopper are also prevented from 1naking !Jrogress. Concurrency is reduced becatt:,e this 
forces different sub-circuits in the sa.nw node to progress at th e sanw rate. and ignores 
non-s trict input conditions in which a.n element can still make progress when some of its 





Figure 7.3 Expected performance of the hybrid-1 simulator. 
The purpose of this experiment is to const ruct a simulator that will do as little work as 
possible at small N rather than be as efficient as the CMB-variant simulators at large N. 
After all. we can already get CMB-variant- simulator performance by running a CMB-variant 
simulator. We expect the simulator performance graph to start at N = 1 at sequential 
simulator speed. 'vVe expect to see sub-linear speedup due to the lost concurren cy, load 
imbalance , and extra work required to deal with the message system. We then expect the 
performance to bottom out at a level above the CMB-variant simulator when N is large. 
7 .1.6 Experimental results 
Like the CMB-variant simulator and the sequential simulator, h,vbrid-1 is also written in 
the form of a reactive program , making it suitable for sweep-mode simulation; however , 
a sweep-mode sim ulator has not been implemented. The real-mode simulator has been 
implemented. and a 64-node Symult 2010 was used as the primary test vehicle. Although 
simulation was performed using a multitude of simulation parameters , only a handful will 
be shown because related variants produce similar results. The variants are: 
130 
----- Queue limit = l. 10 null xrecvs before demaud message. 
Queue limit = .). :30 null xrecvs before demand message. 
Queue limit = 20. Prompter delay = IOns. 
-- --- - Prompter dela.y = :30ns. 
Figur0 7.4 contains the s imulation resu lt of a. l!Jx 14 arra.v-multiplier running on a 6-1-
node S2010 for 10011-s simulated time. lt is s hown alone (left) and superimposed over the 







0 1 2 3 4 6 7 
log2( nodes ) 










"~· . - ~~ .. -:. 
2 3 
Figure 7.4 A 1376-gate multiplier for 10011-s on a Symult 2010. 
both 
4 5 6 
log2(nodes) 
The general characteristic of these curves matches our expectation . In the multiplier 
example, the extra work that the simulator has to do and the difficulty it has in subdividing 
the multiplier for load balancing result in no speedup from N = 1 to 2. For larger N, the 
curves show a slope of~ 1/2 until N = 32, where the curves level out. Between N = 32 and 
64, the curves cross over those of the CMB-variant simulator. The demand-driven modes 
perform consistently better than the queue-limiting modes. The eager-message modes per-
form well for small N, but they bend upward for large N due to an excess of null messages. 
The more eager of the two curves bends upward sooner than the less-eager one. 
131 
Due to the combining of a rcs, hybrid-! cun·es are strong!~, influe nced by element dis-
tribution only when N is large. Figure / .. 5 contains res nlt~ o f simulation using randomized 
element placement. Compared to Figure 1.4. the CM 8-variant curves are shifted upward 
uniformly for all N , and the hybrid-1 curves are bent upward when N is large. The hybrid-! 








0 1 2 
hybrid-1 only 
3 4 5 6 7 











0 1 2 3 
both 
.s 6 7 
log2(nodes) 
Figure 7.5 A 1376-gat e mu ltiplier for 10011-s on a Symult 2010 with random placement. 
Since one end of the hybrid-1 curves is pegged to the sequential simulator time, we can 
also expect a larger change for the hybrid-1 simulator than for the CMB-variant simulator 
when we increase the circuit activity level. Figure 7.6 contains the results of simulation using 
the same multiplier circuit that is operated at a higher oscillation frequenc.v. The hybrid-1 
curves are shifted upward by two octaves, whi le the CMB-variant curves are shifted only by 
one octave. A high activity level is more favorable to the CMB-variant simulator because 
fewer of the messages are null messages. 
Results from the multiplier example in this chapter, and better results from other 
circuits to be shown in Chapter 8, have confirmed that the hybrid- 1 simulator is working 




1 r I 
I 
10 i . 
~'· 
I ---..~ 
I "'~ ·. 
9 "'' -::_,, ,,, 
132 
hy brid -1 only 
,, 
R , .... .. ·:.-... 




0 2 3 
'- --















.) 6 7 
log2( nodes ) 
Figure 7.6 A faster oscillating 1376-gate multiplier for 10011-s on a Symult 2010. 
hybrid-1 s imulator to construct a new hybrid simulator that will converge to the CMB-
variant simulators when [\! is large. 
133 
Section 7.2 Progressive Hybrid Simulator (Hybrid-2) 
The hybrid-1 simulator cannot achieve CMB-variant performance at large N because po-
tential concurrency is lost when non-strict conditions are ignored and elements in a macro 
element are synchronized. Two separate mechanisms are used to recover the lost concur-
rency: first. when an input of an element becomes blocked. it must be allowed to continue 
if it can still make progress (due to a non-strict input condition). Second, when some el-
ements are blocked, we must allow those that are not blocked to continue ahead of the 
blocked elements. 
When a stopper appears at the top of the event list. elements connected to the input 
of the stopper may be blocked. Since hybrid-1 macro elements are simulated by sequential 
simulators. when an element in a macro element becomes blocked. the entire macro element 
is blocked. When an element becomes blocked in hybrid-2, the macro element is, in effect, 
reorganized by moving the blocked element out of the macro element. More blocked elements 
may result due to arcs leading from the blocked element to the new macro element. When 
only unblocked elements remain. however, the macro-element simulator can continue to 
make progress. When a blocked element has received more inputs and becomes unblocked, 
it is put back into the macro element. 
To take advantage of non-strict input conditions. stoppers in hybrid-1 are replaced by 
blocker events in hybrid-2. A blocker appearing at the top of the event list does not cause 
the simulator process to stop; instead, it is delivered like a normal event. For every blocker. 
there is a matching anti-blocker; it has the same simulation time as the blocker and they 
annihilate each other in the simulator. Macro-element inputs produce both blockers and 
anti-blockers. Whereas the hybrid-1 simulator relocates the stopper as more state fragments 
are received, the hybrid-2 simulator instead adds an anti-blocker with a time value equal to 
the previous blocker, a.dds a ny events carried by the fragment, and adds a blocker with the 
time equal to the new time of the hybrid-1 stopper. 
134 
When an element receives either a blocker. an anti-blocker. or a normal e w•nt. thP 
element determines whether it is blocked. It is not blocked if a ll of its ill put s are unblocked 
or if its remaining unblocked inputs contain a non-s tri ct input condition ; it is blocked 
otherwise. When an unblocked element becomes blocked, it sends a blocker wit h a time 
equal to the current inpu t event . vVhen a blocked elemen t becomes unblocked. it sends an 
a nti -blocker with a time equal to the previous blocker. 
I n a hybrid-2 simulator. when N is small. most of the element inputs are not blocked. 
an d the simulation takes on t he characterist ics of a hy brid- 1 simulator. Wh en .\' is large. 
many of the element inputs are blocked, a nd the simulat ion produces the efficiency of a 
CMB-variant simulator. However. one clear disadvantage of hyb rid-2 . compared to hybrid-
1, is that internode a rc merging is no longer possible, and the simulator is potentially more 
sensiti ve to ele ment placement . 
7 .2.1 The mechanism 
1 struct EVENT { 
2 
3 
int e_type ; 
int input_id; 








I • type of the event. • I 
I * id of the element input. • I 
I * time of the event. • I 
9 if(ep->time < elernent_tirne(pp)) ep->time element_time(pp); 
11 set_input_bits(pp,ep); 
12 cornpute_state_and_blockage(pp); 
14 if( was _blocked(pp) && !is_blocked(pp)) add_anti_blocker(pp,ep->time ) ; 
15 if( old_output (pp) '= new_output(pp)) add_output_event (pp,ep->time ) ; 




save_new_state(pp ) ; 
free _event(ep); 
Listing 7.3 Generic logic-gate handler for hybrid-2. 
A sample element entry func tion appears in Li sting 7.3. In addition to the usual input_ id 
and time fields, the hybrid-2 event structure also contains an e_ type field to distinguish 
among normal events. blockers. an d ant i-blocke rs . Since non-s tr ict input conditi ons <He 
135 
utilized. it is now possible for an element to receive events wit h a time value smaller than 
the time of the element. These events are for inp uts that were previously blocked. but 
the element was able to progress fu r ther because a non-st rict input condition was present. 
These events do not contribute to the operation of the element, other than to determine 
the current input state of t he ele ment. Therefore, when such an event is received. its event 
time is simply set to the element time (9) before it is processed like other events. 
Each element input con t ains a pair of vari a bles: One indicates the state. the other 
indicates blockage. Each output con tains two pairs of variables, one for the old state and 
blockage, and one for the new state and blockage. When a n event is received by the process, 
the set_input_bits function is called to set or clear the affected bits in the input structure 
of the element. The new output state and blockage are then com puted from the new input 
state an d blockage (12). If the ele ment has become unblocked due to the event (14), an 
an ti-blocker is sent. If the element has changed state (15), a normal event is sent. If the 
element has become blocked ( 16), a blocker is sent. The ordering of lines 14-16 assures that 
the event following a blocker is an anti-blocker. 
T he sequential-simulator main loop, the kernel to t hese element p rocesses, tests the 
blockage flag before and after an entry function is called; blocked elements are separated 
from unblocked elements by t reating t hem differently. List ing 7.4 is the kernel written as a 
heavy-weight reactive process : 
1 sequential_simulator_main_l oop ( ) 
2 { 
3 MESSAGE *mesg; 
4 PROCESS •proc ; 
6 mesg = get_next_ event () ; 
7 proc = process_table + mesg->pid2; 
9 if(!blocked(proc)) 
10 { 
11 (*proc- >entry ) (proc, mesg->msg_body ) ; 
13 } else 
14 { 
15 i f ( event _time (mesg ) > element_time ( proc) ) 
16 { 











(*proc->entry)(proc, mesg->msg_body) ; 
if(!blocked(proc)) move_queued_events_back_to_event_list(pp); 
} 
Listing 7.4 Hybrid-2 main loop 
When an event is returned from the message system (which contains the event lis t ), 
the main loop identifies the receiver of the event (7) and checks its blockage flag (9). If the 
element is not blocked, it is in the sequential-simulator domain and the event is delivered 
to it as if it were in a normal sequential simulator (11 ). 
If the element is blocked. the main loop checks its readiness to consume the event. The 
event cannot be consumed if its time is larger than the time of the element. The element 
lacks information about the future state of its blocked inputs necessary to consu me an event 
that arrives at a future t ime. The event is queued for the element (17). If the event time 
is less than or equal to the element time, the element has enough information to consume 
the event, and the event is sent to the element (21) . If the element is now unblocked, its 
queued events are moved back into the event list to be delivered again for the element . 
Queued events cannot be delivered directly to the element when the element becomes 
unblocked because they are ones that arrived while some inputs of the element were blocked. 
There may be events for the blocked inputs that have yet to arrive and that need to be 
delivered in the proper order (with respect to the queued events) when the element becomes 
unblocked. Moving all queued events back into the event list is inefficient when the queue 
is long and when moves have to be done frequently. The actual implementation of the 
hybrid-2 si mulator contains an elaborate mechanism for minimizing wasted efforts, and 
this accounts for the largest difference between the hybrid-2 presented here and the actual 
implementation. 
137 
7.2.2 Experi mental results 
Like the other simulators, hybrid -:2 is written in the form of a reactive-process program. 
making it suitable for sweep-mode s imulation; but. as in the case of hybrid-1. a sweep-mode 
simulator has not been crea.ted. Figure 1.1 contains the simulation results of a l4x 14 array-
multiplier running on a 6<1-node S2010 for 10011-s simulated time. It is shown alone (left) 
and superimposed over both the Cf\IB-va.riant result and the hybrid-1 result (right). 












Queue limit = 5. ---- 30 null xrecvs before demand message. 
Queue limit = 20. - - - Prompter delay = 10ns. 
2 
--- Prompter delay = 30ns. 
hybrid-2 on ly 












0 1 2 3 4 




The most noticeable difference between hybrid-1 and hybrid-2 curves in t his graph is 
that whereas hybrid-1 curves level off at large N, hybrid-2 curves keep going down. Hybrid-2 
curves start out very much like hybrid-1 curves because most of the elements in the hybrid-
2 simulators are running under the hybrid-1 mode. As more and more nodes are used in 
138 
the simulation. hybrid- 1 element simulators -,tart to b<'rome idle more frequently. and thPir 
curves start to leve l off. In the hybrid-2 s imulator. instead of becoming idle . more oft lw 
elements ente r the CM B-variant mode to provide additional speedu p over hybrid-1. 
T he other remarkable aspect of hyb ri d-2 curves is that they are all very much alike 
until that point where most of the hybrid-1 curves level off. It is after this transition point 
that progress- promoting act ions begin to dominate, and a variety of different performance 
res ults are produced, depending on the prop erties of t he progress-promoting action in use. 
The hybrid-2 curves a ppear to converge toward the CM B-variant curves. but nothing 
concl usive can be deduced from th is graph because a 64-node machine lacks sufficient nodes 
to demonstrate this effect. The convergence is much more obviou s when elements are placed 
randomly. Pl acement has a much stronger effect on the hybrid-2 simulator than it does on 
the hybrid-1 simulator because random element placement greatly increases the number of 




0 1 2 
hybrid-2 only 






0 1 2 3 4 5 6 
log2( nodes) 
Figure 7.8 A 1376-gate mult ip lier for lOOps on a Symu lt 2010 with random placement 
139 
Figu re 1.8 shows the result of rand om element placement (same placemen t for all simu-
lations shown in this graph). The hybrid-2 curves converge immediately to the C~lB-variant 
cun·es at N = 2. Reduction in internode null messages by bundling internode arcs allows 
the hybrid -1 simulator to show a small speedup at small N. 
Converge nce is also more evident \.,·hen we increase the c ircuit activ ity level. Figure 1.9 
shows the results of simu lating the multiplier with enhanced activity level. Convergence 
begins at a smaller N because the sequential-simulator time is now closer to the CMB-
variant time when N = l. The hybrid-2 curves start out closer to the CMB-va.riant curves. 
and they converge to the CMB-varian t curves at N = 32. 





0 1 2 3 4 5 6 7 
















.5 6 7 
log2( nodes) 
Figure 7.9 A fas ter-oscillating 1376-ga te multiplier for 100J.Ls on a Symult 2010 . 
Although we do not have a large r machine for looking at cases where there are fewer 
elements per node, we can reduce the number of elements per-node by using smaller test 
circuits. We tes ted a 4 x4 array-multiplier that contains 116 gates. At N = 64, there are 
no more than two gates in each node. 
The C}J!B-va.riant curves diverge wildly; some of them do better than the hybrid -2 
140 




~I .... _ 
7 ~"""--
6 1 ~ 
.') 









0 1 2 3 





curves and some do worse. Overall, the hybrid-2 curves seem to follow the better CMB-
variant curves. 
141 
Chapter 8 Additional Performance Results 
This chapter summarizes the simulation results of a few selected circuib that were used in 
this research. They are generally presented in the following order: 
I. Description of the circuits. 
2. Sweep-mode s imulation results on an emulated multicomputer. 
3. Real-mode simulation on a Symult 2010 with systematic element di stribution. 
4. Real-mode simulation on a Symult 2010 with random element dist ribution. 
5. A few sets of real-mode s imulation on smaller circuits of the same type. 
Each se t of real-mode simulations conta ins results from running the CMB-variant simulator. 
the hybrid- ! simulator, and the hy brid-2 simulator. Results from other multicomputers are 
similar and are not shown. 
142 
Section 8.1 2-D Clock Network 
8.1.1 Description 
A clock network is an arbitrarily extensible array of logic gates that oscillates when properly 
initiali zed. The frequency of the oscillation is determined by local characteristics, and 
the phase at any node in the network is locked to the phase of the adjacent nodes. A 
clock network can be used to provide synchronous communication for an arbitrarily large. 






Figure 8.1 A FIFO consisting of 4 units. 
A clock network is a generalized self-timed FIFO circuit. As shown in Figure 8.1. a 
FIFO is made of a number of FIFO units connected into a chain: a FIFO unit contains a 
controller and a register. The registers in a FIFO are connected in a chain via their data 
inputs and outputs; the controllers are connected via their request and acknowledge signals. 
Each controller provides a clock signal to enable and disable the latches in its register. The 
acknowledge and request signals allow the controllers to determine when the FIFO unit 
immediately preceding it has data for it. and when the F IFO unit immediately following it 
has taken the data from it. 
Each FIFO · unit leads but is never more than a half cycle ahead of the following unit, 
and lags but is never more than a half cycle behind the preceding unit. Thus, if registers 
were computers and register-to-register links were communication channels, the data one 
computer latches in at its kth clock tick is the data put out by the preceding computer at 
143 
req 
Figure 8.2 A ( -element FIFO consisting of 4 units . 
that computer's kth clock tick. With a little extra delay. synchronous communication can 
also take place in the reverse direction. 
A simple FIFO control can be constructed using a C element and an inverter. A C 
element is a state-storage device such that when al l of its inputs are high, the output 
becomes high ; when all of its inputs are low , th e output becomes low; and the output 
remains unchanged otherwise. In the FIFO shown in Figure 8.2, the output of a C element 
is connected to an input of the C element in the following unit. The inverted output of a 
C element is connected to an input of the C element in the preceding unit. The output of 
the C element is also used as the clock to the register. 
Figure 8.3 A 3 X 4 array of se lf-oscillating FIFO units . 
1'-!..J 
The FIFO structure can be extended to a higher dimension by cross-connecting a set 
of FIFO controls with another set of FIFO rontrols . Figure ~:u contains a two-dimensional 
arra~; of 12 FIFO units with the regi sters omitted. The edges are terminated in such a way 
that the array will oscillate. This is essentially the same network that is used in the clock-
network simulation. except that each 4-input (" element is replaced by a 12-gate circuit. 
The circuit in Figure 8.3 has 1.51 gates. 
8.1.2 Sweep-mode results 















0 1 2 3 .s 6 -I 8 9 10 11 
log2 ( nodes) 
Figure 8.4 Sweep-mode CMB-variant simulation of an 1841-gate clock network. 
Figure 8.4 contains the sweep-mode results of an 8x 16 clock-network containing 1841 logic 
gates. The speedup is linear until there are fewer than 4 elements in each node. The null 
message overhead is a little larger than 8 at N = 1, and the crossover occurs between N = 8 
and N = 16. Unlike the multiplier example we used in previous chapters. the clock network 
shows a much greater difference between the most-ea.ger variant and the lazier variants. This 
145 
is typical of circuits with many tight loops. where unnecessary null me~sages can persist as 
they travel around the loops. The lazier ,·ariants annihilate such null messages to achieve 
an improved performance over the most-eager variant. 
Al so. unlike the multiplier example, load balancing is s imple because a clock network 
::. ho,vs a steady and uniform activity level at every part of the circuit. Although the CMB-
variant simulators are relatively insensitive to the effect of load balance and activity level, 
the hybrid simulators are more favorably influenced, as we can see in Figure 8 .4. 
8.1.3 Real-mode results 
The performance at N = l and the linear speedup for most of the lazier CMB variants 
fit the sweE'p-mode prediction well . The real-mode curves differ from the prediction in 
that the eager CMB-variant curve is not uniformly worse over all N, and the curve for 
the adaptive demand-driven variant worsens more rapidly than predicted. These two CMB 
variants are not robust in circuits that contain many closed loops where null messages can 
circulate. because the persistence of the null messages depends on run-time conditions such 
as congestion and order of message arrival. As a consequence, the result of the simulation 
can vary significantly from run to run, but when N is small. the behavior is more restricted, 
and the prediction of the sweep-mode simulation prevails. 
The hybrid-1 and hybrid-2 curves are similar to those of the multiplier circuit, except 
these curves show a greater speedup due to better load balance for the clock network. Thus, 
these curves are more similar to those of the multipli er with an enhanced activity level -
there is no significant initial penalty at N = 2. The activity level for this multiplier is more 
uniform because a new wave of activities is injected into the multiplier before old ones have 
com pleted. The hybrid- 1 curves flatten and bend upward between N = 16 and 32 , while 
the h_ybrid-2 curves continue straight down as the)· close in toward the CMB-variant curves. 
T he next set of graphs shows the effect of randomized element distribution. The CMB-
variant curves have shifted very little, but the hybrid-1 curves become much shallower, and 
the hybrid-2 curves show the characteristic upward hump for random element distribution. 
146 
Real-mode results for a n 8x 16 network 





0 1 2 1 4 5 6 7 
I og2 (nodes) 









6 ~ .s 
4 
0 1 2 3 4 5 6 7 
log2( nodes) 














0 1 2 3 4 5 6 7 
log2(nodes) 





0 1 2 3 4 5 6 7 
log2(nodes ) 
Figure 8.5 An 1841-gate clock network for 50!J.s on a Symult 2010 . 
14i 
Figures 8.-1 and 8 . .5 show the results in regions where there are many more logic l'lf'-
ments than nodes. The three additional SE>ts of simu lation results use progres!>ively smallf'r 
clock circuits: the last one has, on average. one logic gate per node for X = (j..J. As t lw 
number of gates is reduced, speedup achieved by the hybrid simulators is reduced because 
the advantage that can be obtained from running sequential, macro-element simulators de-
creases. The CMB-variant simulators, which reflect the ratio of null messages and evf'nl 
messages. show very little change relative to the sequential simulator. 
The lazy CMB-variants are hardy and robust simu lators: They show good speednp 
re lative to themselves al l the way clown to 1 element per node in a fashion consistent with 
the sweep-mode prediction. 
148 
Real-mode results w ith r a ndom e le ment distri bution 






0 1 2 3 4 5 6 7 














0 1 2 3 4 5 6 7 
log2( nodes) 













0 1 2 3 4 5 6 7 
log2( nodes) 












0 1 2 3 4 5 6 7 
log2( nodes) 





R eal-mode results for a -l x 8 ne twork 
2 3 4 5 6 7 
loy2( nodes) 













2 3 4 5 6 7 
log2(nodes) 
I og2 (seconds) hybrid-2 l og2 (seconds) all 3 
13 13 l 
12 12 
~.;.- ..- ' ' ' r ·-.:, 
"~~ .. ' 
"Y. ' , ' 
11 11 '\.' ' ~- ', ' 
· ~ .... ' 
10 10 ~'l~~>' 





6 -~~ 6 
5 5 
0 1 2 3 4 .) 6 7 0 1 2 3 4 .s 6 -I 
log2(nodes) log2( nodes) 
Figure 8.7 A 473-gate clock netwo rk fo r 200;_Ls on a Symult 2010. 
150 




0 1 2 3 4 5 6 7 
log2( nodes) 









0 1 2 3 4 5 6 7 
log2( nodes) 











' ... :~.::.::.' .. 
hybrid - 1 
0 1 2 3 4 5 6 7 
log2( nodes) 









0 1 2 3 4 5 6 7 
l og2 (nodes) 
Figure 8.8 A 24 1-ga t e clock network for 200p,s o n a Symu lt 2010. 
151 
Real-mode res ult s for a 2 X 2 network 







0 1 2 :3 4 .s 6 7 
log2(nodes) log2 (nodes) 







7 ' ' 7 
6 6 
0 1 2 3 4 5 6 I 0 1 2 3 4 5 6 7 
log2(nodes) log2( nodes) 
Figure 8.9 A 65-gate clock network for 500J.Ls on a Symult 2010 . 
Section 8.2 Tree-Ring Example 
8.2.1 Descript ion 
152 
Unlike t he multiplier and the clock network. the tree-ring circuit has no identifia ble fun c-
tions; it is one of the circuits we in vented to tes t the simulator. It is made of a cycle of 1-to-8 
pulse di stributors whose outputs a.re then summed together by a ring of 3-input OR-gates . 
Each 1-to-8 pulse di stributor is composed of seven 1-to-2 distributors connected in a tree 
structure . A test ci rcui t with 12 distributors appears in Figure 8.10: 
Figure 8.10 A 12- unit tree ri ng . 
Each 1- to-2 pulse dis tributor has one input and two outputs. Pulses a ppearing at the 
153 
distributor·s input are alternatively passed to one of its outputs. Thus. a l·to-8 distributor 
spreads the pulses among its eight outputs. A l-to-:2 pulse di s tributor consists of a toggle 
flip flop, made of 9 logic gates, and a 1-to-2 demultiplexor. made of 4 logic gatE's: 
• 
• • 
Figure 8 .11 A 1-to-2 pulse-distributor circuit 
8 .2.2 Simulat io n results 
Sweep-mode simulation has not been done for this circuit. The graphs on the following 
pages are for the simulation of a 12-unit circuit, using both systematic and random element 
dist ribution; a 9-unit circuit; a 6-unit circuit; and, finally, a 3-unit circuit. Tree-ring circuits 
have a lower activity level than the others examined here because only one of the eight 
leaves in each unit can be active at any time. Accordingly, the CMB-variant curves show 
an overhead of four to five octaves relative to the sequential simulation results. The CMB-
variant speedup is, otherwise, Linear with respect to itself. 
The hybrid-2 curves are not as smooth as those of the other circuits because each 
tree-ring circuit contains two sets of sub-circuits with very different properties (the pulse 
distributor and the ring of OR- gates). Partitioning of the circuit over different-s ized multi -
computers produces very different locality relations, which strongly affect the performance 
of the hybrid simulators. The effect of locality can also be seen in the simulation with ran-
dom element distribution. While the hybrid-2 curves for the clock network merely worsen, 
those for this circuit converge immediately to the CMB-variant curves at N = 2. The 
C:vl.B-variant simulator, however, is not strongly influenced by locality. 
The C:VIB-variant curves, which are pegged to the ratio of null messages verses event-
containing messages, show very little change as the size of the circuit is decreased. The 
154 
hybrid simulator curves show a steady Rattening in slope, and hyb rid-! curves Pventuall.\· 
lose all speedup when there are only 281 gates left in lhe circuit. 
155 
Real-mode results for a 12-unit network 











"' ' ~ ,', , .. , ' 
~ 
2 3 4 5 6 7 
log2( nodes) 










0 1 2 3 ·I 5 6 I 









0 1 2 3 4 5 6 7 
log2( nodes) 










0 1 2 3 4 5 6 I 
l og2 (nodes) 
Figure 8.12 A 1142-gate tree network for 50p.s on a Symult 2010. 
156 
Real-mode resu lts w ith r a ndom e lement distribution 










0 1 2 3 4 5 6 7 






















0 1 2 3 4 5 6 7 
log2( nodes) 










0 1 2 3 4 5 6 7 
log2( nodes) 
Figure 8.13 A 1142-gate tree network for 50jjs o n a Symult 2010. 
157 
Real-mode results for a 9-unit network 









0 1 2 3 4 5 6 7 
I og2( 11 odes ) 




















0 1 2 3 4 5 6 7 
log2 ( nodes ) 










0 2 3 4 5 6 7 
log2( nodes) 
Figure 8.14 An 857-gate tree network for 70J.Ls on a Symult 2010 
158 
Real-mode resu lt s for a 6-unit network 





















0 1 2 3 4 5 6 7 
log2( nodes) 










0 2 3 4 5 6 7 
log2 (nodes) 










0 2 3 4 5 6 7 
lo!n( nodes) 
Figure 8.15 An 572-gate tree network for 100J.Ls on a Symult 2010 
159 
Real-mode res ults for a 3-unit network 










0 1 2 3 4 5 6 7 
log2(nodes) 





















; ..... - .·. 
, ... _:..- -":._· '":::.:;;"i:~ 
1 2 3 4 5 6 7 
log2( nodes) 










0 1 2 3 4 5 6 7 
l og2 (nodes) 
Figure 8.16 An 287-gate tree network for 200}-Ls on a Symult 2010 . 
Section 8.3 F IFO Lo o p 
8.3.1 D escript io n 
160 
While the clock network example uses a 2-D array of cross-connec ted FIFO controllers . the 
FIFO loop example uses a circularly connected linear array of FIFO cont ro llers and FIFO 
registers . (Refer to the fi gu re in the clock network section. ) The registers a re made of a 
bank of _ cross-coupled latches with clocked inputs. Each latch is mad<> of 4 logic gatPs. as 
shown in Figure 8.17. 
Figure 8.17 Circuit for one latch. 
Since t he design of the controller constrains the FIFO to contain no more t han 1 unit 
of data for every pair of FIFO units, and since we chose to initiali ze the FIFO loop with 
alternating data units of all ones and all zeros, the number of F IFO units must be a multiple 
of four . 
8.3 .2 S imulation results 
Figu re 8 .18 contains the CMB-variant sweep-mode simulation result using a loop of 28 
FIFO units. The FIFO loop is an example with a lot of usable concurrency. However. 
unlike the clock network, the lazier simulation variants are not a ny better than the most 
eager simulation variant, evidently due to the majority of the circuit loops being found in the 
cross-coupled latches. Non-essential null messages do not remain long in the cross-coupled 
latch because the load signal a nd the reset signal must be long enough for t he cross-coupled 
latch to sett le down to a final value. In doing so, the input to one of the cross-coupled 
161 
latches is held low for a sufficiently long time that all free-running null messages in the 
cross-coupled la t ch are eliminated clue to the non-s trict input condit ion of the NAND-gates. 
Yet, there are st ill essential null messages in the simulation , and the overhead estimate 
of the sweep-mode simulation is lwtween 2 and 3 octaves. The curves should show a linear 














0 2 3 4 .) 6 7 8 9 10 11 
log2 (nodes) 
Figure 8.18 Sweep-mode CMB- variant sim ulation of an 1067-gate FIFO loop . 
The real-mode CM B-variant curves for the FIFO loop circuit matches the sweep-mode 
predictions well . The curves for the hybrid simulators are also as expected . T he hybrid- 1 
curves flatten out and cross over the CMB-variant curves earlier than they do in the previous 
examples because the gates in this ci rcui t are under non-strict input conditions most o f th e 
time, and because hybrid-1 simulators are unable to make use of such conditions. 
One unique characteristic of t his circuit is that when the circuit size is reduced to 4 
FIFO units. all three sets of results show evidence that the curves are bending upward at 
N = 32. This characteristic is not observed in the sweep-mode result. and is an indica.tion 
that some tight loops are broken up and distributed across node boundaries. At N = 64. 
162 
the re a re 2 or 3 e lements per nod e. \Vi th granul arit.v approachi ng the number of gates in a 
cross-coupled latch. a misalignment in a sys tematic distribution will cause t he majority of 
the cross-coupled latches to be split ac ross node boundaries. 
163 
Real-mode result s for a 28-element loop 
log2 (secorzds) CMB-variant log2(.second.s) h.vbrid -1 













0 1 2 3 4 5 6 7 
























0 2 3 4 5 6 7 
log2(nodes) 














0 1 2 3 4 5 6 7 
l og2 (nodes) 
Figure 8.19 An 1067-gate FIFO loop for 100{Ls on a Symult 2010. 
164 
Real-mode res ults with random e lement distribution 
log2(.stconds) Cl\ID- \·ariant log2 ( ~econds ) h.vbrid- 1 
t:3 
12 r<. 
,.. ... ~ ... 
- "<:;. -·. 
11 ·-.-- ~ , ·-., ' ... -







0 1 2 :3 -I .') (j 7 
lorn (nvdes) 




~ 10 I ,, 9 I . >~,, ,, 
' , 


















0 1 2 3 4 5 6 7 
l og2 (nodes) 











0 1 2 3 4 5 6 7 
log2( nodes) 
Figure 8.20 An 1067-gate FIFO loop for lOOp.s on a Symult 2010 . 
165 
Real-mode results for a 12-element loop 











0 1 2 3 4 ~ 6 7 



























0 1 2 3 4 5 6 7 
log2(nodes ) 












0 1 2 3 4 5 6 7 
log2( nodes) 
Figure 8.21 A 459-gate FIFO loop for 100J..Ls on a Symult 2010 . 
166 
Real-mode result s for a 4-element loop 



































0 1 2 3 4 5 6 7 





l og2( nodes) 
all 3 
0 1 2 3 4 5 6 7 
log2(nodes) 
Figure 8.22 A 155-gate FIFO loop for :200f.Ls on a Symult 2010. 
167 
Chapter 9 Summary 
Section 9.1 Economy and Performance of a Multicomputer 
Ylulticomputers are appealing because they improve (and. with advances in VLSI techno!-
ogy. promise to continue to improve) the two most prominent figures of merit of computing 
systems: performance and economy. Performance is proportional to the processing speed 
of a machine: 
Performance ex: processing speed 
Economy is inversely proportional to the cost of running a program; it is, therefore, both 
proportional to the processing speed and inversely proportional to the cost of the machine: 
Economy ex: processing speed 
machine cost 
In most cases, performance and economy are at odds with each other because higher speed 
is achieved by using faster circuits; however, the increase in the machine cost is greater than 
the increase in the processing speed. In a multicomputer, speed is increased not by having 
faster circuits, but by having many cooperating computers. Hence, it is possible to improve 





















Figure 9.1 Two idealized multicomputer evolution paths. 
Whether one agrees that economy can be improved, however, depends on how one sees 
the basic premise of multicomputing. Shown in Figure 9.1 are two idealized evolutionary 
168 
paths leading from the same single-node computer. \Ve will, in our idealized model. consider 
computers to be made entirely of memory, because a fairly fast processor can be built in 
the area required for a few thousand bytes of fast memory. When we compare two single-
processor computers, we compare two collections of memory attached to two identical, 
zero-sized processors . Thus, any two single-processor computers in our comparison have 
the same speed regardless of their size differences. We will also assume that programs do 
not take up more memory as they become more distributed. 
Along path A, we build anN-node multicomputer by putting together N copies of the 
single-node computer. Performance has improved by a factor of N because there are now N 
single-node computers, and each is as fast as the original; economy has not changed because 
the totn.l machine cost has increased by the same factor . 
Along path B, the circuitry of a single-node computer is regrouped into N smaller 
nodes. Performance has improved by a factor of N because each of the N smaller nodes is 
as fast a." the original; economy has also improved by a factor of N because performance 
has improved while the cost of the machine has remained constant. 
These paths A and B also have a strong influence on multicomputer programming. The 
cost C of running a program, in this idealized model, is: 
C = SNT 
S = Price per node per unit time (ex size of the node). 
N = Number of nodes in the machine. 
T = Time it takes for the program to complete. 
When drawn as a 3-D log-log -log plot, which we call the cost space, the surfaces of constant 
cost are given by: 
log(S) + log(N) + log(T) = log(C) 
Constant-cost surfaces, called the C planes, appear as planes perpendicular to the 
( 1.1,1 ) direction vector. Suppose we have an application whose single-node cost is marked 




Figure 9.2 Multicomputer cost space. 
w(' have found a point with higher performance; if we can find a point that is on a plane 




a cost-ineffective curve 




0 0 ' 
0 0 0 ' 










D attainable region 
§lower-cost region 
Figure 9.3 Intersection with A plane 
~ 
log( N) 
Surfaces corresponding to path A correspond to constant node cost; thus they appear 
a.s planes perpendicular to the S-axis. We call such a plane an A plane. Figure 9.3 shows 
the A plane containing P. The intersections of an A plane with C planes form lines of 
slope -1 on the A plane. Since super-linear speedup is impossible by our definition. the 
grey area shown in Figure 9.3 (right) is the possible range of N and T. The cheese area 
170 
is the range intersected by those C planes that are closer to the origin than the C plane 
containing P. The non-cheese area (which is the same as the grey a rea in this case) is the 
range intersected by those C planes that are further away from the origin. The only way to 
have the application be cost-effective is for it to exhibit a linear speedup starting at .V = l. 
Any deviation from linear speedup means that the performance curve of the application 
has crossed into a C plane that is further away from the origin, and that the program will 
be more costly to run. In practice, there are many contributing factors to the actual cost 
of running a program that may more than make up for the inefficiency, but. in the long 
run, what we can afford to bu.v and what we are able to build will ultimately determine the 






a cost-effective curve 
a cost-effect ive curve 
t 
p ~~ vv vvvvvvvvvvv 
~ 00 ~~ 0 0000 00 0000 
0 ~00 0 ~~ 0 0000 0 0000 
0 0~~0 00~~ 00 0 00000 
0 0 0 '-\.0 0 0 0 ~, 0 0 0 0 0 0 0 0 
oooo ~oooo b oo o oooo 
00000 ~0000 00 0 0000 
000000 ~ 000 0000000 
0000000 ~ 0000000000 
00000000 ~ 00 00 0000 
00 0 000000 ~ 0000000 
000000000000 000 000 
0 0 0000000000 00 00 0 0 
0000000000000 0 0 0 00 
0 0 00000000000000 0 0 
attainable region t::(N) 
0 
lower-cost region 
Intersection with B - plane . 
Surfaces corresponding to path B appear as planes perpendicular to the (1,1,0) direction 
vector. We call such a plane a B plane. All points on a B plane have the same SN product. 
and correspond to multicomputers with the same total cost. The plane that contains P is 
shown in Figure 9.4. The intersections of a B plane with C planes form horizontal lines 
on the B plane. An appli cation becomes cheaper to run if it shows any speedup relative 
to the 1-node case. Performance is improved because the time required to perform the 
computation is reduced. Cost is reduced because the computation is now on a C plane that 
171 
is closer to the origin. The area that is both grey and cheese is that range that is attainable 
by the a.pplica.tion, and where both performance and economy are improved. 
In practice. neither of the two paths can continue indefini tely. ln path A, we are limited 
by the maximum physical size of a machine we are able to build , and by the amount of 
concurrency we can find in computations. In path B, we are limited by the minimum 
amount of hardware required to construct a node - computers are not made entirely of 







'-------------~ --' machine 
node count 
Figure 9.5 Two idealized multicomputer evolution paths in the path space. 
To continue, path A mu st use smaller and smaller nodes and path B must use more and 
more hardware. The two paths (Figure 9.5) will eventually meet at t he ultimate machine 
where all nodes are of a sensibly minimal size and the machine contains as many nodes as 
we can assemble in one machine. 
Section 9.2 Overhead and Latency 
Along path B, we encounter a series of multicomputers with progressively smaller nodes . 
Those with single-board nodes are called the medium-grain multicomputers; examples of 
medium-grain multicomputers are the Cosmic Cube, the iPSC/1, the iPSC/2, and the 
Symult 2010. Those with single-chip nodes are called the fine-grain multicomputers; an 
example of a fine-grain multicomputer is the Mosaic. Due to the reduced node cost when 
nodes become smaller and more abundant, the programming emp hasis for a multicomputer 
shifts from one of achieving a linear speedup to one of exploiting the maximum concurrency. 
172 
Since medium-grain nodes are few and expensive. the primary goal of programming 
such multicomputers is to profitably utilize all available CPU c.vcles. Cycles can be lost 
to sources in the application itself: load-imbalance. ext ra synchronization, and insufficient 
concurrency: these internal delays are called overheads. Cycles can also be lost to sources 
in the system: message handling, kernel operation. and network congestion: these external 
delays are called latencies. In a medium-grain multicomputer, overheads and latencies 
are countered by employing at least several times more concurrency in the program than 
there are nodes in the multicomputer. The weak law of large numbers, together with the 
clustering of related elements. covers most of the problems . Nodes are seldom idle because 
the chance that all of their elements are blocked is low. The cost of message transaction s 
is low because clustering causes most of t.he interactions to take place between elements of 
the same node. 
To exploi t more concurrency, we must use more nodes in the multicomputer and fewer 
program elements in each node. Although we can no longer overwhelm overheads and 
latencies by an abundance of concurrency, we no longer have to be obsessed with linear 
speedup, because nodes become cheaper as they decrease in size. Instead, programming for 
fine-grain multicomputers emphasizes the exploitation of all available concurrency in the 
program. Factors that prevent the exploitation of available concurrency are distinguished 
from factors that merely require the use of more nodes. 
Latencies are factors that can prevent the full exploi tation of concurrency. For example, 
when a message is delayed enroute to a waiting element. the element is blocked and the 
program may not progress as fast as it could. Overheads, on the othe r hand, do not prevent 
the full exploitation of concurrency. When an element is blocked waiting for a message 
that has not been produced, it is blocked only because the program has less concurrency 
than there are nodes . Synchronization operations, such as the use of null events in the 
conservative discrete-event simulators, are also overheads: They keep more of the nodes 
busy without interfering with the exploitation of concurrency in the system being simulated. 
173 
An element with unconsumed normal events may sti ll be blocked awaiting a null eve nt. If 
the requ ired null event has been produced and sent, we would attribute the blockage to 
message latency ; if the null event has not been produced , then we would attribute the 
blockage to lack of concurrency. 
Section 9.3 Fine-Grain Multicomputer Programming 
To fully exploit the concurrency of a program, we must remove all latencies and overheads. 
Overheads can be mitigated by putting one program element in each node. but latencies 
can only be reduced by careful hardware and software design. 
On the hardware side. message latency can be reduced with high- speed routers. These 
routers move messages in the network via a modified form of circuit switching called worm-
hole or cut-through routing, which moves a message one step through the network in a time 
comparable to one m emory cycle. Since a router is able to store and fetch messages at a 
rate close to the bandwidth of the memory, sending a message from one node to any other 
node is comparable to copying the same message from one buffer to another buffer. 
On the software side, we must. without giving up generality, provide the thinnest cush-
ion possible between the processes and the hardware. The Reactive Kernel and a fine-grain, 
light-weight programming environment, such as Reactive-Cor Cantor, make an ideal com-
bination because the program is never further than one function call away from the system. 
The execution units for these programming environments, especially the more restricted 
ones like Cantor, are small enough that nearly all of the concurrency in the program can 
be exploited . 
We have aimed in the direction of fine-grain multicomputers in all of our research. and 
our work on the discrete-event simulation is no exception. The CM B-variant simulator is 
ideally suited for fine-grain machines because it is written in a fine-grain notation, and is 
able to fully exploit the concurrency of the system it simulates. The simulator takes on a 
large overhead at N = 1, but this overhead does not prevent the simulation from attaining 
174 
a large speedup at a large N. In many of the logic circuits we tested . near-linear s peedup 
continues until there are only two or three elements in each node . 
Since the C~v!B-variant simulator does not use a ny special techn iques to reduce t he o\·er-
head on a medium-grain multicomputer, the qualities that contribute to the perform a nce 
characteris tics of the simulator persist as the simulation becomes more distributed . The 
hybrid simulators were created to demonstrate the effect of those techniques . The overhead 
is reduced when N is small, but the effect of t hese techniques vani shes and t he performance 
converges to that of the CMB-variant simulator when N is large. 
Section 9 .4 The Next Frontier 
vVe have fully dispersed all available concurrency in a di screte-event simulation program 
when we put one element on each node. If there were more nod es in a mul t icomputer than 
elements in the simulation, we would not be able to utilize those leftover nodes. However, 
we can st ill change the program to one t hat contains more concurrency. In a medium-
grai n multicomputer, where it is necessary to use concurrency to overwhelm lat e ncies a nd 
overheads, rollback simulators such as Time Warp seek to produce additional concurrency 
by computing on speculation . 
The memory in each node of a fine-grain multicomputer is in sufficient for s toring the 
previous states of its element in a rollback simulator. However, when there are more nodes 
than elements, previous states can be stored on unused nodes. When an element has reached 
a synchronization point, where its future is to be decided by a message that has yet to arri ve . 
the element picks a possible outcome and ships a copy of its old self to an unused node for 
storage . Alternatively, the element can make a copy of its new self, which it spawns and 
runs on an unused node. But rather t han becoming dormant, the old self can continue 
to run and produce more copies until all possible outcomes have been exhausted. This is 
the concurrent branch-and-bound simulator; it is the next frontier to be explored when a 
fine -grain multicomputer becomes a vailable. 
175 
Chapter 10 Bibliography 
[1] G.A. Agha, Actors: A :viodel of Concurrent Computation in Distribu ted Systems, 
MIT Press, 1986. 
[2] W.C. Athas, and C.L. Seitz, Multicomputers: Message-Passing Concurrent 
Computers, IEEE Computer, August 1988. 
[3] C.L. Seitz, J . Seizovic, and W-K. Su, The C Programmer's Abbreviated Guide to 
Multicomputer Programming, Caltech-CS-TR:88-1, 1988. 
[4] W-K. Su, R. Faucette, and C.L. Seitz, C Programmer 's Guide to the Cosmic Cube, 
Caltech CS 51.50:DF:84, 1984. 
[5] J. Seizovic, The Reactive I\ erne/, Caltech-CS-TR-88-10, 1988. 
[6] G.M. Birtwhistle, 0-J Dahl, B. Myrhaug, and K. Nygaard, Simula Begin, 
Petrocelli, New York , 1973. 
[7] Dan Ingalls , The Smalltalk 76 Programming System: Design and Implementation , 
Proceedings of the Fifth ACM Conference on Principles of Programming Systems, 
Janurary 1978. 
[8] C.A.R. Hoare, Communicating Sequential Processes, CACM 21(8):666-677, August 
1978. 
[9] C.R. Lang, The Extension of Object-Oriented Language to a Homogeneous, 
Concurrent Architecture, Caltech-CS-TR-5014, May 1982. 
[10) InMos, Ltd., The Occam Programming Manual, Prentice-Hall, 1985. 
[11] William J . Dally, VLSI Architecture for Concurrent Data Structure, Cal tech CS 
5209:TR:86, 1986. 
176 
[12] R.E. Bryant. Simulation of Packet Communication Architecture Computer Systems. 
MIT /LCS/TR-1 8. November 1977. 
[13] K.M. Chandy, and J. :viisra. Distributed Simulation: A Case Study in Design and 
Verification of Distributed Programs, IEEE Software Engineering, September 1979. 
[14] D.R. Jefferson , Virtual Time. ACvf Transactions on Programming Languages and 
Systems, 7(3):404- 425, July 1985. 
[1.5] W.C. Athas, Fine-Grain Concurrent Computations, Caltech CS 5242:TR:87, 19 7. 
[16] Donald E. I\nuth. The Art of Computer Programming, V3, Sorting and Searching, 
Addison-Wesley, 1973. 
[17] M.R. Garey, and D.S. Johnson, Computers and Intractability, A Guide to the 
Theory of NP-Completeness, W.H. Freeman and Company, 1919. 
[18] A.J. Martin, A Message-Passing Model for Highly Concurrent Computation, 
Cal tech CS-TR-88-13, 1988. 
[19] M. Schuster, R.E. Bryant, and D. Whiting, MOSSIM II: A Switch-Level Simulator 
for MOS VLSI, User's Manual, Caltech CS 5033:TR:82, 1982. 
