A VLSI Architecture for Concurrent Data Structures by Dally, William James
A. VLSI Architecture for 
Concurrent Data Structures 
Thesis by 
William J. Dally 
In Partial Fulfillment of the Requirements 
for the Degree of 
Doctor of Philosophy 
California Institute of Technology 
Pasadena, California 
1986 
(Submitted March 3, 1986) 
11 
Copyright © William J. Dally, 1986 
All Rights Reserved 
Ill 
Ackno\v ledgments 
While a graduate student at Caltech I have been fortunate to have the opportunity to 
work with three exceptional people: Chuck Seitz, Jim Kajiya, and Randy Bryant. My 
ideas about the architecture of VLSI systems have been guided by my thesis advisor, 
Chuck Seitz, who also deserves thanks for teaching me to be less an engineer and more 
a scientist. ~!any of my ideas on object-oriented programming come from my work with 
Jim Kajiya, and my work with Randy Bryant was a starting point for my research on 
algorithrn.s. 
I thank all the members of my reading committee: Randy Bryant, Dick Feynman, Jim 
Kajiya, Alain Martin, Bob McEliece, Jerry Pine, and Chuck Seitz for their helpful com-
ments and constructive criticism. 
My fellow students, Bill Athas, Ricky Mosteller, ~1.ike Newton, Fritz Nordby, Don Speck, 
Craig Steele, Brian Von Herzen; and Dan Whelan have provided constructive criticism, 
comments, and assistance. 
This manuscript was prepared using TEX [73] and the Ll~TEX macro package [78]. I 
thank Calvin Jackson, our local TEXpert, for his help with typesetting problems. Most 
of the figures in this thesis were prepared using software developed by Wen-King Su. Bill 
Athas, Sharon Dally, John Tanner, and Doug Whiting deserve thanks for their careful 
proofreading of this document. ' 
Financial support for this research was provided by the Defense Advanced Research 
Projects Agency. I am grateful to AT&T Bell Laboratories for the support of an AT&T 
Ph.D. fellowship. 
Most of all, I thank Sharon Dally for her support and encouragement of my graduate 
work, without which this thesis would not have been written. 
IV 
A.bstract 
Concurrent data structures simplify the development of concurrent programs by encap-
sulating commonly used mechanisms for synchronization and communication into data 
structures. This thesis develops a notation for describing concurrent data structures, 
presents examples of concurrent data structures, and describes an architecture to sup-
port concurrent data structures. 
Concurrent Smail talk (CST), a derivative of Smalltalk-80 with extensions for concur-
rency, is developed to describe concurrent data structures. CST allows the programmer 
to specify objects that are distributed over the nodes of a concurrent computer. These 
distributed objects have many constituent objects and thus can process many messages 
simultaneously. They are the foundation upon which concurrent data structures are 
built. 
The balanced cube is a. concurrent data structure for ordered sets. The set is distributed 
by a. balanced recursive partition that maps to the subcubes of a binary n-cube using 
a Gray code. A search algorithm, VW search, based on the distance properties of the 
Gray code, searches a balanced cube in O(log N) time. Because it does not have the root 
bottleneck that limits all tree-based data structures to 0(1) concurrency, the balanced 
cube achieves O(~) concurrency. 
10g ;V 
Considering graphs as concurrent data structures, graph algorithms are presented for the 
shortest path problem, the max-flow problem, and graph partitioning. These algorithms 
introduce new synchronization techniques to achieve better performance than existing 
algorithms. 
A message-passing, concurrent architecture is developed that exploits the characteristics 
of VLSI technology to support concurrent data structures. Interconnection topologies 
are compared on the basis of dimension. It is shown that minimum latency is achieved 
with a very low dimensional network. A deadlock-free routing strategy is developed for 
this class of networks, and a prototype VLSI chip implementing this strategy is described. 
A message-driven processor complements the network by responding to messages with 
a very low latency. The processor directly executes messages, eliminating a level of 
interpretation. To take advantage of the performance offered by specialization while at 
the same time retaining flexibility, processing elements can be specialized to operate on a 
single class of objects. These object experts accelerate the performance of all applications 
using this class. 
y 
Contents 
Acknowledgments . ill 
Abstract . lV 
List of Figures IX 
1 Introduction 1 
1.1 Original Results . 2 
1.2 Motivation 3 
1.3 Background 4 
1.4 Concurrent Computers . 5 
1.4.1 Sequential Computers 6 
1.4.2 Shared-Memory Concurrent Computers "' I 
1.4.3 Message-Passing Concurrent Computers 9 
1.5 Summary .. 10 
2 Concurrent Smalltalk 11 
2.1 Object-Oriented Programming 12 
2.2 Distributed Objects . 13 
2.3 Concurrency 16 
2.4 Locks 20 
2.5 Blocks 21 
Yl 
2.6 Performance Metrics 21 
I'),.. Summary 22 ... 1 •• c • 
3 The Balanced Cube 23 
3.1 Data Structure 24 
3.1.l The Ordered Set 24 
3.1.2 The Binary n-Cube 25 
3.1.3 The Gray Code ... 26 
3.1.4 The Balanced Cube 28 
3.2 Search .. ~ ~ ~ . . . ~ . . . .. 30 
3.2.1 Di.stance Properties of the Gray Code 30 
3.2.2 VW Search 32 
3.3 Insert 40 
3.4 Delete 44 
3.5 Balance 52 
3.6 Extension to B-Cubes 57 
3.7 Experimental Results . 57 
3.8 Applications . 62 
3.9 Summary .. 65 
4 Graph Algorithms 66 
4.1 Nomenclature .. 66 
4.2 Shortest Path Problems 68 
4.2.1 Single Point Shortest Path 68 
4.2.2 Multiple Point Shortest Path 80 
4.2.3 All Points Shortest Path . 83 
4.3 The Max-Flow Problem ..... 84 
Vll 
4.3.l Constructing a Layered Graph 
4.3.2 The CAD Algorithm 
4.3.3 The CVF Algorithm 
4.3.4 Distributed Vertices 
4.3.5 Experimental Results 
4.4 Graph Partitioning . . . . . . 
4.4.l Why Concurrency is Hard 
4.4.2 Gain ........... . 
4.4.3 Coordinating Simulta.n.eous Moves 
4.4.4 Balance . . . . . . . . . . 
4.4.5 Allowing Negative Moves 
4.4.6 Performance 
4.4.7 Experimental Results 
4.5 Summary . . . . . . . . . . . 
5 Architecture 
5.1 Characteristics of Concurrent Algorithms 
5.2 Technology ..... . 
5.2.1 Wiring Density 
5.2.2 Switching Dynamics 
5.2.3 Energetics . . . . . . • 
5.3 Concurrent Computer Interconnection Networks 
5.3.l Network Topology ... 
5.3.2 Deadlock-Free Routing . 
5.3.3 The Torus Routing Chip . 
5.4 A Message-Driven Processor ... 
88 
89 
96 
103 
105 
109 
110 
111 
112 
114 
115 
116 
116 
118 
119 
120 
122 
122 
125 
1')"" .. 1 
128 
129 
143 
150 
163 
5.4.l Message Reception 
5.4.2 Method Lookup 
5.4.3 Execution 
5.5 Object Experts 
5.6 Summary 
6 Conclusion 
Vlll 
A Summary of Concurrent Smalltalk 
B Unordered Sets 
B.l Dictionaries . 
B.2 Union-Find Sets 
C On-Chip Wire Delay 
Glossary 
Bibliography 
165 
166 
168 
172 
173 
175 
180 
190 
190 
193 
195 
198 
204 
lX 
List of Figures 
1.1 Motivation for Concurrent Data Structures 
1.2 Information Flow in a Sequential Computer 
1.3 Information Flow in a Shared-Memory Concurrent Computer 
1.4 Information Flow in a Message-Passing Concurrent Computer 
2.1 Distributed Object Class Tally Ccllection 
2.2 A Concurrent Tally Method . 
2.3 Description of Class Interval . 
2.4 Synchronization cf Methods 
3.1 Binary 3-Cube .............. . 
3.2 Gray Code Mapping on a Binary 3-Cube. 
3.3 Header for Class Balanced Cube 
3.4 Calculating Distance by Reflection 
3.5 Neighbor Distance in a Gray ;!-Cube 
3.6 Search Space Reduction by vSearch Method 
3.7 Methods for at: and vSearch ........ . 
3.8 Search Space Reduction by wSearch Method . 
3.9 Method for wSearch .. 
3.10 Example of VW Search 
4 
6 
8 
9 
14 
17 
18 
19 
26 
28 
29 
31 
33 
34 
36 
37 
37 
38 
x 
3.11 VW Search Example 2 . 
3.12 Method for localAt:put: . 
3.13 Method for split:key:data:flag: 
3.14 Insert Example ..... 
3.15 Merge Dimension Cases 
3.16 Method for mergeReq:flag:dim: . 
3.17 ~fethods for mergeUp and mergeDown:data:flag: 
3.18 Methods for move: and copy:data:flag: 
3.19 Merge Example: A dim = B dim 
3.20 Merge Example: A dim < B dim 
3.21 Balancing Tree, n = 4 
3.22 Method for size:of: 
3.23 Method for free: . 
3.24 Balance Example ., 
3.25 Throughput vs. Cube Size for Direct Mapped Cube. Solid line is ,Q4:Y_._ log4 N 
39 
42 
42 
44 
46 
46 
47 
48 
49 
50 
53 
55 
55 
56 
Diamonds represent experimental data. . . . 59 
3.26 Barrier Function (n=lO) . . . . . . . . . . . 60 
3.27 Throughput vs. Cube Size for Balanced Cube Solid line is ~;X. Dia-
monds represent experimental data. . 61 
3.28 Mail System ........ . 62 
4.1 Headers for Graph Classes .. '. . . . . . . . . 67 
4.2 Example Single Point Shortest Path Problem 69 
4.3 Dijkstra's Algorithm . . . . . . . . . . . 69 
4.4 Example Trace of Dijkstra's Algorithm . 70 
4.5 Simplified Version of Chandy and Misra's Concurrent SPSP Algorithm . 71 
4.6 Example Trace of Chandy and Misra's Algorithm . . . . . . . . . . . . . 72 
Xl 
4.7 Pathological Graph for Chandy and ~1isra's Algorithm . 73 
4.8 Synchronized Concurrent SPSP Algorithm . 74 
4.9 Petri Net of SPSP Synchronization ..... 
4.10 Example Trace of Simple Synchronous SPSP Algorithm 76 
4.11 Speedup of Shortest Pa.th Algorithms vs. Problem Size . 77 
4.12 Speedup of Shortest Pa.th Algorithms vs. Number of Processors . 
4.13 Speedup of Shortest Path Algorithms for Pathological Graph 79 
4.14 Speedup for 8 Simultaneous Problems on R2.10 .. 81 
4.15 Speedup vs. Number of Problerr..s for R2.10, n=lO 82 
4.16 Floyd's Algorithm .......... . 83 
4.17 Example of Suboptimal Layered Flow 86 
4.18 CAD and CVF Macro Algorithm .. 89 
4.19 CAD and CVF Layering Algorithm . 90 
4.20 Propagate Methods 92 
4.21 Reserve Methods . 93 
4.22 Confirm Methods 94 
4.23 request Methods for CVF Algorithm 97 
4.24 send Messages Method for CVF Algorithm 99 
4.25 reject and ackFlow Methods for CVF Algorithm 100 
4.26 Petri Net of CVF Synchronization ... 102 
4.27 Pathological Graph for CVF Algorithm 103 
4.28 A Bipartite Flow Graph ...... . 104 
4.29 Distributed Source and Sink Vertices 104 
4.30 Number of Operations vs. Graph Size for Max-Flow Algorithms 105 
4.31 Speedup of CAD and CVF Algorithms vs. No. of Processors .. 107 
Xll 
4.32 Speedup of CAD and CVF Algorithms vs. Graph Size 108 
4.33 Thrashing . . . . . . . . . . . . . . . . . 111 
4.34 Simultaneous Move That Increases Cut 112 
4.35 Speedup of Concurrent Graph Partitioning Algorithm vs. Graph Size 117 
5.1 Distribution of Message and Method Lengths 121 
Packaging Levels .... 123 
5.3 A Concurrent Computer 128 
5.4 A Binary 6-Cube Embedded in the Plane 130 
5.5 A Ternary 4-Cube Embedded in the Plane . 131 
5.6 An B-ary 2-Cube (Torus) .......... . 131 
5.7 Wire Density vs. Position for One Row of a Binary 2~Cube . 134 
5.8 Pin Density vs. Dimension for 256, 4K, and lM Nodes . . . . 135 
5.9 Latency vs. Dimension for 256, 4K, and lM Nodes, Constant Delay 137 
' 5.10 Latency vs. Dimension for 256, 4K, and IM Nodes, Logarithmic Delay . 140 
5.11 Latency vs. Dimension for 256, 4K, and lM Nodes, Linear Delay 141 
5.12 Deadlock in a 4-Cycle . . . . . . . . . . . 143 
5.13 Breaking Deadlock with Virtual Channels 147 
5.14 3-ary 2-Cube . . . . . . . . . . . . . . . 149 
5.15 Photograph of the Torus Routing Chip . 151 
5.16 A Packaged Torus Routing Chip 152 
5.17 A Dimension 4 Node 153 
5.18 A Torus System. . . 154 
5.19 A Folded Torus System 155 
5.20 Packet Format . . . . . 156 
5.21 Virtual Channel Protocol 157 
Xlll 
5.22 Channel Protocol Example 157 
5.23 TRC Block Diagram .... 158 
5.24 Input Controller Block Diagram. 159 
5.25 Crosspoint of the Crossbar Switch 160 
5.26 Output Multiplexer Control ... 161 
5.27 TRC Performance Measurements 162 
5.28 Message Format 165 
5.29 Message Reception 166 
5.30 Method Lookup .. 167 
5.31 Instruction Translation Lookaside Buffer 168 
5.32 A Context ..... . 169 
5.33 Instruction Formats 170 
5.34 A Coding Example: Locks . 171 
A.1 Class Declaration ., . 181 
A.2 Methods . . . . . 184 
B.1 A Concurrent Hash Table 191 
B.2 Concurrent Hashing . . . 192 
B.3 A Concurrent Union-Find Structure 194 
B.4 Concurrent Union-Find .. 194 
> 
C.1 Model of Inverter Driving Wire ........................ 196 
1 
Chapter 1 
Introduction 
Computing systerns have two major problems: they are too slow, and they are too hard 
to program. 
Very large scale integration (VLSI) [86] technology holds the promise of improving com-
puter performance. VLSI has been used to make computers less expensive by shrinking 
a rack of equipment several meters on a side down to a single chip a few millimeters 
on a side. VLSI technology has also been applied to increase the memory capacity of 
computers. This is possible because memory is incrementally extensible; one simply 
plugs in more chips to get a larger memory. Unfortunately, it is not clear how to apply 
VLSI to make computer systems faster. To apply the high density of VLSI to ~proving 
the speed of computer systems, a technique is required to make processors incrementally 
extensible so one can increase the processing power of a system by simply plugging in 
more chips. 
Ensemble machines [110] , collections of processing nodes connected by a communications 
network, offer a solution to the problem of building extensible computers. These concur-
rent computers are extended by adding processing nodes and communication channels. 
While it is easy to extend the hardware of an ensemble machine, it is more difficult to 
extend its performance in solving a particular problem. The communication and syn-
chronization problems involved in coordinating the activity of the many processing nodes 
make programming an ensemble machine difficult. If the processing nodes are too tightly 
synchronized, most of the nodes will temain idle; if they are too loosely synchronized, 
too much redundant work is performed. Because of the difficulty of programming an 
ensemble machine, most successful applications of these machines have been to problems 
where the structure of the data is quite regular, resulting in a regular communication 
pattern. 
Object-oriented programming languages make programming easier by providing data 
abstraction, inheritance, and late binding [121]. Data abstraction separates an object's 
protocol, the things it knows how to do, from an object's implementation, how it does 
2 
them. This separation encourages programmers to write modular code. Each module 
describes a particular type or class of object. Inheritance allows a programmer to de-
fine a subclass of an existing class by specifying only the differences between the two 
classes. The subclass inherits the remaining protocol and behavior from its superclass, 
the existing class. Late, run-time, binding of meaning to objects makes for more flexible 
code by allowing the same code to be applied to many different classes of objects. Late 
binding and inheritance make for very general code. If the problems of programming 
an ensemble machine could be solved inside a class definition, then applications could 
share this class definition rather tha.n have to repeatedly solve the same problems, once 
for ea.ch application. 
This thesis addresses the problem of building and programming extensible computer 
systerns by observing that most computer applications a.re built around data structures, 
These applications can be ma.de concurrent by using concurrent data structures, data 
structures capable of performing many operations simultaneously. The details of com-
munication and synchronization are encapsulated inside the class definition for a con-
current data structure. The use of concurrent data structures relieves the progra..T.mer 
of many of the burdens associated with developing a concurrent application, In many 
cases communication and synchronization are handled entirely by the concurrent data 
structure and no extra effort is required to make the application concurrent. This thesis 
develops a computer architecture for concurrent data structures. 
1.1 Original Results 
The following results are the major original contributions of this thesis: 
• In Section 2.2, I introduce the concept of a distributed object, a single object that 
is distributed a.cross the nodes of a concurrent computer. Distributed objects can 
perform many operations simultaneously. They are the foundation upon which 
concurrent data structures are built. 
• A new data structure for ordered sets, the balanced cube, is developed in Chapter 3. 
The balanced cube achieves greater concurrency than conventional tree-based data 
structures. 
• In Section 4.2, a new concurrent algorithm for the shortest path problem is de-
scribed. 
• Two new concurrent algorithms for the max-flow problem are presented m Sec-
tion 4.3. 
• A new concurrent algorithm for graph partitioning is developed in Section 4.4. 
3 
• In Section 5.3.1, I compare the latency of k-ary n-cube networks as a function 
of dimension and derive the surprising result that, holding wiring bisection width 
constant, minimum latency is achieved at a very low dimension. 
• In Section 5.3.2, I develop the concept of virtual channels. Virtual channels can 
be used to generate a deadlock-free routing algorithm for any strongly connected 
interconnection network. This method is used to generate a deadlock-free routing 
algorithm for k-ary n-cubes. 
• The torus routing chip (TRC) has been designed to demonstrate the feasibility 
of constructing low-latency interconnection networks using wormhole routing and 
vfrtuai channels. The design and testing of this self-timed VLSI chip are described 
in Section 5.3.3. 
• In Section 5.5, I introduce the concept of an objut expert, hardware specialized to 
accelerate operations on one class of object. Object experts provide performance 
comparable to that of special-purpose hardware while retaining the flexibility of a 
general purpose processor. 
1.2 Motivation 
Two forces motivate the development of new computer architectures: need anci technol~ 
ogy. As computer applications change, users need new architectures to support their new 
programming styles and methods. Applications today deal frequently with non-numeric 
data such as strings, relations, sets, and symbols. In implementing these applications, 
programmers are moving towards fine-grain object-oriented languages such as Smalltalk, 
where non-numeric data ca.n be packaged into objects on which specific operations are 
defined. This packaging allows a single implementation of a popular object such as an 
ordered set to be used in many applications. These languages require a processor that 
can perform late binding of types and that can quickly allocate and de-allocate resourceso 
New architectures are also developed to take advantage of new technology. The emerging 
VLSI technology has the potential to build chips with 107 transistors with switching 
times of 10-10 seconds. Wafer-scale systems may contain as many as 109 devices. This 
technology is limited by its wiring density and communication speed. The delay in 
traversing a single chip may be 100 times the switching time. Also, wiring is limited to 
a few planar layers, resulting in a low communications bandwidth. Thus, architectures 
that use this technology must emphasize locality. The memory that stores data. must be 
kept close to the logic that operates on the data. VLSI also favors specialization. Because 
a special purpose chip has a fixed communication pattern, it makes more effective use of 
limited communication resources than does a general purpose chip. Another way to view 
VLSI technology is that it has high throughput (because of the fast switching times) and 
high latency (because of the slow communications). To harness the high throughput of 
4 
Concurrent Data Structures 
Object-Oriented Programming ( __ VLSI_) 
Figure 1.1: Motivation for Concurrent Data Structures 
this technology requires architectures that distribute computation in a loosely coupled 
manner so that the latency of communication does not become a bottleneck. 
This thesis develops a computer architecture that efficiently supports object-oriented 
programming using VLSI technology. As shown in Figure 1.1, the central idea of this 
thesis is concurrent data structures. The development of concurrent data structures 
is motivated by two underlying concepts: object-oriented programming and VLSI. The 
paradigm of object-oriented programming allows programs to be constructed from object 
classes that can be shared among applications. By defining concurrent data structures a.s 
distributed objects, these data structures can be shared a.cross many applications. VLSI 
circuit technology motivates the use of concurrency and the construction of ensemble 
machines. These highly concurrent machines a.re required to take advantage of this high 
throughput, high latency technology. 
1.3 Background 
Much work has been done on developing data structures that permit concurrent access 
[31], [32), [33], [34], [76], [81]. A related area of work is the development of distributed 
data structures [39]. These data structures, however, are primarily intended for allowing 
5 
concurrent access for multiple processes running on a sequential computer or for a data 
structure distributed across a loosely coupled network of computers. The concurrency 
achieved in these data structures is limited, and their analysis for the most part ignores 
communication cost. In contrast, the concurrent data structures developed here-are 
intended for tightly coupled concurrent computers with thousands of processors. Their 
concurrency scales with the size of the problem, and they are designed to minimize 
communications. 
Many algorithms have been developed for concurrent computers ~7], [9], [15], [75] [85],[102:, 
;116]. Most concurrent algorithms are for numerical problems. These algorithms tend to 
be oriented toward a small number of processors and use a MMD [42] shared-memory 
model that ignores communication cost and imposes global synchronization. 
Object-oriented programming began with the development of SI?vfULA [11], [19]. SIM-
ULA incorporated data abstraction with cla.sses, inheritance with subclasses, and late-
binding with virtual procedures. SIMULA is even a concurrent language in the sense 
that it provides co-routining to give the illusion of simultaneous execution for simulation 
problems. Smalltalk [51], [52], [7 4], [136] combines object-oriented programming with an 
interactive programming environment. Actor languages [1], [17] are concurrent object-
oriented languages where objects may send many messages without waiting for a reply. 
The programming notation used in this thesis combines the syntax of Smalltalk-80 with 
the semantics of actor languages. 
The approach taken here is similar in many ways to that of Lang [i9]. Lang also pro-
poses a concurrent extension oi an object-oriented programrning language, SThIULA, and 
analyzes communication networks for a concurrent computer to support this language. 
There are several differences between Lang's work and this thesis. First, this work de-
velops several programming language features not found in Lang's concurrent SIMULA: 
distributed objects to allow concurrent access, simultaneous execution of several methods 
by the same object, and locks for concurrency control. Second, by analyzing interconnec~ 
tion networks using a wire cost model, I derive the result that low dimensional networks 
are preferable for constructing concurrent computers, contradicting Lang's result that 
high dimensional binary n-cube networks are preferable. 
1.4 Concurrent Computers 
This thesis is concerned with the design of concurrent computers to manipulate data 
structures. We will limit our attention to message-passi"ng [112] MIMD [42] concurrent 
computers. By combining a processor and memory in each node of the machine, this class 
of ma.chines allows us to manipulate data locally. By using a direct network, message-
passing machines allow us to exploit locality in t!le communication between nodes as 
well. 
6 
Address 
Old Data 
,,_ 
New Data 
..... 
Processor - -- Memory 
--
~ 
Figure 1.2: Information Flow in a Sequential Computer 
Concurrent computers have evolved out of the ideas developed for programming multi-
programmed, sequential computers. Since multiple processes on a sequential computer 
communicate through shared memory, the first concurrent computers were built with 
shared memory. ~ the number of processors in a computer increased, it became neces-
sary to separate the cormnunication channels used for communication from those used 
to access memory. The result of this separation is the message-passing concurrent com-
puter. 
Concurrent programming models have evolved along with the machines. The problem 
of synchronizing concurrent processes was first investigated in the context of multiple 
processes on a sequential computer. This model was used almost without change on 
shared-memory machines. On message-passing ma.chines, explicit communication prim-
itives have been added to the process model. 
1.4.1 Sequential Computers 
A sequential computer consists of a processor connected to a memory by a communi-· 
cation channel. ~ shown in Figure 1~2, to modify a. single data object requires three 
messages: an address message from processor to memory, a data message back to the 
processor containing the original object, and a data message back to memory containing 
the modified object. The single communication channel over which these messages travel 
is the principal limitation on the speed of the computation, and has been referred to as 
the Von Neumann bottleneck [4J. 
Even when a programmer has only a single processor, it is often convenient to organize a 
program into many concurrent processes. Multiprogramming systems are constructed on 
sequential computers by multiplexing many processes on the single processor. Processes 
7 
in a multiprogramming system communicate through shared memory locations. Higher 
level communication and synchronization mechanisms such as interlocked read-modify-
write operations, semaphores, and critical sections are built up from reading and writing 
shared memory locations. On some machines interlocked read-modify-write operations 
are provided in hardware. 
Communication between processes can be synchronous or asynchronous. In program-
ming systems such as CSP [62] and OCCA.\1 [64] that use synchronous communication, 
the sending and receiving processes must rendezvous. 'Whichever process performs the 
communication action first must wait for the other process. In systems such as the Cos-
mic Cube [123] and actor languages [1],[17] that use asynchronous communication, the 
sending process may transmit the data and then proceed with its computation without 
waiting for the receiving process to accept the data. 
Since there is only a single processor on a. sequential computer, there is a unique global 
ordering of communication events. Communication also takes place without delay. A 
shared memory location written by process A on one memory cycle can be read by 
process B on the next cycle 1 . With global ordering of events and instantaneous com-
munication, the strong synchronization implied by synchronous communication can be 
implemented without signifi.cant cost. The same is not true of concurrent computers 
where com..rnunication events are not uniquely ordered and the delay of cornmunication 
is the major cost of computation. 
It is possible for concurrent processes on a sequential computer to access an object 
simultaneously because the access is not really simuitaneous. The processes, in fact, 
access the object one a.t a. time. On a. concurrent computer the illusion of simultaneous 
access can no longer be maintained. Most memories have a. single port and can service 
only a. single access at a. given fone. 
1.4.2 Shared-Memory Concurrent Computers 
To eliminate the Von Neumann bottleneck, the processor and memory can be replicated 
and interconnected by a. switch. Sha.red memory concurrent computers such as the 
NYU Ultra.computer [106],[54],[55], C.M~1P [135], and RP3 [100] consist of a number of 
processors connected to a number of memories through a switch, as shown in Figure 1.3. 
' 
Although there a.re many paths through the switch, and many messages can be trans-
mitted simultaneously, the switch is still a bottleneck. While the bottleneck has been 
made wider, it has also been made longer. Every message must travel from one side 
of the switch to the other, a considerable distance that grows larger as the number of 
processors increases. Most shared-memory concurrent computers are constructed using 
indirect networks and cannot take advantage of locality. All messages travel the same 
1 Some sequential computers overlap memory cycles and require a delay to read a location just written. 
8 
00000000 
l I l I I i I I Old Data 
I SWITCH I 
I 1 • I i l I 1 Address New Data 00000000 
Figure 1.3: Information Flow in a Shared-Memory Concurrent Computer 
distance regardless of their destination. 
Shared-memory computers are programmed using the same process-based model of com-
putation described above for multiprogrammed sequential computers. As the name im~ 
plies, communication takes place through shared memory locations. Unlike sequential 
computers, however, there is no unique global order of communication events in a shared-
memory concurrent computer, and several processors cannot access the same memory 
location at the same time. 
Some designers have avoided the uniformly high communication costs of shared-memory 
computers by placing cache memories in the processing nodes [53]. Using a cache, mem-
ory locations used by only a single processor can be accessed without communication 
overhead. Shared memory locations, however, still require communication to synchronize 
the caches2 • The cache nests the communication channel used to access local memory 
inside the channel used for interprocessor communication. This division of function 
between memory access and communication is made more explicit in message-passing 
concurrent computers. 
2 The probiem of synchronizing cache memories in a concurrent computer is known as the cache coherrncv 
problem. 
9 
Control 
Message 
• 
Figure 1.4: Information Flow in a. ~fessage-PassiI1g Concurrent Computer 
1.4.3 1\ifessage-Passing Concurrent Computers 
In contrast to sequential computers and shared-memory concurrent computers which 
operate by sending messages between processors and memories, a message-passing con~ 
current computer operates by sending messages between processing nodes that contain 
both logic and memory. 
As shown in Figure 1.4, message-passing concurrent computers such as the Caltech Cos-
mic Cube [112] and the Intel iPSC [65] consist of a. number of processing nodes intercon-
nected by communication channels. Each processing node contains both a processor and 
a local memory. The communication channels used for memory access are completely 
separate from those used for inter-processor communication. 
Message-passing computers take a further step toward reducing the Von Neumann bot-
tleneck by using a direct network which allows locality to be exploited. A message to an 
object resident in a neighboring processor travels a variable distance which can be made 
short by appropriate process placement. 
Shared-memory computers, even implemented with direct networks, use the available 
communications bandwidth inefficiently. Three messages are required for each data 
operation. A message-passing computer can make more efficient use of the available 
communications bandwidth by keeping the data state stationary and passing control 
messages. Since a processor is available at every node, data operations are performed 
place. Only a single message is required to modify a data object. The single message 
10 
specifies: the object to be modified, the modification to be performed, and the location 
to which the control state is to move next. 
Keeping data stationary also encourages locality. Ea.ch data object is associated with 
the procedures that operate on it. This association allows us to place the logic that 
operates on a class of objects in close proximity to the memory that stores instances of 
the objects. As Seitz points out, "both the cost and performance metrics of VLSI favor 
architectures in which cornmunication is localized" f 111]. 
' . 
Message-passing concurrent computers are programmed using an extension of the process 
model that makes communication actions explicit. Under the Cosrric Kernel [123], for 
example, a process can send and receive messages as well as spawn other processes. This 
model makes the separation of corrununication from memory visible to the programmer. 
It also provides a base upon which an object-oriented model of computation can be built. 
1.5 Summary 
In this thesis I develop an architecture for concurrent data structures. I begin in Chap-
ter 2 by developing the cor:cept of a distributed object. A programming notation, Con-
current Smail talk (CST), is presented that incorporates distributed objects, concurrent 
execution and locks for concurrency control. In Chapter 3 I use this programming nota-
tion to describe the balanced cube, a concurrent data structure for ordered sets. Con-
sidering graphs as concurrent data structures, I develop a number of concurrent graph 
algorithms in Chapter 4. New algorithms are presented for the shortest path problem, 
the max-flow problem, and graph partitioning. Chapter 5 develops an architecture based 
on the properties of the algorithms developed in Chapters 3 and 4 and the character-
istics of VLSI technology. Network topologies are compared on the basis of dimension, 
and it is shown that low dimensional networks give lower latency than high dimensional 
networks for constant wire cost. A new algorithm is developed for deadlock-free routing 
in k-ary n-cube networks, and a VLSI chip implementing this algorithm is described. 
Chapter 5 also outlines the architecture of a message driven processor and describes how 
object experts can be used to accelerate operations on common data types. 
11 
Chapter 2 
Concurrent Smalltalk 
The message-passing paradigm of object-oriented la.."lguages such as Smalltalk-80 [51] 
introduces a discipline into the use of the communication mechanism of message-passing 
concurrent computers. Object-oriented languages also promote locality by grouping 
together data objects with the operations that are performed on them. 
Programs in this thesis are described using Concurrent Smalltalk (CST), a derivative of 
Smallta.lk-80 with three extensions. First, messages can be sent concurrently without 
waiting for a reply. Second, several methods may access an object concurrently. Locks 
a.re provided for concurrency control. Finally, the language allows the programmer to 
specify objects that are distributed over the nodes of a concurrent computer. These 
distributed objects have many constituent objects and thus can process many messages 
simultaneously. They are the foundation upon which concurrent data structures are 
built. 
The remainder of this chapter describes the novel features of Concurrent Smalltalk. This 
discussion assumes that the reader is familiar with Smalltalk-80 [51]. A brief overview 
of CST is presented in Appendix A. In Section 2.1 I discuss the object-oriented model 
of programming and show how an object-oriented system can be built on top of the 
conventional process model. Section 2.2 introduces the concept of distributed objects. A 
distributed object can process many requests simultaneously. Section 2.3 describes how 
a method can exploit concurrency in processing a single request by sending a message 
without waiting for a reply. The use 'of locks to control simultaneous access to a CST 
object is described in Section 2.4. Section 2.5 describes how CST blocks include local 
variables and locks to permit concurrent execution of a block by the members of a 
collection. This chapter concludes with a brief discussion of performance metrics in 
Section 2.6. 
12 
2.1 Object-Oriented Programming 
Object-oriented languages such as Sl)vHJLA [11; and Smalltalk '.51] provide data abstrac-
tion by defining classes of objects. A class specifies bot!i the data state of an object a~d 
the procedures or methods that manipulate this data. 
Object-oriented languages are well suited to programming message-passing concurrent 
computers for four reasons. 
• The message-passing paradigm of languages like Smalltalk introduces a discipline 
into the use of the corrununication mechanism of message-passing computers. 
• These languages encourage locality by associating each data object with the meth-
ods that operate on the object. 
• The information hiding provided by object-oriented languages makes it very con-
venient to move com..'11on1y used methods or classes into hardware while retaining 
compatibility with software implementations. 
• Object names provide a. uniform address space independent of the physical place-
ment of objects. This avoids the problems associated with the partitioned address 
space of the process model: memory ad.dresses in~rnal to the process and process 
identifiers external to the process. Even when memory is shared, there is still a 
partition between memory addresses and process identifiers. 
In an object-oriented language, computation is performed by sending messages t'o objects. 
Objects never wait for or explicitly receive messages. Instead, objects are reactive. The 
arrival of a message at an object triggers an action. The action may involve modifying 
the state of the object, transmitting messages that continue the control flow, and/or 
creating new objects. 
The behavior of an object can be thought of as a function, B [1]. Let S be the set of all 
object states and M the set of all messages. An object with initial state, i ES, receiving 
a message, m E lvf, transitions to a new state, n E S, transmits a po~ibly empty set of 
messages m1 C Af, and creates a. possibly empty set of new objects o C 0. 
B: S x Af ...._.. P(lv!), S, P(O). (2.1) 
Actions as described by the behavior function (2.1) are the primitives from which more 
complex computations are built. In analyzing timing and synchronization each action is 
considered to take place instantaneously, so it is possible to totally order the actions for 
a single object. 
Af ethods are constructed from a set of primitive actions by sequencing the actions with 
messages. Often a method will send a message to an object and wait for a reply before 
13 
proceeding with the computation. For example, in the code fragment below, the message 
size is sent to object x, and the method must wait for the reply before continuing. 
xSize +-x size. 
ySize +-xSize * 2. 
Since there is no receive statement, multiple actions are required to implement this 
method. The first action creates a. context and sends the size message. The contex:; con-
tains a.11 method state: a. pointer to the receiver, temporary variables, and an instruction 
pointer into the method code. A pointer to the context is placed in the reply-to field 
of the size message to ca.use the size method to reply to the context rather than to the 
original object. When the size method replies to the context, the second action resumes 
execution by storing the value of the reply into the variable xSize. The context is used 
to hold the state of the method between actions. 
Objects with behaviors specified by (2.1) can be constructed using the message-passing 
process model. Ea.ch object is implemented by a process that executes an endless receive-
dispatch-execute loop. The process receives the next message, dispatches control to the 
associated action, and then executes the action. The action may change the state of the 
object, send new messages, and/or create new objects. In Chapter 5 we will see how, 
by tailoring the hardware to the object model, we can make the receive-dispatch-execute 
process very fast. 
2.2 Distributed Objects 
In many cases we want an object that can process many messages simultaneously. Since 
the actions on an object a.re ordered, simultaneous processing of messages is not consi_s., 
tent with the model of computation described above. We can circumvent this limitation 
by using a distributed object. A distributed object consists of a collection of constituent 
objects, each of which can receive messages on behalf of the distributed object. Since 
many constituent objects can receive messages at the same time, the distributed object 
can process many messages simultaneously. 
Figure 2.1 shows an example CST dass definition. The definition begins with a header 
that identifies the name of the class, Tally Collection. the superclass from which Tally 
Collection inherits behavior, Distributed Collection, and the instance variables and locks 
that make up the state of each instance of the class. The header is followed by definitions 
of class methods, omitted here, and definitions of instance methods. Class methods define 
the behavior of the class object, Tally Collection, and perform tasks such as creating new 
instances of the class. Instance methods define the behavior of instances of class Taily 
Collection, the collections themselves. In Figure 2.1 two instance methods are defined. 
class 
superclass 
instance variables 
cla.ss variables 
locks 
cla.ss methods 
class methods ... 
instance methods 
tally: aKey 
11 
14 
TallyCoi!ection 
Distributed Collection 
data 
the class name 
a distributed object 
local collection oj data 
none 
none 
count data matching aKey 
(self upperNeighbor) localTally: aKey sum: 0 returnFrom: myld 
localTally: aKey sum: anlnt returnFrom: anld 
J newSum I 
newSum <--sum. 
data do: [:each I 
(each = aKey) ifTrue: [newSum <--newSum +1]]. 
(myld = anld) ifTrue: [requester reply: newSumJ 
ifFalse: [(self upperNeighbor) localTally: aKey sum: newSum returnFrom: anld]. 
other instance methods ... 
Figure 2.1: Distribut~d Object Class Tally Collection 
15 
Instances of class Tally Collection are distributed objects made up of many constituent 
objects (COs). Each CO has an instance variable data and understands the messages 
tally: and localTally:. A distributed object is created by sending a newOn message to its 
class. 
aTailyCollection +-TallyCollection newOn: someNodes. 
The argument of the newOn: message, someNodes, is a collection of processing nodes 1 . 
The newOn: message creates a CO on each member of someNodes. There is no guarantee 
that the COs will remain on these processing nodes, however, since objects are free to 
migrate from node to node. 
When an object sends a message to a distributed object, the message may be delivered 
to any constituent of the distributed object. The sender has no control over which 
CO receives the message. The constituents thern.selves, however, can send messages to 
specific COs by using the message co:. For example, in the code below, the receiver (self), 
a constituent of a distributed object, sends a localTaily message to the an ldth constituent 
of the same distributed object. 
(self co: anld) localTally: #foo sum: 0 returnFrom: myld. 
The argument of the co: message is a constituent identifier. Constituent identifiers a.re 
integers assigned to each constituent sequentially beginning with one. The constant 
myld gives each CO its own index and the constant maxld gives each CO the number of 
constituents. 
The method tally: a Key in Figure 2.1 counts the occurrences of a Key in the distributed 
collection and returns this number to the sender. The constituent object that receives 
the tally message sends a localTally message to its neighbor2 • The localTally method 
counts the number of occurrences of aKey in the receiver node, adds this number to the 
sum argument of the message and propagates the message to the next CO. When the 
localTally message has visited every CO and arrives back at the original receiver, the 
total sum is returned to the original customer by sending a reply: message to requester. 
Distributed objects often forward mes~ages between COs before replying to the original 
requesting object. Tally Collection, for example, forwards loca!Tally messages in a cycle 
to all COs before replying. CST supports this style of programming by providing the 
reserved word requester. For messages arriving from outside the object, requester is 
bound to the sender. For internal messages, requester is inherited from the sending 
method. 
1 Processing nodes are objects. 
2 The meuage upperNeighbor returns the CO with identifier myld + 1 if myld "I= maxld and the CO 
with identifier 1 otherwise. 
16 
This forwarding behavior illustrates a major difference between CST and Smalltalk-SO: 
CST methods do not necessarily return a value to the sender. :'vfethods that do not 
explicitly return a value using 'i' terminate without sending a reply. The tally: method 
terminates without sending a reply to the sender. The reply is sent later by the locaffaily 
method. 
The tally: method shown in Figure 2.1 exhibits no concurrency. The point of a dis-
tributed object is not to provide concurrency in performing a single operation on the 
object, but rather to allow many operations to be performed concurrently. For example, 
suppose we had a Tally Collection with 100 COs. This object could receive 100 messages 
simultaneously, one at each CO. After passing 10,000 localTally messages internally, 100 
replies would be sent to the original senders. The 100 requests are processed concurrently. 
Some concurrent applications require global communication. For example, the concur-
rent garbage collector described by Lang [79] requires that processes running in each 
processor be globally synchronized. The hardware of some concurrent computers sup-
ports this type of global communication. The Caltech Cosmic Cube, for instance, prcr 
vides several wfre-or global communication lines for this purpose [112]. 
Some applications require global communication combined with a simple computation. 
For example, branch and bound search problems require that the minimum bound be 
broadcast to all processors. Ideally, a communication network would accept a bound 
from each processor, compute the minimum, and broadcast it. In fact, the computation 
can be carried out in a distributed manner on the wfre-or lines provided by the Cosmic 
Cube. 
Distributed objects provide a convenient and machine-independent means of describing a 
broad class of global communication services. The service is formulated as a distributed 
object that responds to a number of messages. For example, the synchronization ser-
vice can be defined as an object of class Sync that responds to the message wait. The 
distributed object waits for a specified number of wait messages and then replies to all 
requesters. On machines that provide special hardware, class Sync can make use of 
this hardware. On other ma.chines, the service can be implemented by passing messages 
among the constituent objects. 
2.3 Concurrency 
CST does not exclude the use of concurrency in performing a single method. A more 
sophisticated tally: method is shown in Figure 2.2. Here I use messages upperChild a.nd 
lowerChild to embed a tree on the COs3 • When a CO receives a tally: message it sends 
two localTally messages down the tree simultaneously. When the localTally messages 
3 The implementation of methods upperChild and lowerChild is straightforward and will not be shown. 
here. 
17 
instance methods for class Ta!lyCollection 
tally: aKey 
l I 
jself localTaily: aKey level: 0 root: myld. 
localTally: aKey level: anlnt root: anld 
i upperTally lowerTaily sum aleveq 
alevei = anlnt + 1. 
sum .-Q. 
data do: [:each I 
(each = aKey) ifTrue: [sum -sum +1lJ. 
(anlnt < maxlevel) ifTrue: [ 
count data matchi"ng a.Key 
upperTal!y -{self upperChild: anld level: alevel) localTally: aKey level: 1 root: anld. 
lowerTally <-(self lowerChild: anld level: alevel) localTally: aKey level: 1 root: anld. 
jupperTally + lowerTal!y + sumJ. 
jsum. 
Figure 2.2: A Concurrent Tally Method 
reach the leaves of the tree, the replies are propagated back up the tree con2urrently. 
The new TallyCollection can still process many messages concurrently, but now it uses 
concurrency in the processing of a single message as well. 
The use of a comma, ', ', rather than a period, '.', at the end of a statement indicates that 
the method need not wait for a reply from the send implied by that statement before 
continuing to the next statement. When a statement is terminated with a period, '.', 
the method waits for all pending sends to reply before continuing. 
A simpler example of concurrency is shown in Figure 2.3. This figure shows a portion of 
the definition of Class lnterval4 • The definition has two methods; l:u: is a class method 
that creates a new interval, and contains: is an instance method that checks if a number 
is contained in an interval. 
As shown in Figure 2.4, the contains: method is initiated by sending a message, contains: 
aNum, to object, anlnterval, of class Interval. Objects of class Interval have two acquain-
tances 5 , I and u. To check if it contains aNum, object an Interval sends messages to both 
'The term Interval here means a closed interval over the real numben, {a E ~ i I~ a:::; u}. This differs 
from the Smallta.lk-80 [51] definition of cla.J1s Interval. 
6 In the parlance of actor languages [l] an object, A's, acquaintances a.re those objects to which A ca.n 
send messages. 
cla.ss 
supercla.ss 
instance variables 
cla.ss variables 
locks 
cla.ss methods 
Interval 
Object 
I 
u 
rwlock 
I: aNum u: anotherNum 
I newlnterval I 
newlnterval +-self newLocal. 
newlnterval I: aNum. 
newlntervai u: anotherNum. 
jnewlnterval 
other class methods ... 
instance methods 
contains: aNum 
require rwlock. 
I lin uin I 
lin +-I ::; aNum. 
uin +-u 2: aNum. 
t(lin and: uin) 
other instance methods ... 
18 
the class name 
the name of its superclass 
lower bound 
upper bound 
none 
implements readers and writers 
creates a new interval 
tests for number in interval 
Figure 2.3: Description of Class Interval 
19 
I reply 
6 
Requester 
Figure 2.4: Synchronization of Methods 
I and u asking I if I ~ aNum, and asking u if u > aNum. After receiving both replies, 
anlnterval replies with their logical and. 
Observe that the contains: method requires three actions. The first action oc~urs when 
the contains: message is received by anlnterval. This action sends messages to I and u 
and creates a context, aContext, to which l and u will reply. The first reply to aConte.xt 
triggers the second action which simply records its occurrence and the value in the reply. 
The second reply to aContext triggers the final action which computes the result and 
replies to the original sender. In this example the context object is used to join two 
concurrent streams of execution. 
Only the first action of the contains: method is performed by object anlnterval. The 
subsequent actions are performed by object aContext. Thus, once the first action is 
complete an Interval is free to accept additional messages. The ability to process several 
requests concurrently can result in a ~reat deal of concurrency. This simple approach 
to concurrency can cause problems, however, if precautions are not taken to exclude 
incompatible methods from running concurrently. 
Figure 2.3 illustrates another novel feature of CST. Instance variables in CST may be 
either internal variables or external variables. Internal variables are stored in the same 
processing node as an object and may be accessed without passing messages. External 
variables may be stored anywhere and require message passing for access. An internal 
variable is created using the newlocal message, while an external variable may be ac~ 
quired by passing a pointer, or may be created using the new message. The l:u: method 
20 
in Figure 2.3 uses the newLocai method to generate new instances of class intervai. 
2.4 Locks 
Some problems require that an object be capable of sending messages and rece1vrng 
their replies while deferring any additional requests. In other cases we may want to 
process some requests concurrently, while deferring others. To defer some messages 
while accepting others requires the ability to select a subset of all incoming messages to 
be received. This capability is also important in database systems, where it is referred 
to as concurrency control [133]. 
Consider our example object, an!nterval. To maintain consistency, anlnterval must defer 
any messages that would modify I or u until after the contains: method is complete. On 
the other hand, we want to allow an Interval to process any number of contains: messages 
simultaneously. 
SAL, an actor language, handles this problem by creating an insensitive actor which only 
accepts become messages [1] 6 . The insensitive actor buffers new requests until the original 
method is complete. Lang's concurrent SIMULA [79] incorporates a select construct to 
allow objects to select the next message to receive. While exclusion can be implemented 
using select, Lang's language treats each object as a critical region, allowing only a 
single method to proceed at a time. Neither insensitive actors nor critical regions allow 
an object to selectively defer some methods while performing others concurrently. 
Adding locks to objects provides a general mechanism for concurrency control. A lock 
is part of an object's state. Locks impose a partial order on methods that execute on 
the object. Each method specifies two possibly empty sets of locks: a set of locks the 
method requires, a..'1.d a set of locks the method excludes. A method is not allowed to 
begin execution until all previous methods executing on the same object that exclude a 
required lock or require an excluded lock have completed. The concept of locks is similar 
to that of triggers [90]. 
A solution to the readers and writers" problem is easily implemented with this locking 
mechanism. All readers exclude rwlock, while all writers both require and exclude 
rwlock. Many reader methods can access the object concurrently since they do not 
exclude each other. As soon as a writer message is received, it excludes new reader 
methods from starting while it waits for existing readers to complete. Only one writer 
at a time can gain access to the object since writers both require and exclude rwlock. 
This illustrates how mutual exclusion can also be implemented with a single lock. 
6 CST objects could use the Smalltalk become: message to implement insensitive actors. 
21 
2.5 Blocks 
Blocks in CST differ from Smalltalk-SO blocks in two ways. 
• A CST block may specify local variables and locks in addition to just arguments. 
[ :arg1 :arg2 I (locks) :var1 :var2 I code] 
• It is possible to break out of a CST block without returning from the context in 
which the value message was sent to the block. The down-arrow symbol, 'l ', is 
used to break out of a block in the same way that 'i' is used to return out of a 
block. 
Sending a block to a collection can result in concurrent execution of the block by members 
of the collection. Giving blocks local variables allows greater concurrency than is possible 
when all temporary values must be stored in the context of the creating method. Locks 
are provided to synchronize access to static variables during concurrent execution. 
2.6 Performance Metrics 
Performance of sequential algorithms is measured in terms of time complexity, the num~ 
ber of operations performed, and space complexity, the amount of storage req~ired [2]. 
On a concurrent machine we are also concerned with the number of operations that can 
be performed concurrently. 
The algorithms and data structures developed in this thesis are based on a message-
passing model of concurrent computation. Message-passing concurrent computers are 
communication limited. The time required to pass messages dominates the processing 
time, which we will ignore. 
In sharp contrast, most existing concurrent algorithrns have been developed assuming an 
ideal shared-memory multiprocessor. In the shared-memory model, communication cost 
is ignored. Processes can access any memory location with unit cost, and an unlimited 
number of processes can access a single memory location simultaneously. Performance 
of algorithms a..'lalyzed using the shared-memory model does not accurately reflect their 
performance on message-passing concurrent computers. 
Communication cost has two components: 
latency: the delay of delivering a single message in isolation, 
throughput: the amount of message traffic the communication network can handle per 
unit time. 
'l ') 
.... 
For purposes of analysis I will ignore throughput and consider only latency. 
The prograrn.s in this thesis are analyzed assuming a binary n-cube interconnection topol-
ogy. Programs are chargeci one unit of time for ea.ch communication channel traversed 
in a binary n-cube. 
2.7 Summary 
In this chapter I have introduced Concurrent Smalltalk (CST), a programming notation 
for message-passing concurrent computers. Its novel features include locks for concur-
rency control, and the ability to create distributed objects. CST borrows its syntax, 
late-binding, and inheritance directly from the Smalltalk programming language [511. 
Many of the ideas in CST are borrowed from Athas' language, XCPL [3J. 
Distributed objects are implemented as a collection of constituent objects (COs). Any 
CO ca.n receive a message sent to the distributed object. Since many COs can receive 
messages at the same time, a distributed object ca.n process many messages simultane-
ously. The constituents of a distributed object are assigned to processing nodes when 
the object is created. Thus, distributed objects provide a mechanism for mapping a data 
structure onto an interconnection topology. Distributed objects are the foundation upon 
which concurrent data structures, such as the balanced cube described in Chapter 3, are 
built. 
CST permits methods to exploit concurrency by sending several messages before waiting 
for any replies. CST also allows some methods to terminate without sending any reply. 
Thus a message can be forwarded across many objects before a reply is finally sent to 
the original requester. 
CST methods are compiled into sequences of primitive actions that can be described using 
a behavior function (2.1). Context objects are used to hold the state of a method between 
actions and to join concurrent streams of execution. Primitive object behaviors can be 
implemented using the message-passing process model of computation [123]. However, 
as we will see in Chapter 5, a direct hardware implementation of the behavior function 
results in improved performance. 
23 
Chapter 3 
The Balanced Cube 
Sequential computers spend a large fraction of their time manipulating ordered sets of 
data.. For these operations to be performed efficiently on a concurrent computer, a new 
data structure for ordered sets is required. Conventional ordered set data structures 
such as heaps, balanced trees, and B-trees [2] have a single root node through which 
all operations must pa.ss. This root bottleneck limits the potential concurrency of tree 
structures, making them unable to take advantage of the power of concurrent computers. 
Their maximum throughput is 0(1). This chapter presents a new data structure for 
implementing ordered sets, the balanced cube [21], which offers significantly improved 
concurrency. 
The balanced cube eliminates the root bottleneck allowing it to achieve a throughput 
of 0 (i
0
; N) operations per unit time1. Concurrency in the balanced cube is achieved 
through uniformity. With the exception of the balancing algorithm, all nodes are equals. 
An operation may originate at any node and need not pass through a root bottleneck as 
in a tree structure. In keeping with the spirit of a homogeneous machine, the balanced 
cube is a homogeneous data structure. 
Why is a concurrent data structure such as the balanced cube needed? Many a.pplica.. 
tions are organized around an ordered set data structure. By using a balanced cube to 
implement this data structure, the application can be made concurrent with very little 
effort. The application is divided into 'partitions that communicate by storing data. in 
and retrieving data from the balanced cube. Because the balanced cube can process these 
requests concurrently, accesses to the balanced cube do not serialize the application. In 
Section 3.8 we will see how a balanced cube can be used in a concurrent computer mail 
system, in a concurrent artwork analysis program, and in a concurrent directed-search 
algorithm. 
The balanced cube's topology is well matched to binary n-cube multiprocessors. The 
1 Unless otherwise specified, all logarithms are base two. 
24 
balanced cube maps members of an ordered set to subcubes of a binary n-cube. A Gray 
code mapping is used to preserve the linear adjacency of the ordered set in the Hamming 
distance adjacency of the cube. 
Previous work on concurrent data structures has concentrated on reducing the interfer-
ence between concurrent processes accessing a common data base but has not addressed 
the limited concurrency of existing data structures. Kung and Lehman [76] have devei~ 
oped concurrent algorithms for manipulating binary search trees. Lehman and Yao [81] 
have extended these concepts and applied them to B-trees. Algorithms for concurrent 
search and insertion of data in AVL-trees [31~ and 2-3 trees [32] have been developed 
by Ellis. Ellis has also developed concurrent formulations of linear hashing [33] and 
extendible hashing [34]. 
These papers introduce a. number of useful concepts that minimize locking of records, 
postpone operations to be performed, and use marking mechanisrns to modify the data 
structure. However, these papers consider the processes and the data. to be stationary, 
and thus do not address the problems of moving processes and data between the nodes 
of a concurrent computer. The cost of comi.1mnications, which we assume to dominate 
processing costs, has largely been ignored. 
The remainder of this chapter describes the balanced cube and how it addresses the issues 
of correctness, concurrency, and throughput. In the next section the data structure is 
presented, and the consistency conditions are described. The VW search algorithm is 
described in Section 3.2. VW search uses the distance properties of the Gray code 
to search the balanced cube for a data record in O(log N) time while locking only a 
single node at a time. An insert algorithm is presented in Section 3.3. Insertion is 
performed by recursively splitting subcubes of the balanced cube. Section 3.4 discusses 
the delete operation. Deletion is accomplished by simply marking a record as deleted. 
A background garbage collection process reclaims deleted subcubes. The insertion and 
deletion algorithms tend to unbalance the cube. A balancing algorithm, presented in 
Section 3.5, a.cts to restore balance. Each of the algorithms presented in this chapter 
is analyzed in terms of complexity, concurrency, and correctness. Section 3.6 extends 
the balanced cube concept to B-cubes which store several records in ead1 node. Section 
3.7 discusses the results of experiments run to verify the balanced cube algorithms. 
The chapter concludes with a. discussion of some possible balanced cube applications in 
Section 3.8. 
3.1 Data Structure 
3.1.1 The Ordered Set 
An ordered set is a set, S, of objects on which a linear ordering < has been defined, 
Va, b E S either a < b or b < a and a i= b unless a and b are the same object. In many 
25 
applications these objects are records and the linear order is defined by the value of a key 
field in each record. In this context the ordered set is used to store a database of relations 
associating the key field with the other fields of the record. The order relation defined 
en the keys of the records is implicit in the structure. A data structure implementing 
the ordered set must efficiently support the following operations. 
at: key return the object associated with a key. 
at: key put: object add an object to the set 
delete: key remove the object associated with key from the set. 
from: Ikey to: ukey do: aBlock concurrently send a value: message to aBlock for each 
element of the set of objects with keys in the range [lkey,ukeyj. 
succ: key1 return the object with the smallest key greater than keyl. 
pred: key1 return the object with the largest key smaller than keyl. 
max return the maximum object. 
min return the minimum object. 
In this chapter we will restrict our attention to developing algorithms for the search (at:), 
insert (at:put:) and delete operations. The remaining functions can be implemented a.s 
simple extensions of these three fundamental operations. The succ: and pred: operations 
can be implemented using the nearest neighbor links present in the balanced ~ube. 
3.1.2 The Binary n-Cube 
The balanced cube is a data structure for representing ordered sets that stores data in 
subcubes of a binary n-cube [96], [124]. A binary n-cube has N = 2n nodes accessed by 
n-bit addresses. Each bit of the address corresponds to a dimension of the cube. The 
node or subcube with address ll1 is denoted N[ G1]. H the address is implicit, the node will 
be referred to as N. The binary n-cube is connected so that node N[ai] is adjacent to all 
nodes whose addresses differ from ll1 in exactly one bit position: { ai 9 2; I 0 :::; j :::; n - 1}. 
A binary 3-cube with nodes labeled by address is shown in Figure 3.1. 
An m-subcube of a binary n-cube is a set of Af = 2m nodes whose addresses are identical 
in all but m positions. An m-subcube is identified by an address that contains unknowns, 
represented by the character X, in the m bit positions in which its members' addresses 
may differ. For example, in Figure 3.1 the top of the 3-cube is the lXX subcube. The 
top front edge is the lXl subcube. 
A right m-subcube is an m-subcube which has unknowns in the least significant m bits 
of the address. No X is to the left of a 0 or 1 in a right subcube address. For example, 
26 
110 
101 
1 010 
001 
Figure 3.1: Binary 3-Cube 
the !XX subcube is a right subcube while the !Xl subcube is not. A node is a right 
0-subcube, a singleton set, since it has zero Xs in its address. The corner node of a right 
subcube N[a] is the node with the lowest address in the subcube, N[min(a)]. The corner 
node address is the subcube address with all unknown bits set to zero. The upper nodes 
of a right subcube N[a] are all the nodes in the subcube other than the corner node: the 
elements of the set N[aj \ N[min(a)]. 
3.1.3 The Gray Code 
The balanced cube uses a Gray code [56] to map the elements of an ordered set to the 
vertices of a binary n-cube. Consider an integer, I, encoded as a weighted binary vector, 
bn-1, ... , bo, so that 
n-1 
I= :L bizi. (3.1) 
j=O 
The reflected binary code or Gray code representation of I, a bit vector G(I) = 9n-1, ... , go, 
is generated by taking the modulo-2 sum of adjacent bits of the binary encoding for I 
[56]. 
ifi<n-I 
ifi=n-1 
27 
(3.2) 
Since the EB operation is linear, we can convert back to binary by swapping gi and bi 
in equation 3.2. We use the function B( J) to represent the binary number whose Gray 
code representation is J. 
ifi<n-1 
ifi=n-1 (3.3) 
By repeated substitution of equation 3.3 into itself we can express bi as a modulo-2 
summation of the bits of G(I). 
n-1 
b, = L 9i (mod 2) (3.4) 
j=i 
While these equations serve as a useful recipe for converting between binary and Gray 
codes, we gain more insight into the structure of the code by considering a recursive 
list definition of the Gray code. For any integer, n, we can construct a list of N = 2" 
integers, gray(n), so that the ph element of gray(n) is an integer whose binary encoding 
is identical to the Gray encoding of I. The construction begins with the Gray code of 
length 1. At the ith step we double the length of the code by appending to the current 
list a reversed copy of itself with the ith bit set to one2 . 
gray(O) = [OJ. (3.5) 
gray(n) =append( gray(n - 1) , 2(n-l) + reverse(gray(n - 1)) ). (3.6) 
It is this reversal that gives the code the symmetry and reflection properties that we will 
use in developing the balanced cube search algorithm. 
In the linear space of the ordered set, element I is adjacent to elements I± 1. In the cube 
space, however, the distance between two nodes is the Hamming distance between the 
node addresses: the number of bit positions in which the two addresses differ. For nodes 
A and B to be adjacent, they must be Hamming distance one apart, dn(A, B) = 1. The 
Hamming distance between I and I - 1, dnA(I) is given by the recursive equation. 
~In (3.5) [OJ denotes the list containing the number zero. The function append(x,y) in (3.6) appends 
lists x and y. The function reverse(z) reverges the order of list z. Also in (3.6) the addition is performed 
with scalar extension. The number 2n-I is added to every element of the reversed list. 
28 
3 (010) 
Figure 3.2: Gray Code Mapping on a Binary 3-Cube 
{ 
dHA(~) + 1 if 211, Ii- 0 
dHA(I) = 1 if 2 AI 
undefined if I = 0 
(3.7) 
A plot of this function is shown in Figure 3.26 on page 60. 
For example, in the case where I = lf- and I - 1 = lf- - 1, the elements are at op-
posite corners of the cube, distance n apart. The Gray code has the property that 
dH(G(I), G(I + 1)) = 1, VI 3 0 ~ I ~ (N - 2). Thus, if we map element I of the 
linear order to node G(I) of the binary n-cube, nodes that are adjacent in linear space 
are also adjacent in cube space. A Gray code mapping of integers onto a binary 3-cube 
is shown in Figure 3.2. 
3.1.4 The Balanced Cube 
In a balanced cube, each datum is associated with a right subcube, N[a;], and is stored 
in a constituent object in the corner node, N[min(ai)], of the subcube. Figure 3.3 
shows the header for class Balanced Cube. A datum is composed of a key, N key, an 
object associated with the key, N objer.t, the dimension of the subcube, N dim, and a 
flag, N flag, that indicates the status of the su bcu be. The data are ordered so that if 
B(ai) > B(az), N[ai] key ~ N[a2] key. Node addresses are ordered using the inverse 
Gray code function; thus, if two addresses are adjacent in the order, they will also be 
class 
superclass 
instance variables 
class variables 
locks 
Balanced Cube 
Distributed Object 
key 
data 
dim 
flag 
rwlock 
29 
the class name 
defines the order 
object associated with key 
the dimension of the subcube 
status of subcube 
none 
implements readers and writers 
Figure 3.3: Header for Class Balanced Cube 
Harn.ming distance one apart. 
For the remainder of this chapter I will refer to both cube addresses and to linear 
order addresses. A cube address is the physical address of a processing node. The 
parenthesized binary numbers in Figure 3.2 are cube addresses. A linear address is the 
position of a node in the linear order. For example, the integers (0-7) in Figure 3.2 are 
linear addresses. Linear addresses Alin are related to cube addresses Acube by (3.2) and 
(3.4). 
Alin B(Acube) 
(3.8) 
Acube = G(A!in) 
Upper nodes of the subcube N[a;] are flagged as slaves to the corner node by setting 
N flag ._#slave. Any messages transmitted to an upper node N[au] are routed to the 
corner node of the subcube to which N[au] belongs. There is one exception to this 
routing rule. A split message is always accepted by its destination and never forwarded. 
This message is the mechanism by which upper nodes become corner nodes. Since the 
cube is balanced, most corner nodes hive dimensions differing only by a small constant. 
Thus, the message routing time between adjacent corner nodes will be limited by a small 
constant. 
Data are associated with the subcubes rather than the nodes of a binary n-cube to allow 
ordered sets of varying sizes to be mapped to a cube of size 2". For example, a singleton 
set mapped to the 3-cube of Figure 3.1 would be associated with the subcube XXX, the 
entire cube. If a second element is added to the set, the cube will be split. One element 
will be associated with the OXX subcube and the other element with the lXX subcube. 
This splitting is repeated as more elements are added to the set. 
30 
A balanced cube is balanced in the sense that in the steady state, the dimensions of 
any two subcubes of the balanced cube will differ by no more than one. This degree of 
balance guarantees O(log N) access time to any datum stored in the cube. The balance 
condition is valid only in the steady state. Several insert or delete operations in qlftck 
succession may unbalance the cube. A balancing process which runs continuously acts 
to rebalance the cube. 
There are two consistency conditions for a balanced cube. It must be ordered as de-
scribed above and operations on the cube must be serializable giving results consistent 
with sequential execution of the same operations ordered by time of completion. This 
condition guarantees correct results from concurrent operations. 
3.2 Search 
3.2.1 Distance Properties of the Gray Code 
To develop a search algorithm for the balanced cube, we need to know the distance 
properties of the Gray code; that is, for any element of the ordered set mapped onto 
the cube, at what distance in linear space its neighbors a.re in cube space. The distance 
properties of the mapping tell us how much we can reduce the (linear) search space with 
each nearest neighbor query in the cube. To achieve O(log N) search time we must cut 
the search space in half with no more than a constant number of messages. 
The reflection properties of the Gray code give us an easy method of calculating di,s.o 
tance in a balanced cube. Consider some node, X, in a balanced n-cube. As shown in 
Figure 3.4, if we toggle the most significant bit of node address X, we generate address 
Y = X EB 2"- 1 • In linear space, Y is the reflection of X through 2"2-
1
. Thus, the linear 
distance between node X and its neighbor, Y, in the n - 1st dimension is 
I 2" - 11 dLN(X,n -1) = 21x- -2- . (3.9) 
To calculate the distance in a lower dimension, say k, we reflect about the center of the 
local gray( k) list. Thus, the linear distance from a node with address X to its neighbor 
in the kth dimension is given by 
(3.10) 
The tables below show the distance function (3.10) for each dimension, k, of a balanced 
4-cube. In each dimension, k, the first table shows the cube address, G(X) for the 
31 
2"-1 
x -2- y 
I ' 
1,... d I d ......... ...., 
I 
z I z 
' 
' I 
' I I 
I I I 
0 0 0 1 1 1 
0 Xn-2 1 1 Xn-2 0 
0 Xn-1 0 0 Xn-1 a 
0 xo 0 0 XQ 0 
Figure 3.4: Calculating Distance by Reflection 
Xth element in the linear order. The second table lists the neighbor of each node, X, 
in the kth dimension, N(X, k). The third table shows the distance to this neighbor, 
dLN(X, k) = IN(X, k) - Xj. To find X's neighbor in dimension k, we convert X to a 
cube address, G(X), toggle the kth bit, G(X) e 2.k, and convert back to the linear order, 
N(X, k) = B(G(X) e 2.e). For example, the neighbor of node X = 4 in dimension k = 1 
is node N( 4, 1) = 7. The distance to this node is di,v( 4, 1) = 17 - 41 = 3. 
32 
x O 1 2 3 4 I 5 ! 6 7 8 9 10 ! 11 I 12 13 14 15 
G(X) Q 1 3 2 6 i 7 ! 5 4 12 ' 13 15 14 ! 1Q 11 9 I 8 
1x 0 i 1 2 3 4 5 I 6 7 i 8 9 10 11 i 12 13 i 14 I 15 i 
I N(X,O) 1 : 0 3 2 5 4 i 7 6 : 9 8 11 10 13 12 15 14 I I 3 i 2 i 1 0 7 6 I 5 4 ·, 11 10 9 8 15 14 13 12 I I N(X, 1) 
I N(X, 2) 7 6 i 5 4 3 2 I 1 0 15 14 13 12 i 11 10 9 a I 
! N(X,3) 15 14 ' 13 12 11 10 : 9 8 7 6 5 4 . 3 2 1 0 I 
x 0 i 1 2 3 4 I 5 i 6 7 I 8 9 i 10 11 ' 12 13 14 15 i 
dLN(X,O) 1 1 1 1 1 I 1 I 1 1 i 1 1 1 1 ! 1 ' 1 1 ! 1 ! I 
dLN(X, 1) 3 1 1 3 31 1 ! 1 3 3 1 ! 1 I 31 3 1 1 3 I ~ 
I dLN(X, 2) 7 5 3 1 I 1 I 3 I 5 7 i 7: 5 i 3 I 1 i 1 3 5 7 i I 
i dLN(X, 3) 15 13 11 9 7 ; 5 i 3 1 I 1 3 I 5 7 : 9 11 13 ! 15 ! 
The data. from these tables are plotted in Figure 3.5. The symmetry of reflection is 
clearly visible. In each dimension, k, we ha.ve 2n-.l:-l Vs centered on right subcubes of 
dimension k + l. There are eight Vs of dimension 0, four Vs of dimension 1, two Vs of 
dimension 2 and one V of dimension 3. For example, the nodes at linear addresses 4-7 
constitute a V of dimension l. Combining this V with it neighboring V (addresses 8-11) 
gives addresses 4-11, a. W of dimension 1. 
Definition 3.1 A V of dimension k is a. right subcube of dimension k + 1: a collection 
of 2.l:+l nodes beginning on a multiple of 2.l:+l in the linear order. 
Definition 3.2 A W of dimension k is two adjacent Vs of dimension k. 
We use these Vs and Ws in the following section to develop a new search algorithm. 
3.2.2 VW Search 
VW search finds a. search key in the Gray cube by traversing the Vs and Ws of the 
distance function shown in Figure 3.5. The neighbors of a. node, X, a.re those nodes 
that a.re directly across a V from X in Figure 3.5. The search procedure sends messages 
across these valleys, selecting a. search path that guarantees that the search space is 
halved every two messages. 
Messages: 
33 
16 
14 
12 
D 
i 10 
s 
t 8 
a 
n 6 
c 
e 4 
2 x 0 /\ 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
Linear Address 
Figure 3.5: Neighbor Distance in a Gray 4-Cube 
VW search is performed by passing messages between the nodes of the cube being 
searched. The body of the search uses two messages: vSearch and wSearch. When 
a node receives one of these search messages, it updates the state fields of the message 
and forwards it to the next node in the search path. Nodes never wait for a reply from 
a message. The formats of the search messages are shown below. The search state is 
represented by the destination node, two dimensions: vDim and wDim, and a search 
mode: V or W. 
vSearch: aKey vDim: vDlm wDlm: wDlm 
wSearch: aKey vDlm: vDlm wDlm: wDlm 
In VW search we encode the search space into the destination address, self, and a 
dimension, wDim. wDim is the dimension of the smallest W in the distance function 
which contains the search space. A second dimension, vDim, is the dimension of the 
smallest V which completely contains our current W and thus the search space. vDim 
can be computed from wDim and self; however, it is more convenient to pass it in the 
message than to recompute it at each node. 
The wDim. ~elf encoding of the search space can be converted to the conventional upper 
bound, lower bound (U, L) representation by means of the reflect function. From (3.10) 
34 
I i , 
..,, .. ---new spa.ce (W) ---'Jl•~new spa.ce (V):..0.-
-----old search spa.ce------
Figure 3.6: Search Space Reduction by vSearch Method 
we know that the reflection in the linear space a.bout dimension, d, of node X i.s given 
by 
fR(X,d) = X - 2(X(mod 2d+l)) + 2d+l -1. (3.11) 
The current position, self, or its reflection in the wDim dimension i.s one bound of the 
search space, and the reflection of this bound in the vDim dimension is the other bound. 
Thus, if the current address is S, the wDim is W, and the vDirn is V, we can calculate 
the linear bounds of the search space ( L,U) from 
L(S, w, V) = min(S, fR(S, W), fR(S, V), f RUR(S, W), V)), {3.12) 
U(S, W, V) = max(S, fR(S, W), fR(S, V), fR(f R(S, W), V)). (3.13) 
Algorithm: 
VWsearch operates by passing vSearch and wSearch messages between the nodes of a 
balanced cube. Each message reduces the search space by comparing the search key to 
the key stored in the destination node. 
When a node receives a vSearch message, the search space extends between the current 
node's neighbors in the V and W dimensions (Nv and Nw) as shown in Figure 3.6. 
These neighbors will always be in opposite directions. By examining the key at the 
present node, the vSearch method makes the current node a new endpoint of the search 
space selecting Nv or Nw as the other endpoint. The dimension of the neighbor chosen 
becomes the new V dimension and the W dimension is decreased until a W neighbor in 
the appropriate direction is found. 
When the W dimension has been reduced below the dimension of the current node, X, 
then X's W neighbor is contained within X's subcube. Thus there is no point in sending a 
message to the W neighbor, and the search is completed. Before terminating the search, 
however, X checks the contents of its linear neighbor in the direction of the key to verify 
that the key hasn't been inserted in the cube during the search. If the key isn't found, 
the search terminates with a nii reply. Otherwise, the search continues by increasing the 
W dimension above the dimension of X's subcube. The method for vSearch is shown in 
Figure 3.7 3 • 
When a node receives a wSearch message, the search space extends from its W neighbor 
(Nw) to that neighbor's V neighbor (Nv) as shown in Figure 3.8. The wSearch method 
makes the current node one endpoint of the new search space, selecting between Nw 
and Nv as the other endpoint. If Nw is chosen as the endpoint, the search proceeds 
as in vSearch. If Nv is the endpoint, however, the diinension remains unchanged and a 
vSearch message is forwarded to the current node's V neighbor. The wSearch method is 
shown in Figure 3.9. 
Example 3.1 The search technique is bes~ described by means of an example. Consider 
the following table. 
x I o I 1 2 3 .( 5 ' 6 i 7 a i 9 I 10 1 11 12 13 14 ! 15 
G(X) I 0 I 1 I 3 I 2 6 ' 7 5 I 12 '1 n I 1s 14 10 11 9 8 
Data \ $A I $B ! $C $0 I $E $F $G $H $1 I $J I $K i $l I $M I $N $0 $P 
The table represents a Gray 4-cube where ea.ch node of the cube stores a. single character 
symbol. Figure 3.10 shows the search of this Gray 4-cube for the key $G stored at node 
G(6). The search begins at node G(2). The search is started with the message vSearch: 
$G wDim: 5 vDim: 5. Since we know that the search key must be in the current dimension 
5 trough of the W (this is the whole 4-cube), we start the search with a vSearch message. 
The subsequent search messages are as· follows: 
1. Since the search key, $G, is greater than the key $C stored at G(2), node G(2) sends 
the message wSearch: $G wDim: 4 vDim: 5 to its dimension 4 neighbor, G(l3). 
2. Since the key, $G, is between G(2)'s key, $C, and G(l3)'s key, $N, G(l3) sends the 
message wSearch: $G wDim: 3 vDim: 4 to node G(lO). 
3 The methods for neighbor:, upperNeighbor:, fowerNeighbor:, key:SameSideAsDim: and re~ 
duceDim:key: are omitted for the sake of brevity. Their implementation is straightforward. 
36 
instance methods for class Balanced Cube 
at: aKey reply the object associated with aKey 
II 
jseff vSearch: a Key wDim: MaxWdim vDim: MaxVdim mode: vMode 
vSearch: aKey wD!m: wDlm vOlm: vDlm 
exclude rwlock. 
JnewVDim newWDim ! 
(key= aKey) ifTrue:[ requester reply: data]. 
(self key: aKey sameSideAsDim: wDim) ifTrue: 
newVDim -wDim. 
newWDim -wDim - 1.J 
ifFalse: [ 
newVDim -vDim, 
newWDim -wDim.J 
newWDim -self reduceDim: wDim key: a Key. 
(wDim < dim) ifTrue: [ 
(key < aKey) ifTrue: [ 
search for aKey 
a reader method 
new dimensions of search space 
check if found 
(aKey < ((self upperNeighbor) key)) ifTrue: [requester reply: nil)] 
ifFaise: [ 
(aKey > ((self lowerNeighbor) key)) ifTrue: [requester reply: nil]]. 
newWdirn -self increaseDirn: wDim key: a Key.] 
(self neighbor: newWDim) wSearch: aKey wDirn: newWDirn vDim: newVDim. 
Figure 3.7: Me~hods for at: and vSearch 
37 
Nv 
~ew space (W) .., .. new space (V) -. 
old search space 
Figure 3.8: Search Space Reduction by wSearch :?vfethod 
instance methods for class Balanced Cube 
wSearch: aKey wDlm: wOlm vDlm: vDlm 
exclude rwlock. 
11 
search for a.Key 
(key= aKey) ifTrue:[ requester reply: dataj. check if found 
(self key: aKey sameSideAsDim: wDim) 
ifTrue: [self vSearch: aKey wDim: wDim vDim: vDimJ 
ifFalse: [(self neighbor: vDim) vSearch: aKey wDim: wDim vDim: vDim.J 
Figure 3.9: Method for wSearch 
38 
16 
14 
12 (5.4) (1] w (4.3) 
D (43/1 i 10 s 8 a (4.2) 
/(3) v'~ n 6 
c I e 4 (2) w 
2 (4)W A/ .......... fo~a__ 
0 '"· / ' 
0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 
Position 
Figure 3.10: Example of VW Search 
3. The search key is not between $N and $K, so G(lO) must reflect the search (in the 
vDim dimension) to the other trough of the W by sending the me3sage vSearch: 
$G wOim: 3 vDim: 4 to node G(5). 
4. Since the key is not between G(S)'s key, $F, and its neighbor G(2)'s key, $C, the 
Wdim is decreased to find a neighbor in the direction of the key. G(5) sends the 
message wSearch: $G wDim: 2 vDim: 4 onto G(6) where the search terminates 
successfully. 
Example 3.2 Figure 3.11 shows two examples of searching a balanced cube which is 
not full and which is temporarily out of balance. 
1. In the first example, a search for the contents of G( 4) is initiated from node G(5) 
with the message vSearch: 4 wDim: 4 vDim: 4. 
2. Since the search key is less than G(5), node G(5) forwards the search message to 
node G(2), a slave node of node G(O) with the message wSearch: 4 wDim: 3 vDim: 
4. Since the search key is greater than the contents of G(O) the W dimension is 
decremented to 0 and the message wSearch: 4 wDim: 0 vDim: 4 is sent to node 
G(3). It is important to note that although G(2) is a slave node and thus uses the 
000 
001 
010 
011 
100 
101 
110 
111 
Linear 
Order 
000 
001 
011 
010 
110 
111 
101 
100 
Cube 
Address 
39 
0 
I 
! 
I 
' 4 
5 
• • t 
11 
T2 
• I3 11 
• i 
Search for 4 
7 
Figure 3.11: VW Search Example 2 
12 
• I3 • Is t4 
• 
Search for 5 
value of the corner node, G(O), the search continues from G(2) and is not detoured 
to G(O). 
3. Node G(3) is also a slave of G(O) and thus less than the search key, so the search is 
reflected across the V dimension to node G(4) with the message vSearch: 4 wDim: 
0 vDim:4. The search key is found at node G( 4). 
The second example in Figure 3.11 illustrates the case in which the search key is not 
present in the cube. 
1-3. The search for the key, 3, is initiated at node G(5). The search proceeds as above 
until the message vSearch: 4 wDim: 0 vDim: 4 reaches node G( 4). 
4. To confirm that the key has not been inserted during the search, node G(4) ex~ 
a.'!lines the key of its linear address neighbor, node G(3) by sending G(3) a key 
message. 
5. Node G(3) replies with the value associated with its subcube, 0. Since 0 and the 
contents of G(4), 4, bracket the search key, the search terminates by sending a nil 
reply to the original requester. 
The remainder of this section analyzes the VW search algorithm to show that the order 
of the algorithm is O(log N) and to prove that the algorithm is deadlock free. 
Lennna 3.1 Each execution of vSearch decreases wDim by at least I. 
40 
Proof There are two cases as shown in Figure 3.6: 
I 
1. If the search key is between the current node, self, and it.s W neighbor, wDim-is 
explicitly decremented. 
2. If the search key is between the current node, self, and its V neighbor, then the 
current W neighbor is in the wrong direction, so ReduceDim:key: will decrement 
wDim by at least 1 to find a neighbor in the proper direction. 
Lemma 3.2 vSearch is executed at lea.st once for every two search messages •. 
Proof: The only case in which vSearch is not executed is when a wSearch message is 
received and the key is not between the current node and its neighbor. The next message 
generated in this case is a. vSearch message. Thus the vSearch method will be executed 
at least once for every two messages. I 
Theorem 3.1 A VW Search of a Gray n-cube requrres no more than 2(log N + I) 
messages. 
Proof From Lemmas 3.1 and 3.2 wDim is decremented at least once every two messages. 
Since wDim is initially n =log N, after 2 log N messages wOim will be zero. An additional 
two messages will either find the search key or decrement wDim below zero, causmg 
termination. I 
Theorem 3.2 The VW Search algorithm is deadlock free. 
Proof: The VW Search algorithm locks only one node at a time: the one currently 
conducting the search. Since rwlock is never required, the key messages transrn.itted 
> 
before terminating an unsuccessful search are never blocked. Thus, there is no possibility 
of de.uilock. I 
3.3 Insert 
Af essages: 
'~fessages to self are local to the node and thus are not counted in this analysis. 
41 
The insert operation is initiated by sending an at:put: message to any node in the 
cube. This message starts a search of the cube for the insert key, aKey. When the 
search terminates, the data, anObject, is inserted by calling method localAt:put:. A 
split:key:data:flag: message is used by this method to split an existing right subcube into 
two right subcubes of lower dimension to make room for the insert. 
at: aKey put: anObject 
localAt: aKey put: anObject 
spilt: aOim key: aKey data: anObject flag: aFlag 
Algorithm: 
The insert algorithm is identical to the search algorithm except that on completion, in 
addition to sending a reply, the insert splits a node and inserts the key and associated 
data. Rather than repeat the search algorithm here, only the changes will be described. 
If the key being inserted is already in the cube, the insert replaces the object bound to 
the key with the object in the at:put: message. If the key being inserted is not already 
in the cube, the insert procedure must insert it. To do this, the not found reply of the 
search procedure listed above: 
requester reply: nil. 
is replaced by a call to the method localAt:put: shown in Figure 3.12. 
If the present node has a dimension greater than zero, then it is split by sending a split 
message to its upper half and decrementing its dimension. If the dimension is already 
zero, the insert terminates with a reply of nil. This does not necessarily mean that the 
cube is full. The cube may just be temporarily out of balance. 
If the insert key and the linear order of the neighbor's address have the same relation to 
the current key and current address, the split message inserts the key and record into the 
corner node of the upper half subcube and sets its dimension to prevent it from routing 
further messages to the original corner node. The method belowNeighbor: dim returns 
true if the linear order address of the current node is less than the linear address of its 
neighbor in dimension dim. Figure 3.13 shows the split method. Once the dimension of 
the split node is set, the split is complete in that the split node will begin responding to 
messages rather than forwarding them to its corner node. 
If the insert key and the linear order of the neighbor's address have opposite relations to 
the current key and current address, the split message copies the original corner node's 
key and record into the upper half subcube. The lower half subcube is then set with the 
new key and record. Note that between the assignment of the key and the assignment 
of the record to the lower half subcube, this subcube is in an inconsistent state. 
instance methods for class Balanced Cube 
localAt: aKey put: anObject 
require rwlock exc!ude rwlock. 
I I 
I I 
(dim > 0) ifTrue: [ 
dim <-dim - 1. 
42 
(self key: aKey sameSideAsDim: dim) ifTrue: 
insert after completing search 
(self neighbor: dim) split: dim key: aKey data: anObject flag: #vaiidJ 
ifFalse: [ 
(seif neighbor: dim) split: dim key: key data: data flag: flag. 
key <-aKey, 
data <-anObjectj 
requester repiy: anObject] 
ifFalse: [ 
requester reply: niiJ 
Figure 3.12: Method for localAt:put: 
instance methods for class Balanced Cube 
split: aOlm key: aKey data: anObject flag: aFlag splits a slave node from its parent 
require rwlock exclude rwlock. 
11 
key <-aKey. 
data <-anObject. 
dim <-aOim. 
flag <-aFiag. 
Figure 3.13: Method for split:key:data:flag: 
43 
To prevent an inconsistent state from being observed, both localAt: put: and split: 
key: data: flag: are writer methods. They both require and exclude rwlock. Thus, no 
other operation can be performed on the current node during an inconsistent state. This 
locking cannot cause deadlock, since the split node is in fact part of the locked node 
until the split is compieted. This is an important distinction. 
Consider splitting the subcube OOO:XXX into 0000.XX and OOOlX.X:. In the instant of time 
before the split, all nodes in OOOX..X.X: must route their messages to 000000. Immediately 
after the split, all messages to the upper half subcube 0001.XX must be routed to 000100. 
For the cube algorithms to operate correctly, the split must be an atomic operation. 
Since the split occurs when the dimension of node 000100 is written, it is an indivisible 
operation. Before the dimension is written, messages to nodes 0001.XX are routed to 
000100 which forwards them to 00000 since it is not a comer node. After the dimension 
is written, these messages are accepted directly by 000100. Because the key and record 
of the split node are in fact not accessible before the dimension is updated, the split 
procedure does not have to require rwlock. This lock, however, makes the analysis of 
the operation siJnpler. 
To prevent the possibility of simultaneously inserting the same key in the cube twice, it 
is necessary that the search terminate in the up direction unless the insert key is lower 
than the lowest key in the cube. 
Example 3.3 Figure 3.14 shows the steps required to insert the key 3 into the cube of 
Figure 3.11. The search part of the insert proceeds as in Example 3.2 However) instead 
of terminating with a not found message, the key, 3, is inserted as follows: 
1. Since the search must terminate in the UP direction, node G ( 4) sends the search 
back to node G(3). The state of the cube at this point is shown in Figure 3.14A. 
2. As shown in Figure 3.14B, G(O), the corner node of the OXX subcube to which 
G(3) belongs, decrements its dimension (from two to one), effectively detaching 
the OlX subcube, and sends a split message to its neighbor in dimension 1, node 
G(3). G(3) becomes the corner node of the newly formed subcube. 
3. The split message inserts the key, 3, into node G (3) and sets its dimension to 1 as 
shown in Figure 3.14C. 
4. Finally, both nodes are unlocked as shown in Figure 3.14D. 
Theorem 3.3 An insert operation m a stationary cube containing N nodes requrres 
O(log N) time. 
Proof: The initial stages of the insert are identical to the search operation and thus 
require O(log N) time. The final stage of the insert is the split operation which takes 
constant time. I 
44 
Key I Dim Key Dim Key Dim Key Dim 
000 000 0 2 O* 1 O* 1 Reply 0 1 
001 001 I I I t i I 
' ' ' ' 010 011 .. • • • ! I I 
011 010 Split 3" 1 3 1 
Linear Cube 
Order Address A B c D 
*~Locked 
Figure 3.14: Insert Example 
Theorem 3.4 A.n insert operation will not deadlock with other concurrent operations. 
Proof: While the insert operation can lock out readers on tv;:o nodes simuitaneously, the 
second node locked is part of the subcube which is locked by the first node. Placing the 
second lock operation does not increase the number of nodes which are locked. Rather, 
requiring and excluding rwLock in the split method assures that the upper half subcube 
will remain locked after its dimension is set to make it an independent subcube. This 
second subcube is in effect created by the insert and thus cannot previously have been 
locked by another operation. This node cannot be created by another operation during 
the final stage of the insert, since its corner node is locked, and the only way to create 
a node is to split it from its corner node. Thus, an insert operation will never have to 
wait to gain access to the split node. I 
3.4 Delete 
Messages: 
The delete operation is initiated by sending a delete: message to any node in the cube. 
This message initiates a search for the node containing the delete key. If found, the 
operation marks this node as deleted and replies to the requester. After the node is 
marked deleted, it sends a mergeReq message to its merge neighbor. The merge neighbor 
merges with the deleted node to recover its space. The messages merge Up, merge Down, 
move and copy are used to merge the two nodes. 
45 
The following is a. list of the principal message selectors used to implement the delete: 
operation. 
delete: aKey 
mergeReq: anld flag: aflag dim: aDim 
merge Up 
mergeDown: aKey data: anObject flag: aFlag 
move: aNode 
copy: aKey data: anObject flag: aFlag 
Algorithm: 
The delete algorithm is identical to the search until the key is found. Then the node is 
marked deleted, flag <-#deleted, and a. mergeReq message is sent to the deleted node's 
merge neighbor. This has the result of routing all messages addressed to this node except 
mergeUp, mergeDown, and copy messages to its merge neighbor. 
Definition 3.3 The merge ne1"ghbor of a node, N[ a], with address, a, is the node N[ m( a)] 
with address, m(a) =a e zN[ajdim. If the subcubes cornered by nodes N[a] and N[m(a)j 
are of the same dimension, they can be merged to form a subcube of greater dimension. 
Further, node N[m(a)] is the only node with which node N[a] can be merged. 
When a node, A, receives a. mergeReq message from another node, B, A determines 
if B is its merge neighbor by comparing dimensions. There are two possible cases, as 
shown in Figure 3.15. If the two nodes are of the same dimension (Figure 3.15A,B), they 
are merged. The merge is accomplished by node A's sending a mergeUp or mergeDown 
message to node B. If A is below B (Figure 3.15A) a mergeUp message is sent. A 
mergeDown message is sent if A is above B (Figure 3.15B). These messages have the 
effect of extending the subcube cornered by node A to include the subcube cornered by 
node B. The method invoked by a mergeReq message is shown in Figure 3.16. 
When the two adjacent nodes A and B have different dimensions, a simple merge is not 
possible. This situation is shown in Figure 3.15C,D. Since A is the merge neighbor of B, 
it will always be the ca.se that A dim < ·B dim. In this ca.se we copy the contents of node 
C, the linear address neighbor of node B, to node B and mark C deleted. In performing 
the copy we reduce the dimension of the deleted subcube and make it possible for the 
linear address neighbor node, C, to merge subsequently with its merge neighbor A. The 
move: and copy:data:flag: messages are used to move the contents of node C to node R 
The merge operation combines the subcube cornered by node, A, with its adjacent 
subcube cornered by B. If the current node is the corner of the upper half subcube, 
the state of the current node is copied into the available lower half subcube with the 
mergeDown message. The method for mergeDown is shown in Figure 3.17. If this method 
46 
A dim= B dim A dim < B dim 
A 
B (DEL) A B (DEL) 
c 
c 
A B (DEL) B (DEL) 
A 
A B c D 
Figure 3.15: Merge Dimension Cases 
instance methods for cla.ss Balanced Cube 
mergeReq: anld flag: aFlag dim: aDlm 
require rwlock exclude rwlock. 
11 
(a Flag = #deleted) ifTrue: [ 
invoked after node anld is deleted 
(aDim =dim) ifTrue: [ same dimension, just merge 
(anld > myld) ifTrue:[ 
((self co: anld) mergeUp) ifTrue:[dim +-dim + 1]] 
ifFalse[ 
((self co: anld) mergeD~wn: key data: data flag: flag) ifTrue:[flag +-#slave]]] 
ifFalse: [ smaller than neighbor, send move 
(self neighbor: (aDim-1)) move: anld]] 
Figure 3.16: Method for mergeReq:flag:dim: 
47 
instance methods for class Balanced Cube 
mergeOown: aKey data: anObject flag: aFlag 
require rwlock exclude rwlock or ifaise. 
I l 
key +-aKey. data +-anObject. flag +-aFiag. 
jtrue. 
copy a node's state and abso;b it 
merge Up 
exclude rwlock. 
merge with node below by becoming a slave 
a reader operation 
11 
flag +-#slave. 
itrue. 
Figure 3.17: Methods for mergeUp and mergeDown:data:tlag: 
is successful, the current node flag is set to #slave to indicate that it is no longer a corner 
node. Since the nodes are inconsistent while the copying takes place, this operation 
requires rwlock. 
If the current subcube is below its adjacent subcube, then the current node is the corner 
of the combined subcube. In this case a mergeUp message is sent to the adjacent subcube 
to set its flag to #slave. Once this message completes successfully, the dimension of the 
current subcube is incremented to extend its domain over the merged subcube. The 
method for mergeUp is also shown in Figure 3.17. 
Since a merge operation must lock both nodes A and B, a priority mechanism is used 
to prevent deadlock. If a mergeDown message arrives at a node which is locked, it 
terminates unsuccessfully. A mergeUp message will wait until the node is unlocked. The 
alternative 'or jfalse' in the lock specification for mergeDown causes it to return false 
rather than wait on an incompatible lock. 
Only a node's merge neighbor can send'it a merge message; thus, there is only one case in 
which merge messages can form a cycle for resources. If two adjacent nodes of the same 
dimension, such as A and B in Figure 3.15A, are both deleted, these nodes will send 
mergeReq messages to each other. The mergeReq method will lock ea.ch node and send 
a mergeUp or mergeDown message to the other node. If the merge messages were both 
to wait on the locks, deadlock would occur. Instead, the mergeDown message terminates 
immediately. Its reply unlocks node B and allows the mergeUp message to proceed. 
The messages move: a.nd copy:data:flag: are used to move the contents of one node to 
another. When the move: message is received by a node, that node attempts to copy 
instance methods for class Balanced Cube 
move: anld 
require rwlock exclude rwlock. 
I I 
' I 
48 
attempt to move contents to node anid 
a wri"ter operation 
((self co: anld) copy: key data: data flag: flag) ifTrue:[ flag +-#deieted] 
copy: aKey data: anObject flag: aFlag 
require rwlock exclude rwlock or jfalse. 
11 
((flag= #deleted) or: (flag= #free)) iffrue: [ 
key +-aKey. data +-anObject. flag -aF!ag. 
itruel 
ifFaise:[ifalse] 
replace contents if deleted or free 
doesn't wait 
Figure 3.18: Methods for move: and copy:data:flag: 
itself to the destination of the move by sending a copy: message to the destination. If 
the copy: succeeds, it replies to the move: which then marks the source node deleted. 
The methods for move and copy are shown in Figure 3.18. 
Example 3.4 This example illustrates the simplest case of garbage collection, where 
the nodes are the same size and all that is required is a merge. Figure 3.19A shows 
the state of a 2-cube where the key stored in G(3) has just been deleted. The following 
messages merge the deleted node with its neighbor. 
1. To initiate collection, G(3) sends a mergeReq message to its merge neighbor G(O). 
2. The mergeReq method locks G(O) and sends a mergeUp message to G(3) as shown 
in Figure 3.19B. This message locks G(3). It will always succeed since mergeUp 
messages have priority over merg~Down messages. 
3. As shown in Figure 3.19C, the mergeUp method sets G(3)'s flag equal to #slave 
effectively attaching it to the OX subcube. 
4. After the merge method replies, G(O) increments its dimension to 2 to reflect the 
fact that the two subcubes, IX and OX, have been merged to form a single subcube, 
XX. The final state of the subcube is shown in Figure 3.190. 
Example 3.5 Figure 3.20 illustrates the case where A dim < B dim. 
49 
Key Dim merge Key Dim Key Dim Key Dim 
00 00 0 1 Req O* 1 O* 1 Reply 0 2 
01 01 I i ! I I 
• • ' ' ' 
I I I 
10 11 t t t • J I I I I 
11 10 DEL 1 DEL 1 merge Slave I i 
' Up 
Linear Cube 
Order Address A B c D 
* => Locked 
Figure 3.19: Merge Example: A dim = B dim 
1. Node G(3)/0 (node G(3) ·with dimension 0) receives a mergeReq message from node 
G(0)/1, as shown in Figure 3.20A. 
2. The linear address neighbor of G(0)/1 is the neighbor of G(3) in the dim - 1 
dimension, G(2). Node G(3) sends a move:G(O) message to G(2) as shown in 
Figure 3.20B. 
3. The move locks node G(2) a.nd copies the key, record and flag from node G(2) to 
node G(O) by sending a copy message as shown in Figure 3.20C. 
4. When copy replies successfully to node G(2) (Figure 3.20D) node G(2) is marked 
deleted. 
5. As illustrated in Figure 3.20E. Node G(2) will now send a. mergeReq to node G(3) 
initiating equal dimension garbage collection. 
Theorem 3.5 To delete a key from a ~ube with N nodes requires O(log N) time. 
Proof: The search portion of the delete requires 0 (log N) time. Marking the node 
deleted and merging the node with its neighbor requires constant time. I 
Theorem 3.6 The delete operation will not deadlock with other concurrent operations. 
Proof: The delete operation locks only one node at a time. I 
50 
Key Dim merge Key Dim Key Dim Key Dim Key Dim 
00 00 DEL 1 Req DEL 1 DEL 1 1 1 1 1 
01 01 i I + I I + l 
' 
I 
' 
I 
' ' • • 
10 11 1 0 1 0 1* 0 1* 0 Reply DEL 0 
11 10 2 0 2 0 2 0 2 0 2 0 
Linear Cube 
Order Address A B c D E 
* => Locked 
Figure 3.20: Merge Example: A dim < B dim 
Theorem 3. 7 The merge operations will not deadlock with other concurrent operations. 
Proof: Although the merge operations lock two nodes simultaneously, this locking is 
ordered so that a node, A, will only wait for a node with an address greater than A to 
become unlocked. Thus, it is impossible to have a cycle of nodes waiting on each other's 
locks. I 
Before proving that concurrent search, insert, delete, and merge operations will give the 
same result as running the operations sequentially in order of completion, we need to 
define some terms and prove one lemma about concurrency. 
Definition 3.4 An operation commits when it has made a final decision to modify the 
state of a node in the cube and/or to reply with a particular result. Once an operation 
commits to modifying the state of a node, it must follow through and perform the 
modification. It cannot back out after committing . 
. 
Definition 3.5 The commit condition is the condition which must occur for an opera-
tion to commit. 
Definition 3.6 An operation completes when it has finished modifying the state of a 
node. After an operation completes it cannot modify any additional state. 
Definition 3. 7 The vulnerable period of an operation is the period between the time it 
commits and the time it completes. 
51 
Definition 3.8 A snapshot of the cube is the state of all corner nodes of the cube with 
no methods in progress. Since there is no concept of simultaneity between nodes of the 
cube, ea.ch node may be stopped at any point as long as causality and order of completion 
a.re preserved. 
Definition 3.9 The neighborhood of an operation includes all nodes whose states a.re 
examined by the operation between the time it com..'Tiits and the time it completes. 
Here are some examples: 
• A search operation comJr.its and completes at the same time. A successful search 
commits to replying with the data when it finds the requested key in the current 
node. An unsuccessful search commits to replying nil when it receives a reply from 
a linear address neighbor confirming that the search key is not in the cube. 
• An insert operation commits when the search portion of the insert receives the 
reply from the query message to an adjacent node. The corr.unit condition is that 
the present node a.nd the adjacent node bracket the insert key. The insert oper-
ation completes when the split method unlocks its node. The node which is split 
constitutes the neighborhood of the insert operation. 
• The commit condition for a. delete operation is the key stored in the present node 
matching the delete key. When this condition is discovered, the operation commits. 
A delete is completed when the delete flag of the node is set true. 
• A merge commits when the mergeUp or mergeDown message is accepted. The 
commit condition is that the two nodes being merged are adjacent. Completion 
occurs when the merged node is unlocked. 
Lemma 3.3 If a.n operation, P's, commit condition is valid throughout P's vulnerable 
period and if P's neighborhood is not changed by another operation during this period, 
then any concurrent execution of P is consistent with a sequential execution of P ordered 
as follows: 
• P is ordered after all operations R which complete before P commits. 
• P is ordered before a.II operations S which commit after P completes. 
• P is ordered either before or after any operation Q that completes during P's 
vulnerable period. 
Proof: P's commit condition and P's neighborhood constitute the state of the cube 
which is visible to P. If this state remains constant from the time P commits to the time 
52 
P completes, then P will act as if there were no concurrent operations, Q, during this 
period since it cannot see any changes caused by Q. It follows that P can be serialized 
with operations Q in any order. Since P's commit decision is valid after all operations 
R have completed, it will be valid if P is not started until after these operations have 
completed. Applying the same logic with S in place of P shows that operations S can be 
started after P completes v.-:ithout changing S's commit condition. I 
Theorem 3.8 Concurrent search, insert, delete, and merge operations will give the same 
result as running the operations sequentially in order of completion. 
Proot The search, insert, delete and merge operations all meet the conditions in the 
hypothesis of Lemma 3.3: 
Search completes at the time it commits and thus meets this condition. 
The com.nut condition for insert is that the present node, A, and the node directly above 
the present node, B, straddle the key to be inserted, K. This condition always holds at 
completion since: (1) a new node C<K cannot be inserted between A and B since this 
insert would have to be performed at A and A is locked, and (2) if B is deleted during 
this period, then for any node D>B, D>K. 
The com.Tit decision for delete is that the delete key is found. The node containing this 
key is locked, so the condition still holds at completion. 
The commit condition for merge is that the adjacent node is marked deleted and the 
merge operation is able to lock the node. Since both of the nodes being merged are locked 
during the vulnerable period, this condition is still valid when the operation completes. 
For all these operations, the neighborhood is the present node which is locked and thus 
remains constant during the critical period. I 
3.5 Balance 
The balancing process proceeds in thr~ steps. 
1. An imbalance between two adjacent subcubes, A and B, in the cube is recognized. 
2. The subcube containing fewer data, say A, frees space on its border with B. With~ 
out loss of generality assume A is below B. To free space, the node containing the 
highest datum in A, AH, splits itself, freeing half its space. 
3. The heavier subcube (containing more data), in this case B, moves its smallest 
datum to the space freed in step 2. 
53 
1000 
0100 
0010 0110 1110 1010 
o::;. 01!1/ 0101 I 
"'/\ 1011• 1001 /\ /\ /\ /\ /\ 
• • • • • • • • • • • • • Cube 0000 0001 (X)ll 0010 0110 0111 0101 0100 1100 1101 1111 1110 1010 1011 1001 1000 
Linear QO(X) 0001 (X)lO 0011 0100 0101 0110 0111 1000 1001 1010 1011 1100 1101 1110 1111 
Figure 3.21: Balancing Tree, n = 4 
Imbalance is recognized by embedding a tree in the cube. As shown in Figure 3.21, for 
n=4, the tree is constructed by recursively dividing the cube into two subcubes. The 
node of each subcube closest in linear order to the other subcube is chosen as the corner 
node. This tree has one idiosyncrasy: messages to the outer child of a node5 must 
traverse two communication links, while messages to the inner child of a node need to 
traverse only one link. Despite this shortcoming, however, the tree is ideal for balancing 
for two reasons. First, it evenly distributes the task of recognizing imbalance over all 
nodes of the cube except the zero node. Also, the root node of every cube is on the 
boundary of the cube across which a datum must be moved to balance the cube with an 
adjacent cube at the same level. Each root node participates in correcting an imbalance 
recognized by its parent. 
The cube is balanced if, for each internal node in the tree, the number of keys stored in 
the subcubes represented by the two children of the node differ by less than 2 : 1. Using 
the number of keys in a. subcube as the balancing criterion rather than the maximum or 
minimum dimension of a. node in the subcube has the advantage that local imbalances 
are averaged out when considering global balance. 
Messages: 
Leaf and internal nodes periodically transmit size messages to their parent nodes. When 
the parent node receives the size message, it updates its size and checks the sizes of its 
5 Here the outer child of a node, A, is the child of A to the outside of the subtree rooted by A's parent 
as drawn in Figure 3.21. The outer child of the root is the left child a.a drawn in Figure 3.21. 
54 
two subcubes for imbalance. 
If the root node of a subcube detects imbalance between the two halves of its subcube, 
it initiates balancing by moving records between its two children. This data trans!er 
takes place in two steps. First, a free message is transmitted to the boundary node 
of the subcube containing fewer elements. This message causes the boundary node to 
split itself as in the insert operation, with the old key and record remaining in the node 
farthest from the subcube boundary. The boundary node of the subcube with the larger 
size is then sent a move message. This message locks the boundary node, copies its key 
and record to the freed node, and then marks the adjacent node deieted. The net effect 
is to move one datum from the larger subcube to the smaller subcube. While a node is 
marked free, it routes all its messages to the destination node. The root subcube repeats 
this operation until balance is restored to a 2:1 size ratio. It is important to note that 
because of the Gray code mapping, most of these messages traverse only a single link in 
the cube. The message from the root to its outer child is the only message that must 
traverse two links. 
size: anlnt of: anld 
free: anld 
Algorithm: 
The size method, shown in Figure 3.22, updates the size of self, checks for balance 
between its two subcubes, and possibly initiates balancing by sending a free message to 
the smaller of the two subcubes. The free method splits its destination subcube in half 
and sends a move message to the node in the other half subcube, instructing it to copy 
itself to the freed node and then to delete itself. 
The free method, shown in Figure 3.23, is similar to insert in that it must split the 
present node to generate a free block. There are two cases. If the subcube contains 
more than one element, the boundary node is a corner node. Since it is right on the 
boundary, it must copy its present state into the split subcube and then free itself. If 
the subcube contains only a single element, the boundary node is a slave to the root 
which recognized the imbalance. In this case the root simply sends a. split message to 
free the boundary half of its subcube. As with insert, locking two nodes simultaneously 
is permissible during a split since the two nodes were the same node at the time of the 
first lock, and it is impossible for another process to attempt to lock the split subcube 
after the original subcube is locked. 
After a node is freed, the node which is to move to the freed subcube receives the move 
message. The move copies the boundary node's key and record to the freed node while 
preserving the freed node's dimension. After the copy completes, the boundary node 
is marked deleted. Although two nodes are locked simultaneously, unlike the merge 
operation, no priority resolution is required to prevent deadlock. Once a node is freed, 
instance methods for class Bala need Cu be 
size: anlnt of: anld 
I ! 
(myld < anld) ifTrue:[ 
lowerSize +-anlnt] 
ifFa!se:[ 
u pperSize +-an Int]. 
55 
update size of subcube rooted at receiver 
mySize +-iowerSize + upperSize. (lowerSize > (2 • upperSize)) ifTrue:[ 
(self upperChiid) free lowerChiid] 
( upperSize > (2 • lowerSize)) ifTrue:[ 
(self lowerChild) free upperChild] 
Figure 3.22: Method for size:of: 
instance methods for class Balanced Cube 
free: anld split self and send a move message to anld 
require rwlock exclude rwlock 
11 
(dim > 0) ifTrue: [ 
((flag= #deleted) or: (flag= #free)) ifTrue:[(self co: anld) move: myldj] 
ifFalse:[ 
dim +-dim - 1. 
(self adjacentTo: anld) ifTrue [ 
(self neighbor: dim) split: dim key: key data: data flag: flag. 
flag +-free]. 
(self co: anld) move: myld] 
ifFalse [ 
(self neighbor: dim) split: dim key: key data: data flag: #free. 
(self co: anld) move: (myld xor: zdim]] 
Figure 3.23: Method for free: 
56 
-Key Dim Key Dim Key Dim Key Dim 
000 000 0 0 0 0 0 0 0 0 
001 001 1 0 1 0 1 0 1 0 
010 011 2 0 2 0 2 0 2 0 
011 010 3 0 3 0 3 0 eve ~ DEL 0 ply ~ 
100 110 
101 111 
110 101 
111 100 
t t ~ ! I I J I I 4 2 
e sp ~ 
• t I r I 
4 1 
__) 
lit Free 1 ~ 
I I 
' ' • • ! I 
4 1 
3 1 ~ 
! I 
' ' • • I I t 
4 1 
linear Cube 
r~ ~ v---. r~ 
Order Addr A B c D 
Figure 3.24: Balance Exan1ple 
there is only one node which can send a copy message to that node. Thus, as in the 
insert and free operations, for purposes of locking, the freed node is part of the boundary 
node from the moment it unlocks after being tagged free. 
Example 3.6 Figure 3.24 shows a balancing operation on a 3-cube. 
I. In Figure 3.24A, root node 100, G(7), sees one record in the upper half of the cube 
and four records in the lower half of the cube. Recognizing this imbalance, G(7) 
sends a free message to G ( 4). 
2. As shown in Figure 3.24B, since G(4) is a slave to G(7), the free operation locks 
C(7), decrements its dimension, and sends a split message to G( 4). 
3. After the split message has marked G( 4) free, a move message is sent to G(3) as 
shown in Figure 3.24C. 
4. After the move completes, G(3) is marked deleted and the cube is balanced as 
shown in Figure 3.24D. 
The balancing operations alter none of the arguments in the proofs of Theorems 3.1 
to 3.8 above. Thus, all of these theorems hold in a cube which is being dynamically 
balanced. 
PY 
57 
3.6 Extension to B-Cubes 
A straightforward extension of the balanced cube is the B-cube. The B-cube is to a· 
balanced cube what a B-tree is to a balanced tree. In the B-cube, rather than storing 
one record in each node, up to k records may be stored in each node. B-cube operations 
attempt to keep the number of records in each node between I~ 1 and k by splitting 
nodes when the number of records exceeds k ar1d merging adjacent nodes when their 
combined number of records drops below k + 1. Within a B-cube node, records are 
sorted and searched by conventional means. Between nodes, the algorithms presented 
here for balanced cubes are applied with some modifications. For example, in the search 
procedure, a query message would reply with both upper and lower keys. The test for 
equality in this case would be lower <=key<= upper. 
B-cubes have several advantages over balanced cubes: 
• The overhead for maintaining the dimension and flag fields in each node is reduced. 
Rather than maintaining these fields for each record, their cost is spread out over 
up to k records. Locks in a B-cube can be either on a record basis or on a node 
basis. Write-locking at the node level and read-locking at the record level seem to 
make the most sense. 
• In a B-cube, the majority of inserts and deletes can be performed entirely within a 
single node without splitting or merging. Thus, the number of node interactions is 
reduced. Also, balancing is required less frequently, since the number of operations 
which changes the node counts is reduced. Note, however, that when balancing is 
performed the amount of data to be moved has increased. 
• It is expected that nodes will be swapped from a mass storage device. In the B-
cube, the size of a node can be chosen to match a convenient transfer size for the 
storage device. In general, this size is larger than a single record. 
A possible disadvantage of B-cubes is that they reduce the potential concurrency of 
the data structure. However, in most applications the number of records will greatly 
exceed the number of available processors, and the concurrency of B-cubes will not be 
the limiting factor. In fact, this reductioh of concurrency is an advantage in the sense 
that it allows the granularity of the data structure to be smoothly varied over a large 
range. 
3. 7 Experimental Results 
The balanced cube data structure has been implemented on a multiprocessor simulator, 
and a number of experiments have been performed to verify the correctness of the algo-
58 
rithrns and to measure their throughput. The balanced cube simulator is a 3000-line C 
program l68]. The code is divided fairly evenly into three parts: 
• A binary n-cube simulator which provides the message passing environment of a 
concurrent computer. 
• The bala.."1ced cube algorithms. 
• Instrumentation code to configure the cube sinmlator and to measure the perfor-
mance of the balanced cube algorithms. 
The decision to use a simulator instead of an actual concurrent computer for these exper-
iments was a difficult one. The Caltech Cosmic Cube was available and was ideally suited 
to run the balanced cube algorithms. The simulator was chosen over the Cosrnic Cube, 
however, because it offered greater flexibility and ease of instrumentation. The simulator 
can model the behavior of a wide range of concurrent computers. Computers of any size 
from one processor to 216 processors can be simulated. For the experiments described 
below, the simulator was configured as a binary n-cube 1 :::; n:::; 13. New communication 
topologies, such as a linear connected cube, can be easily added to the simulator. Also, 
it is easy to model different weightings of processing time to communication time on the 
simulator. 
Two sets of experiments were run. The first set of experiments, described in detail in 
[20], was performed on an early version of the balanced cube which directly ~apped 
the elements of the ordered set to the nodes of a binary n-cube. The current balanced 
cube algorithms, using a Gray code mapping, were used in the second set of experiments. 
After a few experiments were run to verify that the insert, delete, and balance operations 
consume only a modest portion of the cube's resources, all remaining experiments were 
performed using only the search operation. 
Throughput experiments were run to determine if the data structure can a.chi.eve the 
predicted 0( lo:N) throughput. These experiments were run using a load model that 
applied a maximum uniform load to the cube. The experiments were run for both the 
direct mapped cube and the current balanced cube. 
' 
Throughput is the number of operations the data structure can perform per unit time,. 
The balanced cube can perform N operations at a time and each operation requires 
O(log N) time, so the predicted throughput is 0( lo:N ). In the steady state, the balanced 
cube can perform 0( 10:N) operations each message time. 
The throughput results presented in this section assume a uniform load. Both the con~ 
stituents to which requests are made and the keys searched for are uniformly distributed. 
A concentration of messages to one constituent or searching for a single key would cause 
a hot spot and reduce throughput. These throughput results also assume that data 
10 
5 i--
T 
h 
r 
0 
u 1 g 
h 
p 0.5 0 u 
t 
0.1 
0 2 
0 
0 
4 
59 
0 
0 
6 
Log(N) 
0 
/ 
0 
0 
8 10 12 
Figure 3.25: Throughput vs. Cube Size for Direct Mapped Cube. Solid line is O.J.N • log1 N 
Diamonds represent experimental data. 
inserted into the balanced cube is uniformly distributed. If an adversary inserts a pathcr 
logical sequence of data, balancing can, in the worst case, require O(N) messages per 
operation reducing throughput to 0(1). 
The throughput results for the original direct mapped cube of [20], shown in Figure 3.25, 
fail to achieve the predicted throughput. The direct mapped cube achieves a throughput 
of only 0( 103 N ). 
The degradation of 0 (log N) is due to the non-uniformity of the Hamming distance 
between linear order neighbors, as expressed in Equation (3.7). The function, dHA., can 
be thought of as a barrier function. Shown in Figure 3.26, this function represents how 
many channels a message between linear address neighbors must traverse. Degradation 
occurs because the channels corresponding to the higher barriers must carry more traffic 
60 
300 
B 
a 250 
r 
I 200 
e 
r 
F 150 
u 
n 100 -
c 
t 
I I 50 \ 
0 
n 
0 ! ! I . I 
0 200 400 600 800 1000 1200 
Position 
Figure 3.26: Barrier Function (n=IO) 
than the channels corresponding lower barriers. Hence, these channels become congested. 
The average barrier height is given by: 
The degradation is the ratio of maximum barrier height to average barrier height or ~ j. 
The experimental data of Figure 3.25 agrees exactly with this figure. 
The Gray code mapping used in the current balanced cube eliminates this degradation as 
shown in Figure 3.27. The throughput difference of O(log N) between Figures 3.25 and 
3.27 illustrate the importance of developing data structures which match the topology 
of concurrent computers. 
61 
500 
100 ;-
T 50 L h 
r 
I 
0 ! 
u 10 f.. ; 
g 5 l.-I h I 
p 
u 
t 1 i 0 
I 
0.5 ;-
i 0 i 
J 
0.1 ,, 
0 2 4 6 8 10 12 
Log(N) 
Figure 3.27: Throughput vs. Cube Size for Balanced Cube. Solid line is 1°· 61~. Diamonds og, 
represent experimental data. 
62 
e e 
I 
message I 
' 
.. j message 
I Po;t Offic• message I I 1--------_..._ Post Office 
.. 
I addr I name I 
" 
I • I I , 
na."l'le l I aadr 
C Balanced Cube ) name:address associationa 
------
Figure 3.28: Mail System 
3.8 Applications 
A Mail System 
Concurrent data structures such a.s the balanced cube provide a medium through which 
objects can communicate without knowing of each other's existence or physical location. 
Consider a mail system that forwards messages between objects that occasionally migrate 
from node to node. As shown in Figure 3.28, the mail system consists of a balanced cube 
used to hold the associations between object names and their current addresses, and 
local Post Offices that cache these associations and handle communications with objects. 
Objects interact through the Post Office. rather than directly communicating with each 
other. 
• When an object moves to a new node, it registers its new address by sending the 
message at: <name> put: <address> to its local Post Office. The Post Office 
inserts this association in the balanced cube. 
• To send a message to an object, B, the sender object, A, transmits a message 
to its local Post Office. Each local Post Office maintains a cache of recently used 
63 
object-address associations. If the address is not found in this cache, an at:< name> 
message is sent to the balanced cube to look up the address. 
• If an address in the local cache is stale (the object has moved), the destination 
Post Office consults the balanced cube to find the correct address, forwards the 
message, and notifies the sending Post Office of the new address. 
Using the PostOffice mechanism, objects can communicate without ever knowing any-
thing about each other. Objects send messages to names. The object receiving messages 
for a given name can move or be replaced without notifying any of its customers. There 
is no central name server to become a bottleneck. The server that associates names with 
addresses is distributed and can process many requests simultaneously. 
Artwork Analysis 
Applications can be constructed by combining concurrent data structures. Consider the 
problem of integrated circuit artwork analysis. This problem has two aspects: 
circuit extraction: discovering the electrical circuit cf an integrated circuit from an ex~ 
amination of its layout geometry. 
design rule checking: verifying that the layout obeys a set of geometrical design rules. 
These rules specify restrictions such as minimum feature width, minimum feature 
spacing, etc .... 
Traditionally, artwork analysis has been performed using a scan-line algorithm [5],[29],[16]. 
However scan line algorithms are inherently sequential as they involve traversing the chip 
in sequence from one end to the other. In this section we examine an approach to con-
current artwork analysis using balanced cubes. 
The artwork for an integrated circuit is a set of polygons. Artwork analysis involves 
checking for interactions between polygons. An efficient algorithm must be selective in 
these checks to avoid the O(N2) complexity required to check every pair of polygons. If 
polygons are compared only with neighboring polygons, the number of comparisons can 
be significantly reduced. • 
To reduce the number of comparisons, we use a B-cube to maintain the spatial rela-
tionship between polygons in one dimension. Each polygon is enclosed in a bounding 
box, and the B-cube is ordered by the left x coordinate of the bounding box. Within 
each node of the B-cube, two indices into the local list of polygons are maintained, one 
ordered by x coordinate and one by y coordinate. 
Artwork analysis is performed concurrently on this structure by having each polygon 
send a from: leftX to: rightX do: aBlock message to the B-cube. At each node of the 
64 
B-cube, aBlock executes and, using the y index, selects only those polygons that overlap 
the sender in both coordinates. These polygons are then compared with the sender to 
check for design rule errors. 
O(.Nl.5 ) comparisons will be made on y coordinates, making this algorithm less efficient 
than O(Nlog N) sequential algorithms. This algorithm has the advantage, however, of 
being very concurrent, while the scan-line algorithms are inherently sequential. 
By using a two-dimensional corner-stitched data structure as described in [93] it is pos-
sible to achieve concurrency without the 0( v'Fi) penalty imposed by ordering primarily 
in a single dimension. A corner-stitched data structure can be distributed by using 
the pointers as keys into a balanced cube. Since order is not required, the concurrent 
dictionary described in Appendix B could be used instead of the balanced cube. 
Directed Search 
Many problems involve the directed search of a. state space. For example, most game-
playing programs a.re built a.round an a: - f3 search of a. game tree that represents the 
state space of positions. The program begins from the current position and generates 
all possible successor positions. These successor positions are then expanded to generate 
positions two moves ahead and so on. At each step of the search there is a set of 
active positions: those positions that have been generated but not yet expanded. Active 
positions a.re expanded in order of their merit as determined by some evaluation function. 
Some positions may be pruned, eliminated from further consideration, on the basis of 
static evaluation functions. 
We can construct a concurrent directed search algorithm by storing all generated posi-
tions in a balanced cube. As in the artwork analysis example above, a B-cube is used. 
Some hash function of position is used as a key to insert positions into the B-cube. 
Within each node, two indices a.re kept into the data: an index ordered by keys and 
an index ordered by the evaluation function. An expand method running in each node 
repeatedly removes the most promising position from the local B-cube node, expands 
that position, and inserts its descendants into the B-cube. 
The directed search algorithm that results from using a B-cube in this manner has a 
number of desirable properties. 
• Identical positions can be converged, since they will hash to the same key. 
• The hash function in combination with the balance property of the B-cube will 
evenly distribute positions over the processing nodes of a concurrent computer, 
resulting in good load ha.lancing. 
• Perhaps most importantly, no special effort is required to make the expand method 
concurrent. The method simply removes a position from the B-cube, expands 
65 
it, and inserts the descendants into the B-cube. All of the communication and 
synchronization, all of the burdens of concurrency, are handled by the B-cube. 
3.9 Summary 
I have developed a new data structure for implementing ordered sets, the balanced cube. 
The balanced cube is a distributed ordered set object. It is an ordered set of data, along 
with operations to manipulate those data, distributed over the nodes of a concurrent 
computer. Operations are initiated by messages to any node. Thus, many operations 
may be initiated simultaneously. The balanced cube offers significantly improved con-
currency over conventional data structures such as heaps, balanced trees, and B-trees. 
On sequential machines, complexity is measured by instruction counts. Based on these 
conventional measures, the balanced cube performs as well as balanced trees or B-trees 
requiring O(log N) time to search, insert, or delete a record in a structure of N records. 
For concurrent machines, however, communications costs are more important than in-
struction counts, and the throughput of several operations executing in parallel is more 
important than the latency of a single operation. Based on this performance model, 
a balanced cube offers 0( ~} throughput as compared to 0(1) throughput for con-1og 'Y 
ventional data structures. Consider, for example, an N = 1024 processor concurrent 
computer. A conventional data structure implemented on such a machine can process 
only a single access per unit time. A balanced cube, on the other hand, can process over 
100 accesses simultaneously. ' 
In any concurrent system, consistency of interacting operations and deadlock avoidance 
are critical. The balanced cube is provably deadlock free. Each operation locks at most 
one non-deleted node at a time and unlocks this node before locking the next node. In the 
case of the merge operation, where there may be competition for access to deleted nodes, 
a priority scheme is used to resolve any conflicts. In the balanced cube, concurrently 
executing operations produce results that are consistent with a sequential execution of 
the same operations ordered by time of completion. This consistency is achieved by the 
judicious use of locking to make the completion of an operation appear instantaneous 
and to assure that the neighborhood of an operation is not modified between the time 
it commits to modifying the state of the cube and the time it completes performing the 
modification. 
Balanced cubes and B-cubes can be used to construct concurrent applications. In many 
cases, such as in the directed search example of Section 3.8, no special effort is required 
to make an application concurrent. Many instances of the application simply insert and 
remove data from the balanced cube. The balanced cube data structure handles all 
communication and synchronization. 
66 
Chapter 4 
Graph Algorithms 
In this chapter I represent graphs as concurrent data structures and develop algorithms 
for manipulating graphs on message-passing concurrent computers. Unlike the ordered 
set structure examined in Chapter 3, a graph does not have a fixed set of operations 
defined on it. Instead, a graph serves as a framework for modeling and solving a number 
of combinatorial problems. 
Graph data structures have been applied to a wide range of problem areas, including 
transportation, communications, computer aided design, and game playing. Because of 
their importance, graph algorithms for sequential ma.chines have been studied in depth 
[36], [43], [63], [6i], [95], and some work has been done on concurrent graph algorithms 
[102], [103], [116], [85]. However, little work has been done on algorithms for message-
passing concurrent computers, and very little experimental work has been done to de-
termine the performance of concurrent graph algorithms on large (> 100 processor) 
machines. 
This chapter addresses these gaps in the literature by formulating new concurrent graph 
algorithms for three important graph problems and evaluating their performance through 
both analysis and experiment. Section 4.2 discusses concurrent shortest path algorithms. 
A weakness in an existing concurrent shortest path algorithm is exposed, and a new 
algorithm is developed to overcome this problem. Max-flow algorithms are discussed in 
Section 4.3. Two new max-flow algorithms ,are developed. Finally, Section 4.4 deals with 
the graph partitioning problem. Novel techniques are developed to prevent thrashing of 
vertices between partitions and to keep the partitions balanced concurrently. 
4.1 Nomenclature 
Definition 4.1 A [Jraph G (V, E) consists of a set of vertices, V, and a set of edges, 
E ~ V x V. The source vertex of edge en is denoted Sn and the destination, dn. 
67 
class Graph generic graph 
superclass Object 
instance variables vertices a distributed collection 
edges a distributed collection 
class variables none 
locks none 
class Vertex 
superclass Object 
instance variables forward Edges 
backward Edges 
class variables none 
locks none 
class Edge 
superclass Object 
instance variables source s, where e = (s, d) 
dest d, where e = (s, d) 
class variables none 
locks none 
Figure 4.1: Headers for Graph Classes 
Definition 4.2 A path is a sequence of edges P = ei, ... , e1c 3 Vi cl; = Bi+i · The source 
of the path is sp = s 1 and the destination of the path is dp = d1i:. 
Definition 4.3 A path P is said to visit a vertex v if P contains an edge e,. and v = Sn 
or v = d,.. A proper path visits no vertex twice. 
Definition 4.4 The degree of a vertex, v, is the number of edges incident on v. The 
in-degree of v is the number of edges with destination v and the out-degree of v is the 
number of edges with source v. 
Most graphs encountered in computer aided design and transportation problems are 
sparse: O(JEI) ~ O(IVJ). For this reason I restrict my attention to sparse graphs. 
The CST headers for classes Graph. Vertex and Edge a.re shown in Figure 4.1. A graph is 
represented by two distributed collection objects, vertices, V, and edges, E. Elements of 
68 
vertices are of class Vertex and consist of forward and backward adjacency lists. The ad-
jacency list representation is used here, since it is more efficient than adjacency matrices 
in dealing with the sparse graphs characteristic of most large probiems. Each edge in 
the graph is an instance of class Edge which relates its source and destination vertices. 
In the following sections I will define subclasses of Vertex and Edge to include problem 
specific instance variables such as length, weight, capacity and flow. To conserve space, 
these subclasses will not be explicitly declared. Instead, the new instance variables in 
each subclass will be informally described. 
4.2 Shortest Path Problems 
The shortest path problem has wide application in the areas of transportation, corn.mu-
nication and computer-aided design. For example, finding optimal routings for aircraft, 
trucks or trains is a shortest path problem as is routing phone calls in a telephone net-
work. Shortest path algorithms are also used to solve computer-aided design problems 
such as circuit board routing and switch level simulation r14L 
' ' 
To discuss shortest paths, we must first define length. 
Definition 4.5 Length, I, is a function E ,_... R. The length of a path is the sum of the 
edge lengths along the path /(P) =I:~ ·EP l(ei)· 
J 
Definition 4.6 The diameter, D, of a graph, G, is the maximum over all pairs of points 
of the minimum length of a path between a pair of points, 
D =max {min l(P)lsp =Vi, dp =vi} V Vi, Vj EV ( 4.1) 
4.2.1 Single Point Shortest Path 
The single point shortest path problem (SPSP) involves finding the shortest path from a 
distinguished vertex, s E V to every other vertex. In this section I examine an existing 
concurrent SPSP algorithm due to Chandy and Misra [15] and show that it has expo-
nential complexity in the worst case. I go on to develop a new concurrent algorithm for 
the SPSP problem that overcomes the problem of Chandy and !vfisra's algorithm and 
requires at most O(IVl 2 ) messages. 
The SPSP problem was solved for sequential computers by Dijkstra in 1959 [27]. Shown 
in Figure 4.3, Dijkstra's algorithm begins at the source and follows edges outward to 
find the distance from the source to each vertex. The wavefront of activity is contained 
s 
69 
c 
b d f 
Figure 4.2: Example Single Point Shortest Pa.th Problem 
spsp: 1 
I vSet u v I 
vertices do: [:aVertex j aVertex distance: infinity]. 
source distance: 0. 
vSet +-SortedCollection sortB!ock:[:a :b I a distance < b distance}. 
vSet add: source. 
[vSet isEmptyJ whileFalse: [ 
u +-vSet removeFirst. 
( u forward Edges) do: [:edge I 
v +-edge destination. 
((u distance+ edge length) < v distance) ifTrue:[ 
v distance: ( u dista nee + edge length). 
v pred: u. 
vSet add: v]]] 
Figure 4.3: Dijkstra's Algorithm 
(f 
0 
70 
Vertex u Distance Pred vSet (vertex.dist) 
s 0 nil (a.1).(b.2) 
a 1 s (b.2) .(c.2) .( d .3). ( e.5) 
b 2 s (c.2).(d.3).(e.5) 
c 2 a (d.3).(e.4) 
d 3 a (e.4).(f.4} 
e 4 d (f.4). (g.6) 
f 4 d (g.5) 
g 5 f 
Figure 4.4: Example Trace of Dijkstra's Algorithm 
in vSet, the set of vertices that have been visited but not yet expanaea. To avoid 
traversing an edge more tha.n once, the aigorithm keeps vSet in sorted order. Ea.ch 
iteration through the whileFaise: loop, the active vertex nearest the source, u, is removed 
from vSet and expanded by updating the distance of all forward neighbors. When the 
algorithm terminates, the distance from source to a vertex, v, is in v d:stance and the 
path can be found by following the pred links from v back to source. Dijkstra's algorithm 
remains the best known algorithm for the sequential SPSP problem. 
A trace of Dijkstra's Algorithm on the graph of Figure 4.2 is shown in Figure 4.4. For 
each iteration of the whileFalse: loop, the figure shows the vertex expanded, its distance 
from the source, its predecessor and the state of the active set. Note that each vertex, 
and thus each edge, is examined exactly once. Because of this property, for sparse graphs 
Dijkstra's algorithm has a. time complexity of O(\Vj log jVj). The loop is iterated !VI 
times and the rate-limiting step in each iteration, selecting the vertex u, can be performed 
in 0 (log IV I) time using a heap1 . 
Chandy and Misra. [15] have developed a concurrent version of Dijkstra's Algorithm. 
This algorithm is simple and elegant; however, as we will see shortly, it has a worst case 
time complexity of 0(2/VI). A simplified form of Chandy and Misra's algorithm is shown 
in Figure 4.5. While Chandy and Misra's original algorithm uses two passes to detect 
negative weight cycles in the graph, the simple algorithm uses only a single pass. As with 
Dijkstra's Algorithm, Chandy and ?vfisra's Algorithm works by propagating distances 
from the source. The algorithm is initiated by sending the source a setDistance: 0 
from: nil message. \Vhen a vertex receives a setDistance:from: message, with a distance 
smaller than its current distance, it updates its distance and sends messages to all of 
1If there a.re only a. constant number of edge lengths, then the selection c:i.n be performed in constant 
time using a. bucket list a.nd the time complexity of the algorithm is O(JVj). 
insta.nce methods for class Path Graph 
spsp: 1 
11 
source setDistance: 0 from: nil. 
instance methods for class Path Vertex 
setOlstance: aDlst from: aVertex 
i I 
(aDist < distance) ifTrue: [ 
distance -aDist. 
(pred notNil) ifTrue:[pred ack]. 
pred -aVertex. 
forwardEdges do: [:edge i 
71 
(edge destination) setDistance: (distance + edge length) from: self 
nrMsgs -nrMsgs + 1J]. 
ack 
ifFalse: [aVertex ack]. 
11 
nrMsgs -nrMsgs - 1. 
( nrMsgs = 0) ifTrue:[ 
(p.red notNil) ifTrue: fpred ackJ. 
(self= graph source) ifTrue: [graph reply]. 
pred -nil]. 
Figure 4.5: Simplified Version of Chapdy and ~fisra's Concurrent SPSP Algorithm 
72 
Time a b c d e f g 
1 (s.1) ( s.2) 
2 ( a.2) (b.5) ( a.5) 
( a.3) 
3 (d.7) (d.4) (e.7) 
(d.5) (e.7) 
(c.4) 
4 (e.6) (e.6) 
(f.5) 
Figure 4.6: Example Trace of Chandy and ~fisra's Algorithm 
its successors. Every setDistance:from: message is acknowledged with an ack message to 
detect termination as described in [28]. When the source replies to the graph the problem 
is solved and the algorithm terIT"inates. Unlike Dijkstra's algorithm, the expansion of 
vertices is not ordered but takes place concurrently. This is both the strength a.nd the 
weakness of this algorithm. 
A trace of Chandy and Misra's algorithm on the graph of Figure 4.2 is shown in Fig~ 
ure 4.6. Each row of the figure corresponds to one arbitrary time period. Each column 
corresponds to one vertex. For each time period, the messages (vertex, distance) received 
by the vertices are shown in the corresponding columns. For instance, during the first 
time period vertex a receives the message setDistance: 1 from: s, or (s.1) and vertex b 
receives ( s.2). 
The order of message arrival at reconvergent vertices is nondeterministic. Figure 4.6 
shows a particularly pessimistic message ordering to illuminate a problem with the algo-
rithm. During time period 2, messages (b.5) and (a.3) are received by vertex d. In the 
example I assume the message from b ardves before the message from a. Vertex d uv 
dates its distance twice and sends two messages to vertex e. Unlike Dijkstra's algorithm, 
Chandy and Misra's algorithm may traverse an edge more than once. 
This multiple edge traversal, due to the very loose synchronization of the algorithm, can 
result in exponential time complexity. Consider the graph of Figure 4.7. If messages 
arrive in the worst possible order, Chandy and Misra's algorithm requires O(z11'.1) time 
to solve the SPSP problem on this graph. Each triangular stage doubles the number of 
messages. Vertex v1 receives messages with distances 3 and 2; v 2 receives 7,6,5 a.nd 4; 
tlk receives 2k + 2k - 1, ... , 2k in that order. Although it is unlikely that the situation 
73 
sv--1 _3_  --v-1 _4_  ----~-···-~n-~-12_-/_'1--~ 
u1 u1 Un-1 
Figure 4.7: Pathological Graph for Chandy and :\1isra's Algorithm 
will ever get this bad, the problem is clear. Tighter sync.hronization is required. 
To solve the synchronization problem with Chandy and ?vfisra's algorithm I have de-
veloped a new concurrent algorithm for the SPSP problem that synchronizes all active 
vertices. This algorithm, shown in Figure 4.8, synchronizes all vertices in the graph with 
their neighbors. By forcing a vertex to examine all of its input edges before propagating 
a value on its output edges, the worst case time complexity of the algorithm is reduced 
to 0 (JV I) for sparse graphs2 . The worst case number of messages required for sparse 
graphs is O(jVl 2). ' 
The algorithm is initialized by sending an spsp: source message to the graph. The graph 
then initializes each non-source vertex by sending it an spsplnit: oo message. The source 
receives an spsplnit:O message. The spsplnit messages initialize the distance instance 
variable of each vertex and start the synchronized distance computation by having each 
vertex send setDist:from: messages to all of its forward neighbors. 
Figure 4.9 illustrates the synchronization imposed by this algorithm on each vertex by 
means of a Petri Net [98]. During each step of the algorithm, each vertex sends setDist 
messages to all of its forward neighbors. When setDist messages have arrived from all 
backward neighbors, the vertex acknowledges these messages with ackDist messages. 
When ackDist messages are received from all forward neighbors, the cycle begins again. 
Using this mechanism, vertices are kept locally synchronized. They do not operate in 
lockstep, but, on the other hand, two vertices cannot be out of synchronization by more 
than the number of edges separating them. 
The algorithm as presented will run forever smce no check is made for completion. 
~On a.ny real concurrent computer O(JVI) performa.nce will not be seen, eince it ignores communica.tion 
la.tency between vertices. On a. bina.ry n-cube processor, for exa.mple, the a.vera.ge !a.tency is O(log N), 
where N is the number of processora, giving a. time complexity of O(IV J log N). 
instance methods for cla..ss Path Graph 
spsp: s 
11 
vertices do: [:vertex I 
74 
(vertex= source) ifTrue: [vertex spsplnit: OJ 
ifFaise: [vertex spsplnit: ooJI 
instance methods for cla..ss Path Vertex 
setOlst: aDlst over: anEdge 
11 
nrMsgs +-nrMsgs - 1. 
(a Dist < distance) ifTrue: [ 
distance +-aDist. 
pred +-(anEdge source)]. 
( nrMsgs =0) ifTrue:[ 
self sendAd::s. 
(nrAcks = O} ifTrue: [self sendMsgs]J 
spsplnlt: aDlst 
II 
distance +-aDist, 
self sendMsgs 
sendMsgs 
11 
nrAcks +-(forwardEdges size). 
acksSent .-false. 
forwardEdges do: [:edge I (edge destination) setDist: (distance+ edge length) over: edge] 
sendAcks 
11 
nrMsgs .-(backwardEdges size). 
acksSent <-true, 
backwardEdges do: [:edge I (edge source) ackDistj 
ackOist 
l I 
nrAcks +-nrAcks - 1. 
(acksSent and: (nrAcks = 0)) ifTrue: [self sendMsgs] 
Figure 4.8: Synchronized Concurrent SPSP Algorithm 
I 
ackDist from 
forward 
neighbors 
send Ac ks 
I I 
setDist to 
forward 
neighbors 
75 
I 
setDist from 
backward 
neighbors 
sendMsgs 
I 
ackDist to 
backward 
neighbors 
Figure 4.9: Petri Net of SPSP Synchronization 
Completion detection can be added to the algorithm in one of two ways. 
• Embed a tree into the graph. Each step, each vertex (leaf) transmits up the tree a 
message indicating whether or not its distance has changed. Internal nodes of the 
tree combine and forward these messages. When the root of the tree detects no 
change for h consecutive steps, where h is the height of the tree, the computation 
is finished. 
• This shortest path is an example of a diffusing computation as defined in [28] and 
thus the termination detection technique described there can be applied to this 
algorithm. 
For the sake of brevity, the details of implementing completion detection will not be 
described here. In the experiments described below, the second termination technique 
was implemented to give a fair comparison with Chandy and Misra's algorithm. 
An example trace of the synchronous SPSP (SSP) algorithm on the sample graph of 
Figure 4.2 is shown in Figure 4.10. Since each vertex waits for distance messages on all 
incoming edges before propagating its next message on an outgoing edge, an unfortunate 
message ordering cannot cause an exponentiai number of messages. 
Theorem 4.1 The SSP algorithm requires at most O(JVJ x JEI) total messages. 
Proof: In a graph with positive edge lengths, all shortest paths must be simple paths, or 
we could make them shorter by eliminating their cycles. Thus, a shortest path contains 
76 
Time a b c d e f g 
1 (s.1) ( s.2) (a.cc) (a.cc) (a .cc) (d.cc) (e.cc} 
(b.oo) (c.cc) (e.oo) (f.cc) 
( d .cc) 
2 (s.1) ( s.2) (a.2) (a.3) (a.5) (d.cc) (e.x) 
(b.5) (c.cc) (e.cc) (f.oo) 
(d.oc) 
3 (s.1) ( s.2) ( "'\ a . .!1 (a.3) (a.5) (d.4) (e.oo) 
(b.5) ( c.4) (e.7} (f.cc) 
(d.5) 
4 (s.1) ( s.2) (a.2) (a.3) (a.5) (d.4) (e.6) 
(b.5) ( c.4) (e.6) (f.6) 
(d.5) 
Figure 4.10: Example Trace of Simple Synchronous SPSP Algorithm 
at most !VI - 1 edges. By induction we see that the algorithm finds all shortest paths 
containing i edges after i iterations of exchanging messages with its neighbors. Thus, at 
most IV I - 1 iterations are required. Since I El messages are sent during each iteration; 
O([VI x !El) total messages are required. I 
The experirnentEi discussed below were performed by coding both Chandy and Misra's 
algorithm and the SSP algorithm in C and running them on a binary n-cube simulator. 
The simulator charges one unit of time for each communications channel traversed in 
the graph. The experiments show that for large graphs the SSP algorithm outperforms 
Chandy and ~fisra's algorithm because it has better asymptotic performance, while for 
small graphs Chandy and Misra's algorithm performs better since it is not burdened 
with synchronization overhead. 
Figure 4.11 shows the speedup of both algorithms as a function of the problem size. The 
line marked with circles shows the speedup of Chandy and Misra's algorithm, while the 
line marked with diamonds shows the speedup of the SSP algorithm. The graph shows 
that the SSP algorithm performs better than Chandy and Misra's algorithm for large 
graphs. The abrupt change in performance between 128 and 256 vertices is an anomaly 
probably due to the fact that only a single graph of each size was tested. 
The algorithms were run on random graphs of degree four with uniformly distributed 
edge lengths. Tests were run varying the graph size in multiples of two from 16 to 4096 
77 
500 
100 " Chandy a.nd Misra. 0 u 
50 
s 
p 0 
e 10 
I 
!-
e 
d 5 ~ 
u i 
p I 
I 
1 I.-
0.5 L I 
I 
I 
I 
I 
0.1 
2 4 6 8 10 12 14 
Log of Problem Size 
Figure 4.11: Speedup of Shortest Pa.th Algorithms vs. Problem Size 
78 
500 
100 ,,.-, Chandy a.nd Misra. 
'--' 
so 
s 
p 0 SSP 
e 10 
e 
d 5 
u 
p 
1 -
' i 
0.5 
0.1 I. 
0 2 4 6 8 10 12 14 
Log(N) 
Figure 4.12: Speedup of Shortest Path Algorithms vs. Number of Processors 
vertices. In each test the number of processors was equal to the number of vertices in 
the graph. The speedup figure in the graph is given by ~, where T. is the number of 
operations required by Dijkstra's algorithm on a sequential processor ignoring accesses 
to the priority queue, and Tc is the time for the concurrent algorithm on a concurrent 
processor. Note that these speedup figures are, in fact, pessimistic since they ignore the 
time required by the sequential algorithm to access the priority queue. 
Figure 4.12 shows the speedup 0f both algorithms as a function of the number of 
processors. These tests were run on a random graph of degree 4 with 4096 vertices 
and uniformly distributed edge weights. For this graph size, the SSP algorithm is about 
four times as fa.st as Chandy and Misra's algorithm for all configurations tested. The 
speedup of both algorithms is ~ ~ over much of the range with Chandy and Misra's 1og,v 
algorithm falling short of this asymptote for large N. 
Figure 4.13 shows the speedup of both algorithrns for different-size instances of the 
pathological graph of Figure 4. i. Because the graph is very narrow and does not offer 
79 
5 
0 Chandy and Misra. 
i 
s 1 ! 0 SSP p 
e 
e 0.5 i d I 
u 
p ~. 
'--" \ 
I \ 0.1 (__ i \ 
,0 
~ 
2 4 6 8 10 12 14 
Log of Problem Size 
Figure 4.13: Speedup of Shortest Path Algorithms for Pathological Graph 
80 
much potential for concurrency, neither algorithm performed particularly well. The SSP 
algorithm, however, outperformed Chandy and Misra's algorithm by a significant margin. 
Data are not avallable for Chandy and Misra's algorithm on graphs of more than 256 
vertices because the algorithm did not terminate on the 512 vertex ca.se after two days 
of run time on a VA .. X 11/750! The SSP algorithm perform.s moderately well even on a 
4096 vertex graph. 
As we will see in the next section, additional speedup can be gained exploiting conccir-
rency at a higher level by running several shortest path problems simultaneously. 
4.2.2 Multiple Point Shortest Path 
In the multiple shortest path problem there are several source vertices, s 1 , ... ,BJ:. The 
problem is to find the mi..'1imum length path from each source vertex, s;, to every node 
in the graph. For example, during the loose routing phase a.n integrated circuit router 
assigns signals to channels by independently finding the shortest path from each signal's 
source to its destination. Since each signal is handled independently, on a concurrent 
computer all signals can be routed simultaneously. 
The results of a number of experiments run to measure the concurrency of running 
multiple shortest path problems simultaneously are shown in Figures 4.14 and 4.15. 
Figure 4.14 shows the speedup vs. number of processors for eight simultaneous shortest 
path problems on graph R2.10, a random graph of degree 2 a.nd 1024 vertices. This 
figure shows an almost linear speedup for small N trailing off to an lo~N speedup as N, 
the number of processors, approaches the size of the graph. This degradation is due to 
the uneven distribution of load that results when only a few vertices of the graph are 
assigned to each processing node. The maximum speedup of lo~N is due to the log N 
cost of communication in an N processor binary n-cube. 
Figure 4.15 shows the speedup of the multiple path algorithm vs. the number of simul~ 
taneous problems for a fixed computer of dimension 10, 1024 nodes. For a small number 
of problems the speedup is limited by the number of problems available to run. As more 
problems are added the speedup increases to ·a point where it is limited by the number 
of processors available. Beyond this point the speedup remains at a constant level. In 
this experiment the processors become the limiting factor beyond 10 problems. Running 
a sufficient number of shortest path problems simultaneously gives a speedup that is 
independent of the diameter of the graph and is instead dependent on the number of 
available processors and the distribution of work to those processors. 
The experiments shown in Figures 4.14 and 4.15 were run using Chandy and ?Yfisra's 
algorithm. Even greater performance gains are expected for the SSP algorithm since the 
multiple problems could share the significant synchronization overhead of this algorithm. 
81 
500 I 
I 
I 
I 
! 
I 
I 
I 
100 f-
s 50 I p r I 
e I 
e l d u 10 I p I 
5 I 
I 
I 
1 
0 2 4 6 8 10 
log(N) 
Figure 4.14: Speedup for 8 Simultaneous Problems on R2.10 
82 
140 
120 ;-
' I 
I 
s 100 I r p 
I e e 80 
r d u ~ p 60 !\_ 
I I I 40 I ! I I I I 
20 0 
0 5 10 15 20 
Number of Problems 
Figure 4.15: Speedup vs. Number of Problems for R2.10, n=lO 
floyd 
I i j k I 
vertices do: [:vi I 
vertices do: [:vj I 
83 
vi distTo: vj put: length of edge from i to ;1] 
vertices do: [:vk I 
vertices do: [:vi i 
vertices do: [ :vj i 
vi distTo: vj put: (vi distTo: vj) min: ((vi distTo: vk) + (vk distTo: vj)Jll 
Figure 4.16: Floyd's Algorithm 
4.2.3 All Points Shortest Path 
The all points shortest path problem is the extreme case of the multiple shortest path 
problem described above, where every vertex in the graph is a source vertex. An efficient 
sequential algorithm for solving this problem is given by Floyd [41] based on a transitive 
closure algorithm by Warshall [134]. This algorithm, shown in Figure 4.16, ffnds the 
shortest path between any pair of vertices, vi and vj, by incremental construction. The 
algorithm begins, k=O, with vi dist at: vj containing the length of the edge (if any) from 
vi to vj. That is, the shortest path from vi to vj containing no other vertices. On the first 
iteration, the algorithm considers paths from vi to vj that pass through the first vertex 
vk. On the mth iteration, the shortest path passing through vertices numbered less than 
or equal to m is found. Thus, when the algorithm completes, vi distTo: vj contains the 
length of the shortest pa.th from vi to vj. This algorithm has time complexity O(JVj3 ) 
and space complexity O(IVJ 2 ). 
A concurrent version of this algorithm is given in [57]. This algorithm uses IVl 2 proces-
sors, one for each pair of vertices, to execute the inner two loops above in a. single step. 
O(IVI) steps are required to perform the path computation. This approach is 'limilar to 
one described by Levitt and Kautz for cellular automata [82]. Although it gives linear 
speedup, this algorithm is impractical for all but the smallest graphs because it requires 
IVl 2 processors. Since graphs of interest in computer-aided design problems often con-
tain 105 to 106 vertices, practical algorithms must require a number of processors that 
grows no faster than linearly with the problem size. 
Both the sequential and concurrent versions of Floyd's algorithm a.re very inefficient for 
sparse graphs. Floyd's algorithm requires O(jVj 3) operations, while JV[ repetitions of 
84 
Dijkstra's algorithm requires only O(jVj 2 log ;vi) for graphs of constant degree. This 
is even better than the expected ca.se performance of Spira's O(JV;2 log2 iV1) algorithm 
[119]. Thus, for sparse graphs, it is much more efficient to run multiple shortest path 
probleIT'...s as described in Section 4.2.2 than it is to run Floyd's algorithm. 
The space complexity of O(jVi 2) is a serious problem with the all points shortest path 
problem. Note that this space requirement is inherent in the problem since the solution 
is of size IV\2 • Another advantage of running multiple shortest path problems instead 
of the all points problem is that the problem ca.n be run in pieces and backed up to 
secondary storage. 
4.3 The ~fax-Flow Problem 
The problem of determining the maximum flow in a network subject to capacity con-
straints, the max-flow problem, is a form of linear programming problem that is often 
encountered in solving communication and transportation problems. These problems 
usually involve large networks and are very computation-intensive. 
Consider a directed graph G(V,E) with two distinguished vertices, the source, s, and the 
sink, t. Ea.ch edge e EE has a capacity, c(e). A flow function f: E,....... R assigns a real 
number f(e) to each edge e subject to the constraints: 
1. The flow in each edge is positive and less than the edge capacity. 
0 ::=:; f(e) ::=:; c(e), (4.2) 
2. Except for s and t, the flow out of a vertex equals the flow into a vertex: vertices 
conserve flow. 
'Vu EV\ {s,t}, L f(e) I: J(e), (4.3) 
eEin( v) eEout(u) 
where in(u) is the set of edges into vertex u, and out(u) is the set of edges out of 
vertex u. 
The network flow, F(G, !), is the sum of the flows out of s. It is easy to show that Fis 
also the sum of the flows into t, the sink3 . 
F = L f(e) = L f(e) ( 4.4) 
eEout( 1) eEin(t) 
3 It is assumed that there is no flow into the source or out of the sink. 
85 
The max-flow problem is to find a legal flow function, f, that maximizes the network 
fl.ow F. 
The max-flow problem was first formulated and solved by Ford and Fulkerson [43]. To 
understand their algorithm we first need the following definitions. 
Definition 4. 7 An edge e is useful( t1, u) if either 
1. e = (t1, u) and f(e) < c(e), or 
2. e = (u, v) and f(e) > 0. 
An edge that is useful(t1, u) can be used to increase the fl.ow between t1 and u either by 
increasing the fl.ow in the forward direction or decreasi..'1g the fl.ow in the reverse direction. 
Definition 4.8 The available flow, a;, over an edge e; = ( s;, d;) is 
1. c( e;) - f ( e;) in the forward direction from s i to d;, 
Z. f(e;) in the backward direction from di to Sj· 
The available fl.ow is the amount the flow can be increased over an edge m a g2ven 
direction without violating the capacity constraint. 
Definition 4.9 An augmenting path is a sequence of edges ei, ... , en where 
1. e1 is useful(s, vi), 
2. ei is useful( Vi-1' vi) Vi 3 1 < i < n, 
3. en is useful( tln-li t). 
, 
Thus, a.n augmenting path is a sequence of edges from s to t along which the flow can 
be increased by increasing the fl.ow on the forward edges and decreasing the flow on the 
reverse edges. 
The Ford and Fulkerson algorithm begins with any feasible flow and constructs a maximal 
fl.ow by adding flow along augmenting paths. An arbitrary search algorithm is used to 
find each augmenting path. Flow in each edge of the path is then increased by the 
minimum of the available flow for all edges in the path. The original Ford and Fulkerson 
algorithm may require an unbounded amount of time to solve certain pathological graphs. 
86 
a 1/0 c 
1/1 
s t 
b 1/0 d 
Figure 4.17: Example of Suboptimal Layered Flow 
Edmonds and Karp [30] later discovered that restricting the search for augmenting paths 
to be breadth-fi.rst makes the time complexity of the algorithm O(/E! 2 1V:). For dense 
graphs where JEI = O(JVi2) this is quite bad, O(JVJ 5); however, for sparse graphs where 
IE!= O(!VI), Edmonds and Karps's algorithm requires only O(\Vl3 ) time. Only recently 
have better algorithms been discovered for sparse graphs. 
Dinic introduced the use of layering to solve the max-flow problem [35]. Dinic's algorithm 
constructs a max-flow in phases. Each phase begins by constructing an auxiliary layered 
graph that uses only useful edges in the original flow graph. 
Definition 4.10 A layered graph is a graph where the vertex set, V, has been partitioned 
into layers. Each vertex, v, is assigned to layer l ( v) and edges are restricted to connect 
adjacent layers: Ve = (u, v), l(v) = !(u) + l. The layer of a vertex corresponds to the 
number of edges between the source and that vertex. A layered graph is constructed 
from a general flow graph by breadth-first search. 
• The source, s, is assigned to layer l(s) = 0. 
• For each layer i from 1 to k, a. vertex u is assigned to layer i if 3 an edge, e, which 
is useful( v, u) for some vertex u in layer i - l. 
During each phase of Dinic's algorithm a marimal layered flow is found in the layered 
graph using depth first search. The flows added to the layered graph are added to 
the original flow-graph, and the next phase of the algorithm begins by relayering the 
graph. The number of layers in the auxiliary graph is guaranteed to increase by one 
each iteration and obviously can contain no more than IVJ layers, so the number of 
iterations is at most jV I - 1. 
87 
Definition 4.11 A maximal layered flow is a legal assignment of flows to edges of a 
layered graph such that 
• Flows are augmented only in the forward direction. The flow over a forward edge 
(from layer i to layer i + 1) can only be increased and the flow over a reverse edge 
(from i + 1 to i) can only be decreased. 
• All paths from the source to the sink are saturated. 
Because of the layering constraint, a maximal layered flow is not necessarily a maximal 
flow on the layered network and may not be the best achievable within the constraints. 
For example, Figure 4.17 shows a layered graph where each edge is labeled with its 
capacity and fl.ow (capacity/flow). The one unit of flow along path s,a,d,t is a maximal 
layered flow even though a two-unit fl.ow is possible (paths s, a, c, t and s, b, d, t). 
Finding a maximal layered flow is much easier than finding a. max-flow in a general graph 
because each edge ha.s been assigned a direction. Dinic's algorithm [35] constructs the 
layered max-flow using depth-first search which requires O(iVI x iEI) time for each phase 
or O(/V/ 2 JE/) total time. Algorithms due to Karzanov [63) and Malhotra, Kumar, a.nd 
Maheshwari (MK1-1) [84] also use layering, but construct layered max-flows by pushing 
flow from vertex to vertex. Ka.rzanov's algorithm constructs preffows, pushing flow 
from the source, while the simpler MK~f algorithm identifies a fl.ow limiting vertex, t1, 
and then saturates t1 propagating flow towards both the source and sink from t1. Both 
of these algorithms require O(!Vi3 ) time and Galil has shown that these bo~nds a.re 
tight [46]. While considerably better for dense graphs, these layered algorithms offer no 
improvement over Edmunds and Karp for sparse graphs. 
Cherasky developed an O(IV/ 2v1Ef) algorithm by further partitioning the layered graph 
into super/ayers [45]. Karzanov's algorithm is applied between the superlayers while 
Dinic's algorithm is usef ""'i\hin ea.ch superlayer. Galil improved Cherasky's superlayer to 
have complexity 0 (JV j a !El 3) by using a set of data structures called forests to efficiently 
represent paths in the super layers [44]. 
Galil and Naamad have developed an O(!V\ x \El log2 IV!) algorithm that uses a form 
of path compression. The algorithm follows the genera.I form of Dinic's algorithm, but 
avoids rediscovering paths by storing paih fragments in a 2-3 tree [45]. The fastest known 
algorithm for the max-flow problem, due to Sleator [118], also stores path fragments in 
a tree structure. Sleator's algorithm uses a novel data structure called a biased 2-S tree · 
on which join, split and splice operations ca.n be performed very efficiently to give an 
O(log JVj) improvement over Galil and Na.a.mad. 
Despite the intensive research that has been performed on the max-flow problem, little 
work has been done on concurrent max-flow algorithms. This paucity of concurrent 
algorithms may be due to the fact that all of the sequential algorithms reviewed above 
88 
are inherently sequential. They depend upon a strict ordering of operations and cannot 
be made parallel in a straightforward manner. 
Shiloa.ch and Vishkin (SV) [116) have developed a concurrent max-fl.ow algorithm based 
on Karzanov's algorithm. Like Karzanov's algorithm, the SV algorithm operates in 
stages constructing a maximal layered flow at each stage by pushing prefiows from the 
source to the sink. A novel data structure called a partial-sum tree (PS-tree) is used 
to make the pushing and rejection of flow efficient in dense graphs. The SV algorithm 
is based on a synchronized, shared-memory model of computation wherein all proces-
sors have access to a cominon memory and can even read and write the same location 
simultaneously. The algorithm assumes that all processors are synchronized so that 
all active vertices finish their fl.ow propagation before any new active vertices begin 
processing. The CVF algorithm, described below, is very similar to the SV algorithm 
but is based on a message passing model of computation wherein shared memory and 
global synchronization signals are not provided. 
~farberg and Ga.fni have developed a message passing version of the SV algorithm '.85]; 
however, their algorithm is quite different from the CVF algorithm. The CVF algorithm 
is locally synchronized; vertices communicate only with their neighbors. Each cycle of 
the algorithm requires only two channel traversals for synchronization4 • Mar berg and 
Gafni, on the other hand, use global synchronization. All vertices are embedded in a tree 
which is used to broadcast STARTPULSE messages to all vertices to begin each cycle 
and to combine Elv"DPULSE messages to detect the completion of each cycle. The same 
tree is used to detect completion of ea.ch phase of the algorithm. With this approach 
ea.ch cycle requires a minimum of 2 log /V / channel traversals for synchronization. 
4.3.1 Constructing a Layered Graph 
The remainder of this section describes two novel concurrent max-flow algorithms: 
• the concurrent augmenting digraph (CAD) algorithm, 
• the concurrent vertex flow (CVF) algorithm. 
Both algorithms are similar to Dinic's algorithm in that they iteratively partition the 
fl.ow-graph into layers and construct a maximal layered fl.ow on the partitioned network. 
This common macro algorithm is illustrated in Figure 4.18. The algorithms differ in their 
approach to increasing flow in the layered network. The CAD algorithm increases flow 
by finding augmenting paths, while the CVF algorithm works by pushing flow between 
vertices. 
Both the CAD and CVF algorithms construct a layered network using an algorithm 
'A round trip between neighboring vertices is performed each cycle. 
89 
maxFlow: g source: s sink: t 
While an augmenting path exists from s tot 
Construct a layered graph g' from graph g 
Construct a maximai layered flow in g' 
Figure 4.18: CAD and CVF Macro Algorithm 
similar to Chandy and Misra's shortest path algorithm. A..s shown in Figure 4.19, par-
titioning the vertices into layers is the same as finding the shortest path when all edge 
lengths are one. 
The algorithm shown in Figure 4.19 differs from the algorithm shown in Figure 4.5 in 
three ways: 
• Both forward and backward edges are used in constructing paths. 
• Only edges that are useful in the proper direction are considered. 
• All edge lengths are considered to be unity. 
Restricting edge lengths to unity results in greatly improved worst case complexity. 
With unit edge lengths there are at most \VI possible values for a vertex's distance 
from the source. A vertex can change its value at most !VJ times resulting in O(!VJ 2) 
messages in the worst case. For unit edge lengths the looser synchronization of ~fisra 
and Chandy's algorithm is preferable to the tight synchronization of the SSP algorithm. 
Since the algorithm performs at most O(/VI) layerings in the worst case, the contribution 
of layering to the total number of messages required to solve the flow problem is O((Vi 3). 
In addition to partitioning the vertices into layers, it is also necessary to partition the 
edges incident on each vertex, v in layer t' into a set of edges to layer i + 1, outEdges, 
a set of edges to layer t' - 1, inEdges, and all remaining edges. Collections inEdges and 
out Edges will be used extensively in the following algorithms. The partitioning of edges 
is straightforward and will not be shown here. 
4.3.2 The CAD Algorithm 
The CAD algorithm constructs a maximal flow in each layered network by finding aug~ 
menting paths. Multiple paths are explored concurrently and the algorithm merges rt7 
90 
instance methods for class Flow Vertex 
layer: alayer over: anEdge 
ack 
I i 
(a layer < layer) iffrue:[ 
(pred notNif) iffrue:[pred ackFrom: self]. 
layer <-layer. 
pred <-anEdge. 
forwardEdges do:[:edge i 
edge layer: (layer+ 1) from: seif. 
nrMsgs <-nrMsgs + 1J. 
backwardEdges do:[:edge J 
edge layer: (!ayer+ 1) from: seif. 
nrMsgs <-nrMsgs +1]. 
(nrMsgs = 0) iffrue:[ 
pred ackFrom: self. 
pred -nil]] 
ifFalse:[anEdge ackFrom: seifJ. 
11 
nrMsgs -nrMsgs - 1. 
(nrMsgs = 0) ifTrue:[ 
(pred notNil) iffrue:[pred ackFrom: self]. 
pred -niiJ. 
instance methods for class Flow Edge 
layer: a layer from: a Vertex 
11 
(a Vertex = source) iffrue:[ 
{flow < capacity) iffrue:[dest layer: alayer over: self] 
ifFalse:[ a Vertex ack]] 
ifFalse:[ 
(flow > 0) iffrue:[source layer: alayer over: self] 
ifFalse:f a Vertex ack]J. ' 
ackFrom: aVertex 
11 
(self oppositeVertex: aVertex) ack 
Figure 4.19: CAD and CVF Layering Algorithm 
91 
convergent paths into a digraph to improve performance. To prevent several paths from 
claiming the same edge capacity, each path is constructed in three phases: propagation, 
reservation a.nd confirmation. 
Propagation: All potential augmenting paths from s to t in the layered network are 
found by constructing a path digraph rooted at s. Construction of the path 
digraph begins by sending propagate messages from the source over all useful edges 
to vertices in layer 1. A vertex in layer £ waits until it has received messages over 
all incoming useful edges from layer i - l. It then sends propagate messages over 
all outgoing useful edges to layer i + 1. The propagation process continues until 
vertex t is reached. 
For each edge, e, the maximum flow that can reach that edge from the source is 
recorded in instance variable reserve Flow. The capacity of the edges used by the 
paths discovered during the propagation phase is not locked, however, and several 
paths may use the same capacity. Conflicts over edge capacity are resolved during 
the reservation phase. 
Reservation: Paths discovered during the propagation phase reserve edge capacity by 
following links in the pa.th digraph backwards from t ta s. When a propagate 
message reaches the sink, the reservation process is initiated by the sink sending 
a. reserve message back to the preceding layer. A vertex in layer £ waits until 
it receives reserve messages over all outgoing edges and then parcels the reserve 
flow among incoming edges. Since there may not be sufficient flow into the vertex 
from layer £ + 1 to satisfy all reservations, some edges may reduce the value of 
reserveFlow. It is also possible that some vertices may have more incoming flow 
from layer i + 1 than can be reserved on all incoming edges. In this case the excess 
reservations in the higher layers will be reduced during the confirmation phase. 
Confirmation: Reservations are confirmed and possibly reduced during the confirmation 
phase. When a reserve message reaches the source, confirmation is initiated by 
the source sending a confirm message back to layer 1. When a vertex in layer i 
has received confirm messages over all incoming edges, it partitions the flow over 
the outgoing edges, possibly reducing or completely canceling the reservation on 
some of these edges and propagates confirm messages to layer £ + 1. Because of 
the way reservations are made during the reserve phase, the reservations made on 
incoming edges are no greater than the reservations on outgoing edges. Thus, the 
flow into a vertex during the confirm phase is guaranteed to be no greater than 
the reserved flow on outgoing edges. 
The propagate methods for both vertices and edges are shown in Figure 4.20. When 
a non-sink vertex, tJ, receives a propagate message, it accumulates the total flow that 
could possibly reach v in instance variable inFlow and counts the number of propagate 
messages received in instance variable nrMsgs. \Vhen messages have been received over 
instance methods for class Flow Vertex 
propagate: aFlow over: anEdge 
11 
(isSink) ifFalse: [ 
inFiow -inFiow + aF!ow. 
nrMsgs -nrMsgs + 1. 
92 
internal verte::: 
(nrMsgs = (inEdges size)) iffrue: propagate fiow to next layer 
outEdges do: [ :edge I edge propagate: inFlow from: self]. 
inF!ow -o. 
nrMsgs -a] 
(outEdges size= 0) ifTrue:[ dead end, reserve 0 flow 
inEdges do: [ :edge i edge reserve: 0 from: seif]j 
iffrue: [anEdge reserve: aFlow from: self]. sink reflects messages 
instance methods for class Flow Edge 
propagate: aFlow from: aVertex 
I outFlow J 
(aVertex =source) ifTrue:[outFlow = aFlow min: (capacity - flow)] 
ifFaise:[outFlow = aFiow min: flow}. 
(self opposite Vertex: a Vertex) propagate: out Flow over: self. 
reserve Flow -outflow. 
Figure 4.20: Propagate Methods 
forward edge 
backward edge 
instance methods for cla.ss Flow Vertex 
reserve: aFlow over: anEdge 
I outFlow \ 
(isSource} ifFalse: 
inF!ow -inF!ow + aFlow. 
nrMsgs -nrMsgs + 1. 
(nrMsgs = (outEdges size)) iffrue: 
inEdges do: [ :edge I 
93 
outFlow -inFlow min: edge reserveFlow. 
edge reserve: outFlow from: self. 
inFlow -inFlow - outF!owJ. 
inFlow -o. 
nrMsgs -oJ] 
iffrue: [anEdge confirm: aFlow from: self). 
instance methods for cla.ss Flow Edge 
reserve: aFlow from: aVertex 
II 
reserveFlow -aFlow. 
(self opposite Vertex: a Vertex) reserve: a Flow over: self. 
Figure 4.21: Reserve Methods 
internal vertex 
source reflects messages 
all incoming edges5 , a propagate message is transmitted to each outgoing edge in col-
lection outEdges. An edge receiving a propagate message takes the minimum of the flow 
the vertex can deliver, aFlow, and its own available flow and propagates the resulting 
out Flow to the next layer of the graph. When a propagate message reaches the sink, the 
sink immediately sends a reserve message back to the sender to initiate the reservation 
phase6 . 
The code that propagates reservations ,back toward the source is shown in Figure 4.2L 
A vertex, t1, waits to receive reserve messages from all of its outgoing edges, summing 
the reserved flow in instance variable inF!ow. When t1 has received messages from all 
outgoing edges the value of in Flow represents the flow reserved between t1 and the sink, 
t. Vertex v divides this flow among its incoming vertices sending each of them a reserve 
message to propagate the reservations back to the next layer. A reserve message received 
5 Recall that inEdges.outEdges ia a partition of edges constructed during layering and may be different 
than the forwardEdges. backward Edges partition defined by the structure of the graph. 
eThe sink could teat for termination at this point by checking if any flow can reach it; however, for the 
sake of simplicity this test ha.a been omitted. 
instance methods for class Flow Vertex 
confirm: aFlow over: anEdge 
I outflow i 
(isSink) ifFalse: [ 
inFlow +-inflow + aFlow. 
nrMsgs +-nrMsgs + 1. 
(nrMsgs = (inEdges size)) ifTrue: 
outEdges do: [ :edge I 
94 
outflow +-inflow min: edge reserveflow. 
edge confirm: outflow from: self. 
inflow +-inflow - outflow]. 
nrMsgs +-0]] 
ifTrue: [ 
inFiow +-inF!ow + aF!ow. 
nrMsgs +-nrMsgs + 1. 
( nrMsgs = (inEdges size)) ifTrue: [requester reply: inflow]]. 
instance methods for cla.ss Flow Edge 
confirm: aFlow from: aVertex 
11 
reserveF!ow +-0. 
(aVertex =source) ifTrue:[f!cw +-flow + aFlow] 
ifFalse:[flow +-flow· aFlowJ. 
(self oppositeVertex: aVertex) confirm: aFlow over: selfJ. 
Figure 4.22: Confirm Methods 
internal vertex 
sink 
f o rv.;_ard edge 
back edge 
by the source is reflected back to the sender to initiate the confirmation phase. 
The details of the confirmation phase ,a.re shown in Figure 4.22. As in the propagate 
stage, a vertex, v, waits for messages on all incoming edges before sending messages over 
all outgoing edges. When v receives the confirm message from the la.st incoming edge, 
instance variable in Flow represents the amount of flow that has been added to paths from 
the source, s, to v. Vertex v uses this flow to confirm reservations on outgoing edges until 
it is used up. If the incoming flow is not sufficient to satisfy all outgoing reservations, 
one outgoing edge may only have part of its reservation confirmed (a Flow < reserveFlow) 
and some edges may have their reservation completely canceled (aFlow = 0). An edge 
receiving a confirm message increments or decrements its flow by the specified amount 
depending on whether it is a forward or backward edge. 
95 
When all confirm messages reach the sink, t, an iteration is complete and t replies with 
the added fl.ow to the macro-level algorithm. If the added fl.ow is zero, a maximal layered 
fl.ow has been constructed and the macro-level algorithm proceeds to re-layer the network 
for the next solution phase. Otherwise, another iteration of path finding is initiated by 
sending a propagate message to the source. 
Lemma 4.1 Each iteration of the CAD algorithm saturates at least one vertex, v, leav-
ing no useful fl.ow into v. 
Proof: There are two cases: 
• 
1. No vertex reduces the reservation by having inFiow > 0 after sending all of its 
reserve messages: all fl.ow propagated into the sink is confirmed. Proof by induction 
on the number of layers, l. 
• For I = 1, since the fl.ow propagated into the sink comes directly from the 
source, confirming this flow saturates all edges into the sink and thus saturates 
the sink vertex. 
• Consider a network of I layers. The flow propagated along each edge is the 
minimum of the available flow on the edge and the maximum flow that can 
reach the preceding vertex. Thus, for each edge into the sink, either that edge 
is saturated, or all flow propagated to the preceding vertex will be confirmed. 
If a.II edges into the sink are saturated, then the sink is saturated. If some edge 
e = ( v, t) into the sink is not saturated, then all flow propagated into vertex 
v is confirmed. This situation is analogous to vertex v being the sink vertex 
of al - 1 layer network. Thus by induction, some vertex will be saturated. 
2. If some vertex v reduces the reservation, then 3 a vertex u 3 u reduces the 
reservation and no vertex in a layer lower than /(u) reduces the reservation. Thus, 
all flow propagated into u is confirmed. Consider u as the sink of a graph of depth 
l( u ); then by the result of case (1) above, some vertex in this subgraph is saturated . 
Theorem 4.2 The CAD algorithm req':ires O(\VJ 2 \E\) messages. 
Proof: The CAD algorithm sends exactly 3 X i El messages during each iteration of the 
three phases. By Lemma 4.1 each iteration saturates at least one vertex, so there can 
be at most IV I iterations per layering. Since at most IV I - 1 layerings are constructed, 
the total number of messages sent is at most 
3 x IEI x /VI x (!Vi - 1) = O(IVl 2 /EI). (4.5) 
I 
96 
4.3.3 The CVF Algorithm 
Like the CAD algorithm, the CVF algorithm works by iteratively partitioning the gra.ph 
into layers and then constructing a maximal layered flow for each partition. Rather 
than using augmenting paths to construct a maximal layered fl.ow, however, the CVF 
algorithm works by pushing flow from the source vertex to the sink vertex. 
The concept of a preflow C63] p.53) is helpful in understanding this algorithm. 
Definition 4.12 A prefi.ow, f : E >--+ R, is a..'1 assignment of fl.ow to the edges of the 
graph so that Equation ( 4.2) is satisfied, but Equation (4.3) is reduced to an inequality: 
the fl.ow into a vertex may exceed the flow out of a vertex. 
'Iv EV - {s,t}, L f(e) :S L f(e). ( 4.6) 
eEin( v) eEout(v) 
The CVF algorithm constructs a preflow by pushing flow requests from source to sink. 
The preflow is converted into a maximal layered flow by rejecting excess flow requests. 
If a vertex, v, in layer i cannot push ail requested flow on to layer i + 1, it rejects the 
remaining flow sending it back to layer i - l. The vertex, u, receiving the rejected flow 
may send a request to another vertex in layer i, or it may reject the fl.ow itself passing 
the problem back to layer i - 2. 
This approach to constructing a maximal layered flow is not unique. The CVF ~lgorithm 
is a concurrent version of Karzanov's algorithm [63]. It is almost identical to the SV 
algorithm [116]. There are three major differences between the CVF algorithm and the 
SV algorithm. 
I. The SV algorithm depends on a synchronized model of computation where all 
vertices operate in lockstep. The CVF algorithm, on the other hand, is based 
on an asynchronous message-passing model of computation. Vertices operate au~ 
tonomously and all synchronization is explicitly performed using message passing. 
2. The SV algorithm uses PS-trees to combine communications from several edges. 
The CVF algorithm is intended (or sparse graphs where vertex degree is small and 
such a structure is not needed. 
3. The SV algorithm does not detect termination. All vertices become idle when a 
maximal layered flow has been constructed, but there is no mechanism to detect 
this condition. The CVF algorithm explicitly detects termination by propagating 
acknowledgements. 
An iteration of the CVF algorithm is started by having the source vertex send request 
messages over all of its outgoing edges. The code for the request method is shown in 
instance methods for cla.ss Flow Vertex 
request: aFlow over: anEdge 
I l 
(isSink) ifFalse: [ 
inFiow ..-inFiow + aFiow. 
97 
(inF!ow > 0) ifTrue:[ stack push: aFiow<OanEdge] 
nrRequests -nrRequests + 1. 
((nrRequests = inEdges size) and: 
internal vertex 
accumulate flow 
record flow quanta on stack 
((state= #inactive) or: (nrRejects = outEdges size))) ifTrue: 
seif sendMessagesJl distribute accumulated flow 
ifTrue: [anEdge ad.Flow: self]. sink acknowledges immediately 
instance methods for class Flow Edge 
request: aFlow from: aVertex 
11 
(a Vertex= source) ifTrue:{flow -flow + a Flow] 
ifFalse:[flow -flow - aFlowJ. 
forward edge 
backward edge 
(aFlow =self availFlow: aVertex) ifTrue:[state -#saturated]. 
(self opposite Vertex: a Vertex) request: a Flow over: self. 
rejectableF!ow -rejectableFlow + aF!ow. 
availflow: aVertex 
11 
(state# #active) ifTrue:[iO]. 
(aVertex =source) ifTrue:[icapacity - flow] 
ifFalse: [jflow] 
no more flow on saturated edge 
forward edge 
backward edge. 
Figure 4.23: request Methods for CVF Algorithm 
98 
Figure 4.23. When a vertex, v, in layer i has received a request message for a non-zero 
amount of fl.ow, it records the flow quantum requested on a LIFO stack and accumulates 
the flow in instance variable inFlow. When request messages have been received over all 
incoming edges, instance variable in Flow represents the total unbalanced fl.ow into the 
vertex. Method sendMessages balances this flow by either pushing it to layer i + 1 or 
rejecting it back to layer i - 1. After its first activation, vertex v waits for messages over 
all incoming edges and all outgoing edges, accumulating both flow pushed from layer 
i - 1 and flow rejected from layer i...;... 1 before calling method sendMessages to balance 
the flow. 
~1ethod sendMessages shown in Figure 4.24 balances the flow at a vertex, v, a..."1.d syn-
chronizes v with its neighbors. To balance the flow at vertex v the method first tries to 
push the excess flow to layer i + 1 by sending request messages over output edges. These 
request messages propagate the preflow to the next layer of the graph. If flow remains 
after all requests have been sent, the remaining flow i.s rejected back to layer i - 1 by 
sending reject messages over incoming edges. Flow is rejected in LIFO order by rejecting 
flow quanta popped off the stack until the excess flow has been rejected. Once the flow 
has been rejected, sync messages are sent to all back edges to push the rejected flow back 
to layer i - 1 and to synchronize the algorithm. Request messages are always sent to 
all outgoing edges and sync or ack messages to all incoming edges to keep the algorithm 
synchronized. Many of these messages carry zero flow. 
Method sendMessages also performs completion detection by propagating acknowledge-
ments. Sink vertices acknowledge all flow pushed into them by sending an ackFlow me!r 
sage back to the sending edge. When a. non-sink vertex, v, receives acknowledgement 
from all of its neighbors in layer i + 1 and receives no additional flow requests, it sends 
acknowledgments to all of its neighbors in layer i - 1. These acknowledgements, how-
ever, can be canceled by sending a. non-zero flow request to v. 'When the source receives 
acknowledgements from all of its neighbors, completion is detected and the algorithm 
terminates. 
Figure 4.25 shows the details of rejection. When an edge, e, receives a reject message, 
it adjusts its flow accordingly and changes its state to either #saturated (no more flow 
can be requested a.cross e) or #done (no more flow can be requested or rejected across 
e). Flow rejections a.re accumulated until e receives a sync message. The sync message 
causes e to propagate the rejected flow, back to the vertex at its opposite end. Vertices 
handle flow rejections exactly the same as flow requests: flow is accumulated until all 
requests and rejections are in, and then the vertex is bala...'1ced by calling sendMessages. 
Both vertices and edges have a state encoded in instance variable state. Edge states 
progress from #active to # saturated, and finally to #done. 
#active: All edges begin each CVF iteration in the #active state. Flow can be requested 
only across active edges. 
instance methods for cla.ss Flow Vertex 
send Messages 
j outFlow quantum I 
(inFlow > 0) ifTrue:[nrAcks +-OJ. 
outEdges do: [ :edge ! 
99 
outFlow ._inFlow min: edge availFlow: self. 
edge request: outFiow from: self. 
inF!ow +-inFlow - outFlowj. 
((inFlow = 0) and: (nrAcks = outEdges size)) ifTrue:[ 
reactivate unbalanced vertex 
request ffow from next layer 
inEdges do: [ :edge I send acknowledges to previous layer 
edge ackFlow: self)] 
ifFalse[ 
(inFlow > 0) ifTrue: [ 
state +-#saturated. 
[(in Flow > 0) and: (stack notEmpty)J while True: [reject flow to pret>ious layer 
quantum +-stack pop. 
outFlow +-inFlow min: quantum x. 
quantum y reject: outFlow from: self. 
inFlow +-inFlow - outFlow. 
(quantum x > outFlow) ifTrue:[ 
stack push: quantum y~(quantum x - outFlow)]]]. 
inEdges do: [ :edge I edge sync: self]. sync up previous layer 
(state# #saturated) ifTrue:[state +-#active]]. become active 
inFlow +-0, reset flow rejected to source 
nrRequests +-0. 
nrRejects +-0. 
reset message counts 
Figure 4.24: sendMessages Method for CVF Algorithm 
instance methods for class Flow Vertex 
reject: aflow over: anEdge 
I outFlow I 
inflow -inFlow + aFlow. 
nrRejects -nrRejects + 1. 
100 
accumulate fl.ow 
( ( nrRejects = outEdges size) and: 
seif send Messages ]J. 
(nrRequests = lnEdges size}) ifTrue: 
ackFiow: anEdge 
I ! 
I ! 
nrAcks -nrAcks + 1. 
seif reject: 0 over: anEdge. 
instance methods for class Flow Edge 
reject: aflow from: aVertex 
11 
(aYertex =source) ifTrue:[ffow -flow+ aFlowJ 
ifFalse:[ftow +-flow - aFlow]. 
rejectableFlow -rejectableFlow - aF!ow. 
rejectedFlow +-rejectedF!ow + aF!ow 
(a Flow > 0) ifTrue:[ 
(rejectableFlow = 0) ifTrue:[state -#donej 
ifFalse:[state +-#saturatedlJ. 
sync: aVertex 
11 
rejecta bleFlow +-rejecta bleFlow - rejected Flow. 
( seif opposite Vertex: a Vertex) reject: rejected Fl ow over: self. 
rejectedFlow -o. 
ackFlow: aVertex 
11 
(state= #saturated) ifTrue:[state ~#done]. 
(self oppositeVertex: aVertex} ackFlow: self. 
distribute ezcess flow 
count ac.b 
forward edge 
backward edge 
no more flow to reject 
no more requests 
Figure 4.25: reject and ackF!ow Methods for CVF Algorithm 
101 
#saturated: When the maximum possible flow has been requested across an #active 
edge, or when any fl.ow is rejected across an edge, the edge becomes #saturated. 
1'jo further flow can be requested across a #saturated edge. 
#done: When all requested flow is rejected across an edge, or the flow in a #saturated 
edge is acknowledged, the edge becomes #done. The flow in a #done edge cannot 
be changed. 
Vertex states progress from #inactive to #active to #saturated: 
#inactive: To initiate synchronization, all vertices begin in the #inactive state. Inactive 
vertices wait only for messages on their incorning edges before calling send~v1es­
sages to balance their fl.ow. After their first balancing operation, all vertices 
become #active or #saturated. 
#active: As with edges, a vertex remains #active until it rejects fl.ow. 
#saturated: Once a vertex rejects flow, it becomes ;,i-saturated and will no longer accept 
flow requests. 
Lemma 4.2 Each iteration of the CVF algorithm constructs a maximal layered flow. 
Proof: 
I 
• The flow is legal since acknowledges are only propagated back from the sink to the 
source when all vertices are balanced. 
• Suppose the flow was not maximal; then there exists an augmenting path, P, in the 
layered network. Let tli be the vertex of P in the ith layer of the graph. The source 
requests all possible flow from v1, so some vertex on P must have rejected some 
of this fl.ow. Let Vj :j:. t be the vertex of P furthest from the source that rejected 
the flow. Each vertex Vi requests all possible flow from all of its neighbors in layer 
i + 1 including tJi+l before rejecting any flow to Vi-1 · Since Vj rejected the flow 
and Vj+l didn't, all edges out of ·Vj including the edge ( Vj, Vj+i) must be saturated. 
Then we have a contra.diction since P includes (vi, Vj+i), but an augmenting path 
cannot contain a saturated edge. 
The CVF algorithm is synchronized by having each vertex, v, in layer i wait for messages 
from all of its neighbors in layers i ± 1 before sending messages to layers i ± 1. This 
synchronization, illustrated in the Petri Net of Figure 4.26, causes operation of the 
layers to alternate: even layers send messages to odd layers and then odd layers send 
0 _t -\____) 
!ayer 0 layer 1 
102 
layer 2 
\ c-r 00 i-t • y 
i ~ 0 \ ) \ } 
'--···__/ ~
layer 1-1 layer I 
Figure 4.26: Petri Net of CVF Synchronization 
messages to even layers. Since the layers are not completely connected, this alternation 
is somewhat loose; however, the Petri Net assures us that each vertex will execute the 
same number of message sending cycles. 
Lemma 4.3 Each iteration of the CVF algorithm requires at most O(!VI) cycles and 
thus O(lV! 2 ) messages. 
Proof: Flow pushed from the source is either acknowledged or rejected. Acknowledged 
flow takes O(!V I) cycles to reach the sink from its last point of rejection. Rejected flow 
performs a depth-first search (DFS) of the layered graph before it is either rejected back 
to the source or is acknowledged by the sink. Flow first pushes forward (depth-first); 
then, if it is rejected, it follows the same path backward, since requests are rejected in a 
LIFO manner. Each time flow backtracks over a node, that node is saturated and will 
not be visited again. In the worst case a single flow quantum traverses the entire layered 
graph taking O(iVJ) cycles. Since every vertex sends messages every cycle, O(:VJ 2) 
messages are required. If several flow quanta are being rejected simultaneously, the 
traversal takes less time. I 
. 
To see that this bound is tight, consider the graph of Figure 4.27, a binary tree where all 
interncl edges have capacity 100 and all leaves are connected to the sink with capacity 
1. The CVF algorithm will perform DFS on this graph taking 2JVj cycles to construct a 
max:irnal layered flow. In contrast, the CAD algorithm will find a maximal layered flow 
for this graph in O(log !VD cycles. 
The graph of Figure 4.27 illustrates the major difference between the CAD and CVF 
algorithms. In the CAD algorithm all potential paths from source to sink are discovered 
simultaneously without considering possible conflicts. The CVF algorithm, on the other 
hand, never generates any conflicts. It explores only those paths that have guaranteed 
103 
100 
s 
100 
Figure 4.2i: Pathological Graph for CVF Algorithm 
available capacity on their initial segments. This conservative approach to augmenting 
flow can result in sequential execution for graphs like the one shown in Figure 4.27 that 
bottleneck near their sink. 
Theorem 4.3 The CVF algorithm requires at most O(/Vj 3 ) messages. 
Proof: The contribution of layering is O(/Vl 3). By Lemma 4.2 a maximal layered flow 
is constructed by each iteration of the CVF algorithm. Since at most [VI - 1 layerings 
are produced, at most O(IV I) iterations are performed. By Lemma 4.3 each iteration 
takes O(!Vj 2) time. Thus, the algorithm requires O(jVj 3 ) time. I 
4.3.4 Distributed Vertices 
In both the CAD and CVF algorithms, the source and the sink are bottlenecks that 
serialize the algorithm. At most one path can be processed by the source or sink per 
unit time and each path must pass through both the source and sink twice. The problem 
is especially acute in the case of a flow-graph for solvin~ a bipartite matching problem 
where the fanout of the source and fan-in of the sink are y - 1 as shown in Figure 4.28. 
104 
s 
Figure 4.28: A Bipartite Flow Graph 
Figure 4.29: Distributed Source and Sink Vertices 
105 
1 
e6 
0 
100000 CAD Algorithm 
p 
e 
0 CVF Algorithm 
r 10000 /\ Dinic's Algorithm 
a L:, 
t 
1000 
0 
n 
s 
100 
10 
1 5 10 50 500 5000 
Number of Vertices 
Figure 4.30: Number of Operations vs. Graph Size for Max-Flow Algorithms 
The source and sink bottlenecks ca.n be removed by distributing these vertices. The only 
operations performed at the source a.nd sink are keeping message counts and reflecting 
messages back a.cross edges. Messages to or from the source or sink on a. particular edge 
affect no other edges. Thus, we ca.n split the source and sink into multiple vertices: one 
for ea.ch edge incident on the original source or sink as shown in Figure 4.29. The individ-
ual source a.nd sink vertices act independently, reflecting messages and keeping message 
counts. When all source vertices have. been acknowledged (in the CVF algorithm) or 
have received confirm messages (in the CAD algorithm), completion is detected and the 
algorithm terminates. 
4.3.5 Experimental Results 
The CAD and CVF algorithms have been run on a concurrent computer simulator to 
measure their performance experimentally. Dinic's sequential max-fl.ow algorithm was 
also tested to give a performance baseline for comparisons. Randomly generated bipartite 
106 
graphs with uniformly distributed edge capacities were used as test cases for the max-flow 
algorithms. The tests were run on a simulated binary n-cube imerconnec~ion network 
where one unit of time is charged for each channel traversed by a message. This impl~s 
that a random message requires on the average n = log N units of time to reach its 
destination. The results of these experiments are shown in Figures 4.30, 4.31 and 4.32. 
The number of messages required by each of the algorithms as a function of graph size 
is shown in Figure 4.30. For purposes of comparison, Dinic's algorithm was charged 
one message for each edge traversed. While the worst-case complexity of these three 
algorithms is O(:Vl 3 ), all three give linear performance on the test cases. The CAD 
algorithm (squares) requires the fewest messages, ~ 9!V , followed by the CVF algo-
rithm (diamonds) with :::== lliV\, and, finally, Dinic's sequential algorithm (triangles) 
required :::== 30!VI edge traversals to construct a max-flow. This figure shows that the 
CAD and CVF algorithms are, in fact, good sequential algorithms. The overhead of syn-
chronization does not greatly increase the number of messages required when compared 
to a strictly sequential algorithm. The CAD algorithm requires fewer messages than the 
CVF algorithm because it propagates wavefronts of activity across the graph. Only the 
vertices on the wavefront are active at a given time. In contra.st, the CVF algorithm is 
tightly synchronized with ail vertices actively sending messages all the time. 
The speedup of the two concurrent algorithrr..s relative to the sequential algorithm is 
shown in Figure 4.31 as a function of the number of processors for a 4096 vertex graph. 
Both the CAD algorithm (squares) and the CVF algorithm (diamonds) show nearly linear 
speedup until they saturate at 128 processors with speedups of dose to 200. The speedup 
is greater than the number of processors because the CAD and CVF algorithms are better 
than Dinic's algorithm even for a single processor. The speedup of the CAD algorithm 
varies from 2.5 on a single processor to 204 on 256 processors, a relative speedup of 
81.6. The speedup of the CVF algorithm varies from 1.7 on a single processor to 202 on 
1024 processors for a relative speedup of 119. As expected the CAD algorithm performs 
slightly better for small numbers of processors with the CVF algorithm catching up for 
large numbers of processors. 
Figure 4.32 shows the speedup of the CAD (squares) and CVF (diamonds) algorithms 
as a function of graph size. Each test was run with I~! processing nodes. For the CAD 
algorithm, speedup varies from 0.9 for a 4 vertex graph to 196 for a 4096 vertex graph. 
CVF speedups were nearly the same, varying from 0.7 for a 4 vertex graph to 202 for a 
4096 vertex graph. The speedup grows slower than linearly, almost logarithmically, as 
the number of vertices is increased from 4 to 128 and then just about linearly from 128 
to 4096 vertices. This irregularity in the speedup curve may be due to the fact that only 
one graph of each size was tested. 
107 
500 
,_; 
100 I r 
s so I p ~ 
e 
e 
d / u I 10 L p I I 5 L {7 l 
CAD Algorithm 
0 CVF Algorithm 
1 
1 5 10 50 100 500 5000 
Number of Processors 
Figure 4.31: Speedup of CAD and CVF Algorithms vs. No. of Processors 
s 
p 
e 
e 
500 
100 
50 
d 10 
u 
p s 
1 
r-
'-' 
0 
1 
108 
CAD Algorithm / 
CVF Algorithm 
s 10 so 100 500 
Number of Vertices 
Figure 4.32: Speedup of CAD and CVF Algorithms vs. Graph Size 
c I 
I 
5000 
109 
4.4 Graph Partitioning 
The graph partitioning problem involves partitioning the vertices of a graph into two sets 
in a manner that minimizes the sum of the weights of edges incident on both sets. This 
problem has important applications in computer aided design where graphs representing 
the interconnection of logic circuits are partitioned onto several physical packages [105]. 
Graph partitioning is also used in process placement on multiprocessors where a possibly 
dynamic graph representing the interconnection of logical processes is partitioned over 
a set of physical processors )20]. 
unfortunately this important problem is NP-Complete [47]. In practice, however, poly-
nomial time heuristics based on iterative improvement methods are used with good 
results [67J,:38]. 
In the Kernighan and Lin algorithm [67], an initial partition is improved by exchanging 
pairs of vertices between t.he two sets. At each step, the pair that results in the greatest 
reduction in the weight of the cutset is chosen for exchange. The limiting step of the 
algorithm is computing the weight reduction associated with each pair and sorting the 
pairs according to this number. Based on this step, the time complexity of the algorithm 
is estimated to be O(IV) 2 log !V!). 
Fidducia and Mattheyses [38) improve upon this algorithm to give a linear-time heuristic. 
Their most important modification is to consider single vertex moves rather than pairwise 
exchanges. They also use a bucket list to sort the vertices so vertices can be added or 
deleted from the list in constant time. 
A novel approach to the graph partitioning problem using linear programming has been 
developed by Barnes [6]. This approach converts the partitioning problem into a matrix 
approximation problem. The matrix approximation problem is then solved using linear 
programming. This method is good for finding an approximate solution near a local 
minimum for the problem. Barnes then uses an iterative improvement algorithm similar 
to that of Kernighan and Lin to fine-tune this approximate solution. 
The partitioning problem can also be approximately solved using simulated annealing 
[691. Simulated annealing, as applied to graph partitioning, involves randomly selecting 
a ~ave to alter the partition, and then ~cepting this move with a probability dependent 
on its gain a:::i.d the current annealing temperature. At high temperatures most moves are 
accepted regardless of gain. As the graph cools, the algorithm becomes more selective, 
accepting fewer negative gain moves. At zero temperature only positive gain moves are 
accepted. This technique generally achieves better solutions than the straight iterative 
improvement algorithms, because by occasionally accepting bad moves it is capable of 
avoiding local minima. Simulated annealing requires considerably more computing time 
than the other methods. 
Consider an undirected graph G (V, E) where edges ( t11, 1.12) e E E are assigned 
110 
weight w( e). The vertices are partitioned into two disjoint sets A and B. 
Definition 4.13 The cut defined by A and Bis the set of edges C(A,B) = {(a,b) ! evE 
A, b E B}. The sum of the weights of edges in the cut is the weight of the cut 
W(A, B) = L w(e). 
eEC(A,B) 
Definition 4.14 The imbalance of a partition A, B is 
I(A, B) = !Al - [Bil. 
(4.7) 
(4.8) 
The object of a graph partitioning algorithm is to find a partition of V into A, B, subject 
to a balance constraint I(A,B) <Cb so as to minirn.ize the weight of the cut, W(A,B). 
The remainder of this section describes a novel concurrent heuristic graph partitioning 
algorithm. Like the sequential algorithms described in [67) and [38], it is an iterative 
improvement algorithm. Starting from a.n initial partition, vertices are moved from one 
set to the other to improve the partition. The algorithm is concurrent in that it moves 
many vertices simultaneously while sequential algorithms move only one or two vertices 
at a time. 
4.4.1 Why Concurrency is Hard 
Concurrency introduces two major problems: thrashing and balancing. There are cases 
where making several simultaneous moves increases the weight of the cut even though 
each move taken individually would reduce the weight of the cut. The simplest example 
of this thrashing problem is shown in Figure 4.33. Vertices a E A and b E B are 
connected with weight Wh to each other, and with weight wi to another element of the 
same set where wh > w1. Individually, moving a to set B or b to set A would decrease 
the weight of the cut by wh - w1, but moving both a and b at the same time increases 
the weight of the cut by 2w1. 
A balance constraint must be imposed sm the partitioning to prevent the algorithm from 
reducing W(A, B) to zero by moving all of the vertices into one set. We require that 
II(A, B)I <Cb for some constant Cb. In a sequential algorithm it is quite easy to keep a 
running count of the size of each set. Moves are checked in sequence against the count, 
and only moves that keep the counts within the balance constraint are allowed. In the 
concurrent algorithm, this sequential checking of moves against a count is not possible, 
and another mechanism is required to enforce balance. 
The remainder of this section develops a concurrent algorithm that meets the challenges 
described above. It uses a method of inhibiting gain to prevent thrashing and uses a 
matching tree to impose balance. 
111 
A B 
before 
W/ W/ 
a b 
after 
Figure 4.33: Thrashing 
4.4.2 Gain 
An iterative improvement algorithm searches a state space by applying simple transition 
moves to an initial state. In the case of graph partitioning, the state space is the space 
of all possible partitions. Each transition move is the transfer of a vertex from one set 
to the other. My algorithm is greedy in the sense that it moves all those vertices that 
are guaranteed to give the largest immediate gain in the objective function -W(A, B). 
Definition 4.15 The gain of a vertex g( v) is the amount by which W(A, B) is decreased 
by moving v from one set to another. If we define int( v) to be the set of edges connecting 
v to vertices in the same set and ext( v) to be the set of edges connecting to elements if 
the other set, then ' 
g(v) = L w(e) - L w(e). (4.9) 
eEext( v) eEint( v) 
During the first phase of the algorithm, all of the vertices compute their gain as follows. 
1. All vertices transmit their set and the weight of the connecting edge to all 
boring vertices. 
112 
A B 
before 
• • a 
after 
• • A, a B, 
Figure 4.34: Simultaneous Move That Increases Cut 
l 
2. As vertices receive messages from their neighbors, they compute their gain as the 
sum of the weights received from vertices in the opposite set less the sum of the 
weights received from vertices in the same set. 
4.4.3 Coordinating Simultaneous Moves 
Because of the thrashing problem, if vertices were moved on the basis of gain alone, moves 
could potentially increase W (A, B) as shown in Figure 4.34. The vertices adjacent to 
vertex a can be divided into four sets: 
• Am, vertices in set A with positive gain, 
• A,, vertices in set A with negative or zero gain, 
• Em, vertices in set B with positive gain, 
• B,, vertices in set B with negative or zero gain. 
The gain of a before moving any vertices is 
113 
g(a) = (w(Bm) + w(B.)) - (w(Am) + w(A.)). (4.10) 
Where w( S) denotes the weight of all edges connecting a to set S. If all vertices wjth 
positive gain including a are moved simultaneously, the new gain of a (pushing a back 
into set A) becomes 
g1(a) = (w(Bm) + w(A.)) - (w(Am) + w(B.)). (4.11) 
If w(A.) > w(B,) moving a increases the value of the cut, W(A, B). 
To solve this problem of simultaneously moving vertices, we inhibit vertices from moving 
if they are adjacent to vertices of larger gain in the opposite set. Thus, any vertex, a E A, 
that moves knows that all of its neighbors in set B will remain stationary. The set Em 
is empty and the gain, g(a), is guaranteed. If some neighbor of a, a1 EA. moves with a 
to set B, the actual gain will be larger than g( a). To prevent ties, two vertices a and b 
with equal gains g(a) = g(b) compare their vertex IDs. The vertex with the larger ID 
inhibits the other vertex. 
Inhibiting nodes from moving based on gain has the disadvantage of reducing the con-
currency of the partitioning algorithm. To calculate the degradation in concurrency, 
assume that all vertices have degree d, and that positive gains are uniformly distributed 
over some range (0, n]. Then the probability of a vertex with gain g moving is P9 = (~)d. 
Thus, the fraction of nodes with positive gain that can be expected to move is given by 
n I 
fm = 2:::-P.1: 
.i:=l n 
_ 1 n (k)d 
--I: -
n .i:=l n 
11" (k)" ::; - - dk 
n .i:=O n 
= f 1 x"dx jz=O 
- 1 
- d+1 · 
(4.12) 
Even with gain inhibition, vertices must be locked after they are moved to avoid thrash-
ing. This is even true of sequential algorithms. To implement locking, the algorithm is 
performed in phases. At the beginning of ea.ch phase, all vertices are unlocked. When-
ever a vertex is moved it is locked and cannot be moved again until the next phase. 
Phases are repeated until there are no vertices with positive gain. 
114 
Using locking and gain inhibition to prevent thrashing, the algorithm becomes 
l. Set all vertices unlocked. 
2. While there is some unlocked positive gain vertex, 
(a) All vertices transmit their set and the weight of the connecting edge to all 
neighboring vertices. 
(b) As vertices receive messages from their neighbors, they compute their gain as 
the sum of the weights received from vertices in the opposite set less the sum 
of the weights received from vertices in the same set. 
(c) Unlocked vertices transmit their gain to all their neighbors. Locked vertices 
transmit -co to all their neighbors. 
( d) Vertices that have a positive gain greater th~'l the gain of all of their neighbors 
move to the opposite set and become locked. Vertex IDs are used to break 
ties. 
3. Repeat steps 1 and 2 until there are no positive gain vertices. 
4.4.4 Balance 
A balance constraint is required to prevent the algorithm from finding a cut of weight 
zero by moving all vertices into one set. Specifically, no move is allowed that will make 
one set larger than the other by more than some constant Cb. Thus, given a legal initial 
partition, the condition llAI - !Bii <Cb will always be true. 
The algorithm enforces the balancing constraint using a matching tree, a binary tree with 
all vertices at its leaves. At the end of the gain exchange, step 2( c) above, all vertices 
transmit their intentions (move or stay put) and their set (A or B) up the matching 
tree. Each internal vertex of the matching tree waits for all of its children to respond. 
It then attempts to match requests to move from set A with requests to move from set 
B. Matched requests are granted and the grant message is transmitted back down the 
tree. Unmatched requests are collected into a single message (count and set) that is 
transmitted up to the next level of the tree. At the root of the tree, a count of the 
current imbalance, IA! - !Bl, is kept. The root acknowledges unmatched requests as long 
as I(A, B) <Cb. 
With balancing, the concurrent partitioning algorithm becomes 
1. Set all vertices unlocked. 
2. While there is some unlocked positive gain vertex not blocked by the balancing 
constraint, 
115 
(a) All vertices transmit their set and the weight of the connecting edge to all 
neighboring vertices. 
(b) As vertices receive messages from their neighbors, they compute their gai_£ as 
the sum of the weights received from vertices in the opposite set less the sum 
of the weights received from vertices in the same set. 
( c) Unlocked vertices transmit their gain to all their neighbors. Locked ver:;ices 
transmit -oo to all their neighbors. 
( d) Vertices that have a positive gain greater than the gain of all of their neighbors 
transmit their set and their intention to move to their parent in the matching 
tree. All other vertices transmit their intention to remain stationary a....'1d their 
set to their matching tree parent. Vertex IDs are used to break ties in gain 
comparison. 
( e) The matching tree validates requests to move against the balance constraint 
as follows: 
i. Once each matching tree vertex receives messages from all of its children, 
it matches requests from sets A and B. Matched requests are acknowl-
edged. 
ii. Unmatched requests are collected into a message to the next level of the 
matching tree. 
m. The root of the matching tree acknowledges up to C;-l(A, B) unmatched 
requests from set A or up to Cb+ I(A, B) unmatched requests from set B 
and updates I(A, B) accordingly. All remaining unmatched requests are 
rejected. If jl(A, B)J =Cb, all vertices in the smaller set are temporarily 
locked until !I(A, B)I 'I- Cb. 
(f) Vertices that receive acknowledgements to their requests to move become 
members of the opposite set. 
3. Repeat steps 1 and 2 until there are no unblocked positive gain vertices. 
4.4.5 Allowing Negative Moves 
There are cases where any single move increases W(A., B), but a sequence of moves can 
decrease W(A, B). Consider for example the case where iAI = !Bl +Cb, and V b E 
B, g(b) < 0. There are no unblocked positive gain vertices to move. Moving a vertex b 
with small negative gain from B to A, however, may enable a vertex with large positive 
gain to move from A to B. 
The algorithm can be extended to find some of these sequences by maintammg two 
partitions and accepting negative gain moves. The partition A', B' is updated every 
move and its cut weight, W(A', B') is computed. The best partition A, B is updated 
whenever W(A', B') < W(A, B). 
116 
4.4.6 Performance 
Each iteration of the algorithm takes O(d +log Y!) time. Exchanging edge weights-and 
gains between neighbors takes 0( d) time while propagating comparisons up the match 
tree takes O(log ;v:) time. Since probabilistically a!i of the positive gain vertices are 
moved in each iteration, the algorithm should complete after O(d) iterations. Thus, the 
time complexity of the algorithm on a. computer with a.n processor for each vertex of the 
graph is estimated to be O(cf + dlog Y:) or, if we assumed is constant, O(Iog :v!), a 
IV' 
speedup of log ;V: over the linear-time sequential algorithm of Fidducia and ~fattheyses 
[381. 
. ' 
4.4.1 Experimental Results 
The speedup of the concurrent graph partitioning algorithm compared to a sequential 
algorithm similar to Fidducia. a.nd Ma.ttheyses is shown in Fig''1re 4.35. The tests were 
run on random graphs with a.vera.ge degree 4 a.nd uniformly distributed edge weights. In 
each test the number of processors was equal to the number of vertices. 
The speedup is quite disappointing for small graphs but increases significantly for large 
graphs. This behavior is due to the fact that the time required to perform an iteration 
increases very slowly with the graph size, while the number of vertices moved each 
iteration grows almost linearly with graph size. 
The data. in Figure 4.35 suggest that the efficiency of the algorithm could be improved 
by using fewer processors than vertices and performing balancing for all vertices on a 
single processor locally. This would reduce the height of the balancing tree and thus 
reduce the time required for each iteration of the algorithm. 
For each data point shown in Figure 4.35 the concurrent and sequential algorithms pro-
duced partitions of similar weight. A partition of the same graphs performed using 
simulated annealing consistently produced partitions with 20% lower weight. While the 
gradient-following algorithms, both sequential and concurrent, get stuck in a. local mini-
mum, the simulated annealing program is able to find a point near the global minimum. 
The techniques developed in this section, using gain inhibition to prevent thrashing 
and using a matching tree to enforce balance constraints, a.re completely applicable to a. 
partitioning program that uses simulated annealing. In a. concurrent simulated annealing 
program ea.ch vertex would compute its inhibited gi;in, the difference between its gain 
and the largest of its neighbors' gains. Vertices then move with a probability that is a. 
function of their inhibited gain and the current annealing temperature. The matching 
tree is used to keep track of balance and to broadcast the current imbalance I(A, B) to 
each vertex so that balance information can be incorporated in the gain function. 
lli 
50 
s 10 ~ p ! 
e I 
5 I e !-
d 
u 
p 
i 
I 
1 r 
10 50 100 500 1000 5000 
Number of Vertices 
Figure 4.35: Speedup of Concurrent Graph Partitioning Algorithm vs. Graph Size 
118 
4.5 Summary 
In this chapter I have developed concurrent algorithms for three graph problern.s. 
In Section 4.2 I developed a new algorithm for the single point shortest path problern. 
Chandy and :Misra's shortest path algorithm [15], because it is under-synchronized, has 
an exponential worst case time complexity. By adding syncJironization to this algorithm 
I developed the SSP algorithm which has polynomial worst case time complexity. Ex-
perimental comparison of these algorithms verified that the SSP algorithm outperforrn.s 
Chandy and ~fisra's algorithm for large graphs. Further experiments showed that addi-
tional concurrency can be attained by running several problems simultaneously. Running 
multiple problerr1.s is particularly advantageous for the SSP algorithm where the multiple 
problem instances can share the considerable synchronization overhead. 
Two new algorithms for solving the max-Row problem were developed in Section 4.3. 
Both of the algorithms operate by repeatedly layering tl-:e graph and constructing a 
maximal layered flow. The CAD (concurrent augmenting digraph) algorithm constructs 
a layered flow by simultaneously finding all possible augmenti.i.J.g paths. These paths 
compete v.'ith one another for shared edge capacity through a. three-step reservation 
process. The CVF (concurrent vertex flow) algorithm is similar to an existing concurrent 
max-flow algorithm [116], [85], but introduces new methods for synchronization and 
completion detection. Experimental resu!ts show that both of these new algorithms 
achieve significant speedups. 
Finally, in Section 4.4 I developed a concurrent algorithm for graph partitioning. Con~ 
current graph partitioning is difficult for two reasons. First, moving several vertices 
between partitions simultaneously can result in thrashing: two vertices in opposite sets 
that are attracted to each other may indefinitely swap sets. Second, multiple simulta-
neous moves may result in a loss of balance: all vertices could simultaneously jump into 
tL: same set. The new algorithm solves the thrashing problem by using gain to inhibit 
sir..ultaneous moves that might interfere with one another. The balancing problem is 
sclved by embedding a tree into the graph. The tree matches moves in one direction 
with moves in the other direction to assure that the moves made during one iteration of 
the algorithm will not unbalance the partition. 
The algorithms developed in this chapter have a great deal in common: 
• They are synchronized by passing messages. 
• Messages are short, containing between zero and three arguments. 
• Methods are short; most are under 10 lines. 
In Chapter 5 I will investigate how to build hardware to efficiently execute programs 
having these characteristics. 
119 
Chapter 5 
Architecture 
The objective of computer ard1itecture is to organize a computer system to apply avail-
able technology to the solution of specific problems. At the Processor-Memory-Switch 
(PMS) level [117], architecture involves the organization of processing elements and com-
munication channels into a computer system. At the Register Transfer (RT) level [60], 
architecture involves organizing registers, arithmetic units, finite state ma.chines, and 
transmission lines into the processing elements and communication channels that form 
the building blocks at the PMS level. This chapter addresses both the RT and PMS 
levels of architecture. 
Computer architecture cannot ignore the physical organization of the ma.chine. VLSI 
computing systems are wire-limited; the complexity of what can be constructed is limited 
by wire density, the speed at which a ma.chine can run is limited by wire delay, and the 
majority of power consumed by a machine is used to drive wires. Thus, ma.chines must 
be organized both logically and physically to keep wires short by exploiting locality 
wherever possible. The VLSI architect must organize a computing system so that its 
form (physical organization) fits its function (logical organization). 
I start this chapter with an intended application - the model of computation developed 
in Chapter 2 and the algorithms developed in Chapters 3 and 4 - and a technology -
VLSI. From this starting point I develop a new architecture that takes advantage of the 
cost performance characteristics of VLSI technology and includes many features designed 
to enhance the performance of concurrent data structures. 
In Section 5.1 I analyze the algorithms developed in Chapters 3 and 4. These algorithms 
are characterized by short messages, short methods, and a limited number of pending 
messages. In Section 5.3.l I use the characteristics of these concurrent algorithms to 
analyze the performance of interconnection networks. 
In Section 5.2 I look at VLSI technology. VLSI technology is wire-limited both by the 
maximum wire density that the technology can support and, since driving capacitive 
120 
wires dissipates power, by the maximum power density that can be tolerated. Propaga-
tion delays in VLSI systems are al.so wire-limited. The delay of very short wires scales 
logarithmically with wire length until a critical length is reached 1 . Beyond this crigcal 
length, wire delay is bounded by the speed of light and grows lineariy with wire length. 
In Section 5.3.l I use these characteristics of the technology to derive some surprising 
results on network topology. 
The deveiopment of an architecture that applies VLSI technology to support concurrent 
data structures is approached in two steps. 
• First I consider the interconnection network over which processing elements (PEs) 
cor:nmunicate. Based on measurements of programs and characteristics of the tech-
nology, in Section 5.3.l I show that a 2-dimensional torus or grid network topology 
is preferable to a higher dimensional network. Experimental results back up this 
surprising result. 
In addition to a topology a network requires a routing algorithm. In Section: 5.3.2 I 
go on to develop a new method for constructing deadlock-free routing algorithms in 
concurrent computer interconnection networks and apply this method to the tw0o 
dimensional torus network. The design of a self-timed VLSI chip that implements 
this algorithm is discussed in Section 5.3.3. 
• To take advantage of a low latency communications network, the PEs must be 
designed to operate efficiently in the message-passing environment. In Section 5.4 
mechanisms are developed to implement the model of computation described in 
Chapter 2 in hardware. The arrival of a message at a node results in the PE's 
performing the required action with a minimum of delay. Also, the sending of a 
message is made indistinguishable from a method call. 
To take advantage of VLSI technology, we must both exploit locality and build 
hardware that is specialized to particular applications. In Section 5.5 I introduce 
the concept of an object expert (OE) to achieve both of these goals. OEs exploit 
locality by storing objects of a. particular class near the logic that operates on that 
class. 
5.1 Characteristics of Concurrent Algorithms 
In Chapters 3 and 4, 42 CST methods were written. Here we examine these methods to 
find the average message length, the average method length, and the average number of 
pending messages per object. 
1 This critical length is about 30rom for a typical 1.25µ CMOS technology. 
121 
o~'-===============~ 0 1 2 3 4 5 6 7 8 
Message Length 
12: :--
10•. i ! 
8 r 1 
6" 
4 ,i 
2 "i .-
0'-"========= 
0 5 10 15 20 25 
Method Length 
Figure 5.1: Distribution of ~fessage and Method Lengths 
Message Length 
Message Length 3 4 5 I 6 
Number of .\-fessages 3 10 19 ! 10 1 
Every method has at least three fields: receiver, selector, and an implicit reply-to field 
(either the sender or the requester). Thus, the minimum message length is 3 fields; 
any message arguments add to this minimum length. The table above gives the static~ 
frequency of message lengths for the 42 methods examined. These data are also shown. 
in the left half of Figure 5.1. The average message length, L, is 4.9 fields. If we assume 
a 32-bit field size, L ~ 160 bits. 
Method Length 
1 Method Length 1 2 3 I 4 I 5 i 6 10 11 I 12 13 14 ! 23 
Number of Methods 4 I 7 6 I 6 3 1 3 4 I o 1 2 ' 1 2 1 0 1 
CST methods tend to be quite short. The lengths of the 42 methods presented in Chap-
ters 3 and 4 a.re tabulated above and shown in the right side of Figure 5.1. The average 
method length is 5.i lines. While counting static method length does not account for time 
taken in loops, this inaccuracy is partially offset by the fact that many of the methods 
considered involve multiple actions. Because methods are short, each message received 
results in only a small amount of computation. Thus, the latency of message trang., 
mission must be kept very small, or excessive time will be spent transmitting messages 
between processing nodes and little time will be spent computing at each node. 
~Static frequency is a measure of how often an event occurs in the program text. Dynamic frequency, 
on the other hand, is a measure of how often an event occurs during execution of the program. 
122 
Pending l\1essages 
A CST object usually has only a small number of messages pending at any instant m 
time. An object typically transmits a. number of messages (usually < 3) and then waits 
for replies from these messages before transmitting additional messages. Thus, the total 
number of messages in the network at any given time is a. small multiple of the number 
of objects. 
The characteristics of concurrent programs described in this section guide the devei-
opment of a concurrent computer architecture in the remainder of this chapter. The 
message length is an important factor in deciding on the topology of the network, as 
described in Section 5.3.l. The short method length means that network latency is a 
critical parameter. Since the computation initiated by the arrival of a message takes oniy 
a short period of time, message delivery must be made fast, or all processing elements 
will become idle waiting for messages. Also, processing elements must be able to handle 
messages quickly, since the time (Tnode) required to send a message and to initiate an 
action upon receipt of a message contributes to the total message latency. Finally, since 
each object typically has only a few messages pending at once, the required network 
throughput can be calculated as a function of the number of objects managed by ea.ch 
processing element. 
Before we begin developing our concurrent architecture, we must first examine the avail-
able technology. 
5.2 Technology 
5.2.1 Wiring Density 
VLSI systems (VLSI chips packaged together on modules and boards) are limited by 
wire density, not by terminal or logic density. Current packaging technology allows us to 
make more connections from VLSI chips to modules and boards than can be routed away 
from the chips. Since VLSI systems are wire-limited, the techniques of VLSI complexity 
theory [127] used to calculate bounds o~ the performance of VLSI chips are applicable 
to systems as well. In Section 5.3.l I use this wire-cost model of VLSI systems to derive 
some results on concurrent computer interconnection networks. 
VLSI complexity theorists, by considering the wire-limited nature of VLSI chips, have 
been able to prove lower bounds on the area times time squared (AT 2) required to 
perform a computation [87], [127]. The bound is calculated by finding the minimum 
bisection width of all possible communication graphs for the computation. Thompson 
shows that the area, A, of a VLSI chip is proportional to the square of the bisection 
width, while the time required for the computation, T, is inversely proportional to the 
Figure 5.2: Packaging Levels 
bisection width. Thus, the quantity AT2 is a bound independent of bisection width. 
By consideri.r1g the wire density, not the logic density, as the limiting factor of the 
technology, the VLSI complexity theory has been able to compute new bounds on the 
complexity of sorting [129], computing Fourier transforms [128], and numerous other 
transitive computations. 
Modern high performance computers are packaged in three primary levels as shown m 
Figure 5.2 [115]. 
Chip: Circuit components and local interconnections are fabricated on a monolithic 
silicon die. 
Module: Silicon dice are bonded to a (usually ceramic) module which provides intercon-
nections between the chips and from the chips to board pins. Connections from 
chip to module can be made either by wire bonds or by solder bumps. With wire 
bonding the chip is placed face-up on the module, and bonds are made by running 
wire from pads on the periphery of the chip to corresponding pads on the module. 
Connections are limited to one or two rows of pads about the periphery of the 
chip. Typical pad dimensions are 100µ on 200µ centers. Solder bump connections 
are made by depositing solder bumps on the face of the chip and then placing 
the chip face down on the module and reflowing the solder. Solder bumps can be 
distributed over the face of the chip on 250µ centers [ 12]. 
Board: A number of modules are assembled on a printed circuit board (PCB) that 
provides interconnection between modules. ~fodules are connected to a PCB 
either by pins brazed to the back of the module that fit into holes drilled though 
the PCB or by surface mounting the module to the PCB in a manner similar to 
124 
solder bumping chips to a module. Boards are connected using either cables or a 
backplane. 
There are often two secondary levels of packaging a.s well. Boards are packaged together 
in chassis, and chassis are assembled into racks. 
Dimension Levei of Packaging ! Cnits 
: Chip Module Board 
Wire Width 1.25 100 200 : µ 
Via Diameter 1.25 100 500: µ 
Wire-Hole Pitch 3.75 200 750: µ 
Signal Layers 2 > 10 > 10 
Linear Size 10 ! 100 600 i mm 
The table above compares the characteristics of these three levels of packaging. The 
PCB data are derived froin design rules for a circuit board with 8 mil wire width, 8 
mil spacing and 20 mil minimum hole diameter. The module characteristics are derived 
from available data. on IBM's thermal conduction module (TCM) [12] and a comparable 
ceramic technology available from Kyocera [77]. The design rules for several 1.25µ CMOS 
processes were consulted to construct the chip column of the table. 
These numbers are for technologies that are in production today (1986). Integrated 
circuit design rules are halved every 4 to 6 years [91], so that by 1990 it is reasonable to 
expect chips to have 0.5µ wide wires. Module and PCB technology also scale with time 
but at a slower rate, so that the density gap between chips (~ 250 1;.:::) and modules 
(5 1::::) will continue to widen. Wafer-scale integration [97] attempts to close this gap by 
increasing chip size to the module level. 
Most of the complexity of a VLSI system is at the chip level. Modern chips contain 
~ 2500 wiring tracks ( 1~~:), compared to 500 (1~~;) for modules and 800 ( 6~~µm) for 
PCBs. While modules and PCBs can have more layers than chips, the use of additional 
layers is limited by the fact that in most PCB technologies, every via penetrates through 
the entire thickness of the board. Because chips are 50 times as dense and significantly 
more complex than modules, the amount of information that can be transferred from 
chip to module is a bottleneck that limits the performance achievable by a. VLSI system. 
The number of connections from a chip to a module is limited by the wiring density of 
the module, not, as many believe, by the number of terminals that can be placed on a 
chip. Consider a. lOmm chip with bond pads on 250µ centers (the spacing of TCM bond 
pads [12]). The chip could make over 1600 connections if it were completely covered 
with pads. There would be no point, however, in having this number of connections. A 
lOmm slice of the ceramic substrate is capable of handling only ~ 25 wires per layer3 . 
3 Assume alternate wiring channels are used by via.a to lower la.yera. 
125 
Even if 10 wiring layers were used, only 250 wires could be routed away from the chip 4 . 
At the PCB level, assuming 10 layers and two wires between pins on a 2.5mm grid, only 
80 wires can be routed out from under the chip. Even wire-bonding can achieve terminal 
densities that can saturate module technology. Two rows of pads on 200µ centers about 
the periphery of the chip would be sufficient to make 400 connections. 
Wires, not terminals or logic, are the limiting factor in high-performance VLSI systems. 
5.2.2 Switching Dynamics 
The intrinsic delay of an MOS device is the transit time, r, the time required for a charge 
to cross the channel [86]. 
L 
r=-
µE' (5.1) 
where µ is the carrier mobility, L is the channel length, and E is the electric field. Since 
E = t (5.1) can be rewritten a.a 
£2 
r=-µV' (5.2) 
r also represents the time required for a device to transfer the amount of charge on the 
gate, Qg, from the drain to the source so, ins= ~ [50]. 
A more useful time measure is the delay of an inverter driving another inverter of the 
same size [111]. 
Cinv 
Tinv = --r 
Cg 
(5.3) 
qnv is the input capacitance of the inverter, and Cg is the gate capacitance of the 
inverter's n-channel transistor. For a CMOS inverter with the p-channel device twice 
the size of the n-channel device, Tlnv = 3r. In a typical 1.25µ CMOS technology with a 
2.5V supply 5 voltage, r = 25ps and rinv = 75ps. 
An inverter driving a load with capacitance CL has delay, 
•To convert pin density to wire density, this ca.lcula.tion assumes tha.t a.11 the pins a.re routed to one edge 
of the chip. 
5 ~ geometries get smaller, carrier velocity sa.tura.tion limits device current, so tha.t increasing the applied 
voltage does not reduce r linearly. For (5.1) to hold, voltages must be sea.led to keep E < 2 ~· 
CL 
t = c-:-- Tinv. 
mv 
126 
(5.4) 
To drive large capacitances the delay can be minimized by using an exponential horn, a 
chain of inverters with each stage e times the size of the preceding stage :s6]. Csing this 
technique, the minimum delay to drive a load from an minimum size inverter is 
(5.5) 
For short wires, wire delay depends logarithmically on wire length, lw, 
(5.6) 
where K = ~ and Cw is the capacitance per unit length of wire. Typically, K is in 
the range 0.1 .< K < 0.2. Long wires, on the other hand, act as transmission lines and 
are limited by the speed of light. Let le be the critical length at which speed of light 
limits transmission time. The delay of an optimally sized exponential horn driving a 
transmission line is the logarithmic delay of the first loge K le - 1 stages of the driver plus 
the linear delay of the wire, 
1 , fwy'€; t1ongwire = Tinve (.oge Klc - 1)-:- ---, 
c 
or asymptotically, 
lwy€°; 
tiongwire > ---· 
c 
(5.8) 
The crossover from a capacitive (short) wire to a transmission line (long) wire occurs 
when the delay of the last driver stage equals the time of flight along the wire, i'inve = 
10 f?. This equation can be rewritten as 
(5.9) 
With Tinv = 75ps and €r ~ 4, the crossover from a capacitive (short) wire to a transmis--
sion line (long) wire occurs at lw ~ 30mm. Thus for today's technology (l.25µ), even 
relatively short wires are speed-of-light limited. In an 0.5µ, technology Tinv = 30ps , and 
the crossover is at lw ~ lOmm, about the length of a chip. 
These speed-of-light wires are off-chip wires. As shown in Appendix C, the high resis-
tivity of on-chip wires limits on-chip signal velocity to~ 8 x 106 El. 
sec 
127 
The delay of global wires in VLSI systems is due to speed-of-light delay in the wire (not 
the RC delay of the driver) and thus increa.ses linearly with wire length. For short wires, 
lw < 1'\:we+, delay is due to the RC delay of the driver and thus grows logarithmically 
V'""' 
with wire length. In Section 5.3.l I consider both linear and logarithmic delay models. 
5.2.3 Energetics 
The energy dissipated by a switching event in a VLSI system, E 8 w, 18 almost entirely 
used to charge the capacitance of the circuit node being switched. 
(5.10) 
When C is the gate capacitance of a minimum-sized inverter, Cinv, Esw is the switching 
energy of the technology, a figure of merit commonly used to compare logic technologies. 
Since V and Cinv both scale linearly with linear dimensions, >.., the switching energy of 
MOS logic scales a.s the cube of the linear dimensions, Eaw ex: -fe. 
In most VLSI systems the wiring capacitance dominates device gate capacitance, and 
most of the switching energy is used to drive wires. The power required to drive these 
wires must be supplied to ea.ch logic circuit by a power distribution system. This power, 
in the form of heat, must also be removed by a cooling system. The power density 
that the power supply and cooling systems can handle limits the performance of VLSI 
systems. With very advanced cooling technology [12], power densities of 30~ have been 
achieved. 
If CA is the capacitance per unit area and Tcy is the cycle time of the system, power 
density, PA, is given by 
(5.11) 
Power density remains constant since, a.s voltage scales down, delay also scales down and 
capacitance per unit area scales up (all linearly with>..). 
Consider a typical 1.25µ technology. Let us make the following assumptions: 
• CA is the capacitance of one metal layer, CA= 10-"r!;. 
• The cycle time is 100 inverter delays, T cy = lOOrinv = 7.Sns. 
• The supply voltage, V, is 2.5V. 
128 
Interconnection Network 
Figure 5.3: A Concurrent Computer 
Then the power density is PA ~ 40~. Even with a very modest cycle time, the power 
density of a VLSI chip exceeds the capability of state-of-the-art cooling technology. Thus, 
power density limits the wiring density of a VLSI system independent of the wire density 
of the interconnection technology. We cannot escape from the problem of wiring density 
by adding more wire layers. 
' To reduce power density we must run our system more slowly. From (5.11) one would 
expect that power density varies as the inverse of cycle time; however, using hot-clock6 
logic [114], the power density can be made to scale as the i..'lverse square of the cycle time, 
PA ex: T;;/. This relation is a strong argument for concurrency. Concurrent computing is 
energy efficient. We can run two computers at half speed with half the energy required 
to run one computer at full speed. 
5.3 Concurrent Computer Interconnection Networks 
. 
Figure 5.3 shows the organization of a concurrent computer. A number of processing 
nodes (N) communicate by means of an interconnection network. From Section 5.1 we 
know that the network must have a. low latency to support fine-grain concurrent algo-
rithms. We also know, from Section 5.2, that since VLSI technology is wire-limited, 
these networks are limited by the amount of wire required to construct them. In Sec-
tion 5.3.l I compare networks under the assumption of constant wire cost and show that 
low-dimension networks (e.g., a torus) offer lower latency than can be achieved with a 
6 Slow-clock logic ia a. better name for this technique since it ill the speed of the clocks relative to the 
circuit rather than their voltage lt!vei that results in an energy savings. 
129 
high-dimensional interconnect (e.g., a binary n-cube). This surprising result strongly 
motivates the use of low-dimension k-ary n-cubes for the interconnection networks of 
concurrent computers. 
A deadlock-free routing algorithm for k-ary n-cube networks is required if these networks 
are to be useful. In Section 5.3.2 I develop a novel method for constructing deadlock-
free routing algorithms and apply this method to several networks including k-ary n-
cubes. To test these ideas, I have designed a VLSI chip that implements such a routing 
algorithm. The design and testing of this chip is described in Section 5.3.3. 
5.3.1 Network Topology 
Interconnection networks for concurrent computers have been studied intensely, and 
many different network topologies have been proposed. Tree networks have been prcr 
posed for use in concurrent computers [13]. However, it has been shown that most logical 
communication graphs do not map well onto a tree network topology[120]. A crossbar 
switch ca.n be used to connect every node, Pi, to every other node, P;. A crossbar has the 
desirable characteristic of being non-blocking. In a non-blocking network, any connection 
that describes a permutation of the processing nodes ca.n be constructed without inter-
ference. Unfortunately crossbars are impractical for large systems because their wiring 
density grows as N 2 • Benes [10] developed a non-blocking network for telephone systems 
that requires only 0 ( N log N) switching elements. The Benes network has the disad-
vantage, however, that it requires a long time to configure for a particular permutation. 
In a concurrent computer where the pattern of communications varies dynamically, this 
long configuration time is unacceptable. Batcher's sorting network [7] is a more prac-
tical non-blocking network. While it requires O(N log2 N) switching elements and has 
O(log2 N) delay, it can be configured dynamically as messages are routed. 
Most concurrent computers are constructed using blocking networks because the advan-
tages of a non-blocking network are not sufficient to offset the O(log N) increased cost 
of a non-blocking network. The Omega network [80], a multiple stage shuffle-exchange 
network [122], is an example of such a. blocking network. The Omega network has 
O(N log N) switching elements7 and a delay of O(Iog N)8 . The Omega network is is0> 
morphic to the indirect binary n-cube or flip network [8] [108]. The direct version of this 
network is the the binary n-cube [111], [96], [124]. The binary n-cube is a. special case 
of the family of k-a.ry n-cubes, cubes with n dimensions and k nodes in each dimension. 
Since most of the interconnection networks used for concurrent computers are isomorphic 
to binary n-cubes, a. subset of k-a.ry n-cubes, in this section we restrict our attention to 
7 Recall from Section 5.2 tha.t it is the wiring density tha.t is important, not the number of switching 
elements. I use the number of switching elements here for purposes of comparison only. 
8 The Omega network ha.a O(log N) delay under the a.aaumption tha.t wire delay is independent of wire 
length. Aga.in, I use this assumption here for purposes of compa.rison only. We have a.irea.dy seen that 
this assumption is not consistent with the cha.racteriatica of VLSI technology. 
130 
Figure 5.4: A Binary 6-Cube Embedded in the Plane 
k-ary n-cube networks. It is the dimension of the network that is important, not the 
details of its topology. We refer to n as the dimension of the cube and k as the radix. 
Dimension, radix, and number of nodes are related by the equation 
(5.12) 
' 
We can construct k-ary n-cubes with (approximately) the same number of nodes but 
with different dimensions. Figures 5.4-5.6 show three k-ary n-cube networks in order 
of decreasing dimension. Figure 5.4 shows a binary 6-cube (64 nodes). A 3-ary 4-cube 
(81 nodes) is shown in Figure 5.5. An 8-ary 2-cube (64 nodes), or torus, is shown 
in Figure 5.6. Each line in Figure 5.4 represents two communication channeis, one in 
each direction, while each line in Figures 5.5 and 5.6 represents a single communication 
channel. 
Networks have traditionally been analyzed under the assumption of constant channel 
bandwidth. Under this assumption each channel is one bit wide (W = 1) and has unit 
delay (Tc = 1). Thus, the constant bandwidth assumption favors networks with high 
dimensionality (e.g., binary n-cubes). 
The constant bandwidth assumption, however, is not consistent with the properties of 
VLSI technology. Networks with many dimensions require more and longer wires than 
do low-dimensional networks. Thus, large dimensional networks cost more and run more 
slowly than low-dimensional networks. A realistic comparison of network topology must 
take both wire density and wire length into account. 
In this section we compare the performance of k-ary n-cube interconnection networks 
131 
Figure 5.5: A Ternary 4-Cube Embedded in the Plane 
Figure 5.6: An 8-ary 2-Cube (Torus) 
132 
using the following assumptions: 
• _;..;-etworks must be embedded into the plane9 . 
• Nodes are placed systematically by embedding !;} logical dimensions in each of the 
two physical dimensions. We assume that both n and k are eve.::i integers. The 
long end-around connections shown in Figure 5.6 can be a.voided by folding the 
network a.s shown in Figure 5.19 on page 155. 
• For networks with the same number of nodes, wire density is held constant. Each 
network is constructed with the same bisection width, B, the total number of wires 
crossing the midpoint of the network. To keep the bisection width constant, we 
vary the width, W, of the communication channels. We normalize to the bisection 
width of a bit-serial (W = 1) binary n-cube. 
• The networks use wormhole routing, described in Section 5.3.2. 
• No more than a single bit is in transit on any wire at a. given time. 
• Channel delay, Tc, is a function of wire length, L. We begin by considering channel 
delay to be constant. Later, the comparison is performed for both logarithmic and 
linear wire delays; Tc ex: log L a.nd T~ ex: L. 
When k i..s even, the channels crossing the midpoint of the network are all in the highest 
dimension. For each of the VN rows of the network, there are k( %- 1) of these channels 
in each direction for a total of 2v'Nk(~-i) channels. Thus, the bisection width,,B, of a 
k-a.ry n-cube with W-bit wide communication channels is 
(5.13) 
For a binary n-cube, k = 2, the bisection width is B(2, n) = W N. We set B equal to N 
to normalize to a binary n-cube with unit width channels, W = 1. The channel width, 
W(k, n), of a k-ary n-cube with the same bisection width, B, follows from (5.13): 
2W(k, n)v'NkC~- 1) = N, (5.14) 
v'N 
W(k,n)= (" ) 2k J-l 
v'N k.Jii k 
= ~-~~- = ~~ = -. 
2kJ k-1 2./N 2 (5.15) 
The peak wire density i..s greater than the bi.section width in networks with n > 2 because 
the lower dimensions contribute to wire density. The maximum density, however, is 
bounded by 
0 If a. three-dimensiona.l pa.cka.ging technology becomes a.va.ilable, the compariwn cha.nges only slightly. 
i-1 
2WVN L kj 
i=O 
~-1 
kv'ii I: kj 
i=O 
(7: (ki -1) kvN k 
- 1 
k rv(v!Y- 1 ~ 
y, k-1) 
< (k: 1) B. 
133 
(5.16) 
A plot of wire density as a function of position for one row of a binary 20-cube is shown 
in Figure 5.7. The density is very low at the edges of the cube and quite dense near the 
center. The peak density for the row is 1364 at position 341. Compare this density with 
the bisection width of the row, which is 1024. In contrast, a two-dimensional torus has 
a wire density of 1024 independent of position. One advantage of high-radix networks is 
that they have a very uniform wire density. They make full use of available area. 
Each processing node has 2n channels each of which is ~ bits wide. Thus, the number 
of pins per processing node is 
Np= nk. (5.17) 
A plot of pin density as a function of dimension for N = 256, 4K and lM nodes10 is shown 
in Figure 5.8. Low dimensional networks have the disadvantage of requiring many pins 
per processing node. A two-dimensional network with lM nodes (not shown) requires 
2048 pins and is clearly unrealizable. However, the number of pins decreases very rapidly 
as the dimension, n, increases. Even for lM nodes, a dimension 4 node has only 128 pins. 
Recall from Section 5.2.1, however, that the wire density of the board under the chips 
becomes saturated before the maximum pin density of the chip is exceeded. Since all 
of the IM node configurations have the same bisection width, B = lM, these machines 
cannot be wired in a single plane. However, if the pin density is within reason, we may 
still be able to construct these machines by escaping into three dimensions. 
Latency 
From Section 5.1 we know that the latency of the network is the critical performance 
10 1K = 1024 and, lM = lK x lK = 1048576. 
134 
1400 
R ;/V'\/"'\ 
/"'v,...,..\ 
I' '"\ /""' """ ' 1200 ' \ I 0 r ( \/ \ w \ ~I \,' ~. ,/Y"\. 
1000 (¥ I I{ \ .. \, 
w \ I \ 
I 800 I \ ( '"\ r \\ e 600 \ \ ) ' D 400 I \ I e I ' 
n I \ f \ s 200 \ I \ I I I ,I \ 
t 0 iJ ' i y 
' 
-200 
0 200 400 600 800 1000 1200 
Position 
Figure 5.7: Wire Density vs. Position for One Row of a Binary 20-Cube 
135 
350 
p 300 
256 Nodes 
n 250 & 6. 4K Nodes \ s \ 0 lM Nodes \ p 200 \ \ 
e 
r 150 ~ 
N I 
100 ' 0 r 
d I 
e 50 I ~ 
'<( 
""' -..:. 
0 '-- .! 
0 5 10 15 20 
Dimension. n 
Figure 5.8: Pin Density vs. Dimension for 256, 4K, and lM Nodes 
136 
measure. Latency, T1, is the sum of the latency due to the network and the latency due 
to the processing node, 
Ti = Tnet + Tnode · (5.18) 
In this section we are concerned only with Tnet. We will consider Tnode in Section 5.4. 
Network latency depends on the time required to drive the channel, Tc, the number of 
channels a message must traverse, D, and the number of cycles required to transmit the 
message across a single channel, ;$-,where Lis message length. 
Tnet = Tc ( D + ~) (5.19) 
If we select two processing nodes, Pi, Pj, at random, the average number of channels 
that must be traversed to send a message from Pi to Pi is given by the following three 
equations for the torus, the binary n-cube and general k-ary n-cubes: 
Dt = VN-1, 
(
k - 1 \ 
D(k, n) = - 2-) n. 
(5.20) 
(5.21) 
(5.22) 
The average latency of a k-ary n-cube is calculated by substituting (5.15) and (5.22), 
into (5.19) 
((k -1) 2£) Tnet = Tc -2- n + T . (5.23) 
Figure 5.9 shows the average network la£ency, Tnet, as a function of dimension, n, for 
k-ary n-cubes with 28 (256), 214 (4K), and 220 (IM) nodes11 . The left most data point 
in this figure corresponds to a torus (n = 2) and the right most data point corresponds 
to a binary n-cube (k = 2). This figure assumes constant wire delay, Tc, and a message 
length, L, of 150 bits. Although constant wire delay is unrealistic, this figure illustrates 
that even ignoring the dependence of wire delay on wire length, low-dimensional networks 
achieve lower latency than high-dimensional networks. 
11 For the sa.ke of comparison we allow radix to ta.ke on non-integer values. For some of the dimensions 
considered, there is no integer radix, k, that givee the correct number of nodes. In fact, this limitation 
can be overcome by constructing a mixed-radix cube. 
137 
160 \ 
<) r 
140 ! l 6 \ \ 
L 120 \ p 
' I 
a \ ~ 
t 100 r \/ e ! \ \ \ \ \ n 80 \}\ 256 Nodes r- \ c y 6 4K Nodes 60 I-I \t:J I 0 lM Nodes I 
'--" 
40 ~ I 
I c:: I 
I 
" 20 
0 s 10 15 20 
Dimension. n 
Figure 5.9: Latency vs. Dimension for 256, 4K, and lM Nodes, Constant Delay 
In general the lowest latency is achieved when the component of latency due to distance, 
D, and the component due to message length, J7, are approximately equal, D ~ *-· 
For the three cases shown in Figure 5.9, minimum latencies are achieved for n = 2, 4, 
and 5 respectively. The following table shows the values for D and ~ for these three 
configurations. 
N n I •k I w D w 
I 256 2 16 8 15 19 : 
4K 4i 8 I 4 14 38 I 
lM 5 i 16 I 8 38 19 
The length of the longest wire in the system, lw, becomes a bottleneck that determines 
the rate at which each channel operates, Tc. The length of this wire is given by 
lw(k,n) = k~- 1 • (5.24) 
138 
If the wires are sufficiently short, delay depends logarithmically on wire length. If the 
channels are longer, they become limited by the speed of light, and delay depends linearly 
on channel length. Substituting (5.24) into (5.6) and (5.8) gives 
{ 
1 +loge lw = 1 -;- ( ~ - 1) loge k logarithmic delay 
Tc ex: 
lw = k~-l linear delay. 
We substitute (5.::!5) into (5.23) to get the network latency for these two cases: 
logarithmic delay 
linear delay. 
(5.26) 
Figure 5.10 shows the average network latency as a function of dimension for k-ary n-
cubes with 28 (256), 21' (4K), and 220 (l~f) nodes, assurning logarithmic wire delay 
and a message length, L, of 150. Figure 5.11 shows the same data assuming linear wire 
delays. In both figures, the left most data point corresponds to a torus ( n = 2) and the 
right most data point corresponds to a binary n-cube (k = 2). 
In the linear delay case, Figure 5.11, a torus (n = 2) always gives the lowest latency. This 
is because a torus offers the highest bandwidth channels and the most direct physical 
route between two processing nodes. Under the linear delay assumption, latency is 
determined solely by bandwidth and by the physical distance traversed. There is no 
advantage in having long channels. 
Under the logarithmic delay assumption, Figure 5.10, a torus has the lowest latency for 
small networks (N = 256). For the larger networks, the lowest latency is achieved with 
slightly higher dimensions. With N = 4K, the lowest latency occurs when n is three12 . 
With N = lAf, the lowest latency is achieved when n is 5. It is interesting that assuming 
constant wire delay does not change this result much. Recall that under the (unrealistic) 
constant wire delay assumption, Figure 5.9, the minimum latencies are achieved with 
dimensions of 2, 4, and 5 respectively. 
The results shown in Figures 5.10 through 5.9 were derived by comparing networks under 
the assumption of constant wire cost to a binary n-cube with W = 1. For small networks 
it is possible to construct binary n-cubes with wider channels, and for large networks 
(e.g., IM nodes) it may not be possible to construct a binary n-cube at all. In the case 
of small networks, the comparison against binary n-cubes with wide channels can be 
performed by expressing message length in terms of the binary n-cube's channel width, 
12In an actual machine the dimension n would be restricted to be an even integer. 
139 
in effect decreasing the message length for purposes of comparison. The net result is the 
same: lower-dimensional networks give lower latency. Even if we perform the 256 node 
comparison against a binary n-cube with W = 16, the torus gives the lowest latency 
under the logarithmic delay model, and a dimension 3 network gives minimum latency 
under the constant delay model. For large networks, the available wire is less than 
assumed, so the effective message length should be increased, making low dimensional 
networks look even more favorable. 
In this comparison we have assumed that only a single bit of information is in transit 
on each wire of the network at a given time. Under this assumption, the delay between 
nodes, Tc, is equal to the period of each node, Tp. In a network with long wires, however, 
it is possible to have several bits in transit at once. In this case, the channel delay, Tc, is 
a function of wire length, while the channel period, Tp < Tc, remains constant. Similarly, 
in a network with very short wires we may allow a bit to ripple through several channels 
before sending the next bit. In this case, Tp > Tc. Separating the coefficients, Tc and 
Tp, (5.19) becomes 
(5.27) 
The net effect of allowing Tc =f. Tp is the same as changing the length, L, by a factor of 
¥.- and does not change our results significantly. 
When wire cost is considered, low-dimensional networks (e.g., tori) offer lower latency 
than high-dimensional networks (e.g., binary n-cubes). Intuitively, tori outperform bi-
nary n-cubes because they better match form to function. The logical and physical 
graphs of the torus are identical; Thus, messages always travel the minimum distance 
from source to destination. In a binary n-cube, on the other hand, the fit between form 
and function is not as good. A message in a binary n-cube embedded into the plane may 
have to traverse considerably more than the minimum distance between its source and 
destination. 
Throughput 
Throughput, another important metric of network performance, is defined as the total 
number of messages the network can handle per unit time. One method of estimating 
throughput is to calculate the capacity of a network, the total number of messages that 
can be in the network at once. Typically the maximum throughput of a network is some 
fraction of its capacity. The network capacity per node is the total bandwidth out of each 
node divided by the average number of channels traversed by each message. For k-ary 
n-cubes, the bandwidth out of each node is nW, and the average number of channels 
traversed is given by (5.22), so the network capacity per node is given by 
140 
1200 
256 Nodes 
1000 I 0 r \ 
I \ 
L 4K Nodes 
I \ L 800 I 0 lM Nodes i- \ 
a ! \ i 
t \ 
e 600 I n 
c : 
y 400 r I I \._ ! 
I 
I 
200 I ,... I 
I ~ I I 
0 ' 
0 5 10 15 20 
Dimension. n 
Figure 5.10: Latency vs. Dimension for 256, 4K, and lM Nodes, Logarithmic Delay 
141 
100000 A 
"' 50000 r-
! 
10000 i f-i 
L 5000 L 
a 
t 
e 1000 ~ 0 256 Nodes n 500 I 
c 
y 6 4K Nodes I 
100 L i L j 50 0 lM Nodes I I I 
I I 
I I I 
10 ! ,1 
0 5 10 15 20 
Dimension. n 
Figure 5.11: Latency vs. Dimension for 256, 4K, and lM Nodes, Linear Delay 
r(k,n) 0:: nW(k, n) D(k,n) 
142 
(5.28} 
The network capacity is i...'1.dependent of dimension. For a constant amount of wire, there 
is a constant network bandwidth. 
To verify that throughput is proportional to capacity and to determine the constant of 
proportionality, I measured the throughput of k-ary n-cubes with 256 and 4K nodes 
using a k-ary n-cube network simulator. The simulations were performed using random 
traffic and the deterministic routing algorithm described in Section 5.3.2. The traffic 
was generated by repeatedly having each node randomly choose a destination node and 
initiate transmission of a message to the destination node. 
The results of the throughput experiment a.re tabulated below. Throughput is shown as 
a fraction of capacity. The experimental results indicate that throughput is between 0.4 
and 0. 7 capacity for the cases tested. For practical purposes throughput is independent 
of dimension. 
Parameter 256 Nodes II 4K Nodes I 
Dimension I 2 ! 4 8 II 2 I 4 6 12 I I 
radix I 16 I 4 2 I! 64 I 8 4 ! 2 II I 
Throughput i 0.468 i o.618 I o.596 11 0.492 I 0.426 I 0.427 0.460 
Hot Spot Throughput 
In many situations traffic is not uniform, but rather is concentrated into hot spots. A hot 
spot is a pair of nodes that accounts fa~ a disproportionately large portion of the total 
network traffic. As described by Pfister [101] for a shared-memory computer, hot-spot 
traffic can degrade performance of the entire network by causing congestion. 
The hot-spot throughput of a network is the maximum rate at which messages can be 
sent from one specific node, Pi, to another specific node, P;. For a k-ary n-cube with 
deterministic routing, the hot-spot throughput, 0Hs, is just the bandwidth of a single 
channel, W. Thus, under the assumption of constant wire cost we have 
0Hs = W = k- 1. (5.29) 
143 
N2 
0 0 0 
C3 C2 
., 
1 
w N3 1 N1 1 
co 2 2 2 C1 
NO 
Figure 5.12: Deadlock in a 4-Cycle 
Low-dimensional networks have greater channel bandwidth and thus have greater hot-
spot throughput than do high-dimensional networks. Intuitively, low-dimensional net;.., 
works operate better under non-uniform loads because they do more resource sharing. 
In an interconnection network the resources are wires. In a high-dimensional network, 
wires are assigned to particular dimensions a.nd cannot be shared between dimensions. 
For example, in a binary n-cube it is possible for a wire to be saturated while a physically 
adjacent wire assigned to a different dimension remains idle. In a torus all physically 
adjacent wires are combined into a single channel that is shared by all messages that 
must traverse the physical distance spanned by the channel. 
5.3.2 Deadlock-Free Routing . 
Deadlock in the interconnection network of a concurrent computer occurs when no mes-
sage can advance toward its destination because the queues of the message system are 
· full [70]. Consider the example shown in Figure 5.12. The queues of each node in the 
4-cycle are filled with messages destined for the opposite node. No message can advance 
toward its destination; thus the cycle is deadlocked. In this locked state, no communi-
cation can occur over the deadlocked channels until exceptional action is taken to break 
the deadlock. 
144 
Definition 5.1 A flow control digit or flit is the smallest unit of information that a 
queue or channel can accept or refuse. Generally a packet consists of many flits. The 
unit of communication that is visible to the programmer is the message. A message may 
be composed of one or more packets, each of which carries its own routing and sequencing 
information in a header. 
This complication of standard terminology has been adopted to distinguish between 
those flow control units that always include routing information - viz. packets - and 
those lower-level flow control units that do not - viz. flits. The literature on computer 
networks [125] has been able to avoid this distinction between packets and fl.its because 
most networks include routing information with every flow control unit; thus the fl.ow 
control units are packets. That is not the case in the interconnection networks used by 
message-passing concurrent computers such as the Caltech Cosmic Cube [112]. 
The concurrent computer interconnection networks we are concerned with in this section 
are not store-and-forward networks. Instead of storing a packet completely in a node 
and then transmitting it to the next node, the networks we consider here use wormhole 
routing13 [113]. With wormhole routing, only a few flits a.re buffered at each node. As 
soon as a node examines the header fiit(s) of a packet, it selects the next channel on 
the route and begins forwarding flits down that channel. As flits are forwarded, the 
packet becomes spread out a.cross the channels between the source and destination. It 
is possible for the first flit of a packet to arrive at the destination node before the la.st 
flit of the packet has left the source. Because most flits contain no routing information, 
the flits in a packet must remain in contiguous channels of the network and cannot be 
interleaved with the flits of other packets. When the header flit of a packet iscblocked, 
all of the flits of a packet stop advancing and block the progress of any other packet 
requiring the channels they occupy. Because a single packet blocks many channels at 
once, preventing deadlock in a wormhole network is harder than preventing deadlock in 
a store-and-forward network. 
I assume the following: 
• Every packet arriving at its destination node is eventually consumed. 
• A node can generate packets des}ined for any other node. 
• The route taken by a. packet is determined only by its destination and not by other 
traffic in the network. 
• A node can generate packets of arbitrary length. Packets will generally be longer 
than a single flit. 
18 A method similar to wormhole routing, ca.lied virtual cut-through, is described in [66]. Virtual cut-
through differs from wormhole routing in that it buffers messages when they block, removing them 
from the network. With wormhole routing, blocked messages remain in the network. 
145 
• Once a queue accepts the first fl.it of a packet, it must accept the remainder of the 
packet before accepting any fl.its from another packet. 
• An available queue may arbitrate between packets that request that queue space 
but may not choose amongst waiting packets. 
• Nodes can produce packets at any rate subject to the constraint of available queue 
space (source queued). 
The following definitions develop a notation for describing networks, routing functions, 
and configurations. 
Definition 5.2 An interconnection network, I, is a strongiy connected directed graph, 
I= G(N, C). The vertices of the graph, N, represent the set of processing nodes. The 
edges of the graph, C, represent the set of cormnunication channels. Associated with 
each channel, c1·, is a queue with capacity cap(c.-). The source node of channel Ci is 
denoted s, and the destir.ation node ~. 
Definition 5.3 A routing function R : C x N i---+ C maps the current channel, Cc, and 
destination node, na, to the next channel, c11 , on the route from Cc to na, R( Cc, na) = c11 • 
A channel is not allowed to route to itself, Cc "# c11 • Note that this definition restricts 
the routing to be memoryless in the sense that a packet arriving on channel Cc destined 
for na has no memory of the route that brought it to Cc· However, this formulation of 
routing as a function from C x N to Chas more memory than the conventional definition 
of routing as a function from N x N to C. Making routing dependent on the current 
channel rather than the current node allows us to develop the idea of channel dependence. 
Observe also that the definition of R precludes the route from being dependent on the 
presence or absence of other traffic in the network. R describes strictly deterministic 
and non-adaptive routing functions. 
Definition 5.4 A channel dependency graph, D, for a. given interconnection network, I, 
and routing function, R, is a directed graph, D = G(C, E). The vertices of D are the 
channels of I. The edges of D are the pairs of channels connected by R: 
' 
E = { ( c,, c j) IR ( c, , n) = c j for some n E N}. (5.30) 
Since channels are not allowed to route to themselves, there are no I-cycles in D. 
Definition 5.5 A configuration is an assignment of a. subset of N to each queue. The 
number of flits in the queue for channel ct wiil be denoted size( ct). If the queue for channel 
c; contains a flit destined for node na, then member( na, ci) is true. A configuration is 
legal if 
146 
'ic; EC, size(c;) ::; cap(c;). ( ~ '>1\ ~ .... ) 
Definition 5.6 A deadlocked configuration for a routing function, R, rs a non-empty 
legal configuration of channel queues such that 
Ve; EC, (Vn 3 member(n, c;), n =f:. d.· and ci = R(c;, n) => size(c;) = cap(c;).)(5.32) 
In this configuration no flit is one step from its destination, and no flit can advance 
because the queue for the next cl1annel is full. A routing function, R, is deadlock-free on 
an interconnection network, I, if no deadlock configuration exists for that function on 
that network. 
Theorem 5.1 A routing function, R, for an interconnection network, I, is deadlock-free 
iff there a.re no cycles in the channel dependency graph, D. 
Proof: 
=>Suppose a network ha.s a cycle in D. Since there are no 1-cycles in D, this cycle must 
be of length two or more. Thus, one can construct a deadlocked configuration by filling 
the queues of each channel in the cycle with flits destined for a node two channels away, 
where the first channel of the route is along the cycle. 
-<=Suppose a network ha.s no cycles in D. Since Dis acyclic, one can assign a total order 
to the channels of C so that if ( c;, Cj) E E then c; > ci. Consider the least channel in 
this order with a full queue, ci. Every channel, Cn, that ci feeds is less than ci, and thus 
does not have a full queue. Thus, no fl.it in the queue for c1 is blocked, and one does not 
have deadlock. • 
Virtual Channels 
Now that we have established this if-and-only-if relationship between deadlock and the 
cycles in the channel dependency graph, we can approach the problem of making a 
network deadlock-free by breaking tlie cycles. We can break such cycles by splitting 
each physical channel along a cycle into a group of ttirtual channels. Each group of 
virtual channels shares a physical communication channel; however, each virtual channel 
requires its own queue. 
Consider for example the case of a unidirectional four-cycle as shown in Figure 5.13A, 
N = {na, ... ,n3}, C = {ca, ... ,c3}. The interconnection graph, I, is shown on the left 
and the dependency graph, D, is shown on the right. We pick channel co to be the di-
viding channel of the cycle and split each channel into high virtual channels, c10 , ... , c13 , 
and low virtual channels, coo, ... , co3, as shown in Figure 5.13B. 
147 
nz 
C3 c2 
C3~C2 • ..  • 
I \ , , 
A n3 • • ni 
' 
\ 
i 
co ci I 
' ... 
•c1 
no 
co 
n2 C13 cu 
• ... c. co2 ... 
' I B I i ng n1 / •. I ' I • I I COQ co1 
' ... • no ClQ c11 
I: Interconnection Graph D: Dependency Graph 
Figure 5.13: Breaking Deadlock with Virtual Channels 
148 
When a packet enters the network it is routed on the high channels until it passes 
through node zero. After passing through node zero, packets are routed on the low 
channels. Channel coo is not used. We now have a total ordering of the virtual channels 
according to their subscripts: c13 > cu > cu > c10 > co3 > co2 > co1 · Thus, there iirno 
cycle in D, and the routing function is deadlock-free. 
Many deadlock-free routing algorithms have been developed for store-and-forward com-
puter communications networks [48], [49], [58], [89], [130), [131]. These algorithms are 
all based on the concept of a structured buffer pool. The packet buffers in each node 
of the network are partitioned into classes, and the assignment of buffers to packets is 
restricted to define a partial order on buffer classes. The structured buff er pool method 
has in common with the virtual channel method that both prevent deadlock by assign-
ing a partial order to resources. The two methods differ in that the structured buffer 
pool approach restricts the assignment of buffers to packets, while the virtual channel 
approach restricts the routing of messages. Either method can be applied to store-and-
forward networks, but the structured buffer pool approach is not directly applicable to 
wormhole networks, since the flits of a packet cannot be interleaved. 
In the next section, virtual channels are used to construct a deadlock-free routing algo-
rithm for k-ary n-cubes. In [24] algorithms are developed for cube-connected cycles and 
shuffle-exchange networks as well. 
k-ary n-cubes 
The E-cube routing algorithm [79],[124] guarantees deadlock-free routing in binary n-
cubes. In a cube of dimension d, we denote a node as n.c where k is an d-digit binary 
number. Node n.c has d output channels, one for each dimension, labeled co.c, ... , C(d-l).C· 
The E-cube algorithm routes in decreasing order of dimension. A message arriving at 
node n.c destined for node n1 is routed on channel Ci.c, where i is the position of the most 
significant bit in which k and I differ. Since messages are routed in order of decreasing 
dimension and hence decreasing channel subscript, there are no cycles in the channel 
dependency graph, and E-cube routing is deadlock-free. 
Using the technique of virtual channels, this routing algorithm can be extended to handle 
all k-ary n-cubes. Rings and toroidal ineshes are included in this class of networks. This 
algorithm can also handle mixed radix cubes. Each node of a k-ary n-cube is identified 
by an n-digit radix k number. The ith digit of the number represents the node's position 
in the ith dimension. For example, the center node in the 3-ary 2-cube of Figure 5.14 is 
n 11 . Each channel is identified by the number of its source node and its dimension. For 
example, the dimension 0 (horizontal) channel from n 11 to n10 is con. To break cycles, 
we divide each channel into an upper and lower virtual channel. The upper virtual 
channel of con will be labeled <=Din, and the lower virtual channel will be labeled coon· 
To give internal channels the lowest priority, they are labeled with a dimension higher 
than the dimension of the cube. To assure that the routing is deadlock-free, we restrict 
149 
~·. 
' co22 ' co12 ' cozo \___ _ __. .. ~·--------! .. _. _____ .. ~ 
\_ 
nn t nz1 n20 
' C122 C121 C1zo 
f con 
... 
f CQlQ 
.... _/ 
n11 n10 
C112 c111 cuo 
' 
cooz 
' 
coo1 
' 
_/ 
cooo 
I ... ... ...  
u no2 no1 l i noo u '-J 
c102 c101 C100 
Figure 5.14: 3-ary 2-Cube 
it to route through channels in order of descending subscripts. Priority is always given 
to the message from the channel with a lower subscript. 
As in the E-cube algorithm, we route in order of dimension, most significant dimension 
first. In each dimension, i, a message is routed in that dimension until it reaches a 
node whose subscript matches the destination address in the ith position. The message 
is routed on the high channel if the ith digit of the destination address is greater than 
the ith digit of the present node's address. Otherwise the message is routed on the 
low channel. It is easy to see that this routing algorithm routes in order of descending 
subscripts and is thus deadlock-free. 
Formally, we define the routing function: 
{ 
cdl(n-r") if (dig(n,d) < dig(j,d)) /\ (dig(n,d) -:f: 0), 
R ( ·) _ cdo(n-r") if (dig(n,d) > dig(j,d)) V (dig(n,d) = 0), KNC Cdvn1 n, - 'f (vk · d. ( k) d' ( · k)) cil(n-r") l y > •, ig n, = ig J, ;\ 
(dig(n,i) -:f: dig(j,i)), 
(5.33) 
where dig(n, d) extracts the tfh digit of n, and r is the radix of the cube. The subtraction, 
n - rd, is performed so that only the tfh digit of the address n is affected. 
150 
Assertion 5.l The routing function, RKNC> correctly routes messages from any node 
to any other node in a k-ary n-cube. 
Proof: By induction on dimension, d. 
For d = 1, a message, destined for ni, enters the system at ni on the internal channel, 
CaOi. If i < j, the message is forwarded on channels, co1i, ... , co10, Coor, ... , CcXJ(f-t· l) to 
node ni. Ifi > j, the path taken is, co01, ... ,c00(i+l)· In both cases the route reaches 
node ni. 
Assume that the routing works for dimensions :::; d Then for dimension d+ 1 there are two 
cases. If dig(i, d) =I= dig(j, d), then the message is routed around the most significant cycle 
to a node nk 3 dig(k, d) = dig(J, d), as in the d = 1 case above. If dig(i, d) =dig(;', d), 
then the routing need be performed only in dimensions d and lower. In each of these 
cases, once the message reaches a node, nk, 3 dig(k, d) =dig(;', d), the third routing rule 
is used to route the message to a lower-dimensional channel. The problem has then been 
reduced to one of dimension :::; n, and the routing reaches the correct node by induction. 
I 
Assertion 5.2 The routing function RKNC on a k-ary n-cube interconnection network, 
I, is deadlock-free. 
Proof: Since routing is performed in decreasing order of channel subscripts, Ve,, Cf, nc 3 
R(ci, nc) = Cj ~ i > j, the channel dependency graph, D, is acyclic._ Thus by 
Theorem 5.1 the route is deadlock-free. I 
5.3.3 The Torus Routing Chip 
I ha.ve developed the torus routing chip (TRC) as a demonstration of the use of virtual 
channels for deadlock-free routing. Shown in Figures 5.15 and 5.16, the TRC is a ~ 
IO, 000-transistor chip implemented in 3µ CMOS technology and packaged in an 84-lead 
pin-grid array. It provides deadlock-free packet communications in k-ary n-cube (torus) 
networks with up to k = 256 nodes in each dimension. While primarily intended for 
n = 2-dimensional networks, the chips can be cascaded to handle arbitrary n-dimensional 
networks using r j 1 TRCs at each processing node. TRCs have been fabricated and 
tested. 
Even if only two dimensions are used, the TRC can be used to construct concurrent 
computers with up to 216 nodes. It would be very difficult to distribute a global clock 
over an array of this size f 40]. To avoid this problem, the TRC is entirely self-timed 
[109], thus permitting each processing node to operate at its own rate with no need for 
global synchronization. Synchronization, when required, is performed by arbiters in the 
TRC. 
151 
Figure 5.15: Photograph of the Torus Routing Chip 
Figure 5.16: A Packaged Torus Routing Chip 
To reduce the latency of cornmunications that traverse more than one channel, the TRC 
uses wormhole [113] routing rather than store-and-forward routing. Instead of reading an 
entire packet into a processing node before starting transmission to the next node, the 
TRC forwards each byte of the packet to the next node as soon as it arrives. Wormhole 
routing thus results in a message latency that is the sum of two terms, one of which 
depends on the message length, L, and the other of which depends on the number of 
communication channels traversed, D. Store-and-forward routing gives a latency that 
depends on the product of L and D. Another advantage of wormhole routing is that 
communications do not use up the memory bandwidth of intermediate nodes. A packet 
does not interact with the processor or memory of intermediate nodes along its route. 
Packets remain strictly within the TRC network until they reach their destination. 
System Design 
The torus routing chip (TRC) can be used to construct arbitrary k-ary n-cube inter-
connection networks. Ea.ch TRC routes packets in two dimensions, and the chips are 
ca.scadable as shown in Figure 5.17 to construct networks of dimension greater than two. 
The first TRC in each node routes packets in the first two dimensions and strips off their 
address bytes before passing them to the second TRC. This next chip then treats the 
next two bytes as addresses in the next two dimensions and routes packets accordingly. 
The network can be extended to any number of dimensions. 
153 
Input 4 Input 2 
12) 
' 
• In put 1 TRC I Output 1 
121 
12 12 
12 
Input 3 
12 TRC 12 Output 3 
' Output 4 Output 2 
Figure 5.17: A Dimension 4 Node 
154 
Figure 5.18: A Torus System 
155 
Figure 5.19: A Folded Torus System 
A block diagram of a. 2-dimensional message-passing concurrent computer constructed 
around the TRC is shown in Figure 5.18. Each node consists of a processor, its local 
memory, and a TRC. Each TRC in the torus is connected to its processor by a processor 
input channel and a processor output channel. Connections on the edges of the torus 
wrap around to the opposite edge. One can avoid the long end-around connection by 
folding the torus, as shown in Figure 5.19. 
A flit in the TRC is a byte whose 8 bits are transmitted in parallel. The X and Y 
channels each consist of 8 data lines and 4 control lines. The 4 control lines are used for 
separate request/ acknowledge signal pairs for each of two virtual channels. The processor 
channels are also 8 bits wide, but have o~ly two control lines each. 
The packet format is shown in Figure 5.20. A packet begins with two address bytes. 
The bytes contain the relative X and Y addresses of the destination node. The relative 
address in a given direction, say X, is a count of the number of channels that must be 
traversed in the X direction to reach a node with the same X address as the destination. 
After the address comes the data field of the packet. This field may contain any number 
of non-zero data bytes. The packet is terminated by a zero tail byte. Later versions of 
the TRC may use an extra. bit to tag the tail of a packet, and might also include error 
checking. 
156 
x y Do • • • 0 
relative address non-zero data bytes tail 
Figure 5.20: Packet Format 
The TRC network routes packets first in the X direction, then in the Y direction. Packets 
a.re routed in the direction of decreasing address, decrementing the relative address at 
each step. When the relative X address is decremented to zero, the packet has reached 
the correct X coordinate. The X address is then stripped from the packet, and routing 
is initiated in the Y dimension. When the Y address is decremented to zero, the packet 
has reached the destination node. The Y address is then stripped from the packet, and 
the data and tail bytes are delivered to the node. 
Each of the X and Y physical channels is multiplexed into two virtual channels. In each 
dimension packets begin on virtual channel 1. A packet remains on virtual channel 1 
until it reaches its destination or address zero in the direction of routing. After a packet 
crosses a.cl.dress zero, it is routed on virtual channel 0. The address 0 origin of the torus 
network in X and Y is determined by two input pins on the TRC. The effect of this 
routing algorithm is to break the channel dependency cycle in each dimension into a 
two-turn spiral similar to that shown in Figure 5.13 on page 147. Packets enter the 
spiral on the outside turn and reach the inside turn only after passing through address 
zero. 
Ea.ch virtual channel in the TRC uses the 2-cycle signaling convention shown in Fig-
ure 5.21. Each virtual channel has its own request (R) and acknowledge {A) lines. 
When R = A, the receiver is ready for the next flit (byte). To transfer information, 
the sender waits for R = A, takes control of the data lines, places data. on the data 
lines, toggles the R line, and releases the data. lines. The receiver samples data on each 
transition of R line. When the receiver is ready for the next byte, it toggles the A line. 
The protocol allows both virtual channels to have requests pending. The sending end 
does not wait for any action from the receiver before releasing the channel. Thus, the 
other virtual channel will never wait longer than the data transmission time to gain 
access to the channel. Since a. virtual channel always releases the physical channel after 
transmitting each byte, the arbitration is fair. If both channels are always ready, they 
will alternate bytes on the physical channel. 
157 
/ 
REQ \ ______ _ 
I \_ ACK 
Figure 5.21: Virtual Channel Protocol 
DATA CH 1 CH 0 CH 0 CH 1 
REQ 1 
ACK 1 
REQ 0 
ACK 0 
Figure 5.22: Channel Protocol Example 
Consider the example shown in Figure 5.22. Virtual channel Xl gains control of the 
physical channel, transmits one byte of information, and releases the channel. Before 
this information is acknowledged, channel XO takes control of the channel and transmits 
two bytes of information. Then Xl, having by then been acknowledged, takes the channel 
agam. 
Logic Design 
As shown in Figure 5.23, the TRC consists of five input controllers, a five by five crossbar 
switch, five output queues, and two output multiplexers. There is one input controller 
and one output controller for ea.ch virtual channel. The output multiplexers serve to 
multiplex two virtual channels onto a single physical channel. 
The input controller is responsible for packet routing. When a packet header arrives, 
Xl in 
R,A 
XO iu 
R,A 
Xin 
Da..ta. 
Yl in 
R,A 
YO in 
R,A 
y ill 
Date. 
P in 
R,A 
Pin 
Data. 
l 
r-
I l 
• 
l 
r-
I l 
s 
l 
I 
lnpu't 
Con'troller 
Xl HI 
Input 
Coatroller 
XO 16 
Jn put 
Controller 
Yl lG 
!npu• 
Controller 
YO 16 
Input 
ControlJer 
p 16 
158 
Queue l 
Xl Xl 
lO Xl SxS ~ Crossbar Switch Queue l 
XO XO 
10 XO 
i 
Queut l 
Yl Yl 
10 Yl 
a 
Queue ~ • YO YO 
10 YO 
g 
Queue l 
p p 
10 p 
B 
Figure 5.23: TRC Block Diagram 
ll MUX 
x 
& 
s MUX 
y 
g 
___.;..__ 
a 
_,...._ 
& 
Xl OU't 
R.A 
XO out 
R.A 
X out 
Da.t.., 
Yl out 
R.A 
YO out 
R.A 
Y out 
0-.u. 
P out 
R.A 
P ou1 
Data. 
REQ 
ACK 
DATA 
159 
Control 
Logic 
j I 
I LJ Zero I LJ Deere- t ._,_8-------~· .. _L_a_t_ch _ _.181 .. _c_h_e_c_k__.m .. _m_e"'!"n_te_r__. a 
l 
Multi-
flops 
l 
State 
Latch 13 Logic 
Array 
Figure 5.24: Input Controller Block Diagram 
ACK 
REQ 
DATA 
Select 
the input controller selects the output channel, adjusts the header by decrementing and 
sometimes stripping the byte, and then passes all bytes to the crossbar switch until the 
tail byte is detected. 
The input controller, shown in Figure 5.24, consists of a datapath and a. self-timed state 
ma.chine. The data.path contains a latch, a. zero checker, and a decrementer. A state 
latch, logic array, and control logic comprise the state machine. When the request line 
for the channel is toggled, data. a.re latched, and the zero checker is enabled. When the 
zero checker makes a. decision, the logic array is enabled to determine the next state, 
the selected crossbar channel, a.nd whether. to strip, decrement, or pa.ss the current byte. 
When the required operation has been completed, possibly requiring a. round trip through 
the crossbar, the state and selected channel a.re saved in cross-coupled multi-flops and 
the logic array is precharged. 
The input controller a.nd a.II other internal logic operate using a 4-cycle signaling con-
vention [109]. One function of the state machine control logic is to convert the external 
2-cycle convention into the on-chip 4-cycle convention. The signals are converted back 
to 2-cycle at the output pads. 
The crossbar switch performs the switching and arbitration required to connect the five 
160 
ROUT GIN-
I I 
w 
I 
Data. REQ I i 
in 9 I I I I i I 
I 6 CJ!D I I ACK I I I 
in I 
S1 I I I I I 6 I I~ I I i ) L---l Channel i I i , ' 
Select 6 I I I gr "V' I 
RIN GOUT- Data-.REQ- ACK 
out out 
Figure 5.25: Crosspoint of the Crossbar Switch 
input controllers to the five output queues. A single crosspoint of the switch is shown 
in Figure 5.25. A two-input interlock (mutual-exclusion) element in each crosspoint 
arbitrates requests from the current input channel (row) with requests from all lower 
channels (rows). The interlock elements are connected in a priority chain so that an 
input channel must win the arbitration in the current row and all higher rows before 
gaining access to the output channel (column). 
The output queues buffer data from the crossbar switch for output. The queues are of 
length four. While shorter queues would suffice to decouple input and output timing, the 
longer queues also serve to smooth out the variation in delays due to channel conflicts. 
Each output multiplexer performs arbitration and switching for the virtual channels 
that share a common physical channel. As shown in Figure 5.26, a small self-timed state 
machine sequences the events of placing the data on the output pads, asserting request, 
and removing the output data. An interlock element is used to resolve conflicts between 
channels for the data pads. 
To interface the on-chip equipotential region to the off-chip equipotential region that con-
nects adjacent chips, self-timed output pads (Figure 7.22 in [109]) a.re used. A Schmidt 
Trigger and exclusive-0 R gate in each of these pads signals the state machine when the 
ACK 
161 
,....------------ Data Valid-
s o 1----;s o 1---------i 
R Q R Q 
I 
'------_,._.
1
-+--
1 
---~!------------ REQ Done 
------------------------------ ACK ~ p ~ : ~: ~~1----- Select 
• 
• 
• 
CH 0 Circuit Identical 
Figure 5.26: Output Multiplexer Control 
pad is finished driving the output. These completion signals are used to assure that the 
data pads are valid before the request is asserted and tha.t the request is valid before the 
data. are removed from the pads and the channel released. 
Experimental Results 
The design of the TRC began in August 1985. The chip wa.s completely designed and 
simulated a.t the transistor level before any layout was performed. The circuit design 
was described using CNTK, a language embedded in C [26], and was simulated using 
MOSSIM [14]. A subtle error in the self-timed controllers was discovered at the circuit 
level before any time-consuming layout was performed. Once the circuit design was 
verified, the TRC was laid out in the new MOSIS scalable CMOS technology [132] using 
the Magic system [94]. A second circuit description was generated from the artwork and 
six layout errors were discovered by simulation of the extracted circuit. The verified 
layout was submitted to MOSIS for fabrication in September 1985. 
The first batch of chips was completed the first week of December but failed to function 
because of fabrication errors. A second run of chips (same design), returned the second 
week of December, contained some fully functional chips. 
Performance measurements on the chips are shown in Figure 5.27. To measure the 
maximum channel ra.te, output request and acknowledge lines were tied together, and 
input acknowledge was inverted and fed back into input request. In this configuration 
irtout Request 
Cutput Request 
Input Request 
Input Acknowledge 
Output Request 
Output Acknowledge 
Output Request 
Output Data 
162 
.A 
B 
c 
Figure 5.27: TRC Performance Measurements 
163 
the chip runs at a maximum speed, shown in Figure 5.27 A., of 4MHz. The delays from 
input request to output request and input acknowledge, shown in Figure 5.27B, are 150ns 
and 250ns respectively. Data propagation time from input to output (not shown) was 
measured to be 60ns for both rising and falling edges. Thus data are set up 90ns ahead 
of the output request. Data hold time, shown in Figure 5.27C, is 20ns. -
Tau model calculations suggest that a redesigned TRC should operate at 20MHz and 
have an input to output delay of 50ns. The redesign would involve decoupling the timing 
of the input controller by placing single-stage queues between the input pads and input 
controller and between the input controller and the crossbar switch. The input controller 
would be modified to speed up critical paths. 
Summary 
Communication between nodes of a concurrent computer need not be slower than the 
communication between the processor and memory of a conventionai sequential com-
puter. By using byte-wide datapaths and wormhole routing, the TRC provides node-
to-node communication times that approach main memory access times of sequential 
computers. Corn.munications a.cross the diameter of a network, however, may require 
substantially longer than a memory access time. 
The TRC serves as still another counterexample to the myth that self-timed systems 
are more complex than synchronous systems. The design of the TRC is not significantly 
more complex than a synchronous design that performs the same function. ~ for speed, 
the TRC is probably faster than a synchronous chip, since each chip can operate at its 
full speed with no danger of timing errors. A synchronous chip is generally operated at 
a slower speed that reflects the timing of a worst-case chip and adds a timing margin. 
5.4 A Message-Driven Processor 
In Section 5.3 we investigated means of minimizing message latency, Ti, by choosing 
the proper dimension interconnection network and proper routing strategy. We ignored, 
however, the contribution of the proc~ssing node to latency: Tnode in (5.18). In this sec-
tion I present novel architectural features that minimize Tnode by matching the behavior 
of the processor to the object-based model of computation described by functio".l (2.1). 
In a concurrent computer built around a conventional instruction processor, interpreting 
a message is a time-consuming process. First, the processor responds to an interrupt 
informing it that a message has arrived. Next, the message is fetched from memory, 
and the method to be executed in response to the message is determined. Finally, after 
executing ~ 100 instructions, the processor begins execution of the method. If the 
execution of the method involves sending a message, another cumbersome instruction 
164 
sequence is required to initiate the send. The latency introduced by performing these 
message receives and sends in software is intolerable in a. system where the average 
method is only 10 instructions long. 
Instead of nesting the instruction fetch-decode-execute loop of a conventional processor 
inside the receive-dispatch-execute loop required to process a message, a message driven 
processor directly interprets messages. A level of interpretation lS removed; messages 
are the instructions of a message-driven processor. 
When a message arrives at a processing node, the processor performs the following steps: 
Reception: Upon message arrival the message is immediately removed from the network. 
The message is buffered if the processor is busy and is received when the processor 
becomes idle. Reception and buffering of messages are performed by hardware. 
The current message is placed in a receive register to allow the processor fast 
access to arguments. 
Method Lookup: Once a message has been copied into the receive register, the method 
corresponding to the message is determined by examining the message selector 
and the class of the receiver. An instruction translation lookaside buffer (ITLB) 
[22] is used to speed the translation of messages into methods. 
Execution: Methods are either primitive or defined. Primitive methods, small integer 
add for example, are performed directly by the processor. They generally involve 
modifying the contents of an object a.nd/or transmitting a reply message. 
Defined methods create a context and specify a sequence of actions. Actions are 
similar to subroutines on a conventional processor. They a.re executed by sending 
a sequence of messages. Some of the sends performed during the execution of 
a defined method a.re handled locally. They a.re simply instructions. Sends to 
objects outside the current processing node result in sending a. message over the 
network. Addressing modes are provided to allow fast access to the fields of the 
current message, acquaintances of the receiver, and the contents of the context 
during the execution of an action. 
If a method consists of more than a single action, the context is retained, and 
the messages transmitted by the method are directed to reply to the context. A 
pointer to the next sequence of messages to be executed for the method is stored 
in the context. After the final action of a method, the processor sends a reply to 
the object specified in the Reply To field of the original message unless this field 
is n ii. 
The classes (data types) and the operations supported by a. processing node may 
vary arnongst nodes. As described in Section 5.5, some processing nodes may be 
object experts specialized to store and operate on a. particular class of objects. 
165 
Message Receiver Selector I Reply To I Arg 1 I • • • I Arg N 
Object ID I Tag I Instance j 
Figure 5.28: Message Format 
5.4.1 Message Reception 
The format of a message is shown in Figure 5.28. Each message contains the following 
fields: 
Receiver: The identifier of the object to which the message is directed. 
Selector: The name of the message. The selector, together with the class of the receiver, 
determines what method is to be executed in response to the message. If the 
message has a nil receiver, the selector directly determines the method 14 • 
Reply to: The object that is to receive the reply from this message. If this field is nil, no 
reply is expected. 
Arguments: Object identifiers for the arguments of the message, if any. 
As shown in the lower portion of Figure 5.28, each object identifier consists of two fields. 
The Tag field specifies whether the object is a primitive or a reference object and, if the 
object is a primitive, specifies its class. If the object is a primitive, the Instance field is 
the object itself. For example, if the Tag field specifies that the object is of class Small 
Integer, the Instance field contains the integer. For reference objects, the instance field 
contains a pointer to the object in object space. The object pointer is translated into a 
node number by the global mail system and into an address within the node by the local 
mail system. The class of a reference object is found within the object itself. 
The process of message reception is illustrated in Figure 5.29. If the processor is idle 
when a message arrives from the network, the message is read directly into a receive 
register. The receive register contains slots for the receiver, selector, and reply to fields 
of the message, as well as four arguments. Additional arguments are stored in memory 
at a location referenced by the argument pointer. 
14Messages from the network will never have a nil receiver. Messages executed aa the instructions of 
defined method, however, may have a nil receiver. 
• I 
' 
• 
• 
• 
166 
Message Buffer 
in Memory 
Head 
T~ii 
Figure 5.29: Message Reception 
I 
' Receiver Se!ector 
eo1v 1 o 
An:r 1 
Receive Register 
If the processor is busy when a message arrives, the message is automatically buffered 
in memory. Message buffer memory access takes priority over processor memory access, 
since it is critical to network performance that a message be removed from the network 
as soon as it arrives at its destination node. Dedicated registers point to the head and 
tail of a message queue in memory. When the processor becomes idle, the message at 
the head of the queue is removed from the queue and copied into the receive register. 
The use of special purpose hardware to remove messages from the network, buffer them 
in memory, and load messages into a processor register has two significant performance 
advantages. Since messages are quickly removed from the network, network performance 
is improved; if messages were left for any period of time with their tails blocking network 
channels, severe network congestion could result. Also, message latency is reduced since 
the time for a conventional processor to 'respond to a network interrupt and load the 
message is eliminated. If the processor is idle, the message is loaded as soon as it arrives. 
5.4.2 Method Lookup 
Once a message is received, the first step in interpreting the message is to look up 
the method specified by the selector and the class of the receiver. If the receiver is a 
primitive, its class is encoded in the tag part of the object ID and is already in the 
Receiver 
Receiver 1------.1--....;C.;..;! a;..;;.s..;;..s --1 
Selector Hash 
167 
Receiver's Class Receiver's Superclass 
i------~su oerc1assie------1 ... - • • • 
1-------1 
Class Method Table 
Figure 5.30: Method Lookup 
Methods 
Seiector 11 Method 1 
Seiector 21 Method 2 
• 
• 
• 
• 
• 
Superclass Method Table 
receive register. If the receiver is a reference object, the class of the object must be 
fetched. Part of the object's class is a table of selectors understood by that class and 
the method corresponding to each selector. This table is searched, by hashing, to find 
the selector in the received message. If the selector is found, the corresponding method 
is executed. Otherwise, the object superclass is checked, then the superclass' superclass, 
and so on, a.s shown in Figure 5.30. 
Method lookup can be accelerated by using an ITLB as shown in Figure 5.31 [22]. The 
ITLB is an associative memory that associates selector and class with the corresponding 
method. Each entry in the ITLB corresponds to a unique method and contains three 
fields: 
Key: The selector and class that specify the method. 
Primitive Bit: Specifies whether the method is primitive or defined . 
. 
Method: How the method is to be accomplished. For a primitive method, this field 
determines which primitive operation the processor is to perform. For a defined 
method, this field contains the object ID of the method. 
Method lookup using the ITLB proceeds in three steps. First, the class of the receiver 
is obtained and concatenated with the selector to form a key into the ITLB. The ITLB, 
an associative memory, attempts to find an entry matching this key. If an ITLB entry 
is found, then the method field and primitive bit are read from the ITLB. Otherwise, 
168 
Key Receiver j Selector 
~ 
ITLB Receiver 1 Selector 1 Pi Method 1 
Receiver 2 Selector 2 Pz Method 2 
• • • • 
• • • • 
• • • • 
Receiver N Selector N PN Method N 
Method l'"-P...._l ___ M_e_t_ho_d _ ~ 
Figure 5.31: Instruction Translation Lookaside Buffer 
a conventional method lookup must be performed as described above. All primitive 
instructions are permanently stored in the ITLB. 
The memory requirements for a message-driven processor are quite modest. A processing 
element need not keep a complete class description for each class of objects it contains. 
In fact, a processing element need not keep any code resident at all. When a method is 
referenced, it can be copied over the network. When an ITLB miss occurs, method lookup 
can be spread across a number of processing elements15 • Performance can be enhanced, 
however, by maintaining redundant copies of some methods and class descriptions. For 
example, it would be beneficial to maintain local copies of each method referenced in the 
ITLB. 
5.4.3 Execution 
A context object, shown in Figure 5.32, is created at the start of a defined method and 
controls the execution of the method. The receiver, reply-to object, and arguments from 
the message as well as the method pointer from the method lookup operation are copied 
into the context. The context contains the instruction pointer (IP) that sequences the 
instructions in the method. Local variables are also held in the context. A context cache 
16It is important tha.t only primitive or gua.ra.nteed resident methods a.re used for method lookup. Oth-
erwise, the system could get into a.n infinite loop of looking up a. method required to look up a. method 
etc .... 
169 
Context Method 
Method 
-
Header 
IP Inst 1 
Receiver Inst 2 
) An Action 
Reply To - Inst 3 
- ) 
Arg 1 Inst 4 
Arg 2 • 
• 
• • 
• 
• Inst N 
Var 1 • 
• Var 2 ~ • 
• ~ 
Figure 5.32: A Context 
as described in [22] can be used to provide fast allocation of and access to contexts. 
To execute a method, the processor fetches the instruction referenced by the IP, executes 
the instruction, and increments the IP to proceed to the next instruction. Each instruc~ 
tion conceptually sends a message, by specifying the receiver, selector, and arguments. 
Many of the instructions, however, will not result in actually sending a message, but 
instead will modify the contents of a local object or alter execution of the method by 
modifying the IP. The processor executes these instructions directly. 
Since all instructions are message sends, a processor that interprets a single instruction 
would suffice. However, to improve the efficiency of commonly used primitives and 
control instructions, certain messages will be more compactly encoded. A set of possible 
instructions is as follows: 
send: The SEND instruction sends a message to the specified receiver. Addressing 
modes are provided to represent compactly the receiver, selector, and arguments. 
If the receiver is nil, a constant, or a. local object, the operation will be performed 
directly and no send will occur. Otherwise, the fields of the message will be 
assembled and the message transmitted over the network. 
control instructions: These instructions can be thought of as sending messages to the 
current context. BRANCH conditionally alters the value of the IP. SUSPEND 
halts execution but preserves the context so that a reply can resume execution. 
EXECUTE suspends the current context and begins execution of another context. 
170 
SEND SEND kreceiver>l<seiector>l<arg list>j 
A +-B <op> C <op> I A B c 
BRANCH <offset> ON <condition> BRANCH <offset> I <condition> I 
SUSPEND SUSPEND 
EXECUTE <context> EXECUTE_ <context> 
REPLY <arg list> REPLY <arg list> 
Figure 5.33: Instruction Formats 
REPLY sends a reply message with a specified argument list to the object specified 
by the Reply-To field of the context and then deletes the current context. 
common messages: Commonly used messages such as MOVE, at:put:, or + may be en-
coded directly to save space. Often these instructions combine several conceptual 
eends into a single instruction. For example, the instruction A +-B + C sends the 
message+ C to B and then stores the result in A, conceptually sending an at:put: 
message to the object containing A. 
Possible formats for these instructions are shown in Figure 5.33. Each instruction consists 
of an opcode field and zero or more operand fields. Each operand field contains an 
operand specifier that describes the operand using one of four addressing modes. 
global constant: A global constant table contains commonly used constants such as true, 
false, nil, and small integers. The operand descriptor specifies an offset into the 
constant table. 
local constant: A local constant table (literal table) is associated with each method. 
This table contains selectors for the messages sent in the method and any other 
171 
This code prefixes a method that is qualified by the line: 
'require rwlock exclude rwlock or ifalse.' 
SUCCEED: 
CONTEXT.Tl +- RCVR.REQUIRED NOR RCVR.EXCLUDED 
BRANCH LCONST.SUCCEED ON CONTEXT.Tl 
REPLY GCONST.false 
RCVR.REQUIRED +- GCONST.true 
RCVR.EXCLUDED +- GCONST.true 
remainder of method code 
Figure 5.34: A Coding Example: Locks 
constants required by the method that are not contained in the global table. AB 
with the global constant addressing mode, the operand descriptor specifies a.n 
offset into the constant table. 
context: The operand descriptor selects a field of the context, or, if a specific code is used, 
the context itself. This mode is used to access arguments and local variables. 
receiver: The operand descriptor selects a field (variable) of the receiver or the receiver 
itself. 
The high two bits of the operand descriptor select the addressing mode. The remaining 
bits select a specific constant or a specific field of the context or receiver. 
As an example of message-driven processor code, consider implementing locks, described 
in Section 2.4, by prefixing the body of,the method with the code shown in Figure 5.34. 
First, two fields of the receiver, REQUIRED and EXCLUDED, are NORed to determine if 
the current method can be allowed to proceed. The result of this operation is stored in 
the context in temporary variable, TL If the method can proceed, a BRANCH is made 
to the location labeled SUCCEED, the locks are set, and the method is continued 16 . 
Otherwise, the method is terminated by sending a reply with false as the argument. In 
this example the context acts as a receptionist [1 j in controlling concurrent access to the 
receiver. 
16For mutual exclusion a Boolean lock ill sufficient. In the genera.I case, however, a. counting semaphore 
is required. 
172 
5.5 Object Experts 
One approach to harnessing the power of VLSI technology is through specialization. 
Many problems are so demanding of computer resources that the capabilities of general-
purpose computers are insufficient to solve these problems in a reasonable period of 
time. Examples of demanding problems are finite element a....'lalysis, signal processing, 
game playing, circuit simulation, and logic simulation. In recent years hardware acceler-
ators have been constructed for a number of these problems. Vector and array processors 
[104] have been used to accelerate numerical applications such as finite element analy-
sis and signal processing, and special-purpose engines have been constructed for chess 
playing [18] and for logic [99] and switch-level [23] circuit simulation. These machines 
typically offer performance of 100 to 1000 times that of a general-purpose computer; 
however, they have the disadvantage of being specialized for one problem. The low fab-
rication cost of VLSI technology makes building special-purpose processors economicaily 
feasible; however, li:rrited design resources and economy of scale considerations make it 
impractical to build a diiferent processor for each problem. The challenge is to build an 
accelerator sufficiently flexible to be applied to many problems. 
A flexible accelerator can be constructed by applying specialization to data structures 
rather than to particular applications. Most applications are built around a central data 
structure, and in many cases operations on this data structure consume most of the com-
puting resources. By accelerating operations on common data structures, all applications 
using these data. structures are accelerated. Just as object-oriented programming makes 
it convenient to share class definitions across several applications, hardware specialized 
to operate on a particular class of objects, an object expert, can also be shared across 
many applications. 
In addition to exploiting specialization, object experts are also well suited to VLSI tech-
nology because they promote locality. Data of a specific class are stored in the object 
expert, and operations on the data are performed locally. A floating point vector expert, 
for example, would store vectors of floating point numbers near the pipelined arithmetic 
unit that operates on the vectors. 
A machine constructed around object experts is a. heterogeneous machine. Different 
types of processing elements specialized for different classes of objects are distributed 
a.bout the machine with several object experts clustered at each processing node. Each 
processing node contains at least one general-purpose processor (an object expert for 
context objects) and possibly one processor that implements the class of objects used 
to construct the mail system. Clustering of object experts promotes two levels of con-
currency: pipelining applications within each cluster and running several clusters in 
parallel. Many applications can take advantage of two-level concurrency. For example, a 
logic simulator can pipeline the event list, fanout list, and evaluation functions and run 
several of these pipelines in parallel on different portions of the circuit simultaneously. 
173 
Here are a few possible object experts: 
A Floating Point Vector Expert accelerates numerical applications by providing fast arit.!l-
metic operations on vectors of floating point numbers, similar to those provided 
by the vector arithmetic units of a scientific computer such as the Cray 1 [104]. 
This expert is an example of a processing element that stores and operates on 
objects that beiong to a restriction of a class. In this example, only those vector 
objects that contain only floating point numbers and that are under a certain 
maximum length are eligible. If an integer is stored into one of these vectors, or if 
a vector exceeds the maximum length, the vector would be exported to a general 
purpose processor. 
A Graph Expert accelerates algorithms similar to those presented in Chapter 4. This is 
an example of an expert for a distributed object class. Constituents (edges and 
vertices) of distributed objects (graphs) are stored in each expert. Several experts 
at different processing nodes cooperate in accelerating operations on these graphs. 
Set Experts accelerate operations on collections of objects. An ordered set expert would 
accelerate the balanced cube operations described in Chapter 3. An unordered 
set expert would accelerate the concurrent hashing operations described in Ap-
pendix B. 
1/0 Experts provide a convenient way of dealing with I/O devices as objects that send 
and receive messages. A display screen, for example, is an expert for objects of 
class BitBlt. To draw on the screen, a copybits message is sent to one of these 
objects. Input devices such as keyboards, lightpens, and mice are handled by 
object experts that send messages in response to an input event, e.g., a keystroke, 
and reply to messages about the state of the device, e.g., mouse position. Mass 
storage experts control devices such as disks and tapes. An I/O expert maps 
physical objects that manipulate the outside world into CST objects that send 
and receive messages. 
5.6 Summary 
This chapter has presented a computer architecture designed to support concurrent 
object-oriented programming. This architecture is motivated by the latency-sensitive 
nature of the algorithms developed in Chapters 3 and 4 and the wire-limited nature of 
VLSI technology. 
In Section 5.3 I analyze concurrent computer interconnection networks to determine 
what network topology gives the lowest latency for a given amount of wire. The analysis 
is restricted to k-ary n-cube networks because the dimension of the network is more 
important than the details of the topology. In Section 5.3.l I use a wire-cost model to 
174 
analyze k-ary n-cube networks and derive the result that keeping wire cost constant, 
low-dimensional networks have lower latency than do high-dimensional networks. The 
lowest latency occurs when the component of latency due to message length, k-, is nearly. 
the same as the component of latency due to distance, D. Low-dimensional networks also 
provide higher hot-spot throughput because ea.ch communication channel is shared by 
more processor pairs. The average throughput of a network is independent of dimension. 
If low-dimensional, high-radix networks are to be used in message-passing concurrent 
computers, a deadlock-free routing algorithm for these networks is required. Avoiding 
deadlock is a more difficult problem in these networks than in conventional computer 
networks because these networks use wormhole routing to achieve low latency rather than 
the store-and-forward routing used in traditional computer networks. With worrnhole 
routing, queueing is allocated on the basis of flits that cannot be interleaved with flits of 
other messages, while in conventional networks, resources are allocated on the basis of 
packets that can be interleaved. In Section 5.3.2 I develop a novel deadlock-free routing 
algorithm based on the concept of tJirtuai channels. By multiplexing two virtual channels 
on ea.ch physical communications channel and by making routing a function C x N ,__.. C 
rather than the traditional N x N ,__.. C, this algorithm converts cycles in the channel 
dependency graph, D, into spirals, thus avoiding deadlock. 
I have developed the Torus Routing Chip (TRC), described in Section 5.3.3, to demon-
strate the feasibility of the type of network described in this chapter. The TRC combines 
many novel features. 
• It is completely self-timed [109]. 
• It uses wormhole routing [113]. 
• It implements the virtual channel deadlock-free routing algorithm [24] in hardware. 
TRCs have been fabricated and they operate properly. 
In addition to minimizing network latency, the latency of each processing node must 
also be minimized by matching the architecture of the processor to the semantics of the 
programming model. Section 5.4 outlines the architecture of a message-driven processing 
element that responds directly to messages rather than interpreting messages using a 
conventional instruction processor. 
To take advantage of the performance offered by specialization while at the same time 
retaining flexibility, processing elements can be specialized to operate on a. single class 
of object. These object experts, Section 5.5, by accelerating common object classes, 
improve the performance of a.II applications using those classes. Object experts also 
promote locality by storing the objects local to the hardware that modifies them. 
175 
Chapter 6 
Conclusion 
The performance of computers can be made incrementally extensible by exploiting VLSI 
technology to build concurrent computers, ensembles of processing nodes connected by 
a. network. These concurrent computers can be programmed by combining concurrent 
data structures. The problems of communication and synchronization are encapsulated 
in the data structure, leaving the programmer free to concentrate on problems spedfic 
to his/her application. 
This thesis has developed a paradigm for programming concurrent computers: concur-
rent data structures. To describe concurrent data structures, a programming notation, 
Concurrent Smalltalk (CST), has been developed incorporating the concept of a dig., 
tributed object. A distributed object is a single object consisting of a collection of con-
stituent objects, each of which can receive messages sent to the distributed object. Thus 
distributed objects can process many messages simultaneously. They are the foundation 
upon which concurrent data structures are built. 
The balanced cube is a. concurrent data. structure for ordered sets. It achieves concur-
rency by eliminating the root bottleneck of tree-based data structures. A balanced cube 
has no root; all nodes are equals. An ordered set is represented in a balanced cube by 
mapping elements of the set to right subcubes of the balanced cube using a Gray code,. 
The VW search algorithm, based on the distance properties of the Gray code, searches 
a. balanced cube in logarithmic time. This search algorithm can be initiated from any 
node and will uniformly distribute activity over the nodes of the cube. The B-cube is an 
extension of the balanced cube that stores several data in each node to match the grain 
size of the data. structure to the grain size of a. particular computer. The balanced cube 
is an example of a concurrent data structure that differs markedly from its sequential 
counterparts. 
Concurrent graph data structures can be used to solve many combinatorial problems. In 
Chapter 4 concurrent algorithms for the shortest path problem, the max-flow problem, 
and graph partitioning were developed. These graph algorithms illustrate many of the 
176 
synchronization problems encountered in concurrent programming. Consider, for exam-
ple, the shortest path problem. Dijkstra's sequential algorithm [27] cannot directly be 
made concurrent because it depends on a total order of events. It is too tightly synchro-
nized. A concurrent algorithm due to Chandy and Misra fl5] that relaxes this ordering 
' . 
of events but introduces no other synchronization may require exponential time because 
it is too loosely synchronized. 
The concurrent algorithms described here are characterized by short messages and short 
methods. Supporting this fine-grain concurrency requires a low-latency interconnection 
network for efficient execution. Because VLSI technology is wire-limited, alternative 
architectures must be compared keeping wire cost constant. Consider the family of k-
ary n-cube networks: networks with n dimensions and k processors in each dimension. 
High-dimensional networks ·with narrow channels are compared against low-dimensional 
networks with wide channels. The minimum latency occurs when the delay due to 
message length, ~., is nearly equal to the delay due to the distance traveled, D. This 
minimum occurs at a surprisingly low dimension. For small networks, 1000 processors 
or less, the minimum latency is achieved with a two-dimensional network. Even for very 
large concurrent computers, networks with 4 or 6 dimensions are sufficient. In addition 
to providing low-latency, low-dimensional networks have several other advantages. They 
are easy to construct, since they fit into a plane with fewer folds than high-dimensional 
networks. Two dimensional networks are particularly easy to construct since they fit 
into the plane with no folds, and all channels are the same length. Low dimensional 
networks are also easier to interface to and control. Since they have fewer channels per 
node, they require less control logic to manage communications. 
Virtual channels can be used to construct deadlock-free routing algorithms for all strongly 
connected interconnection networks including k-ary n-cubes. By making routing a func-
tion of the channel on which a message arrives at a node, and by multiplexing several 
virtual channels over a single physical channel, the cycles in a channel dependency graph 
can be broken into spirals to avoid deadlock. A virtual channel routing algorithm has 
the advantage that it can be used with wormhole [113] routing. In a wormhole network, 
flow control is performed on Bit;.s that cannot be interleaved. Conventional structured 
buffer pool deadlock avoidance algorithms are designed for store-and-forward networks, 
where flow control is performed at the level of packets that can be interleaved. These 
algorithms depend on the ability to interleave packets and thus cannot handle wormhole 
routing, since flits cannot be interleaved. The torus routing chip (TRC), a self-timed 
VLSI chip, has been developed to demonstrate the feasibility of wormhole routing and 
virtual channels. 
Low-latency processing elements are required to support fine-grain concurrent compu-
tation. A conventional processor executes about one hundred instructions to receive 
and interpret a single message. A message-driven processor directly interprets messages, 
eliminating this interpretation overhead. The instructions of a message-driven processor 
are messages. By performing automatic message reception a.nd buffering, accelerating 
message lookup with an instruction translation Iooka.side buffer, and providing address-
li7 
ing modes for fast access to the context and receiver, the architecture of a message-driven 
processor is matched to the semantics of CST. 
A machine with low-latency communications channels and processing elements is capable 
of supporting instruction-size granularity. In the past, concurrent computation has been 
performed with process-size granularity. With finer-grain concurrency, less memory is 
required at each processing node. Current machines require a large memory at each 
node to support a grain size large enough to keep their high latency from dominating 
computation time. Some argue that a large memory is required to store a copy of the 
operating system at each node; however, such a practice is wasteful. By properly layering 
the operating system, only a few bottom-level modules, e.g., method lookup and mail 
delivery, need to be replicated in every processing node. Higher level modules can be 
stored in a single processing node and cached in other nodes as required. 
VLSI technology, being wire-limited, encourages specialization and locality. Storing data 
local to the logic that manipulates it results in shorter wires. A special-purpose VLSI 
chip has a fixed communication pattern and thus can make better use of the limited wires 
than a general-purpose chip that must support many different communication patterns. 
Unfortunately the high cost of designing a VLSI chip makes it impractical to build 
special-purpose VLSI chips for every application. However, specialization ca.n be applied 
to many applications by building VLSI chips to accelerate operations on common classes 
of objects. These object experts can be shared among applications, offering performance 
comparable with special-purpose hardware while retaining much of the flexibility of a 
general-purpose ma.chine. 
Computer architecture encompasses the design of programming languages, data struc~ 
tures, and algorithms, as well a.s hardware. The approach taken here is to start with 
a programming paradigm, concurrent data structures, develop a notation, CST, write 
algorithms using this notation, and finally to organize hardware to support these algo-
rithms. In contrast, many computer arc...~itects restrict themselves to the last step. They 
analyze existing algorithms and fine-tune architectures to execute these algorithms. The 
problem with this evolutionary approach is that it leads to inbreeding, amplifying both 
the good and bad features of existing computer architectures. The algorithms analyzed 
are optimized to run on the previous generation of machines, which were fine-tuned to 
execute the previous generation of algorithms, and so on. Each generation, algorithms 
are designed to make the best use of the good features of the machine and to avoid the 
bad. The next generation of machines, based on these algorithms, makes the good fear 
tures better and ignores the bad since they were not frequently used by the algorithms" 
The worst effect of this approach to architecture is that it discourages new program-
ming language features. Late-binding programming languages, for example, are often 
judged to be inefficient because they cannot be efficiently implemented on conventional 
ma.chines. Late-binding languages are not inefficient; conventional architectures are just 
not well matched to these languages. 
Powerful software features such as late-binding operators and automatic storage man-
178 
agement that are often cited as inefficient need not be slow. These features can be made 
very efficient with a modest amount of hardware support. In fact, high-level features can 
lead to a. more efficient computing system by replacing many ad hoc mechanisms w~h 
a singie mechanism that can then be implemented in hardware. The key to a successful 
architecture is to identify a few simple mechanisms that can be accelerated by hardware. 
A VLSI architecture must match the physical form of a machine to its logical function. 
Traditionally computer architect.s and designers have concentrated on the logical orga-
nization of ma.chines. giving little consideration to their physical design. With VLSI 
technology this is no longer possible. VLSI technology is wire limited. To make best use 
of wiring resources, architects must carefully plan the physical design of their ma.chines. 
For example, consider the interconnection networks analyzed in Chapter 5. Consider-
ing just the logical organization of the network, one quickly deduces (as Lang [79] did) 
that binary n-cubes offer superior performance because of their smaller logical diameter. 
When the physical implementation of the network is considered, however, one finds that 
in fa.ct !ow-dimensional netv•orks offer better performance because they make better use 
of their wires. Th'e short logical diameter is no longer a great advantage since, after being 
embedded into a two- or three-dimensional implementation space, all network topologies 
have the same physical diameter. 
Many experiments are required to refine the ideas presented in this thesis. A first step 
is to implement a compiler and run-time system to run CST on an existing concurrent 
machine such as the Caltech Cosmic Cube [112]. Because of the high latency of Cos-
mic Cube communication channels and the mismatch between CST semantics and the 
architecture of the Intel 8086-based processing nodes, such a system will be qulte ineffi~ 
cient. Nevertheless this programming system will be used to gain practical experience in 
concurrent object-oriented programming and in building systems out of concurrent data 
structures. 
The next step is to build hardware to improve the efficiency of the system. This is best 
done in stages. 
1. Provide a low-latency communication facility by building a concurrent computer 
using the TRC for communications and a commercial microprocessor, such as the 
Motorola 68000 [92], for a processing element. Such a machine could be built in 
a relatively short time frame and would provide valuable experience in using a 
low-latency communication network. 
2. Build a message-driven processor to complement the low latency of the TRC-
based network. This machine will provide an efficient environment for fine-grain 
concurrent object-oriented programming and will provide further experience with 
this programming style. 
3. Provide a powerful ma.chine for demanding applications by constructing object 
experts for several commonly used classes such as floating point vectors and ordered 
sets. 
179 
The availability of a machine comparable to (3) above will stimulate much research 
on concurrent software. Concurrent operating systems will evolve to support fine-grain 
object oriented programming. To run in a. fine-grain machine with limited storage in each 
processing node, operating systems will be partitioned into layers with only the bott-:;m 
layer duplicated in each node. Memory management functions will make the partitions 
between processing nodes invisible to user programs by maintaining a. single name space 
across the machine. The system will relocate objects as required to make efficient use of 
memory and processing resources, dyna.n1icaily balancing the load a.cross the processing 
nodes. Systerns will evolve to the point where a host is no longer required. Input/output 
devices will connect directly to processing nodes and will appear as objects to the system. 
Methods will be edited and compiled directly on the concurrent computer. 
One fertile area for further research is the development of concurrent computer-aided-
design (CAD) applications. The exponential growth in the complexity of VLSI systems 
that has made possible the construction of the machines described here has also exceeded 
the capacity of sequential CAD programs. For example, verification of a. 105-transistor 
VLSI chip by logic simulation takes several weeks of CPU time. Since simulation time 
grows as the square of device complexity, one can project that a 106-transistor chip will 
require several years to verify. Concurrent CAD programs will give performance several 
orders of magnitude better than sequential applications, reducing verification time from 
years to days. More importantly, concurrent applications give performance that scales 
with the size of the problem. As VLSI chips become more complex, we will construct 
larger concurrent computers to design these chips. We wiil apply VLSI technology to 
solve the problem of VLSI complexity. 
To exploit the low latency but high throughput of VLSI technology, we build concurrent 
computers consisting of many processing nodes connected by a network. Software is 
the real challenge in the development of these machines. It is difficult to focus the 
activity of large numbers of processing elements on the solution of a single problem. 
This thesis proposes a solution to the problem of programming concurrent computers: 
concurrent data structures. Most applications a.re built a.round data structures. The 
problem of coordinating the activity of many processing elements is solved once and 
encapsulated in a. class definition for a. concurrent data structure. This data structure 
is used to construct concurrent applications without further concern for the problems 
of communication and synchronization. The combination of VLSI and concurrency will 
make computers fast. The combination of object-oriented programming and concurrent 
data structures will make them easy to program. 
180 
Appendix A 
Summary of Concurrent Smalltalk 
Concurrent Smalltalk (CST) is an extension of the Smalltalk-80 programming language 
[51], [52], [74], [136] that incorporates distributed objects, concurrent message sending, 
and locks. The differences between CST and Smalltalk-SO are described in Chapter 2. 
This Appendix gives a brief surrunary of the entire progra..."IlIIling language for those read-
ers not familiar with Smalltalk-80. For a more complete description of the programming 
language, the interested reader should consult [51] or [136]. 
Classes 
A CST program consists of a set of class declarations. Each class declaration describes 
the state and behavior of a class of objects and has the form shown in Figure A.l. The 
declaration contains the name of the class's superclass, specification of the class object, 
and specification of each instance of the class. 
class: The class name identifies the class object, the object that contains the class vari-
ables and implements the class methods. The class object name is capitalized 
since the class is a global object1 and by convention names of shared variables are 
capitalized. 
superclass: The superclass name identifies the superclass from which the current class 
inherits variables and methods. The current class is declared as an extension of the 
superclass. All class and instance variables declared in the superclass are added 
to the lists specified in the class declaration. All class and instance methods that 
are not overridden in the class definition are also inherited from the superclass. 
The inheritance can extend through many levels of the superclass hierarchy, with 
1Sma.llta.lk could be grea.tly improved by a.dding some type of scoping to class na.mes so tha.t a. user could 
loca.lly override a. class in a.n a.pplica.tion without changing the cla.ss used by the rest of the system. 
cla.38 
superclass 
instance variables 
class variables 
locks 
class methods 
class methods ... 
instance methods 
instance methods ... 
<identifier> 
<identifier> 
<identifier> * 
<identifier> * 
<identifier> * 
181 
the class name 
name of the superclass 
state of each instance object 
state of the class object 
locks controlling access to each instance o&ject 
Figure A.l: Class Declaration 
the current class inheriting methods and variables from the superclass that were 
in turn inherited from the superclass' superclass, and so on. 
instance variables: The private memory of each instance of the defined class. For exam-
ple, if we define a class Point with instance variables x and y, then each instance 
of class Point is created with two local variables named x and y distinct from the 
variables in any other instance. The instance variables specified in this declaration 
are in addition to any instance variables specified by the superclass. 
class variables: Variables shared by the class object and ail instances of the class. There 
is only one instance of each class variable. This single copy of a class variables 
can be accessed by any instance of the class. Class variables are capitalized since 
they are shared variables. 
locks: Locks are special instance variables that control concurrent access to objects. 
class methods: Methods that define the behavior of the class object. Each method spec-
ifies a number of expressions to be performed in response to a message. Typically 
class methods handle tasks such,a.s object creation. 
instance methods: Methods that define the beha:rior of each instance of the class. 
Messages 
Everything in CST is done by passing messages. Sending a message to an object causes 
the object to execute one of its methods. A message has three parts: 
182 
receiver: The object to which the message is being sent. 
selector: The type of message. The selector specifies the method the receiver is to exe-
cute. 
arguments: Additional data required for the receiver to execute the method specified by 
the selector. 
Here are some examples of message expressions. 
theta sin 
This message expression sends the message with selector sin and no arguments to theta, 
the receiver. A message like sin that has no arguments is called a unary message. In a 
unary message, the selector follows the receiver. 
a+ b 
The receiver, a, is sent the message containing the selector, +, with argument, b. A 
message like + b, where there is a single argument and the selector consists of one or 
two special characters, is called a binary message. 
foo at: 10 put: 'hello' 
This keyword message sends the message with selector at:put: to object foo with argu-
ments 10 and 'hello'. In keyword messages the selector consists of a keyword before each 
argument. Each keyword is terminated by a colon, ':'. 
When an object receives a message, it looks up and executes the method that matches 
the message selector. The method lookup begins with the receiver first checking its 
own instance methods. If the method is not found in the receiver's class, the instance 
methods defined in the superclass are checked, and so on. 
Literals 
The receiver and arguments in a message expression may be variables, pseudo-variables, 
or literals. CST supports the following types of literals: 
numbers: Numbers consist of an optional sign, an optional radix, an integer part, an 
optional fraction part, and an optional exponent. Here are some examples of 
numbers. 
17 
16rFF 
3.14159265358979 
-10.le-2 
2r101e2 
183 
an integer, radix 10 is def a ult 
a radix 16 (hexadecimal) integer 
pt 
-0.101 
2r10100 or 20 
characters: Character literals consist of a dollar sign,'$', followed by any c...i.iaracter, e.g., 
$A. 
strings: String literals consist of a sequence of characters delimited by single quotes. e.g., 
'Hello World'. To insert a single quote into a string it is duplicated, e.g., 'don"' t'. 
symbols: Symbols or atoms consist of a hash mark followed by the name of the symbol, 
e.g., #siave. 
arrays: A sequence of literals is denoted by the sequence with hash marks removed 
enclosed in parentheses, '()', and preceded by a hash mark, e.g., #(1 2 siave $A 
'Eiement' 2r10001 (1 2 3)). 
Assignment 
To simplify assignment of values to variables, CST permits the result returned by a 
method to be assigned to a variable by using the backarrow,'<--', character. For example, 
the message 
a +-3 + 2. 
assigns to variable a the result of sending the message + 2 to the object 3. Assignment 
can be thought of as sending an at: variable put: expression message to the current 
environment. 
Messages that do not include an assignment do not generate a reply. To wait for a 
message that returns no value, the message is preceded by a backarrow, '+-', with no 
variable. For example, the message 
+-aRectangle display. 
sends a disp!ay message to a Rectangle and expects a reply from this message. The message 
aRectangle display., 
on the other hand, does not expect a. reply. 
instance methods for Integer 
rangeProduct: upperBound 
locks would go here 
I midPoint upperProd lowerProd ! 
self= upperBound ifTrue: [ 
jseifJ 
ifFalse: [ 
184 
midPoint +-self+ upperbound 11 2. 
lowerProd +-self rangeProduct: midPoint. 
upperProd +-midPoint rangeProduct: upperBound. 
jlowerProd * upperProd.] 
instance methods for Interval 
contains: aNum 
require r.vlock. 
I lin uin I 
I I 
lin +-l ~ aNum. 
uin +-u ;:::: aNum. 
i(lin and: uin) 
Methods 
Figure A.2: Methods 
tests for number in interval 
An object's protocol2 is defined by the instance methods in the class declaration. Two 
example method descriptions are shown in Figure A.2. The first method calculates 
the product of a sequence of integers beginning with the receiver and ending with the 
argument upperBound. This definition of the message rangeProduct: follows that of 
Theriault [126]. The second method is the contains method for class Interval described 
in Chapter 2 (Figure 2.3 on page 18). This method tests if a number is contained in a 
closed interval of numbers. 
Each method description consists of the following parts: 
header: The method header consists of the selector that activates the method with 
pseudo-variables in place of arguments. When a message is received by an object, 
the object's method with the corresponding header is activated. Message argu~ 
~An object's protocol consists of the messages understood by an object. 
185 
ments are bound to the pseudo-variables in the method header. Pseudo-variables 
are like instance variables except that they cannot be assigned to. For example, 
the header rangeProduct: upperBound specifies that the following method will be 
executed in response to a message with selector rangeProduct: and the pseudO:. 
variable upperBound will be bound to the argument of the message. 
concurrency control: An optional concurrency control line specifies a required set of 
locks, an excluded set of locks, and an optional escape expression. The method is 
allowed to execute only when no currently pending method requires an excluded 
lock or excludes a required lock. If the method is locked out, the escape expres-
sion is executed or, if no escape expression is present, the method is suspended. 
Because the contains: method requires rwlock and specifies no escape, it will be 
suspended if some previous method excluded rwlock and will be restarted only 
when all such methods have completed. 
local variables: Local variables are declared between two vertical bars, 'j'. For example, 
the rangeProduct: method declares three local variables, midPoint, upperProd, and 
lower Prod. 
message expressions: The remainder of the method consists of message expressions. Mes-
sages are separated by commas, '.', or periods, '. '. A comma between two messages 
means that the second message can be sent before receiving a reply from the pre-
vious message. When a period follows a message, replies must be received from 
all previous messages whose results are assigned with a backarrow before the next 
message can be sent. For example, the rangeProduct: method sends messages 
to self and midPoint concurrently and then waits for replies from both messages 
before multiplying the two results. 
Messages may be nested within other messages. The reply of one message, A, 
may specify the receiver or argument of another message, B. For example, the 
message 
self= upperBound iITrue: [· · ·] ifFalse: [· · ·]. 
in method rangeProduct: first sends the message,= upperBound, to self and then 
sends the iITrue:ifFalse: message to the reply of this first message. Three rules 
govern the parsing of these compound messages: 
> 
1. Any messages enclosed in parentheses, '()',are evaluated before the messages 
outside the parentheses. 
2. Unary messages take precedence over binary messages, and binary messages 
take precedence over keyword messages. 
3. For messages of equal precedence, evaluation proceeds from left to right. 
Two special identifiers allow a method to refer to the receiver. 
self is an expression that specifies the receiver. 
186 
super also specifies the receiver, but messages sent to super are interpreted by 
looking up the method beginning in the receiver's superclass. ~essages to 
super are often used to inherit a method from the superclass while making 
additions in the subclass. 
Messages return a value by preceding a message expression with an uparrow, 
'j'. The value returned by the following message is in turn returned by the 
method. Preceding a variable by an uparrow returns the value of the variable. A 
downarrow, 'l', causes a method to terminate without returning a value. 
Blocks 
Like Smalltalk-80, CST has no built-in control structures. Instead, control structures are 
built by sending messages to blocks. Blocks are deferred sequences of message expressions 
that are executed when they are sent a value message. Blocks are like methods in that 
they have arguments, locks, and local variables; unlike methods, however, blocks may 
have free variables that are lexically scoped. That is, a block may refer to the local 
variables of the method in which it is defined. Here is an example block: 
[:edge I require rwlock :var! I 
var1 <-edge flow. 
var1 > 0 ifTrue: [l]J. 
Blocks are enclosed in square brackets, 'O', and consist of the following parts. 
argument list: The optional argument list specifies the names of pseudo-variables that 
are bound to arguments passed into the block with a value: arg message. Each 
identifier in the list is preceded by a colon, ': '. For example, in the block above 
the pseudo-variable edge is an argument. If this block is sent the message value: 
anEdge, the block will be executed with pseudo-variable edge bound to object 
an Edge. 
concurrency control: Like methods, blocks, may optionally specify two sets of locks to 
control concurrent access to the block. 
local variables: The optional variable list consists of a list of identifiers preceded by 
colons. Local variables exist only for one activation of the block. Each time a 
block receives a value message, it creates a new context with a new set of local 
variables, all initialized to nil. 
message expressions: The remainder of the block consists of a sequence of message ex-
pressions. The sequence is interpreted as in a method except that uparrow, 'I', 
returns out of the method calling the block and down arrow, 'l ', breaks out of the 
187 
block or method calling the block. The last message expression in the block is 
the value of the block expression. 
A block is activated by sending it a value message. When a block receives a value message, 
the arguments of the message are bound to the arguments of the block and the message 
expressions in the block are executed. 
Distributed Objects 
A distributed object is a collection of constituent objects ( COs) that receive messages 
sent to the distributed object. Because a distributed object contains many independent 
constituents, it can process many messages simultaneously. 
Distributed objects are declared as subclasses of class DistributedObject. A new dis-
tributed object is created by sending the newOn: message to the appropriate class object. 
For example, a new instance of a Tal!yCollection (described in Figure 2.1 on page 14) is 
created with the message 
aTallyCollection .-TallyCollection newOn: someNodes. 
The argument of the newOn: message, someNodes, is a collection of processing nodes. 
The newOn: message creates a CO on each member of someNodes. 
When a message is sent to a distributed object, it may be delivered to any constituent 
of that object3 • It is possible to send a message to a specific constituent of a distributed 
object by indexing the object with the selector co:. For example, the message 
aTallyCollection tally: 'hello'. 
is sent to any constituent of aTallyCollecti;in. The message 
aTallyCollection co: 3 tally: 'hello'. 
is sent to the third constituent of a TallyCollection. Constituents are indexed sequentially 
beginning with one. The pseudo-variables maxld, the total number of constituents, and 
myld, the index of self, are available to constituent objects for use in computing indices. 
3 0ne hopes tha.t the ma.il system will be efficient a.nd deliver the message to the nearest CO or perhaps 
the CO with the shortest message queue. 
188 
Common Messages 
To describe all of the ciasses and messages in a Smalltalk system is beyond the scope of 
this appendix. I include the following list of common messages to assist the reader in 
understanding the CST code in this thesis. This list is by no means comprehensive. 
Block 
value This unary message causes a block with no arguments to be executed. 
vaiue: anObject · · · A block with i arguments is sent a message with i value: keywords, 
one for each argument. This message passes the arguments to the block and 
causes the block to execute. 
while True: a Block A value message is repeatedly sent to the receiver. As long as the 
receiver replies with true, a value message is sent to a Block, and the sequence is 
repeated. If the receiver replies with false, the method terminates. 
whiieFalse: aBlock This message is similar to whi!eTrue but with the receiver negated. 
As long as the receiver block evaluates to false, the argument block is iterated. 
Boolean 
ifTrue: a Block Sends a value message to a Block if the receiver is true. 
ifFalse: aBlock Sends a value message to aBlock if the receiver is false. 
ifTrue: trueBlock ifFalse: falseBlock Sends a value message to trueBlock if the receiver is 
true. Otherwise, if the receiver is false, a value message is sent to falseBlock. 
ifFalse: falseBlock iITrue: trueBlock Sends a value message to trueBlock if the receiver is 
true. Otherwise, if the receiver is false, a value message is sent to faf seBlock. 
Number 
+ Addition. 
- Subtraction. 
• Multiplication. 
/ Division. 
189 
11 Integer division rounding to -oc. 
\ \ Modulo (remainder of division after rounding to -oc). 
quo: Integer division rounding to 0. 
rem: Modulo (remainder of division after rounding to 0). 
abs Absolute value. 
negated Additive inverse. 
reciprocal Multiplicative inverse. 
The selectors abs, negated, and reciprocal are not terrnL."lated with a. colon because 
they are unary messages. The selectors quo: and rem: are terminated by colons 
because they are keyword messages. 
190 
Appendix B 
Unordered Sets 
Many applications use unordered data structures and do not require the overhead nec-
essary to support an ordered set concurrent data structure like the balanced cube of 
Chapter 3. In this appendix I present two unordered concurrent data structures. A 
dictionary can be used in applications that require a data structure to hold associations 
between objects but do not need to maintain an order relationship on the objects. A 
union-find set can be used in applications where sets of data are combined. 
B.1 Dictionaries 
A dictionary is a set of associations between pairs of objects. Each element of a dictionary 
is an ordered pair {aKey,anObject) that associates a key aKey with object anObject. A 
dictionary supports the following operations 1 • 
at: aKey return the object associated with key aKey. 
at: aKey put: anObject add an object to the set. 
delete: aKey remove the object associatoo with key aKey from the set. 
do: aBlock : send a value: anObject message to aBlock for each object in the set. 
Dictionaries represent binary relations. A common use of a dictionary is to represent 
the name-of relation by binding symbols to names. For example, the symbol table in a 
compiler is a dictionary. 
1The complete protocol of clau Dictionary ie given in Chapters 9 and 10 of [51]. Most of the protocol 
is omitted here for the sake of brevity. 
191 
t~ hfe 
deoth bits 
• 
•{..__ __ • __ !_• __ 
I aKey 1-l -•1M8 hKev 
node 
Figure B.1: A Concurrent Hash Table 
Dictionaries ca.n be implemented using a variety of data structures including radix search 
tries [107], binary search trees [2], and ha.sh tables [107]. Hash tables are usually the 
structure oi choice for sequential machines. The expected case access time for a hash 
table is 0(1) compared to O(log N) for the binary search tree, and ha.sh tables make 
more efficient use of memory than radix search tries. In the past, one objection to hash 
tables was their fixed size; however, the recent development of extendible hashing [37], 
[83] makes hash tables efficient even for sets that change size dynamically. 
Ellis has developed concurrent algorithms for extendible hashing [34] and linear hashing 
[33]. These algorithms involve locking schemes and protocols to support concurrent 
access to a shared ha.sh table. Like most work on databases, this work assumes a disk-
based system where multiple processes may compete for access to shared disk pages and 
is not directly applicable to concurrent computers. 
Unlike most sequential data structures, the hash table is ideally suited for a concur-
rent implementation. The table is homogeneous and can be distributed uniformly over 
the nodes of a concurrent computer. Ha.sh tables, unlike tree structures, have no root 
bottleneck. 
A concurrent implementation of a ha.ah table using a variant of bounded index hashing 
[83] is shown in Figure B.1. To see how this structure is used, consider the at: method 
for distributed object Hash Table shown in Figure B.2. Search key aKey is converted to 
a ha.shed key h Key by sending it the message hash. The low-order bits of h Key are used 
to find the node that contains the data, while the next depth bits of h Key find the head 
of a linked list within the node. A linear search of this list is performed to return the 
object associated with a Key, or nil if this object is not found. The at:put: and delete: 
methods are obvious extensions of the at: method. 
An extendible hash table [37] is implemented in each node. Each node,s table is initialized 
class 
superclass 
instance variables 
class variables 
locks 
instance methods 
at: aKey 
I hkey I 
hKey +-aKey hash. 
Hash Table 
Dictionary 
table 
depth 
rwlock 
(self at: (hkey mod maxld)) find: 
private instance methods 
find: a Key at: h Key 
require rwlock 
I link I 
link +-table at: (hKey \ \2d~pth). 
[link isNi!J whileFaise: [ 
192 
a distributed object 
table of links {key, data, next} 
log of table size 
none 
implements readers and writers 
find anObj in hash table 
compute hashed key of object 
aKey at: (hKey/maxld). 
in proper node, fi.nd object 
(link key= aKey) ifTrue[requester reply: link data]. 
link +-link next.] 
requester reply: nil. 
Figure B.2: Concurrent Hashing 
to size 2depth. When the number of entries increases beyond a2depth for some constant 
a, depth is incremented and the size of the table is doubled. The objects in the table 
need not be rehashed as the table grows. Doubling the size of the table simply increases 
by one the number of significant bi ts of h Key. The new entries in the table initially 
duplicate the old entries. As accesses are made to the table, the linked lists are split to 
shorten the access paths. 
The do: aBlock method broadcasts aBlock to each node of the distributed hash table 
object. Each node enumerates the objects in its local table, sending each of them to 
a Block. If a Block updates no instance or method variables, it can be replicated, and 
the value methods can be processed in parallel. If a Block updates instance or method 
variables, then the executions must be synchronized. 
An operation on the table requires only two messages, a find:at: message to the node 
containing the key and the reply: message back. Thus, hashing is 0(1) in the number 
of elements in the set. However, since the destination of these messages is random, each 
193 
message travels an average distance of lc1N, where N is the number of nodes in the 
machine. This makes hash table access time grow O(log N) with the number of nodes 
in the machine. 
Since the hash function randomizes access to a hash table, there is very little interaction 
between concurrent hash operations. Thus, the concurrency of hashing is 0 ( N). 
B.2 Union-Find Sets 
A union-find set, as the name implies, supports the operations of forming the union of 
two sets and finding the set to which an element belongs 2• 
union: aSet returns the union of the receiver with aSet. Both the receiver and aSet are 
modified to form the new set. 
add: anElement adds anElement to the receiver 
set returns the set to which the receiver belongs. Elements of the set must support 
this message as well. 
Algorithm 4.3 in [2] performs a sequence of union and find operations in time that is 
almost linear3 in the number of operations performed, approximately constant time per 
operation. Unfortunately this algorithm has very poor concurrency. Every find requires 
traversing a tree from the leaves to the root. The root serializes finds since it can only 
process one message at a time. 
To eliminate this root bottleneck, we store with each element the identity of the set 
to which the element belongs. As shown in Figure B.3, during a union operation the 
smaller set becomes a subset of the larger set. Each element of the smaller set must also 
be informed that it is now a member of the larger set. The code for these operations is 
shown in Figure B.4. 
Only the elements of the smaller set are updated during a union operation. Since each 
time an element is updated the size ,of the set it belongs to has at least doubled, an 
element is updated at most O(log N) times, where N is the number of elements 4 . Thus, 
if we implement each of the sets with a dictionary or other constant access time structure, 
the average time per union operation will be O(log N). Find operations require 0(1) 
time. The concurrency of union-find operations depends on the balance of the resulting 
tree structure of sets. 
~As with dictionary, this cla.as eupports a more complete protocol. 
3 The time grows a.a Na(N) where a is the inverse of Ackerman's function and N is the number of 
operations. 
'A similar approach is used in Section 4.6 of [2]. 
194 
Before A union: B After A union: B 
Figure B.3: A Concurrent Union-Find Structure 
class 
superclass 
instance variables 
class variables 
locks 
instance methods 
add: anObj 
require melock 
II 
anObj parent: self. 
jsuper add: anObj 
union: aSet 
require melock 
II 
Union Find Set 
Set 
parent 
melock 
a distributed object 
parent set if not self 
none 
add an element or subset to the set 
make smaller set a subset of larger 
((aSet size) > (self size)) ifTrue: [jaSet union: self]. 
jself add: aSet 
private instance methods 
parent: aSet inform all elements of new parent 
require melock exclude melock 
II 
parent ~aset. 
seif do: [:each I each parent: aSet]. 
Figure B.4: Concurrent Union-Find 
195 
Appendix C 
On-Chip Wire Delay 
Signal velocities on an integrated circuit are limited by the resistance and capacitance 
of the wire to be far less than the speed of light. Because the resistivity of integrated 
circuit wires is high, it is not possible to build good transmission lines on a chip. Instead, 
on-chip signal v.-ires are lossy transmission lines with a delay proportional to the square 
of their length. 
We can propagate a signal with linear delay by placing repeaters along a transmission 
line. Each repeater is an inverter of size S. The repeaters are spaced distance L apart. 
Let us make the following assumptions: 
• The ratio of inverter input capacitance to transistor gate capacitance is X. For 
a CMOS inverter with the p-channel transistor twice the size of the n-channel 
transistor, X = 3. 
• Transistors will be modeled by a. linear resistance. The resistance of a. minimum 
width transistor is Re. The output resistance of each inverter is Rinv = 1:f. 
• The gate capacitance of a minimum width transistor is a constant, Cg, and scales 
linearly with device size. The input capacitance of each inverter is Cinv = XSCg. 
• The resistance of a. unit length wire is KrRt. The resistance of the wire between 
two inverters is Rw = LKrRt· • 
• The capacitance of a unit length wire is KcCg. The capacitance of the wire between 
two inverters is Cw = LKcCg. 
We model one stage of the RC transmission line with repeaters with a IT network as 
shown in Figure C.l. Half of the distributed wire capacitance is lumped at each end of 
the wire. The output resistance of the driving repeater is added to the input end of the 
network, and the input capacitance of the receiving repeater is added to the output side 
of the network. 
196 
 
 
 
 
19i 
(C.4) 
Setting this equal tD zero and solving for L gives 
(C.5) 
Substituting (C.3) and (C.5) back into (C.l) gives the delay of the optimal segment 
(C.6) 
Dividing (C.5) by (C.6) gives the maximum signal velocity in units of so~:es 
(C.7) 
Let us put some real numbers into these equations. The following table gives approximate 
values for our four constants as a function of linear dimension, >.., in microns. 
Parameter Value Units 
Rt 104: 10 
c,. 4>.. ! fF 
Kc i 0.1 
l Kr 5xl0 ll -~-
For a 1µ technology (>.. = 0.5µ), if we set X = 3, we can calculate: 
• Optimal repeater size is S ;::, 60. 
• Optimal repeater spacing is L ;::, 2500. 
• Time between repeaters is T ;::, 300ps. 
• The maximum signal velocity, v = TL ;::, 8 x 106 ..!!!.. < < 3 x 108 ..!!!.. • 
sec sec 
This calculation has not been terribly accurate. Still, it is clear that signal velocities on 
integrated circuits are limited by resistance to be much less than the speed of light. 
198 
Glossary 
acquaintance An object's, A's, acquaintances are those objects to which A can send 
messages. In most cases an object's acquaintances are its instance variables and 
class variables. 
actor: A synonym for object. 
algorithm: A finite set of instructions for solving a specific type of problem f711. 
' , 
argument: An object passed as part of a message. A.rguments are bound to pseud0o 
variables in the method executed in response to the method. 
assignment: The process of binding an object to a variable. In CST assignment is indi-
cated by a backarrow, '+-'. For example, a +-b, assigns the value of b to variable 
a. 
balanced cube: A concurrent ordered set data structure that maps the elements of an 
ordered set to the right subcubes of a binary n-cube. 
balanced tree: An ordered set data structure based on a binary search tree whose height 
is kept within a constant factor of log2 N, where N is the number of data in the 
tree [72]. 
B-cu.be: A concurrent ordered set data structure where multiple data are stored in each 
node of a balanced cube. 
B-tree: An ordered set data structure based on a tree with the following properties [72]: 
1. Each internal node of a B-tree of order N has between N/2 and N children. 
2. All leaves of a B-tree are at the same level and contain no data. 
3. An internal node with k children contains k records. 
4. The ith record of an internal node is greater than the i - l 8t record. 
5. All records stored in the ith child of a. node, A, a.re greater than the i - 1st 
record stored in A and less than the ith record stored in A. 
199 
binary message: A message with a single argument and a selector composed of one or 
two special characters. For example, a + b and p S q are binary messages. 
binary n-cube: An interconnection topology ~ith 1V = 2n nodes where eac...}i node has a 
binary address, a, and is connected to those nodes whose addresses differ from a 
in exactly one bit position, a e 21, 0 s i < n. 
binding: The process of associating meaning with an object. For example, object-oriented 
programming languages bind meaning to message objects by associating a method 
with each message. 
block: In Concurrent Smalltalk, a block is a sequence of deferred message expressions 
along with arguments, locks and local variables. A block is executed when it 
receives a value message. 
[:each ! require rwlock :vari :var2 I messagei. message2] 
The block above, for example, has a single argument, each, requires a lock, rwLock, 
and has two variables, vari and var2. When it receives a value: arg message, this 
block binds each to arg and executes the two messages. 
cache: A small, fast memory used to hold frequently accessed data. 
class: An object that describes the state and behavior of objects of a certain type. 
class variable: A variable shared by objects of a certain class. It can be accessed by the 
class itself and by any instance of the class. 
communication channel: The hardware used to transmit information between the nodes 
of a network. The channel includes the physical wires that carry the information, 
the buffers or queues that store information in transit, and the logic that controls 
information flow. 
computer architecture: The process of organizing a computer system to apply available 
technology to the solution of a set of problems. 
concurrent algori"thm: An algorithm for a. concurrent computer. 
concurrent computer: A computer composed of many autonomous processing elements 
connected by a network. The term concurrent is used rather than the term parallel 
to emphasize the autonomous nature of the processing elements [111]. 
concurrent data structure: A data structure that can perform many operations simulta-
neously. 
constituent object (CO): An object that is part of a distributed object. Constituent 
objects receive messages sent to the distributed object. 
200 
data abstraction: Data abstraction separates an object's protocol, the messages an object 
understands, from an object's implementation, how the object responds to the 
messages in its protocol. 
data structure: A collection of data on which some relations are defined. 
deadlock: Deadlock occurs when no progress can be made because of a cyclic conflic:. 10r 
resources. In an interconnection network deadlock occurs when no message can 
advance toward its destination because the queues of the message system are full. 
degree: The degree of a vertex, v, is the number of edges incident on v. 
diameter: The maximum over all pairs of vertices of the length of the shortest path 
between two vertices in a graph. 
direct network: An interconnection network in which the terminal nodes are also the 
switching elements as opposed to an indirect network in which the terminals and 
switching elements are distinct. 
distributed object: An object consisting of a collection of constituent objects. A message 
sent to the distributed object may be received by any constituent of the object. 
edge: An ordered pair of vertices. 
ensemble machine: A machine consisting of an ensemble of processing nodes connected 
by a network [110]. The processing nodes of an ensemble ma.chine may be au-
tonomous as in a concurrent computer, or they may operate in lockstep as in a 
SIMD [42] parallel computer. 
flit: A FLow control digIT, the smallest unit of information that can be accepted 
or refused by a communication channel or queue. One or more flits make up a 
packet. Individual flits do not contain sequencing or routing information and thus 
flits in a packet cannot be interleaved with flits of another packet. 
heap: A data structure for implementing a priority queue. A heap is organized as a 
binary tree with one record stored in each node of the tree. The tree is ordered 
so that the record stored in each node is greater than the records stored in both 
of its children. 
hypercube: A k-ary n cube with dimension, n, greater than three. Hypercube is often in-
correctly used as a synonym for binary n-cube; however, the radix of a hypercube 
is not restricted to be two. 
identifier: A name or symbol. In CST an identifier consists of a letter possibly followed 
by a sequence of letters and digits. 
inheritance: In a.n object-oriented language, a subclass inherits behavior from its super-
class. 
201 
instance: An instance of a class, A, is an object of class A. 
instance variable: A variable local to a particular instance of an object. Instance vari-
ables make up an object's private memory. 
interconnection network: A communication network used to connect the processing nodes 
of an ensemble machine. 
indirect network: An interconnection network in which the terminal nodes are distinct 
from the switching elements as opposed to a direct network in which the terminals 
contain the switching elements. 
k-ary n-cube: An interconnection topology with N = F 1 nodes. Each node in a k-ary 
n-cube has an n-digit radix k address, a= an-1, ... , ao, and is adjacent to those 
nodes with addresses b = bn-1, ... , bo that differ from a in only one digit, say the 
fh, and this digit differs only by one, ai = bi ± 1. Binary n-cubes are a special 
case of k-ary n-cubes where k = 2. 
keyword message: A message consisting of a selector and one or more arguments where 
the selector is a sequence of keywords terminated with colons, ':', one preceding 
each argument. For example, the message receiver at: 8 put: 'arg2' is a keyword 
message with selector at:put: and arguments 8 and 'arg2'. 
late binding: Binding meaning to objects as late as possible, usually at run-time. In 
contrast, early binding usually takes place at compile time. 
latency: The elapsed time required to perform an operation. The latency of a message 
transmission is the elapsed time from the time the first flit of the message leaves 
the source to the time the last flit of the message arrives at the destination. 
lock: A programming construct used to restrict concurrent access to an object. 
message: In an object-oriented programming language, a message is a request for an 
object to perform some action. Messages consist of three parts: a receiver that 
specifies the object which is to receive the message, a selector that specifies the 
type of action to be performed, and arguments that supply additional information 
required to perform the action. In ,an interconnection network, a message is a 
logical unit of communication. A message may be broken down into a number 
of packets, physical units of communication that contain routing and sequencing 
information. Packets in turn may be broken down into flits. 
message-passing concurrent computer: A concurrent computer in which the processing 
nodes communicate by pa.ssing messages over communication channels. 
method: A description of how an object is to respond to a message. Methods in object~ 
oriented programming languages are similar to procedures and subroutines in 
conventional programming languages. 
202 
multiprogrammed system: A computer system that supports multiple processes on a sin-
gle processor. 
object: The primitive element of an object-oriented programming system. An object 
consists of a state and a behavior. The state of an object is made up of a number 
of variables or acquaintances. The behavior of an object is specified by a number of 
methods. The object executes these methods in response to particular messages. 
object expert: A processing element specialized to operate on a restricted class of objects. 
An object expert contains both storage for instances of this class of objects and 
logic specialized to operate on these objects. 
packet: In a communication network a packet is the smallest unit of information that 
contains routing information. Packets may be broken down into flits. 
path: A sequence of connected edges in a graph. 
protocol: The set of messages that an object understands. 
receiver: The object to which a message i.s sent. 
selector: A part of a message specifying the type of operation to be performed by the 
object receiving the message. 
self-timed: A design discipline where the sequencing of events i.s controlled by the internal 
delays of elements rather than by an external clock. 
sequential computer: A computer that executes instructions one at a time. 
shared-memory concurrent computer: A concurrent computer in which the processing 
elements communicate by reading and writing shared storage locations. 
store-and-forward routing: A routing strategy where an entire packet is stored in each 
node along a multi-hop path before transmission to the next node i.s initiated. 
strongly connected: A graph is strongly connected if there exists a path from every vertex 
in the graph to every other vertex. 
structured buff er pool: A technique used to prevent deadlock in an interconnection net-
work by controlling the allocation of buffers to packets. 
subclass: A class that inherits methods and variables from an existing class, its super-
class. 
superclass: The class from which methods and variables a.re inherited. 
throughput: The total number of operations performed per unit time. 
tori: Plural of torus. 
203 
torus: Topologically, a torus is a. doughnut shaped surface. In terms of interconnection 
networks, torus is a. synonym for k-ary n-cube. 
tree: In Computer Science a tree refers to a hierarchical data structure organized as a 
connected acyclic directed graph where the in-degree of each vertex is less than 
or equal to one. 
useful: In a flow graph, an edge, e, is useful from vertex u to vertex v, denoted useful( u,v) 
if e = (u, v) and /(e) < c(e), ore= (v, u) and f(e) > 0. 
vertex: A part of a graph. 
virtual channels: A technique for preventing deadlock in an interconnection network 
by multiplexing several virtual channels, each with its own queue, over a single 
physical channel and restricting the routing on virtual channels so that there are 
no cyclic dependencies amongst channels. 
very large sea.le integration (VLSI): A technology for fabricating integrated circuits con-
taining over 104 devices. 
wafer scale integration {WSI}: A technology for fabricating integrated circuits the size 
of waiers ( 50-150rnm on a side). 
wormhole routing: A routing strategy where ea.ch flit of a packet is immediately for~ 
warded to the next node along a multi-hop path without waiting for the rest of a 
packet to arrive. 
204 
Bibliography 
[l] Agha, Gul A., Actors: A Model of Concurrent Computation in Distributed 
Systems, MIT Artificial Intelligence Laboratory, Technical Report 844, June 
1985. 
[2] Aho, Alfred V., Hopcroft, John E., and Ulhnan, Jeffrey D., The Design and 
Analysis of Computer Algorithms, Addison-Wesley, Reading, ~fa.ss., 1974. 
[3] Athas, W.C., XCPL, an Experimental Concurrent Language, Dept. of Com-
puter Science, California Institute of Technology, Technical Report 5196, 1985. 
[4] Backus, John, "Can Programming Be Liberated from the von Neumann Style? 
A Functional Style and Its Algebra of Programs," CACM, Vol. 21, No. 8, 
August 1978, pp. 613-641. 
[5] Baird, Henry S., "Fast Algorithms for LSI Artwork Analysis," Proceedings, 1,;th 
A CM/IEEE Design Automation Conference, 1977, pp. 303-311. 
[6] Barnes, Earl R., "An Algorithm for Partitioning the Nodes of a Graph," SIAM 
J. Alg. Disc. Meth., Vol. 3, No. 4, December 1982, pp. 541-550. 
[7] Batcher, K.E., "Sorting Networks and Their Applications," Proceedings AFIPS 
F JGC, Vol. 32, 1968, pp. 307-314. 
[8] Batcher, K.E., "The Flip Network in STA.RAN," Proceedings, 1976 Interna-
tional Conference on Parallel Processing, pp. 65-71. 
[9] Baudet, Gerard M., The Design and Analysis of Algorithms for Asynchronous 
Multiprocessors, Ph.D. Thesis, Department of Computer Science Carnegie-
.Mellon University, Technical Report CMU-CS-78-116, 1978. 
[10] Benes, V.E., Mathematical Theory of Connecting Networks and Telephone Traf-
fic, Academic, New York, 1965. 
[11] Birtwhistle, Graham M., Dahl, Ole-Johan, Myhrhaug, Bjorn, and Nygaard, 
Kristen, Simula Begin, Petrocelli, New York, 1973. 
205 
[12] Blodgett, A.J. and Barbour, D.R., "Thermal Conduction Module: A High Per-
formance Multilayer Ceramic Package," JBJ,f J. of Research and Development, 
Vol. 26, No. 1, January 1982, pp. 30-36. 
[13] Browning, Sally, The Tree Machine: A Highly Concurrent Computing Environ-
ment, Dept. of Computer Science, California Institute of Technology, Technical 
Report 3i60, 1985. 
[14] Bryant, R., "A Switch-Level Model and Simulator for :\f OS Digital Systerns," 
IEEE Transactions on Computers, Vol. C-33, No. 2, February 1984, pp. 160-
1 ii. 
[15] Chandy, K.Yf. and Misra, J., "Distributed Computation on Graphs: Shortest 
Path Algorithms," CACM, Vol. 25, No. 11, November 1982, pp. 833-837. 
[16] Chapman, P.T. and Clark K., Jr., "The Sea.Ii-Line Approach to Design Rules 
Checking," Proceedings, 21•t ACM/IEEE Design Automation Conference, 1984, 
pp. 235-241. 
[17] Clinger, W.D., Foundations of Actor Semantics, MIT Artificial Intelligence 
Laboratory, Technical Report 633, May 1981. 
[18] Condon, Joseph H. and Thompson, Ken, "Belle Chess Hardware," Advances in 
Computer Chess, Vol. 3, Pergamon Press, Oxford, 1982, pp. 45-54. 
[19] Dahl, O.J. and Nygaard, K., "SIMULA - An Algol-Based Simulation Lan-
guage," CACM, Vol. 9, No. 9, September 1966, pp. 671-678. 
[20] Dally, William J. and Seitz, Charles L., The Balanced Cube: A Concurrent 
Data Structure, Dept. of Computer Science, California Institute of Technology, 
Technical Report 5174:TR:85, February 1985, early release of [21]. 
[21] Dally, William J. and Seitz, Charles L., The Balanced Cu.be: A Concurrent 
Data Structure, Dept. of Computer Science, California Institute of Technology, 
Technical Report 5174:TR:85, May 1985. 
[22] Dally, William J. and Kajiya, J ., "An Object Oriented Architecture," Pro-
ceedings, 1ffh International Symp~sium on Computer Architecture, 1985, pp. 
154-161. 
[23] Dally, William J. and Bryant, Randal E., "A Hardware Architecture for Switch-
Level Simulation" IEEE Transactions on Computer-Aided Design, Vol. CAD-4, 
No. 3, July 1985, pp. 239-250. 
[24] Dally, William J. and Seitz, Charles L., Deadlock-Free Message Routing in Mu.l-
tipro.:.essor Interconnect£on Networks, Dept. of Computer Science, California. 
Institute of Technology, Technical Report 5206:TR:86, 1986. 
206 
[25] Dally, William J. and Seitz, Charles L., "The Torus Routing Chip," to appear 
in J. Di<Jtributed Systems, Vol. 1, No. 3, 1986. 
~26] Dally, William J., CNTK: An Embedded Language for Circuit Description, 
Dept. of Computer Science, California Institute of Technology, Display File, in 
pre par a ti on. 
[27] Dijkstra, E.W., "A note on two problems in connexion with graphs," Nu-
merische Mathematik, Vol. 1, 1959, pp. 269-271. 
[28] Dijkstra, E.W. and Scholten, C.S., "Termination Detection for Diffusing Com-
putations," Information Processing Letters, Vol. 11, No. 1, August 1980, pp. 
1-4. 
[29] Donath, W .E. and Wong, C.K., "An Efficient Algorithm for Boolean Mask 
Operations," Proceedings, 2d11 ACM/IEEE Design Automation Conference, 
1983, pp. 358-360. 
[30] Edmonds, J. and Karp, R.M., "Theoretical Improvements in Algorithmic Ef-
ficiency for Network Flow Problems," JA CM, Vol. 19, No. 2, April 1972, pp. 
248-264. 
[31] Ellis, C.S., "Concurrent Search and Insertion in AVL Trees," IEEE Transac-
tions on Computers, Vol. C-29, No. 9, September 1980, pp. 811-817. 
[32] Ellis, C.S., "Concurrent Search and Insertion in 2-3 Trees," Acta Informatica, 
Vol. 14, 1980, pp. 63-86. 
[33] Ellis, C.S., Concurrency and Linear Hashing, Computer Science Department, 
University of Rochester, TR 151, March 1985. 
[34] Ellis, C.S., Distributed Data Structures, A Case Study, Computer Science De-
partment, University of Rochester, TR 150, August 1985. 
[35] Even, S. and Tarjan, R.E., "Network Flow and Testing Graph Connectivity,"' 
SIAM J. Computing, Vol. 4, 1975, pp. 507-518. 
[36] Even, Shimon, Graph Algor,ithms, Computer Science Press, Rockviile, Md., 
1979. 
[37] Fagin, Ronald, Nievergelt, Jurg, Pippenger, Nicholas and Strong, H. Raymond, 
"Extendible Hashing- A Fa.st Access Method for Dynamic Files," A CM Trans-
actions on Database Systems, Vol. 4, No. 3, September 1979, pp. 315-344. 
[38] Fiduccia, C.M. and Mattheyses R.M., "A Linear-Time Heuristic for Improving 
Network Partitions," Proceedings, trl'" ACM/IEEE Design Automation Con-
ference, 1982, pp. 175-181. 
[39] Filman, Robert E. and Friedman, Daniel P., Coordinated Computing, Tools and 
Techniques for Distributed Software, McGraw-Hill, New York, 1984, Ch. 17. 
207 
[40] Fisher, A.L. and Kung, H.T., "Synchronizing Large VLSI Processor Arrays," 
IEEE Transactions on Computers, Vol. C-34, No. 8, August 1985, pp. 734-740. 
[41] Floyd, R.W., "Algorithm 97: Shortest Path," CACM, Vol. 5, No. 6, June 
1962, p. 345. 
[42] Flynn, Micha.el J ., "Some Computer Organizations and Their Effectiveness," 
IEEE Transactions on Computers, Vol. C-21, No. 9, September 1972. 
[43] Ford, L.R., Jr. and Fulkerson, D.R., Flows £n Networks, Princeton University 
Press, Princeton, N.J., 1962. 
[44] Galil, Z. and Na.amad, A., "Network Flow and Generalized Path Compression," 
Proceedings, 11th A CM Symposium on the Theory of Computing, 1979, pp. 13-
26. 
[45] Galil, Z., "An O(V~ E~) Algorithm for the Maximal Flow Problem," Acta 
Injormatica, Vol. 14, 1980, pp. 221-242. 
[46] Galil, Z. "On the Theoretical Efficiency of Various Network Flow Algorithms," 
Theoretical Computer Science, Vol. 14, 1981, pp. 103-111. 
[47] Garey, M.R. and Johnson D.S., Computers and Intra.ctibility, A Guide to the 
Theory of NP-Completeness, W. H. Freeman and Company, 1979, p. 209. 
[48] Gelernter, David, "A DAG-Based Algorithm for Prevention of Store-and-Forward 
Deadlock in Packet Networks," IEEE Transactions on Computers, Vol. C-30, 
No. 10, October 1981, pp. 709-715. 
[49] Gerla, Mario, and Kleinrock, Leonard, "Flow Control: A Comparative Survey," 
IEEE Transactions on Communications, Vol. COM-28, No. 4, April 1980, pp. 
553-574. 
[50] Glasser, Lance A. and Dobberpuhl, Daniel W., The Design and Analysis of 
VLSI Circuits, Addison-Wesley, Rea.ding, Mass., 1985. 
[51] Goldberg, Adele and Robson, David, Smalltalk-BO: The Language and its Im-
plementation, Addison-Wesley, 'Rea.ding, Mass., 1983. 
[52] Goldberg, Adele, Smalltalk-BO: The Interactive Programming Environment, 
Addison-Wesley, Reading, Mass., 1984. 
[53] Goodman, J., "Using Cache Memories to Reduce Processor-Memory Traffic,)') 
10th Annual Symposium on Computer Architecture, June 1983. 
[54] Gottlieb, Alan, et al., "The NYU Ultra.computer - Designing an M™D Sha.red 
Memory Parallel Computer," IEEE Transactions on Computers, Vol. C-32, 
No. 2, February 1983, pp. 175-189. 
208 
[55] Gottlieb, Alan, et al., "Basic Techniques for the Efficient Coordination of Very 
Large Numbers of Cooperating Sequential Processors," ACM TOPLAS, Vol. 
5, No. 2, April 1983, pp. 164-189. 
[56] Gray, H.J. and Levonian P.V., "An A..1.'1.alog-to-Digital Converter for Serial Com-
puting Machines," Proceedings of the I.R.E., Vol. 41, No.IO, October 1953, 
pp.1462-1465. 
[57] Guibas, L.J., Kung, H.T., and Thompson, C.D., "Direct VLSI Implementation 
of Combinatorial Algorithms," Proceedings, Caltech Conference on VLSI, 1979, 
pp. 509-525. 
[58] Gunther, Klaus D., "Prevention of Deadlocks in Packet-Switched Data Trans-
port Systems," IEEE Transactions on Communications, Vol. COM-29, No. 4, 
April 1981, pp. 512-524. 
[59] Hewitt, Carl, "The Apiary Network Architecture for Knowledgeable Systems," 
Conference Record of the 1980 LISP Conference, 1980, pp. 107-117. 
[60] Hill, F.J. and Peterson, G.R., Digital Systems: Hardware Organization and 
Design, Wiley, New York, 1978. 
[61] Hillis, W. Daniel., The Connection Machine (Computer Architecture for the 
New Wave}, MIT Artificial Intelligence Laboratory, AI Memo No. 646, Septem-
ber 1981. 
[62] Hoare, C.A.R., "Communicating Sequential Processes," CACM, Vol. 21, No. 
8, August 1978, pp. 666-677. 
[63] Hu, T.C., Combinatorial Algorithms, Addison-Wesley, 1982. 
[64] Inmos Limited, IMS T4E4 Reference Manual, Order No. 72 TRN 006 00, 
Bristol, United Kingdom, November 1984. 
[65] Intel Scientific Computers, iPSC User's Guide, Order No. 175455-001, Santa 
Clara, Calif., Aug. 1985. 
[66] Kermani, Parviz and Kleinrock, Leonard, "Virtual Cut-Through: A New Com-
puter Communication Switching '[echnique," Computer Networks, Vol 3., 1979, 
pp. 267-286. 
[67] Kernighan, B. W. and Lin, S., "An Efficient Heuristic Procedure for Partitioning 
Graphs," Bell System Technical Journal, February 1970, pp. 291-307. 
[68] Kernighan, E.W. and Ritchie, D., The C Programming Language, Prentice-
Hall, Englewood Cliffs, N.J., 1978. 
[69] Kirkpatrick, S., Gelatt, C.D. Jr., Vecchi, M.P., "Optimization by Simulated 
Annealing," Science, Vol. 220, No. 4598, 13 May 1983, pp. 671-680. 
209 
[70] Kleinrock, Leonard, Queueing Systems: Volume 2: Computer Applications, Wi-
ley, New York, 1976, pp. 438-440. 
[71] Knuth, Donald E., The Art of Computer Programming, Volume 1/ Fundamen-
tal Algorithms, Addison-Wesley, Reading, Mass., 1973. 
[72] Knuth, Donaid E., The Art of Computer Programming, Volume S/ Sorting and 
Searching, Addison-Wesley, Reading, Mass., 1973. 
[73] Knuth, Donald E. The TEXbook, Addison-Wesley, Reading, Mass., 1984. 
[74] Krasner, Glenn, Smalltalk-80: Bits of History, Words of Adt:fre, Addison-
Wesley, Reading, Mass., 1983. 
[75] Kung, H.T., "The Structure of Parallel Algorithms," Advances in Computers, 
Vol. 19, 1980, pp. 65-112. 
[76] Kung, H.T. and Lehman, P.L., "Concurrent Manipulation of Binary Search 
Trees," ACAf Transactions on Data.base Systems, Vol. 5, No. 3, September 
1980, pp. 354-382. 
[,.._, 
"1 Kyocera, Inc., Design Guidelines, Multilayer Ceramics, CAT/2T8403TM. 
[78] Lamport, Leslie, The La.TEX Document Preparation System, Second Prelimi-
nary Edition, 1983. 
[79] Lang, C.R. Jr., The Extension of Object-Oriented Languages to a Homogeneous, 
Concurrent Architecture, Dept. of Computer Science, California Institute of 
Technology, Technical Report 5014, May 1982. 
[80] Lawrie, Duncan H., "Alignment and Access of Data in an Array Processor," 
IEEE Transactions on Computers, Vol. C-24, No. 12, December 1975, pp. 
1145-1155. 
[81] Lehman, P.L. and Yao, S.B., "Efficient Locking for Concurrent Operations on 
B-Trees," ACJf Transacti'ons on Database Systems, Vol. 6, No. 4, December 
1981, pp. 650-670. 
[82] Levitt, K.N. and Kautz, W .H., "Cellular Arrays for the Solution of Graph 
Problems," CACM, Vol. 15, No. 9, September 1972, pp. 789-801. 
[83] Lomet, David B., "Bounded Index Exponential Hashing," ACM Transactions 
on Database Systems, Vol. 8, No. 1, March 1983, pp. 136-165. 
[84] Malhotra, V.M., Kumar, M.P., and Maheshwari, S.N., "An O(!Vi3 ) Algorithm 
for Finding Maximum Flows in Networks," Information Processing Letters, VoL 
7, No. 6, October 1978, pp. 277-278. 
210 
[85] Marberg, J.M. and Gafni, E., "An O(N3 ) Distributed ~fax-Flow Algorithm," 
Proceedings, 1lfh Princeton Conference on Information Sciences and Systems, 
1984, pp. 478-482. 
[86] Mead, Carver A. and Conway, Lynn A., Introduction to VLSI Systems, Addison-
Wesley, Reading, Mass., 1980. 
[87] :\1ead, Carver A. and Rem, Martin, "Cost and Performance of VLSI Computing 
Structures," IEEE J. Solid-State Circuits, Vol. SC-14, No. 2, April 1979, pp. 
455-462. 
[88] Mead, Carver A. and Rem, Martin, ":\-finimum Propagation Delays in VLSI," 
IEEE J. Solid-State Circuits, Vol. SC-17, No. 4, August 1982, pp. 773-775. 
[89] Merlin, Philip M. and Schweitzer, Paul J., "Deadlock Avoidance in Store-
and-Forward Networks-I: Store-and-Forward Deadlock," IEEE Transactions on 
Communications, Vol. C0:\1-28, No. 3, :\farch 1980, pp. 345-354. 
[90) :\iiklosko, J. and Kotov, V.E., Algorithms, Software and Hardware of Paral-
lel Computers, VEDA, Publishing House of the Slovak Academy of Sciences, 
Bratislava, 1984. 
[91] Moore, Gordon, "VLSI: Some Fundamental Challenges," IEEE Spectrum, April 
1979, pp. 30-37. 
[92] Motorola Inc., MC68000 16-bit Microprocessor User's Manual, Third Edition, 
Prentice Hall, Englewood Cliffs, N.J ., 1982. 
[93] Ousterhout, John K., "Corner Stitching: A Data-Structuring Technique for 
VLSI Layout Tools," IEEE Transactions on Computer Aided Design, Vol. 
CAD-3, No. 1, January 1984, pp. 87-100. 
[94] Ousterhout, John K., et al., "The Magic VLSI Layout System," IEEE De81·gn 
and Test of Computers, Vol. 2, No. 1, February 1985, pp. 19-30. 
[95] Papadimitriou, C.H. and Steiglitz, K., Combinatorial Optimization: Algorithms 
and Complexity, Prentice Hall, 1982. 
[96] Pease, M.C., III, "The Indirect Binary n-Cube Microprocessor Array," IEEE 
Transactions on Computers, Vol. C-26, No. 5, May 1977, pp. 458-473. 
[97] Peltzer, Douglas L., "Wafer-Scale Integration: The Limits of VLSI?" VLSI 
Design, September 1983, pp. 43-47. 
[98] Peterson, James L., "Petri Nets," Computing Surveys, Vol. 9, No. 3, September 
1977, pp. 223-252. 
[99] Pfister, G.F., "The Yorktown Simulation Engine: Introduction," Proceedings, 
1 ffh A CM/IEEE Design Automation Conference, 1982, pp. 51-54. 
211 
[100] Pfister, G.F., et al., "The IBM Research Parallel Processor Prototype (RP3): 
Introduction and Architecture," IEEE 1985 Conf on Parallel Processi"ng, Au-
gust, 1985, pp. 764-771. 
[101] Pfister, G.F. and Norton, V.A., "Hot Spot Contention and Combining in ~fulti­
stage Interconnection Networks," IEEE Transactions on Computers, Vol. C-34, 
No. 10, October 1985, pp. 943-948. 
[102] Quinn, Michael J. and N arsingh, Deo, "Parallel Graph Algorithms," Computing 
Surveys, Vol. 16, No. 3, September 1984, pp. 319-348. 
[103] Quinn, Michael J. and Yoo, Yea.r Back, "Data Structures for the Efficient So-
lution of Graph Theoretic Probiems on Tightly-Coupled MThiD Computers," 
Proceedings, 198~ International Conference on Parallel Processing, 1984, pp. 
431-438. 
[104] Ramamoorthy, C.V. and Li, H.F., "Pipeline Architecture," ACM Computing 
Surveys, Vol. 9, No. 1, March 1977, pp. 61-102. 
[105] Russo, R.L., Oden, P.H., and Wolff, P.K., "A Heuristic Procedure for the Par-
titioning and Mapping of Computer Logic Blocks to Modules," IEEE Transac 0 
tions on Computers, Vol. C-20, 19il, pp. 1455-1462. 
[106] Schwartz, J.T., "Ultracomputers," ACM TOPLAS, Vol. 2, No. 4, October 
1980, pp. 484-521. 
[107] Sedgewick, Robert, Algorithms, Addison-Wesley, Reading, Mass., 1983. 
[108] Seigel, Howard Jay, "Interconnection Networks for SllvfD Machines," IEEE 
Computer, Vol. 12, No. 6, June 1979, pp. 57-65. 
[109] Seitz, Charles L., "System Timing" in Introduction to VLSI Systems, C. A. 
Mead and L. A. Conway, Addison-Wesley, 1980, Ch. 7. 
[110] Seitz, Charles L., Experiments with VLSI Ensemble Machines, Dept. of Com-
puter Science, California Institute of Technology, Technical Report 5102, Oc-
tober 1983. 
[111] Seitz, Charles L., "Concurrent VLSI Architectures," IEEE Transactions on 
Computers, Vol. C-33, No. 12, December 1984, pp. 1247-1265. 
[112] Seitz, Charles L., "The Cosmic Cube," CACAf, Vol. 28, No. 1, Jan. 1985, pp. 
22-33. 
[113] Seitz, Charles L., et al., The Hypercube Communfrations Chip, Dept. of Com-
puter Science, California Institute of Technology, Display File 5182:DF:85, 
March 1985. 
212 
[114] Seitz, Charles L., et al., "Hot-Clock nMOS," 1985 Chapel Hill Conference 
on Very Large Scale Integration, Henry Fuchs, ed., Computer Science Press, 
Rockville, :'vfd., 1985. 
[115} Seraphim, D.P. and Feinberg, I., "Electronic Packaging Evolution m IBM," 
IBM J. of Research and Development, Vol. 25, No. 5, September 1981, pp. 
617-629. 
[116] Shiloach, Y. and Vishkin, U., "A.n O(nz log n) Parallel MA .. X-FLOW Algo-
rithm," J. Algorithms, Vol. 3, No. 2, June 1982, pp. 128-146. 
[117] Siewiorek, D.P., Bell, C.G., and Newell, A., Computer Structures: Principles 
and Examples, .McGraw-Hill, New York, 1982. 
[118] Sleator, Daniel D.K., An O(nm log n) Algorithm for Maximum Network Flow, 
Ph.D. Thesis, Department of Computer Science, Stanford University, Report 
No. STAN-CS-80-831, December 1980. 
[119] Spira, P.:'vf., "A New Algorithm for Finding All Shortest Paths in a Graph of 
Positive Arcs in Average Time O(n2 log2 n)," SIAM J. Computing, Vol. 2, No. 
1, pp. 28-32. 
[120] Steele, Craig S., Placement of Communicating Processes on Mult£processor Net-
works, Dept. of Computer Science, California Institute of Technology, Technical 
Report 5184, 1985. 
(121] Stefik, Mark a.nd Bobrow, Daniel G., "Object-Oriented Programming: Themes 
and Variations," AI Magazine, Vol. 6, No. 4, Winter 1986, pp. 40-62. 
[122] Stone, H.S., "Parallel Processing with the Perfect Shuffie," IEEE Transactions 
on Computers, Vol. C-20, No. 2, February 1971, pp. 153-161. 
[123] Su, Wen-King, Faucette, Reese, and Seitz, Charles L., C Programmer's Guide 
to the Cosmic Cube, Dept. of Computer Science, California Institute of Tech~ 
nology, Technical Report 5203, September 1985. 
[124] Sullivan, H. and Bashkow, T.R., "A Large Scale Homogeneous Machine," Proc. 
~th Annual Sijmposium on Con:puter Architecture, 1977, pp. 105-124:. 
[125] Tanenbaum, A. S., Computer Networks, Prentice Hall, Englewood Cliffs, N.J., 
1981. 
[126] Theriault D.G., Issues in the Design and Implementation of Act2, MIT Artifi· 
cial Intelligence Laboratory, Technical Report 728, June 1983. 
[127] Thompson, C.D., A Complexity Theory of VLSI, Department of Computer Sci· 
ence, Carnegie-Mellon University, Technical Report CMU-CS-80-140, August 
1980. 
213 
[128] Thompson, C.D., "Fourier Transforms in VLSI," IEEE Transactions on Com-
puters, Vol. C-32, No. 11, November 1983, pp. 1047-1057. 
[129) Thompson, C.D., "The VLSI Complexity of Sorting," IEEE Transactions an 
Computers, Vol. C-32, No. 12, December 1983, pp. 1171-1184. 
[130] Toueg, Sam and Ullman, Jeffrey D., "Deadlock-Free Packet Switching Net-
works," Proceedings, 11th ACM Symposium on the Theory of Computing, 1979, 
pp. 89-98. 
[131] Toueg, Sam, "Deadlock- and Livelock-Free Packet Switching Networks," Pro· 
ceedings, 11fh ACM Symposium on the Theory of Computing, 1980, pp. 94-99. 
[132] Trotter, D., MOSIS Scalable CMOS Rules, Version 1.2, 1985. 
[133] Ullman, Jeffrey D., Principles of Database Systems, Computer Science Press, 
1982. 
[134] Warshall, S., "A Theorem on Boolean Matrices," JAC.\f, Vol. 9, No. 1, January 
1962, pp. 11-12. 
[135] Wulf, W. and Bell, C.G., "C.mmp - A Multi-Mini-Processor," Proceedings, 
AFIPS FJCC, Vol. 41, Pt. 2, 1972, pp. 765-777. 
[136] Xerox Learning Research Group, "The Smalltalk-80 System," BYTE, Vol. 6, 
No. 8, August 1981, pp. 36-48. 
