Communication in a Tree Machine by Browning, Sally A. & Seitz, Charles L.
Communication in a Tree Machine 
Sally A. Browning 
Bell Laboratories 
Murray Hill, New Jersey 07974 
Charles L. Seitz 
Computer Science Department 
California Institute of Technology 
Pasadena, California 91125 
ABSTRACT 
Communication assumes a progressively dominant and limiting role in VLSI 
because it becomes relatively more expensive in chip area, signal energy, and time. 
The principle of locality becomes aJI important to integrated systems design, and 
implies that larger single processors are not the route to performance improve-
ments. One computer architecture that can exploit the capabilities of VLSI is an 
ensemble of small processors operating concurrently. 
The tree machine is such a structure. Each of the many processors in the 
binary tree can communicate directly only with its parent and two children. How-
ever, the tree is programmed as if each processor had an arbitrary number of des-
cendents, and the programs are compiled into code for a binary tree. We describe 
the communication structure of tree machine programs, the compilation process, 
and the underlying hardware. 
Jan 30, 1981 
ou~ 
CALTECH CONFERENCE ON VLSI, JanuaPy 1981 
510 
SaZZy A. Browning and Charles L. Seitz 
Communication in a Tree Machine 
Sally A Broll'ning 
Bell Laboratories 
Murray Hill, New Jersey 07974 
Charles L . Seit::. 
Computer Science Department 
California Institute of Technology 
Pasadena, California 91125 
I. The Tree Machine Architecture and Project 
As Very Large Scale Integration (YLSI) becomes a reality, we have the opportunity and 
motivation to build entirely new kinds of computers. The traditional view of a single large process-
ing unit physically separated from a single large store by a memory bus is certainly realizable in 
VLSI. But. as Sutherland and Meads have observed, this architecture requires too much global 
commumcation to be a good fit to YLSI. It is also a carryover from the era in which proces.sing and 
~torage were best implemented in different technologies. 
Most of the space on an integrated circuit is occupied by the wires that carry control and data 
to the functional blocks. It is no longer sufficient to spend most of the design effort on the indivi-
dual cells and leave the wiring until the end. Rather. the interconnection must be considered an 
integral part of the design . As Seitz4 has pointed out, making larger scale single processors in YLSI 
scaled to submicron dimensions becomes self-defeating: the diffusion delay in a wire scales up qua-
dratically while the delay of a MOS switch scales down linearly. Thus, the ratio of communication 
to !>witching delay scales up as the third power of the scaling factor, and rapidly becomes a dom-
inant design factor for metal sizes below about one micron. Faster signal paths on higher, thicker 
layers of wider metal runs are possible and a likely feature of new technologies, but will be a limited 
resource and will require larger drivers built on the diffused, polysilicon, and thin metal layers. 
Together. these observations might be called the principle of locality for YLSI. If a signal 
must travel between two widely spaced points, it will either go slowly, or require extra area for 
bigger drivers and thicker wires. The incentive to design chips with local communication is strong. 
The tree machine architecture 1 provides a general purpose computing environment while capi-
talizing on the properties of VLSI. The tree machine is a collection of very small processors con-
nected together as a binary tree. The proces.sors are identical, and each has its own memory. 
There is no global communication, only communication between parent and child in the tree, and 
between the root of the tree and the external world. This architecture gives rise to integrated cir-
cuits that have regular interconnect , local communication, and many repetitions of a single processor 
design. These integrated circuits, in turn , can be assembled in regular patterns at the printed-circuit 
board and backplane level to construct machines with thousands to hundreds of thousands of proces-
sors. 
The tree machine is general purpose because it is programmable. A varied collection of algo-
rithms have been designed that take advantage of the available concurrency. In each case, the time 
complexity of the algorithm is roughly equal to the number of data elements in the problem. Thus, 
sorting can be done in linear time, and matrix operations in O{n2) time. Many graph problems. 
including some that arc NP-complete, can be solved in O(edges) time, provided the tree machine 
has enough processors. Problems that require an exponential growth in resources, like the NP-
complete problems and divide-and-conquer algorithms, are a good fit for the tree machine architec-
ture: each additional level in the tree doubles the number of available processors. 
ARCHITECTURE SESSION 
::>11 
Communication in a Tpee Machine 
This paper describes work done during the summer, L980. at Caltech. The first design has 
one processor per chip; As processing with finer design geometries becomes available several proces-
sors and their memories will be put on a single chip. The goal is to have a tree machine built and 
tested by the fall of 1981. and a machine of at least 1023 processors working in l\IX2. 
A feature of the tree machine that makes it particularly attractive for Implementation in a 
research environment is that it can be viewed as a recursive structure. Each node is the root of a 
subtree. Thus, once the root of the tree has been tested, it can test, recursively. the rest of the tree. 
Once a bootstrap program has been loaded into the root , it can load the other processors. The 
importance of simple testing and loadtng procedures increases with the number of transistors on an 
integrated circuit. 
2. Programming a Tree Machine 
Tree machines are programmed in a high level language resembling Hoare's "Communicating 
Sequential Processes" (CSP) notation. 3 The notation. TMPL. is described in full elsewhere;1 here we 
will look only at the communication primitives. This paper describes the tree machtne communica-
tion protocol , and the mapping between the programmer's view of communication and the 
hardware. Thus, it is in large part a discussion of the TMPL compiler. whose major parts are 
shown in Figure L. TMPL programs are written for trees with arbitrary fanout. and transformed by 
the compiler into source code for a binary tree; this is the MAP step. and is described in section 3. 
source MAP QUEUE COMPILE ASSEMBLE load 
code stream 
Figure 1. TMPL Compilation Process 
While TMPL supports four communication primitives, the hardware implements only two of them. 
We remove the other two in the QUEUE step. This mapping is discussed following a presentation 
of the hardware implementation . The final two compilation steps are familiar: the source code i<; 
translated into machine code in the COMPILE step, and a load stream is composed in the ASSEM-
BLE step. These two processes are not described. We begin with a brief presentation of the nota-
tion, concentrating on the constructs used in communication. 
2.1. Communication Primitives in TMPL 
The basic building block of TMPL is the processor definition. Each definition describes a 
self-contained computational unit that communicates through ports with other processors. A proces-
sor definition includes the naming of one external port to its parent in the tree, and an arbitrary 
number of internal ports to descendents. Communication statements mention port names through 
which the message will pass. rather than naming target processors. As a result, processor definitions 
are written without regard to the eventual connection plan of the tree. A processor expects to fol-
low a specific communication protocol when acces.<;ing a port. Any processor that follows the same 
protocol can be connected to the other end of the port. 
This definitional locality makes possible a parts kit of standard processor definitions. Each 
part is a processor, or tree of processors. that can be completely characterized by the behavior at its 
ports. As long as the expected messages are sent and received, the parts will work anywhere in the 
system. 
Inter-process communication can be specified in two ways, either as an imperative statement, 
or as a conditional expression. The statement form results in the processor being blocked until the 
communication is successfully completed. The conditional form appears as part of a loop or case 
statement, and is performed only if both processors communicating along the specified port are 
ready to exchange the message. 
Syntactically, message statements and expressions are identical; the general form is shown 
below. We retain Hoare's notation for the direction of the communication: ? indicates input, and ! 
CALTECH CONFERENCE ON VLSI, Januapy 1981 
512 Sally A . BPowning and ChaPles L. Seitz 
is used for output. 
They arc distinguished by context: message statements can occur wherever a statement is vahd . 
while message expressions are legal only in guards, described below. 
A communication involves two proces.<;<>rs connected via a port. An output request to the port 
from one processor must match up with an input request for the same message from the other pro-
cessor in order for the communication to take place. Either the output or the input can be done 
conditionally, but not both. This restriction prevents kind of deadlocking: the "after you, after you" 
situation that arises when neither proces.<;or will commit itself to the conditional exchange. 
TMPL provides compile-time checking for occurrences of the illegal conditional-to-conditional 
communication by requiring that message names be typed . These types specify how the message 
will be used: imp for imperative mode, and cond for conditional. The message names, types. and 
direction'> (input or output) are made externally available. When the tree connection plan is read 
and the processors linked through ports, the message interfaces are compared. Illegal communica-
tions. as well as messages that cannot be paired up, are flagged . 
For example, suppose we have three processor definitions called A, B, and C. Message ports. 
names, types, and directions used in the three processors are shown below, in Figure 2. 
Proces.sor Port Port Tvoe Name Tvoe Direction Arguments 
A parent external load cond ? I 
parent external unload cond ! I 
left internal load imp ! I 
left internal unload imp ? 1 
right internal load imp ! 1 
right internal unload imp_ ? I 
B parent external load cond ? 1 
parent external unload imp ! I 
c parent external load cond ? l 
parent external unload cond ! 1 
left internal load cond ! 1 
left internal unload imp ! I 
Figure 2. Message Descriptors 
If we connect the left port of A to B's port, we find that A will send load me!.sagc..,, and B will 
receive them ; similarly, 8 will send unload messages and A will receive them. In addit1o11. all com-
munications between the two are either imperative to imperative or imperative to nmditional. and 
thu!. valid. A connection between the left port of C and B produces invalid communications. 
While C outputs load messages and B receives them , both are doing it conditionally. And the 
unload messages have the same direction: both processors want to send the message. Thus, Band C 
cannot be connected using the left port of C. Incidently , the protocols match up if B and C are 
connected through their parent ports, but in order to preserve the tree structure, connections must 
always be made from an internal port to an external one, that is, from parent to child. 
2.2. Message Statements and Expressions 
Message statements are a familiar concept. rec,cmbling subroutine calls: the procc-...,or cxecutlll)!. 
the ~tatement is blocked from further execution until the communication is succc:-..,full) completed. 
Message expressions, however, are more complicated. The following paragraph!. de!.t:ribc them in 
more detail , beginning with a description of the statements in which the message cxpres.'>ion~ may 
ARCHITECTURE SESSION 
Communication in a TPee Machine 
appear. 
As in Dijkstra's notation,2 TMPL has generalized loop and conditional statement~ made up of 
a set of guarded commands. A guard is an expression; the command will be executed only if the 
guard has the value true. A TMPL guard can be a logical expression, a message expre!>!-.ion. or a 
combination of the two. A single guard can contain at most one message expression. Each of the 
following lines contains a valid guarded command. 
i < 9 AND i<=!O- j := j* 10 + 
NOT found AND p?arc(i.j) -found:= start=i AND end = i 
left ,right? answer - p !answer ; coullf : =count+ 1 
The syntax for loop and conditional statements differs only in the braces used to 1-! nclo!-.e the set of 
guards: curly ones are used for loops. and square ones for conditionals. Semantically. however. the 
two statements are quite different. A loop statement is executed repetitively until all guards are 
falc;e. A conditional statement is executed exactly once: at least one guard must be true, otherwise 
the statement and the program will abort. In both statements. if more than one guard is true, a 
nondeterministic choice will be made. For example, if i=5 when the conditional statement below is 
entered, all three of the guards are true. We cannot assume that the first one will be chosen. but 
must settle for a random choice. 
[ i S 9- SKIP 
i>O- i := -9 
i = 5- p!five ;found := found+ 1 
The nondeterministic property of the loop and conditional statements is an mtegral part ~lf 
programming a tree machine. The processors act independently; a pair of them is synchronized by 
communication. Because a processor has three ports to other processors and knows nothing of the 
timing characteristics of its neighbors, a TMPL program can describe only the seqtll'lln' it will use to 
access the ports. In many cases, a message could arrive on any one of a set of ports. giving rise to a 
set of guarded commands, each one triggered by communication on a specific port. For example. 
the conditional statement below has messages expressions for all three ports. We can make no state-
ments about the order in which the guards become true: it depends both on when the ports become 
active, and on the choice between true alternatives. 
[ bus ,left ,right?done - acti1•e :=active - I 
left ,right?notDone - active: =active+ 1 
Conditional and loop statement guards arc the only place that message expressions mav 
appear. They may be combined with the familiar logical expressions to make more complicated 
expressions. While an individual guard can contain at most one message expression. a guan.l ~et can 
contain many, and can mix message expression guards with those containing only logical expres-
sions. If logical and message expressions are combined in a guard. the logical expression must 
appear first, and is evaluated fi rst. Because the message expression can have the side effect of doing 
an input or output as it is evaluated, it is examined only if the value of the logical expression ic; 
true. 
Like logical expressions, message expressions are evaluated. Unlike logical expressions. they 
can assume one of three values. False signifies that the communication cannot be done. and is 
assigned whenever some other communication is pending on the selected port. A message 
CALTECH CONFERENCE ON VLSI, Januapy 1981 
514 Sally A . BPowning and ChaPles L . Seitz 
expression is true if the communication is possible: the port is waiting for exactly thi~ message. The 
maybe value is assigned to the expression if it is neither true nor false, that is, the port mentioned 
in the message expression has no pending communications. 
The key to evaluating a message expression is the restriction that at least one of the two pro-
cessors involved in a communication must use the imperative form. The mes~age stwement forces 
the··communication , making the issuing processor and associated port busy until the tran!>fer has 
been completed successfully. Thus, a processor that is evaluating a set of message e>.pressions has 
only solid. imperative communications to match up with. The state on the busy ports won't change 
while a decision is being made. 
3. Mapping Arbitrary Fanouts Onto a Binary Tree 
TMPL programs are written as if the processor had as many immediate dc~cndcnts as 
needed . In fact, the underlying architecture is a binary tree. and arbitrary fa nouts arc simulated. 
A processor definition that declares more than two internal ports to children becomes the root of a 
composite processor, shown in Figure 3. Several layers of the tree are used to provide the required 
number of descendents. and the intermediate processors, called padding processors. arc provided 
with a skeletal program that allows them to pass messages between the parent and children. The 
process of generating those skeletal programs is the subject of this section . We begin with some 
guidelines for the mapping algorithm , look at a pair of routing numbers that can uniquely identify a 
descendent processor , and fini sh with a case by case treatment of the seven kind~ of communication 
that the padding processors must handle. 
3.1. The Mapping Constraints 
T he constraints we place on the mapping algorithm have distinctly different flavors. The first 
one reflects the underlying system architecture. while the second constraint is an arbitrary design 
decision. 
First. there is no wildcard message name that matches everything. Because message expres-
sions must eventually be evaluated as true or false, the hardware knows about message names. 
Thus, each padding processor must be tailor-made for the problem and will handle only those mes-
sages that can pass between parent and child. If the parent and children exchange load. arc, and 
answer messages, the padding processor must have message statements or expressions for load, arc 
and answer messages as well. 
Our second constraint is an attempt to minimize the amount of information the mapping algo-
rithm needs: we allow the program for the paren t processor to be modified in the mapping process, 
but not that of the child. It is clear that the code in the composite processor wi ll require modifica-
tion: it specifies communications on more ports than it really has. But it is not necessary to know at 
compile time what is connected to the ports. 
Several benefits arise from this constraint. We can compile a set of padding processors based 
only on information available in a single processor definition . The resulting composite processor has 
precisely the same communication characteristics as the unmapped processor. Thus, assertions about 
the communication properti es of the original processor hold for the composite. 
3.2. The Routing Numbers 
We use two integers, a depth finder and a path , to uniquely address a specific descendent of a 
composi te processor. The depth finder indicates whethe r or not the message has reached a "leaf" of 
the composite. The path selects the left or right branch at each processor in the composite. New 
values for the depth finder and path are calculated at each level in the composite processor. 
The depth finder measures the number of padding processors that must be traversed to reach 
the bottom of the composite processor. T hus, the depth finder is not a measure of diMance from 
the root; rather, it is the distance from the leaves. Suppose we arc simulating a fanout of n. The 
composite processor wi ll contain n - I processors: the root of the composite supplies two connections 
ARCHITECTURE SESSION 
Communication in a TPee Mach~ne 
to children, and each additional padding processor adds two connections while occupying an exist ing 
one, for a net gain of one connection. Thus, a composite processor with fanout n contains one copy 
of the modified parent processor and n - 2 padding processors. As we descend in the compo<;ite pro-
cessor this count can be recalculated to describe the number of padding processors in the composite 
if the current one is the root. At the leaves of the composite, the depth is zero. 
The rule for calculating the depth finder value depends on the fact that unbalanced composites 
are heavy on the left side, as shown in Figure 3. The algorithm for calculating the depth finder is 
given below. It is applied recursively for all subtrees in the composite processor: each descendent is 
the root of a subtree, and supplies the value for depthpurtnr to the next level. 
depthpamu = 11 -2 
depthpamu- J 
deplhttfr = 2 
depth"""'" - 2 depth,;8111 = 2 
The second argument specifies a path to the desired child , and relies on the allocation of logi-
cal to physical ports shown in Figure 3. The numbering of logical ports starts with 1, with odd ports 
assigned to the left side, and even ones to the right. This rule is applied recursively until all leaf 
processors have bmary logical and physical fanout. 
-
I 2 3 4 5 6 
3 
5 
Figure 3. Mapping Arbitrary Fanouts onto a Binary Tree 
Suppose the processor has a logical fanout of n, and wants to communicate with its i'" child . 
The path is initially set to i. At each node in the composite, odd path values are passed to the left. 
and even values to the right. A new path value is computed at each level according to the algo-
rithm below. T his, in conjunction with the depth finder, can be used to uniquely select a descen-
dent. 
pathparrnt : = i 
pathpam11 + 1 ( odd(pathpamu) - pathkft := 2 
palhpar.nr 
even (path) - pa th,;ghr : = 2 
CALTECH CONFERENCE ON VLSI , Januapy 1981 
516 SaLLy A. BPo~ning and ChaPLes L. Seitz 
Two routing numbers are needed to locate a specific processor because of the asymmetry of 
the composite processor. Paths from parent to child are not always the same length. In Figure 3, 
for example, it takes three steps to reach the sixth child, and only two to find the third one. Both 
the identity and depth of a descendent are essential information, and cannot be encoded in a single 
number. 
,. 
3.3. Generating Programs for the Padding Processors 
Padding processor programs are generated based on the parent's view of communication. We 
do not have access to the programs for the descendent processors, and must retain the original mes-
sage interface in the fully padded composite. A list of message statements and expressions that the 
parent program might utilize follows: 
1. imperative output 
2. imperative broadcast output 
3. imperative input 
4. imperative broadcast input 
5. conditional output 
6. conditional input 
7. conditional broadcast input 
Note the absence of conditional broadcast output, better described as the case where all descendents 
from a node are ready to input the same message. This state can be expanded into a guard that is 
the logical AND of a set of messages expressions, one for each descendent. Since guards cannot 
contain more than one message expression, we do not implement conditional broadcast output. 
Each of the valid cases is discussed briefly below. The general strategy is this: the parent pro-
cessor definition is modified to communicate with two descendents through internal ports l and r . 
Two, one, or none of the routing numbers are appended to the message, as needed. There may be 
extra messages used for synchronization between parent and child: the message no longer travels 
directly. 
Each padding processor is given the program segment required to read or write the messages 
expected by the parent. If any routing numbers have been added to a message, they must be 
stripped off before being handed to the child processor: it is expecting the original message. The 
depth finder, which records the number of padding processors yet to traverse, is zero at the bottom 
of the composite processor. 
Imperative output directed at a specific processor requires both routing numbers to locate the 
receiver. Imperative broadcast output communication requires no additional arguments or mes-
sages. The message is allowed to spread throughout the composite processor, since it is directed at 
all descendents. 
Imperative input from a specific processor requires both routing numbers, like the correspond-
ing output command. In fact, any time a specific processor is mentioned in a communication state-
ment , both numbers are used to locate it. Imperative broadcast input asks for a message from any 
one of its children. Note that the statement c(*)?msg is equivalent to a conditional statement: 
[ c(l)?msg -SKIP 
I c(2)?msg -SKIP 
I c(n)?msg -SKIP 
Remember that guards are not evaluated in any particular order: a non-deterministic choice is made 
among those that evaluate to true. 
The imperative broadcast input is treated as if it were a conditional statement: each child is 
ARCHITECTURE SESSION 
Communication in a Tree Machine 
asked whether it has something to give the parent. If not, the next child is interrogated until one is 
found that is trying to send the matching message. The fact that the input is imperative guarantees 
that there will be at least one child that wants to send the message. 
The template for imperative broadcast input relies heavily on the fact that message expressions 
are evaluated. The pair of statements that follow demonstrate a technique for finding out how a 
message expression was evaluated. 
got:= FALSE ; {NOT got AND l?msg - got:= TRUE} 
The loop will execute zero or one times, and the boolean value got can be used to find out what 
happt>ned. The loop will be stuck evaluating the guard until some communication is initiated on the 
port called I. If the communication is a request to output msg, the value of the guard is TRUE, and 
got becomes true. The loop will not be executed again because of the NOT got phrase. If, on the 
other hand, the communication initiated on port I does not match, the value of the guard is false, 
and the body of the loop is never executed; got remains false. Thus, the value of gsor tell us 
whether or not a msg was input from port I. Imperative broadcast input asks each child in turn to 
output a message. As soon as a child is found that wi!J satisfy the request , the polling terminates. 
Conditional input, conditional output, and conditional broadcast input are all similar. In each 
case, the specific child (or each child in succession) is asked whether or not it can supply the 
requested message. If so, the message is accepted and moved up to the top row of padding proces-
sors so that it is available for the parent. The padding tree will have only one instance of a particu-
lar message stored in it at a time. If the guard is part of a loop, the message will be re-requested 
before the next iteration of the loop. 
Conditional input and output require the use of both routing numbers, but do not send up an 
explicit no answer. Conditional broadcast input uses the no answer to continue polling the children, 
but needs only the depth finder routing number. 
Appendix 1 contains templates for generating skeleton programs for the padding processors. 
The compiler uses these templates, filling in specific port and message references as needed . The 
padding processor program is a conditional statement, with a guard for every possible communica-
tion between parent and child. 
4. Communication Primitives in the Hardware 
We devised two criteria for the tree machine processor and port design. First, the design must 
use chip area sparingly. In general , we are willing to sacrifice execution speed in order to limit the 
number of functions built into the hardware. Each new instruction is carefully scrutinized before 
being added to the repertoire. Second , the processor should reflect the structure and scope of 
TMPL. We intend to program this machine solely in TMPL. Thus the instruction set is designed 
for easy translation from TMPL to machine code. 
These two goals compliment each other when designing instructions to implement TMPL state-
ments that do not involve communication. However, when we attempt to implement directly the 
rich communication structure of TMPL, we find the complexity of the processor, and thus the area 
it occupies, increasing. 
The double queue implementation discussed below is a nice compromise. It provides direct 
implementation of two of the four communication modes, conditional input and imperative output. 
The other two modes can be implemented via compiler translation into the chosen modes. This 
transformation is discussed later. 
4.1. An Overview of the Queue Design 
A TMPL port is a bidirectional one: the same port is used for both input and output. The 
underlying hardware has two distinct unidirectional connections between the processor, controlled by 
a port. The hardware port is a simple micro-coded processor, distinct from the tree machine proces-
sor. The port and its instruction set are described in more detail in the next section. 
517 
CALTECH CONFERENCE ON VLSI, January 1981 
518 
SaLLy A . B~owning and Cha~Les L . Seitz 
Each unidirectional connection between processors has an associated queue that buffer-.. me!>-
sages, thus decoupling the two processors, as shown in Figure 4. The length !)f the queue d~:t e r­
mines the independence of the communicating processors, but cannot matter tu thL· .tlf!Prllhm" cnn-
trolling the port. One may assume a length of one for all of the discussion that lollo""· .\ qu~:ue 
size wi ll eventually be chosen through experimentation and simulat ion . 
r--------------------, 
1 Port 1 
L-- ---
____ _ ____ .J 
Figure 4. Communication Using Queues. 
Each access to a port causes one word in the associated queue to be read m written. T hus, a 
TMPL message with several arguments is compiled into a sequence of messages. one for each argu-
ment: 
l! arc(i,j) becomes l!arc; l!i; l!j 
l?arc(i ,j) - becomes l?arc - l?i ; l?j 
Notice that the message name is no longer associated with its arguments. If the two communicating 
processors have different notions of how many arguments the message contains, chaos will ensue. 
An alternative implementation would tag each argument with the message name, as in the example 
below. 
l!arc(i ,j) becomes l! arc(i); l!arc(j) 
Tagging provides some measure of run-time validity checking on messages, a t the cost of extra bits 
in each entry in the queue. Since we have complete info rmation about the nature of the communi-
cation between processors at compile time (see Figure 2), we do better to put the checking there, 
retaining our streamlined port design. 
4.2. The Port Instruction Set 
The port reponds to three instructions from the processors it interfaces. MATCH(msg) com-
pares its argument with the first element in the queue. It returns true if they match. fats~ if they 
are different , and maybe if the queue is empty . READ removes the first clement t rum thL· 4ueuL' 
and returns its value to the requesting processor. If the queue is empty, it waits unt il something is 
inserted. WRITE(msg) adds its argument to the end of the queue, returning true to the wrill:r. If 
the queue is full , maybe is returned, and the writer must retry the operation. Figure 5 formally 
presents the algorithms. We assume the usual operations on queues (called 0 in Figure 5): 
empty(Q) is a boolean function that is true when the queue is empty, head(Q) returns the value of 
the first element in the queue, removeHead(Q) re turns the first element and removes it from the 
queue, full (Q ) is a boolean function that is true when the queue is full , and add(Q,msg) adds ms~ to 
the end of the queue. 
It remains to show how TMPL message statements and expressions are compiled into port 
instructions. We will look o nly at conditional input and imperative output. The next section 
describes a technique for transforming the missing communication modes into these. 
Conditional inputs require the MATC H and READ instructions. MATCH is used to satisfy 
the condition: the message will be input only if MATC H returns true. The arguments arc retrieved 
with R EAD instructions. 
ARCHITECTURE SESSION 
Communication in a TPee Machin e 
{ MATCH(msg) -
I READ -
[NOT empty(Q) AND head(Q) = msg- return(TRUE) 
I NOT empty(Q) AND head(Q) :F msg- return(FALSE) 
I empty(Q) - return(MA YBE) 
] 
[ NOT empty(Q) - return(removeHead(Q)) 
I empty(Q) - SKIP 
I 
WRITE(msg) -
[ NOT fuli(Q) - add(Q,msg); return(TRUE) 
I full( OJ · return(MA YBE) 
] 
Figure 5. The Port Instruction Set 
p?arc(i ,j) - becomes MATCH(arc) - READ: i: = READ: r - RF.AD 
lmper<~tiW outputs usc on ly the WRITE instruction . hut are complicated by the fact that the output 
will l<~il if the queue is full. Thus. they are implemented in a loop. !.hown in the TMPL notation 
heltm . 
p!orc(i . J) ht•CO/IIl'S [ WRITF.(arc)- SKIP ]; 
[ WRITt.(i) - SKIP ); 
[ WRITE(.i) -SKIP I 
l'he first implementation of the port utilize~ queues of length two. The port is ..,plit h<..' l\\<..'<..'11 
the I\\O processors th<tt 'ihare it. with one word of each queue in each proces..,nr. Transkr.., h<..' t\\<..'en 
the two halves of the queue is done bit-serially. with one bit transferred durin!! each storage r~l'le . 
-I.J. Other Comments about the Processor 
l'hc processors in our present design have 12-bit registers. and a 1::!-hit addre ... ~ for accessing a 
prn~'l.tlll -;tore organized as .f-nit nihhk<., . Instruction~ are variable length. ranging from two to o;;ix 
lllhbk:-, long. Most data is stored in the -.ixteen general purpose registers. Floating pmnt numbers 
occupy four registers: one for the exponent. and three for the mantissa. A ... uhmutine for floating 
point multiply is about 150 nibbles long and executes in about 600 storage cyd...-... . 
This relatively serial organization and direct connection to on-chip storage allows a ~hort 
'-Image cycle and simple instruction pre-fetch techniques. Internal pipelining in the pre-fl..'tl'h organi-
tation means that some nibble<> fetched are not executed but this cycle is used to g<X)d ad,antage to 
r<..'ln.: ... h dynamic storage. The processors are small enough that we expect to fit four of th..:m \\ith 
1112-+ nibbles of storage each on a chip with>- = I micron. or one per chip with >- = 2 micmn . By 
adhering to the principles of smallness and regularity, the processors achieve very good duty-factors 
on their internal parts. and are expected to he very fast. 
5. Mapping TMPL Primitives onto the Hardware 
The hardware design places a pair of queues between the two communicating. processor~. It 
~upports three operations: putting a message at the end of the output queue. removing the top cle-
ment of the input queue. and non-destructively examining the top element of the input 4ueuc. 
These operations correspond to imperati ve output and conditional input. TMPL program~ abo con-
tain imperative inputs and conditional outputs. and these must be transformed into the two tonm. 
CALTECH CONFERENCE ON VL SI , JanuaPy 1981 
520 
SaZZy A. Browning and CharZes L. Seit z 
that arc implemented. We discuss that mapping here. 
5. I. Imperative Input 
An imperative input statement is semantically identical to a conditional statement with only 
one guard. the expression form of the imperative, and a no-op action following the guard; 
p?msg I p?msg - SKIP ] 
Mc-;,age expressions remain in the maybe state until there is some activity on the port named in the 
~::-..pn:,~ion. As long as not all of the guards in a conditional or loop statement are false, the state-
nwnt will not terminate. In this case. the processor will wait until something arrives at the port. If 
th~ m~:ssage matches, the input is done, the no-op is executed. and the conditional statement is 
comrl~ted . If some other message is pending on the port, the conditional statement aborts, as does 
.t mJ,matched imperative. 
\\\: have just changed an imperative input into a conditional, and in the process, may have 
L r~·:tt~:d an ilh:gal conditional to conditional communication. lf we stopped here, that would be a 
m.~ ,, 11 concern. However. we are also going to remove all conditional outputs in this compilation 
-..ll.:p. With only conditional inputs and imperative outputs remaining, it is impossible to have an 
invalid pairing of conditionals. 
5.2. Conditional Output 
When conditional output is used , the processor is asking the port whether the other processor 
i~ waiting to input the message. The conditional output can be restated as a pair of communica-
tion!-. . First the processor issues a message statement asking, in essence, if the other processor wants 
the message. This imperative communication goes outside the body of the conditional or loop state-
ment that contains the original conditional output. That expression is replaced with a conditional 
mplll waiting for a positive reply to the query. The replacement technique is shown hclow. 
[ p!msg- command ) becomes p!query ; [ p?yes - p!msg; command ] 
Notice that a new message, query. has been introduced. The program in the processor at the 
other end of the port must now be changed to accept the new message . The message typing dis-
cu-;sed earlier makes it trivial to identify the imperative input that is paired with the original condi-
tional output . and replace it with a conditional statement that waits for the query message, answers 
it . and then inputs nul?. The replacement code is given below. 
p?msg; becomes [ p?query - p!yes ; [ p?msg - SKIP ] 1 
The example above uses a conditional statement with a single guard. The same replacement 
technique i~ used for statements with multiple guards, but there is a subtle difference in the code in 
the processor doing the imperative inputs: it must clear all of the query messages before requesting 
the original message. Figure 6 shows a loop statement with three conditional outputs as guards, and 
an imperative input that matches one of them. We show how the code in both processors is re-
written to remove the conditional outputs from one and the imperative input from the other. 
Removing conditional outputs from the source code is not as clean as one might like. The 
connection plan for the tree must be used to identify the port connections, since both processors 
involved in the conditional output must be modified. A solution that requires changing only the pro-
gram in the processor actually doing the conditional output is preferred , but elusive . 
6. Conclusions 
We have described a strategy for providing local communication among small, simple proces-
sors connected like a binary tree. The design has both software and hardware components. 
The software, a compiler, allows the programmer to write programs for trees with arbitrary 
fanout using a rich set of communication primitives. The compiler recursively applies a mapping 
algorithm that assigns logical ports to physical ones until the tree with arbitrary fanout has been 
ARCHITECTURE SESSION 
Communication in a Troee Machine 
Original Source 
{p!ml-SKIP 
I p!m2- SKIP 




p!queryl; p!query2; p!query3; 
{ p?ack1- p!ml; SKIP 
I p?ack2 - p!m2; SKIP 
I p?ack3 - p!m3; SKIP 
} 
[ p?queryl -SKIP ]; 
[ p?query2-
]; 
[ p?query3 -SKIP ]; 
p!ack2; [ p?m2 -SKIP ] 
Figure 6. Removing Conditional Outputs 
made binary. Skeleton programs for the intermediate processors that pass messages between parent 
and child are produced. 
The hardware design places a pair of queues between processors. The queues implement a 
subset of the software primitives; the others can be mapped onto them by the compiler. The advan-
tage of the queue scheme is its simplicity. Because the power of the tree machine comes from the 
number of processors, and not the capabilities of an individual one, a hardware design that is small. 
simply described, and easy to use, is desired. 
Acknowledgements. The research described here was sponsored in part by the Defense 
Advanced Research Projects Agency, ARPA order #3771, and monitored by the Office of Naval 
Research under contract #NOOOI4-79-C-0597. Our starting point was a sketch of the machine 
presented in S. A . Browning's doctoral dissertation. 1 A group of nine people met regularly to refine 
that sketch into a first implementation in silicon. Martin Rem, Lennart Johnsson. and Peggy Li 
made valuable contributions to the discussions of mapping problems between the notational abstrac-
tion and the machine implementation. Each potential design was examined in light of the require-
ments of the notation until one was found that satisfied both the programmers and the hardware 
designers. A compiler like the one described here is being implemented by S. A. Browning at Bell 
Laboratories. 
A tree machine processor has been designed and is being laid out by C. L. Seitz, Howard 
Derby, Chris Kingsley, and Chris Lutz. Erik deBenedictis and Peggy Li have written a collection of 
programs to evaluate the processor design. 
References 
I. Sally A. Browning, The Tree Machine: A Highly Concurrent Computing Environment. California 
Institute of Technology (1980). Computer Science Technical Report #3760 
2. E. W. Dijkstra, A Discipline of Programming, Prentice-Hall, Englewood Cliffs, New Jersey 
(1976). 
3. C. A. R. Hoare, "Communicating Sequential Processes," C.ACM 21(8} . pp. 666-677 (August, 
1978). 
4. Charles L. Seitz, "Self-Timed VLSI Systems," Proc. Caltech Conference on Very Lar~:e Scale 
Integration, pp. 345-356 (January, 1979). 
5. Ivan E. Sutherland and Carver A . Mead, "Microelectronics and Computer Science," Scientific 
American 237(3), pp. 210-228 (September, 1977). 
CALTECH CONFERENCE ON VLSI, Januaroy 1981 
522 SaLLy A . BPo~ning and ChaPLes L . Seitz 
Appendix 1. Templates for Padding Programs 
J. Imperative output: c(i)!msg 
parent : 
padding: 
[ odd(i) - l!msg((n-3)/2,(i+ 1)/2) 
I even(i) -
r n= 3 - r!msg 
I n> 3- r!msg((n-4)/2, i/2) 
J 
p?msg(d,p) -
L d = O AND odd(p) - l!msg 
I d < 2 AND even(p) - r!msg 
I d> O AND odd(p) - l!msg((d-l )/2,(p+ 1)/2)) 
I d> I AND even(p) - r!msg((d-2)/2, p/2) 
] 
2. Imperative Broadcast Output: c(*)!msg 
parent : l,r!msg 
padding: p?msg - l,r!msg 




l!request((n-3)/2,(i+ 1)/2) ; 
l?msg 
I even(i) -
[ n=3 - SKIP 




[ d=O AND odd(p) - l?msg 
I d< 2 AND even(p) - r?msg 
I d> O AND odd(p) -
l!rcquest((d-l)/2,(p+ 1)/2)) ; 
l?msg 
I d> 1 AND even(p) -





e-·ommun?-cat?-on 1-n a 'l' r ee Macn1-ne 




[ l?msg - SKIP 
ll?no-
[ n=3 -SKIP 





got:= FALSE ; 
[ d= O-
{ NOT got AND l?msg - got:=TRUE } 
I d> O-
] ; 
l?req(( d-l )/2); 
[ l?msg- got:=TRUE 
I l?no- SKIP 
J 
[ NOT got AND d< 2 -
{NOT got AND r?msg - got: = TRUE} 
I NOT got AND d> l -
r?req( ( d-2)/2); 
[ r?msg- got :=TRUE 
I r?no - SKIP 
] 
1 ; 
[ got - p!msg 
I NOT got - p!no 
1 
CALTECH CONFERE NCE ON VLSI, January 1981 
524 SaZZy A. B~owning and Cha~Zea L . Seitz 
5. Conditional Output: c(i)!msg -
parent: 
padding: 
[ odd(i) - l!msg((n-3)/2,(i+ 1)/2) 
I even(i) -
[ n=3- SKIP 
I r>3- r!msg((n-4)/2,i/2) 
] 
l,r?yes-
whatever was here 
[ odd(i)- l!msg((n-3)/2,(i+ 1)/2) 
I even(i) -
[ n=3 - SKIP 




[ d =O AND odd(p) -
got:= FALSE ; 
{NOT got AND l!msg- got :=TRUE }; 
r got - p!yes 
I NOT got - SKIP 
] 
I d<2 AND even (p) -
got:= FALSE ; 
{NOT got AND r!msg- got:=TRUE }; 
[got - p!yes 
I NOT got - SKIP 
] 
I d>O AND odd(p) - l!msg((d-l)/2,(p+ 1)/2) 
I d> l AND even(p)- r!msg((d-2)/2,pl2) 
] 
I l,r!yes - p!yes 
ARCHITECTURE SESSION 
;ommunication in a TPee Mach i n e 
6. Conditional Input: c(i)?msg -
parent: 
padding: 
[ odd(i) - l?req((n-3)/2,(i + 1 )/2) 
I even(i) -
[ n= 3- SKIP 
I r> 3 - r!msg((n-4)/2,i/2) 
] 
l,r?msg-
whatever was there 
( odd(i)- l?req((n-3)/2,(i + 1)/2) 
I even(i) -
( n= 3- SKIP 




( d= O AND odd(p)-
got:= FALSE ; 
{NOT got AND l?msg- got: = TRUE}; 
[got- p!msg 
I NOT got- SKIP 
] 
I d< 2 AND even(p) -
got: = FALSE ; 
{NOT got AND r?msg- got: = TRUE}; 
[got- p!msg 
I NOT got- SKIP 
] 
I d> O AND odd(p) - l!req((d-l)/2,(p+ 1)/2) 
I d> 1 AND even(p) - r!req(( d-2)/2,i/2) 
] 
I l,r?msg - p!msg 
CALTECH CONFERENCE ON VLSI, JanuaPy 1981 
526 SaLLy A. BPowning and ChaPLes L . Seitz 
7. Conditional Broadcast Input: c(*)?msg -
parent: 
padding: 
l!req( ( n-3)/2) ; 
[ l?yes - SKIP 
ll?no -
[ n= 3- SKIP 
I n> 3 - r!req((n-4)/2) 
] ; 
[n = 3- SKIP 
I n>3 AND r?yes- SKIP 
I n>3 AND r?no- SKIP 
] 
l,r?msg-
whatever was there 
l!req((n-3)/2) ; 
[ l?yes - SKIP 
ll?no -
[ n= 3- SKIP 
I n>3- r!req((n-4)/2) 
] ; 
[ r?yes - SKIP 






{ NOT got AND l?msg - got:=TRUE} 
I d> O -
] ; 
l?req(( d- J )/2) ; 
[l?yes- got:=TRUE ; l?msg 
ll?no - SKIP 
] 
[NOT got AND d<2-
{NOT got AND r?msg- got: = TRUE} 
I NOT got AND d> l -
r?req( ( d-2)/2) ; 
]; 
[r?yes - got:=TRUE ; r?msg 
I r?no - SKIP 
] 
[got - p!yes ; p!msg 
I NOT got - p!no 
] 
?CHITECTURE SESSIO N 
