Computations on a Tree of Processors by Browning, Sally A.
Computations on a Tree of Processors 
Sally A. Browning 
Computer Science Department 
California Institute of Technology 
Pasadena, California 91125 
Because processors and memories are both implemented in silicon, it 
is worthwhile to consider architectures that mingle both functions 
on a single chip . With the VLSI promise of a million or so devices 
on a ch1p, several hundred processors can communicate with each 
other at on-ch1p speeds. 
But in order to manage the complexity of such a ch1p, both when 
designing 1t and when testing 1t, the interprocessor communication 
paths should be regular and simple. This paper exa~ines the utility 
of a particular interconnect scheme, a binary tree . 
The processors are arranged as a binary tree: each processor except 
~ne root has a single parent, ana each orocessor except those at 
thA leaves of the tree has two descendents. This arrangement models 
the hierarchical communication round 1n large organizations. 
The binary tree architecture has some interesting aspects that make 
it a good choice for a general purpose structure. Any particular 
processor in a tree of n processors can be accessed in at most 
log2 n time. This comoares favorably with the O(n) access time for a 
linear arrangement or orocessors, or the O(~) access time if the 
processors are arranged in a rectangular array. 
The number of processors available increases exponentially at each 
level 1n the tree machine. If the problem to be solved has this 
growth pattern, then the tree geometry will match the problem. By 
The research described in this paper was sponsored by the Defense 
Advanced Research 
Noo 1ZJ-7a-c-·oao6. 
Projects Agency under contract number 
CALTECH CONFERENCE ON VLSI, January 1979 
454 computations on a Tree of Processors 
contrast, processors arranged as a list have a constant number of 
processors (namely 1) available at each level. And rectangular 
arrays make a polynomial number of processors available at each 
level, according to the allocation scheme chosen . Figure 1 
deMonstrates this property of the three structures . 
These three schemes are the simplest ways to connect processors 
together. They provide each processor wi th two (the list), three 
(the tree), or four (the square array) neighbors. Each has 
advantages over the others, and each has a fan club. 
The point of my. research is to determine whether or not there is a 
predominate geometry to the problems that might be solved on a 
highly concurrent machine . If such a geometry exists , and a 
hardware implementation 1s realizable in silicon, then that ~achine 
should be bu11t. 
Since the tree architecture appears to have more flexibility than 
the other two structures, I have concerned myself mostly with it . 
Th1s paper will describe several algorithms that have been 
successfully mapped onto the tree. In later sections , the 
Ringmachine, a linear array of processors proposed by Hike 
Ullner[7] , will be introduced in order to show that problems that 
are dominated by loading and unloading do not require the 
additional communication paths available in the tree . 1ne final 
section of the paper describes a problem from numerical analysis 
that makes effective use of the tree machine . The oaoer ends with 
some comments about the direction future investigations will take. 
A Digression into Programming Notation. 
The processors in the tree have some characteristics that must be 
emphasized by the notation used to describe them. 
First, each processor is a general computing machine with some 
amount of local store. A template that tlescr i bes both the program 
ARCHI TECTURE SESSION 
Compu tat i o n s o n a Tree of Processor s 
Ust 
Pt= 1 
P1 = 1 
Corner 
P1 = 1 
PI "" Pe-t + 1 
. 
A 
Tree 
pl = 1 
P. = 21·1 
I 
Center 
P. = 1 ~ = i2 I 
i\ 
P1 s number of processors at level i. 
l'I&IUW J. Proce••or• Available at eacll Level. 
455 
CALTECH CONFERENCE ON VLSI, January 1979 
456 ~ally A. Bro wnin g 
and the data that will characterize each processor. This template 
will be instantiated as many nodes of the tree . 
Communication between each processor, its parent, and its children 
should be limited to explicitly defined entry points. That is, 
there is no omnipotent entity that is able to oversee and influence 
the actions of other processors except as explicitly described . 
Each processor can expect to have local sovereignty, and can only 
be affected by communication it expects. 
And perhaps most importantly, locality should be encouraged in the 
problem solutions . Communication between processors requires 
synchronizing their actions, limiting the amount of concurrency 
that can be achieved. 
The notation that embodies these criteria is the class construct 
described by Dahl and Hoare [4). The class allows the programmer to 
def1ne as a single entity both a data structure and the procedures 
that operate on it . Thus the implementation details are known only 
to the class itself. Each object is an instance of a class, and can 
be thought of as a machine, capable of local computation but 
responding to well-defined requests from the outside world. 
The most widely known programming language that incorporates the 
class construct is SIMULA 67 [Z]. SIMULA extends the syntax of 
Algol 60 with class definitions . I will use a modified version of 
the SI"ULA syntax to describe the nodes in the processor tree. The 
syntax for a class declaration can be described in BNF as follows: 
<class declaration> :: • class <class identifier> 
<formal parameter part>; 
<attribute part>; 
<class body> 
<class body) ::• <statement> 
ARCHITECTURE SESSION 
Computations o n a Tree of Processors 
In order to deacr1be highly concurrent algor1th~s despite the 
aequent1•1 n•tur• of SIMULA, the meaning of the se•1colon symbol is 
changed. In vanilla SIHULA, semicolon is used to terminate a 
statement. Instead, read semicolon as "At this point, all 
statements in progress must be terminated before advancing to the 
next statement•. L1nefeed will be used to indicate syntactic end of 
the statement. In other words, linefeeds are used to separate 
statements; semicolons are used to separate groups of statements 
that can execute concurrently. E. W. Oijkstra introduces this 
se~1colon convention in [6] . 
Making Arbitrary Branching Ratios. 
While the physical structure of the tree restricts each processor 
to two descendents, a logical structure can be imposed on the tree 
to accomodate an arbitrary branching ratio. Each logical processor 
consists of several physical processors, enough to provide the 
desired number of offspring. A logical node with n children is 
built from n-1 physical nodes and is log2n levels deep. Figure Z 
shows some sample logical processors . Figure 3 gives a SIHULA 
representation of the algorithm used 
branching ratios. 
to simulate arbitrary 
All of the algorithms described in this paper will describe logical 
processors and the logical structure of the tree. The SIHULA code 
assumes the existence of the logical processor defined in Figure 3, 
and builds definitions based on it . The complex~ty of each 
algorith~ will be calculated for the physical structure, however. 
Sorting. 
A binary tree with depth log2 n can be used to sort n numbers. The 
sort is accomplished as a byproduct of loading the numbers into 
memory and then reading them out again. The numbers themselves are 
457 
CALTECH CONFERENCE ON VLSI, January 1979 
458 Sally A. Browning 
1'1gure B. Lolflcal Proce••or• with 2 to 8 Deacendent• 
( •oltd color boxe• compriae the logical proc•••or) 
ct.ASS NMe(fl)tiNTfGER"' 
BEGIN 
RIF(Node)left.rtllttt 
llftlt o.- to butN logfcal node1 
tf ft)Z THEN left:•NEW Node((n+1 )/12)1 
If ft)a THEN rflhta•NEW NOIIe(rt//Z)a 
END of ClAU N..._ 
Rtr(PtoooaecM') PROCEDURE 8oft(a)1 INTEGER •a 
BEGIN REF(node)IJa 
lt••lf a<•(n+1 )/IZ THEN left ELM rflht1 
WHILE NOt (It IN Procouor) DO 
It•·'' a<•(~t.ft+1)/IZ THEN ll.left ELSE , .rtghtt 
...... ,. 
END of PROCEDURE 8oftt 
END of CLASS ProcoaMrt 
1'1gure 3. MaJUn,g Arbitrary Branching Ratto• 
ARCHITECTURE SESSION 
Computations on a Tree of Processors 
never in sorted order internally, but come out of the tree in the 
desired order . 
Sorting is • particularily interesting example because it 
illustrates a fundamental issue in concurrency. It is well known 
that sorting on a sequential machine can be done with O(nlog2n) 
comparisons. However, it has been shown on very fundamental grounds 
that if communication is restrict.ed to nearest neighbors, at least 
n
2 comparisons are required[5]. The apparent advantage of the 
algorithms comes as a direct result of longer 
communication paths. It is also clear that no scheme will be able 
to produce an ordered set of numbers until all numbers are loaded 
into the machine. This means that the best achievable 
complexity is O(n). 
time 
The algorithm I use is an implementation of heap sorting. The 
algorithm that runs in each processor, given in Figure 4, has a 
procedure for loading the tree called Fillup end a procedure 
invoked during the output cycle called Passup. 
Fillup keeps the largest number seen to date, and passes the 
smaller one to the right or left child, keeping the tree balanced 
by alternating sides . 
Passup returns this processor's current number and refills itself 
with the larger of the numbers stored in its descendents. This 
action is pipelined so that the largest number is always available 
1n the root. 
This sort algorithm is bounded by the time it takes to load and 
remove the numbers. Thus it has time complexity O(n). It requires n 
processors, one for each number to be sorted. 
459 
CALTECH CONFERENCE ON VLSI , Janua r y 197 9 
460 
Proceaaor CLASS HeapSort, 
BEGIN 
INTEGER number; 
BOOLEAIII balanced,empty, 
REF(proceaaor)left,rlghtJ 
PftOCEDURE flllup(cMdldate), INTEGER cendldate1 
BEGIN 
IF empty THEN 
BEGIN 
n~•cand,..te 
emptra•fALSE, 
END 
ELSE 
BEGIN 
IF candtclate>nuwtber THEN lawep, 
BEGIN INTEGER t& 
ta =candidate, 
candlclatea•nuwtberJ 
numbera•t, 
END1 
IF balanced 
THEN left. tlllup(candt.s.te) 
USE rfaht. flllup( candidate), 
balanceda•NOT balanced; 
END1 
END of procedunl flllupJ 
INTEGER PROCEDURE ,.aaupl 
BEGIN 
paaau~t:cnumber, 
If left .. NONE AND rtght .. NONE THEN emptya•TRUE llta a leaf1 
ELSE 
If loft.empty THEN 
BEGIN 
tr rtght.empty THEN emptya•TRUE !both aubtntea empty; 
ELSE ~a•rtght.,.aaup, Ifill from right aon, 
END 
ELSE 
If rtght.empty THEN numbef':cteft.,.aaup Ifill frCMII left aon, 
ELSE numbera•lf left.nuwtber>rtght.number 
THEN left.,.aaup ELSE rtght.,.aaupa 
ttake the Ia,..,., of the two, 
END of proceclure ,. .. ~, 
Unit code& 
empty:•TRUEt 
balanceda•TRUE& 
!left end rfttht .... 
END of olau Hoapllort1 
P11JUre 4. Heap Sort 
ARCHITECTURE SESSION 
Sally A. Bro wning 
Compu tat i o n s o n a Tree of Processor s 
Matrix Multiplication. 
Consider the problem of mult i plying two nxn matrices together. The 
tree machine algorithm that provides the answer in the least amount 
of time divides the multiplicand into rows and the multiplier into 
columns, pipelines the loading 
single eleMents. This process 
O(n2 ) tiMe, a processor and 
and multiplication of 
requires O(n2 ) processors 
time product of O(n4 ). 
pairs of 
and takes 
If each 
processor has enough •emory to store a row of the matrix instead of 
a single element, the algorithm would require O(n) processors, 
resulting in the more familiar O(n3 ) product. 
The algorithm makes use of a tree that has a branching ratio of n 
at each node, and is two levels deep . The root node has n 
descendents, each controlling n leaves of the tree . Then there are 
n
2 leaves and a total of 2n2 -1 processors. 
Each child node of the root, hereafter called a rowsupervisor, 
will represent a row of the multiplicand matrix, and produce a row 
of the product matrix. Each of the n descendents of a row 
supervisor will hold one element of the row. 
The algorithm is given in Figure 5. The multiplier matrix is loaded 
into the tree one element at a time, by column . The root hands each 
element to all row supervisors, wh i ch s end it to their appropriate 
leaf: the first element in any column goes t o the fi rst child of 
each row supervisor, the nth element to the nth child. That child 
multiplies the multiplier element by the multiplicand element it 
holds, and returns the product to the row supervisor. When an 
entire column of the multiplier has been loaded into the tree , each 
row supervisor takes the n products generated in its children, adds 
them, and returns one element i n the corresponding column of the 
product matrix. That is, when the first column of the multiplier 
has been loaded into the tree , the first column of the product 
matrix is available, and so on. 
This process can be pipelined to take O(n2 ) time . Thus the t i me it 
461 
CALTECH CONFERENCE ON VLSI, January 1979 
ProceaMr ClAU ,. . .......,...,. 
BEGIN 
lthe Mlltttx alzo, N, Ia an attrtt~Nto of CLASS "oco•-· a.t11 I• avau--. 
to ua, 
"EAL....-ott 
INTEeER oo..tt 
PROCEDURE LoH(o'-t)l REAL .._,, 
BEGIN 
co•t••count+1, 
aon[oovnt).load(-'-"t)t 
IF eount•N THEN oCHiftts•Oa 
END of ,_ • ..,_ LoMt 
"EA,L PMJCEDURE Mtllt~.._t)t REAL .._.., 
BEGIN 
eounta•eount + 1t 
pntduota•pnMiuct + aon(.....t).MUitlply(.._.t)l 
tf co.nt•N THEN 
BEGIN 
MUittpl~a•,.-otl 
counta•Oa 
,..,., •• o.o. 
ENDt 
END of ,.cod ... Meett .. 
llnltlallntlont 
counts•Oa 
pnMhtclts•O.Oa 
END of ela .............. ... 
ProceaMf CLASS LMft 
BEGIN 
PROCEDURE LoH(o'-t)t REAL o'-11 
BEGIN 
rowe....._,., • .._.,, 
END of ,roce..._ Load1 
PROCEDURE Mult.,ey(•l•••nt)t "EAL .._..., 
BEGIN . 
MUitlptya•row•loooKAit • .._., 
END of llfOD•,_. Mutt .. 
T11f1UW 5. Afatrbc Mult1p11cation 
ARCHITECTURE SESSION 
Sally A. Browning 
Computatio ns o n a Tree o f Processo r s 
takes to load the n2 elements of the matrices dominates the t i me 
complexity of the problem . Remember, however, that matrix 
operations are meaningless except in the context of the driving 
problem. The entries in the matrix 
generated, and the generation time may 
must be taken, however, to generate 
are not 
be less 
so much loaded as 
2 than O(n ) . Care 
the matrix entries in the 
arrangement used by the multiplication algorithm; moving elements 
around in the tree is costly. 
The Color Coat Problem. 
This NP-complete problem is an adaptation of the K-colorability 
problem. Given an undirected graph G of n nodes and a set of n 
colors, each with an associated cost, f i nd a mi nimum cost coloring 
of the graph such that no nodes sharing an edge are the same color. 
There are n" possible colorings of the graph. Evaluating them 
sequentially produces a solution in time O(n"). I present a 
parallel algorithm of order n2 • 
Each level 1n the processor tree r epresents the consideration of 
another node . That is, level one shows possible colors for the 
first node, level two colors the second node based on the choices 
made for at level one, and so on. I will describe the generation of 
the potential colorings . 
Each processor, described in Figure 6, has an edge list called edge 
and a list of costs indexed by color number called colorcosts. 
There is an array called coloring that reflects the color choices 
for preceding nodes, and a boolean array called colors that 1s used 
to generate the possible colorings for this node. 
The algorithm, given 1n procedure color, begins by assuming that 
all colora yield valid colorings . The array coloring is used to 
eliminate those colors that have been used to color nodes that 
share an edge with this node. This reduced set of colors, all of 
' .I: OJ 
CALTECH CONFERENCE ON VLS I , January 1979 
464 
Pfoceaa• ClASS ColorCoata 
BEGIN 
BOOlEAN ARRAY edfe[1•n,1•n),colora[t tn)a 
INTEGER ARRAY oolortng(tan],colorcoata[1•n]a 
INTEGER ooat1 
PROCEDURE cotcw(_.), INTEGER_., 
BEGIN INTEGER Ia 
tf ftCMie)n THEN 
BEGIN 
coat••O. 
FOR h•1 TO_.., DO ooat••coat+coiOf'Coat(coiOI'Ing[t]]a 
END 
ELSI 
BEGIN 
FOR h•t TO nocle•t DO If eclge(t,noclll) THEN 
ooiOf'a( coloring[ I] ]t•f ALSE 1 
fOR h•t TOn DO 
tf colora[l] THEN 
BEGIN 
aon(f).colortng[noclll]••l 
aon(t).cotOI'(node+t )I 
END 
ElSE aon(l)z·NONEt 
ooat••,...xcoatt 
fOR h•t TOn DO 
ENDt 
If (If aon(t) • NONE 
THEN FALSE ELSE coathon(l).coat) 
THEN ooats•aon(l).ooatt 
END •t ,no•..,. colort 
END of ota .. c.IOf'Coatt 
1"1gure 8. CoJor-co•t Problem 
ARCHITECTURE SESSION 
VL4o.._....._,Y C'1 • UJ.. VVY 1.l..1..U f5 
Compu tations on a Tree or Processors 
Symbol 
• 
G 
II 
Color 
Bl ue 
Green 
Red 
F1gure ?. Color-Coat Example: Graph and Color L1at 
1"1gure 8. CoJor-Co•t Exampler Solution Tree 
Cost 
2 
1 
0 
465 
CALTECH CONFERENCE ON VLSI, January 1979 
466 ~a~~y a . tlr ownln g 
wh1ch are legal colorings, is used to spawn descendents, one for 
each coloring or this node. 
When the tree 1s n levels deep all the legal colorings have been 
generated. The leaf nodes calculate a cost for the coloring they 
represent, and each parent node takes as its cost the least cost 
among 1ts children. Thus the minimum cost coloring is stored at the 
root . 
An example 1s 1n order . A sample graph and color set are given in 
Figure 7. Figure 8 shows the colorings and costs arrived at by the 
algor1th~. Each level of the tree represents a node of the tree. 
That is, 1f the root is level 0, the first node is colored 1n level 
1, and level 3 represents potential colorings for the third node. 
Besides representing a part of a coloring, each node also contains 
the M1n1mum cost coloring found among its descendent colorings. 
The MinimuM cost of coloring the sample graph is 1, and is achieved 
by coloring nodes (1,2,3) (red,green,red) . 
When the color cost problem is solved in 8 brute force manner on 8 
sequential machine, it takes exponential time. The tree machine can 
solve the probleM in O(n2 ) time us i ng an exponential number of 
processors. So on either machine, this problem exhibits exponential 
growth . 
Transitive Closure. 
Given a directed graph G, the transitive closure of G, G• , can be 
generated. The arcs of G• are subject to the following condition : 
for every arc (v,w) in G• there is a path, (v,e 1 ),(e1'e2 ), • • • 
( em, w) , in G • 
The best sequential algorithm for generating the transitive closure 
of a graph 1s attr i buted to Warsha11[1,8]. The algorithm uses three 
FOR loops · that run through the incidence matrix adding Rrcs. After 
ARCHITECTURE SESS ION 
Comput a tio ns o n a Tree of Processor s 
k steps of the outer loop, there is a path from vertex i to vertex 
j through vertices in the set {1,2, ... ,k} if and only if B[i,j]=l . 
On a seQuential machine, this algorithm takes O(n3 ) time. The code 
1s given in Figure 9. 
A direct mapping of Warshall's algorithm onto the tree machine 
3 . yields a rather boring n algorithm that merely spreads the three 
iterative steps among the processors in the tree . 
There is a much more fruitful path to take. By understanding what 
actually happens during the execution of the algorithm, an 
effective mapping of Warshall's algorithm onto the tree machine is 
discovered. 
There are two key points to be made about Warshall's algorithm. 
First, the algorithm is cascading. Newly created arcs can effect 
the creation of yet more arcs . Any realization of the algorithm 
must allow for this characteristic . It is not sufficient to 
consider only the arcs in the original graph. 
Also important is the comparison between arcs. In Figure 9 this 
comparison is stated as 
IF b[i,j] AND b(j,k] THEN b[i,k]:=TRUE; 
In English, this reads •if there is an arc from i to j, and an arc 
from J to k, then create an arc from 1 to k•. 
Suppose that instead of an incidence matrix, there is a list of 
arcs. This list will be used as input to the tree machine. The 
output is the list of arcs in the transitive closure. 
The tree has a root node, n descendents of the root that are 
instances or the class vertex, and n2 descendents of the vertex 
processors described by the class toVertex.- The vertex processors 
represent the n nodes in the graph. The toVertex processors are the 
n possible arcs from each node. Jim Rowson deserves special thanks 
467 
CALTECH CONFERENCE ON VLSI, January 1979 
468 Sa l ly A. Hrown1ng 
~or distilling my complicated structure into this very simple one. 
The arcs in the original graph are used only as the starting place 
and are indistinguishable from generated arcs. As new arcs are 
created, they are considered by all th.e vertex processors just as 
the original arcs are. 
Area are created using a variant of the Warshall comparison. An arc 
has a starting point, fromV. and an ending point, toV . Each arc is 
considered by all the vertex processors. Each vertex will create an 
arc by turning on its appropriate descendent if one o~ two 
conditions · is true. Either this vertex must be the starting point 
of the arc. or there must be an existing arc from this vertex to 
the starting point . 
The first condition takes care of the arcs in the original graph. 
The tree starts out with no arcs. As the original arcs are loaded 
into the tree, the ~irst condition is true and arcs are created. 
The second condition 1s the Warshall comparison. Suppose the arc 
(v,w) is being considered by vertex u . If arc (u,v) exists then arc 
(u,w) is created. This is how new arcs are created. 
As each arc is created, by 
broadcast throughout the tree; 
other arcs. 
satisfying either criterion, it is 
it might effect the creation of 
The code for this algorithm is given in Figures 10 and 11. Figure 
10 shows the properties common to all three kinds of processor 
nodes, •nd defines some auxiliary classes used for queueing and 
passing date between processors. Figure 11 is the definition of the 
three processors, including the procedures that implement the 
revised Warshall algorithm . 
The key routines are load and unload . Procedure load appears in the 
root and vertex processors and is used to pass arcs through the 
systeN. Unload is in the root. Each call on unload yields an arc in 
ARCHITECTURE SESSION 
Computations on a Tree of Processors 
BOO\.EAN AftRAV B(1tn,1tn]a 
INTEGER I,J,II.a 
FOR kt•1 ton DO 
fOR h•1 ton DO 
feft ja•1 ton DO 
If l(t,k) AND 8(11.,1) THEN B(I,I]1•TRUEa 
#'tgure 9. War•ball'• Algorithm (Sequential Macbtne) 
CLASS .-ooea .... , 
BEGIN 
REF(proceaaor) MY•Y aoft(1m]a 
Mf(.-ooeaaer) ,.,_,, 
END ot elaaa IINOOHOfl 
twoc••- CLASS Gtwoco•-• 
BEGIN 
REf(hoad)Qa 
PROCEDURE IMertlnQ(qe)l Rff(.,ouoEietMnt)qe1 qe.lnto(Q)a 
REF(..-uoE'-"t) PROCEDURE flratlnQ, 
BEGIN Rff(t~uoueEIOfMftt)tiea 
flratlnQa•qe••Q, flr1t1 
qe.out, 
END of ,.._ •• ..,. flratlnQI 
END of olaaa O,..OOOhOfa 
tlnll. CLASS .-uoE'-"t(~)l Rff(Qproooaaor)lftY()wnofl 
BEGIN 
END of olaaa .,.uoEe....ta 
CLASS odge(froMV,toV)I INTECRR tr-V,toV1 
BEGIN 
ENDofola ........ 
#'tgure J 0. General Proc•••or De~1n1t1on and Auxtlary Cla•••• 
469 
CALTECH CONFERENCE ON VLSI , January 1979 
470 
QprooeaMf CLASS roota 
BEGIN · 
PROCEDURE loed(o)a REF(Mie,_. 
BEGIN INTEGER Ia 
FOR h•t STE!It 1 UNTil" DO aon{t).loH(o)a 
ENDof .......... loMa 
REF( ..... ) PROCEDURE Uftfoetl1 
BEGIN 
IF Q ..... ty THEN tlftfoNa·NONE 
ELsE BEGIN REF(..-.E'-'t)qea REf(ectge)e1 
tte•·fttatlnGa 
unfoM!•e.•tte • ...,own..nextE.I 
loed(o)t 
ENDa 
ENDofiii'ODd----. 
BEGIN Int..- Ia 
FOR h•t STEP 1 UNTIL" DO aon{l)r•now -'ox(t)a 
END of lnlt o.-, 
END of olaaa roota 
QproceaMf CLASS vertox(MyNoiJo)a INTEGER MyNoclot 
BEGIN 
REF( ....,_EioMoftt,., 
BOOlEAN .,ou.da 
REF( .. ) PROC£DURE nextE., 
BEGIN AEF(ttuo_E...._.t~ta 
ttt:• ftfatlnQ; 
neatEdgor·NEW odgo(MyNoclo.t~t.My()wner.~~tyNocloa 
IF NOT Q ..... ty THEN ,_..,t.ln-'tnG(IIO) 
ELSE tiUOuod:"'FALSEa 
END of prooodwe neat£.1 
PROCEDURE loed(o)1 R£f(odgo,_. 
BEGIN 
IF o . fromVo:MyNoiJo OR .-(o.tr-v).M~oo•lt• 
THEN BEGIN 
aon[ •· toV).IMfUtlloa 
tFNOTIIUOu.d 
THEN BEGIN 
...... t .ln-'lnG(tte)l 
..-..... •TRUEa 
ENDa 
END1 
END of pr....,. toMa 
..-uotlr•faiMt 
410: -NEW 'IUOVOEioMent(THts wrtox)t 
BEGIN INTEGER It 
FOR 11•1 STEP 1 UNTIL" DO Hft(t)a•now toVertoa(l)a 
END of lftlt coclo1 
END of claaa verto•t 
Sally A. Browning 
l'lgure ll . lfevl•ed War•halllmplementat1on (Contlnued on next page) 
ARCHI TECTURE SESSION 
Computations o n a Tree of Processors 
Q,_a.- ClASS toVertoa(MYNoct.)t INTEGER ~ttyNoct.t 
BEGIN 
REF(quouoEe-t)qe, 
BOOLEAN edgeEal•t•t 
PROCEDURE ~Edfel 
BEGIN 
If NOT edgllfAiata 
THEN BEGIN 
edgiiEalataa•TRUEt 
p~.m-tlnQ(qe), 
ENDt 
END of pr-•lt- ..tlE .... t 
odgeEal•taa•fAlllft 
qoa•N£W quouo£'--'t(THIS toVertoa)t 
ENDofotaaa-... 
r1gure! !. Rev1aed Warallall Algorithm lmplement.t1on 
r~&are ! B. Arrangement. o' a 111 the Tree and R1ngmacll1ne 
471 
CALTECH CONFERENCE ON VLSI, January 1979 
472 ~ally A. tlrown1ng 
the transitive closure. 
Each arc in the origtnal graph is aiven to the rool via a call on 
procedure lnad. The arc is passed to all vertex processors. There, 
on the second level, each vertex executes the test described above 
to see if the arc causes the creation of an arc from this vertex. 
Once all the arcs of the original grapn have been loaded, the arcs 
of the transitive closure are available for unloading. As an arc is 
handed to the outside world by a call on the root's unload 
procedure, it is passed back down the tree to the vertex 
processors, just as the original arcs were, by a call on procedure 
load. 
A double system of queues is used to indicate the availability of 
arcs for the unloading and broadcasting ooerations. The queue in 
the root is used by the vertex processors to indicate willingness 
to provide an arc to the root. When an arc is unloaded, it is also 
broadcast through the tree via the load routine. The aueue in the 
vertex processors is used by the toVertex processors to indicate 
that another arc has oeen created. 
The queues are used to avoid polling the vertexs and toVertexs fro~ 
available arcs. The polling introduces two iterat1on statements 
which are executed for each arc 1n the transitive closure . They 
cloud the issue by appearing to affect the complexity. The queues, 
on the other hand, simulate the hardware nicely. The two upper 
levels of the tree need to respond to a signal fro~ any one of 
their c~11dren. The queues provide this effect. 
The algorith~ as described above and in Figure 11 has time 
complexity of the order of the number of arcs in the ~ransitive 
closure. The maximum number of arcs in a direct~d graph of size n, 
a 2 is n ; the transitive closure is itself such a graph, is n . Thus 
the time complex1ty of this algorithm is O(n2 ), li~ited by the ti~e 
1t takes to read out the arcs of the closure. 
ARCHITECTURE SESSION 
Computations o n a Tree of Processors 
As described, z Zn -1 processors are used to generate the closure. A 
solution using only n+l processors, yet essentially the same, can 
be devised. Suppose each vertex processor, now the leaves of the 
tree, contains a boolean array instead of using toVertex processors 
to represent existing arcs. The vertex processors have more local 
store. and a parameter of the problem, the size of the graph, has 
been introduced into the physical requirements for each processor. 
This is something I want to avoid. It is, however, a perfectly 
valid implementation, and indeed, retains the O(n3 ) total 
complexity. 
Is the Tree Machine Magic? 
lt is time to address the question of whether these problems need 
tne tree machine structure . The answer is simple. No. I will give 
an alternative architecture that vields an equivalent solution. 
Mike Ullner has oroposed the Ringmachine (7], a •tree of branching 
ratio one•. The structure is a doubly-linked ring of processors, or 
more simply, a linear pipeline. 
This structure is also capable of doing transitive closure in O(n2 ) 
2 time using O(n ) processors, and the code is as simple as that for 
the tree machine implementation. The Ringmachine algorithm is given 
in detail in [3]. 
The key to the O(n2 ) solution is the 
communication path. In fact, sorting and matrix 
pipeline, not 
mu 1 tip li cation 
the 
are 
also problems in this class . The size of the answer determines the 
size of the problem. Any pipelined structure that can spew out 
answers one at a t1me in a continuous stream is adequate. 
So what is the tree machine better at? The difference between the 
tree and the ring is that any particular node in the tree can be 
total number of processors . Problems that have one answer that can 
be in any of a larQe number of processors can take advantage of the 
473 
CALTECH CONFERENCE ON VLSI, January 1979 
4 74 Sally A. Browning 
tree structure. NP-complete problems, like the color cost problem 
treated aarl1er, are a graphic example of this. Those problems 
require an exponential number of processors, however, and thus are 
not practical. 
An Algorithm that Uses the Tree Effectively. 
J have found a problem that does make use of the extra 
com.unication paths in the tree. It is taken fro~ numerical 
analysis, and is presented here out of context. The proble~ 1s to 
generate a vector x from a vector a according to the following 
rule: 
In other words, the ith element of the vector x 1s the su~ of the 
first i ele•ents of the vector a. This proble~ is solvable on a 
aequant1a1 .. china in O(n) ti•e . 
If the tree ••chine and Ringmachine are treated as peripheral 
functional units that are given a and produce x, the performance of 
the two Machine is identical. loading and unloading the vectors 
again dominates the t1me complexity . In each case, n processors are 
used to solve the problem in O(n) time. 
A mora interesting formulation of the problem assumes that the tree 
and R1ngmachine are already loaded with some convenient arrangement 
of a. How fast can x be generated, with x ending up in the same 
arrangement as a? 
Given the arrangements shown in Figure 12, the Ringmachine uses n 
processors to generate x in place in O(n) steps. The tree machine, 
on the other hand, uses n processors, but arrives at the answer in 
0(1og2n) steps. For large n, this is a significant difference. 
ARCHITECTURE SESS I ON 
compu~a~1ons o n a Tree or Processors 
CLASS aum(a,maJt)l INTEGER a,ma~; 
BEGIN END1 
PYec••- ClAU -torSUWII 
8EOIN 
INTEGER aubacrtpt1 
INTEUR 11t1 
AEF(aam) P'AOCEOURE 8U""'ftl 
BEGIN 
It' f..rt••NONE AND ,._,.••NONE 
Ttftflf _u,a•NEW auM(~~t,aubacript) 
ELM BEGIN MF(avm)l,r1 
h•lf .. tt••NON£ THEN NEW .. lm(x,aubacrlpt) ELS£ left.aumpuPf 
ra•lf rtght••NOffE THEN NEW IUWI(x,aubaorlpt) ELSE rlght.aumpup, 
···••1 .• , 
IVMUpa-NEW IUM(lll+r.a,r.IMX)I 
fett .au.clown(l)l 
rllht.aumdown(NEW aum(~~t,aub.crtpt))l 
ENDt 
END of~-~~~ 
PROCEDURE autiMiown(p)l AEF(aUM)PI 
BEOtN 
If p ..... x(aubaorlpt THEN lllt•~~t+p.al 
If left•/•NONE THEN lett.autndown(p); 
If rlght•/•NONE THEN rlght.autndown(NEW IUM(~~t,aublcrlpt)l 
END of prooedure autndownr 
END of eNN vectorSUWit 
I"~ 'J a. Algor1tlun ~or ~1nd1ng x,. 
Sequential 
Machine Tree Machine 
apace tiae processors tiae 
Heap Sort n nlog2 n n D 
Matrix 2 3 2 2 Mult.iplicat.ion n -n D D 
n n 2 Color Coat D n D D 
Transitive 2 3 a a 
Closure D D D D 
XI D n D log2 n 
r1gure f4. BttqUent1al and Tree Macla1ne Pedormano•. 
4 '/b 
CALTECH CONFERENCE ON VLSI, January 1979 
476 
The Ringmachine algorithm is straightforward. Starting with the 
vector a distributed as in Figure 12, each processor adds numbers 
that are passed in from the left to the ai it holds before passing 
them on. After the ith processor has seen i-1 numbers, 1t sends a1 
to the right and becomes dormant . The nth processor waits n-1 time 
steps for a1. The other n-2 elements arrive in the next n-2 time 
steps, and are added to an to form xn. Thus the process 1s complete 
after Zn~t cycles. 
The algorithm on the tree machine is not as simple. The arrangemen~ 
of the a1's given in Figure lZ is not intuitive. And the algorithm 
requires data to flow up and down the tree simultaneously. The 
SIMULA code is given in Figure 13 . 
The summing starts in the lower left hand corner of the tree. Each 
node gets partial sums from its left and right children . The left 
hand sum is added to the a1 in the processors, stored as x 1, and 
passed to the right child . Then the sum from the right child is 
added in, and this result, the sum of all three numbers available, 
1s sent to to the parent processor. It takes log2 n cycles for t~e 
root to receive the sum of the a1•s in the 1Aft half of the tree, 
and another log2 n steps for that sum to filtAr down to the lower 
right corner, forming x". 
The algorithm described above uses the extra communication paths of 
tne tree to advantage. It remains to be seen if the problem can be 
put back into the numerical analysis context from which it came, 
and st111 perforM better on the tree than on the Ringmachine. 
Conclustons. 
The work described in this paper 
questions. First, are multiprocessor 
what kind of system should be built? 
is aimed at 
systems useful? 
deciding 
And if 
two 
so, 
The answer to the first question is a resounding yes . Figure 14 
ARCHITECTURE SESSION 
Computations o n a Tree of Processor s 
compares the performance of the algorithms described here on 
sequential machines and the tree machine . In each case, the time 
complexity 1s substantially reduced . 
The second question does not yet have a clear answer. I am just 
beginning to examine problems that can use the three-neighborness 
of the tree to advantage. Unless the additional complexity of 
building a tree rather than a Ringmachine can be justified, the 
simpler structure is heavily favored . I am hopeful, however, that 
numer1cal analysis problems will demonstrate the value of the tree 
•ach1ne. 
477 
CALTECH CONFERENCE ON VLSI, January 1979 
478 
Be1'erences 
[1] Aho, A.V., J . E. Hopcroft , and J.D. Ullman 
The Design and Analysis of Computer Algorithms 
Addison Wesley, Reading, Massachusetts, 1974 
Sal ly A. Browning 
[2] Birtwistle, G. M., 0-J Dahl, B. Hyhrhaug , K. Nygaard 
SIMULA BEGIN 
Petrocelli, New York, 1973 
[3] Brown1ng, Sally A. 
•rrans1tive Closure and the Tree Machine• 
Computer Science Department Display file 12402 
California Institute of Technology, 1978 
[4] Dahl, 0-J, E. V. Dijkstra, C.A.R. Hoare 
Structured Programm1ng 
Academic Press, New York, 1972 
[5] Demuth, H. B. 
•Electronic Data Sorting• 
PhD. Thesis (Stanford University, October 1956) 
[6] Oijkstra, E. W. 
A Discipline of Programming 
Prentice-Hall, Englewood Cliffs, New Jersey, 1976 
[7] Ullner, Hike 
•Ringntachine• 
Computer Science Department Display file in progress 
California Institute of Technology, 1978 
[8] Varshall, S. 
•A Theorem on Boolean Matrices• 
J.ACM 9:1, p.ll-12 
ARCHI TECTURE SESSION 
