Tree Machines: Architectures and Algorithms A Survey Paper by Ibrahim, Hussein
Tree Machines: Architectures and AlgorIthms 
A Survey Paper 
Hussein A. H. Ibrahim 




Recent advances In very large scale integrated (VLSI) CircUit technology have lead 
to a surge In research aimed at finding new computer organizations that support a 
great deal of concurrency Computer organizations based on tree structures appear 
well-SUited to several kinds of parallel computations. In thiS paper we will diSCUSS 
the pl?rformance of tree machines as well as Issues related to their Implementation 
In \'LSI Examples of tree machines are presented, With an emphasis on the way 
the processing elements communicate in the machine. A taxonomy of tree algonthms 
based on a taxonomy of parallel algOrithms proposed by Kung In 1979 IS 
Introduced. Examples of tree algOrithms are also given. 
Table of Contents 
1. Introduction 
2. Performance Considerations 
3. Implementation Issues 
31 Layout of Tree Machines 
3.2 Pinout Limitations and Tree Machines 
33 The Leiserson Scheme for Tree MachIne Layout 
4. Some Tree Machines 
41 The Tree Channel Processor 
4.2 The Caltech Tree Machine 
4.2.1 The Processor Architecture 
4.2.2 CommunIcation in the Caltech Tree Machine 
43 The X-Tree Machine 
4 .. 31 Communication in the X-Tree 
4.32 The X-node architecture 
44 The NON-VON Supercomputer 
44.1 CommunIcatIOn In NON-VON 
4 5 The Stony Brook Tree Machine 
4.51 Communication In the Stony Brook Tree Machine 
·1 5 Special Purpose Tree Machines 
45 1 The DADO Tree Machine 
4.5 2 A Tree Machine for Searching Problems 
5. Algorithms on Tree Machines 
.) 1 SIMD Algonthms 
5 lIThe Transitive Closure Algorithm (SIMD Version) 
S 2 ~m·.ID Algonthms 
5.2.1 The Transitive Closure Algonthm (MI}yID VersIOn) 
j 3 Systohc Algorithms 






























List of Figures 
Figure 3-1: Hyper-H Embedding of a Binary Tree 8 
Figure 3-2: The Mapping of a Right Skewed Binary Tree 8 
Figure 3-3: Implementing Tree Machines Using Two Kinds of Chips 10 
Figure 3-4: Interconnection of Two LelSerson Chips 11 
Figure 3-6: Leiserson Layout: The PrInted CirCUit Board 11 
Figure 4-1: A Linear Array Processor 12 
Figure 4-2: A Subtree of the Tree Channel Processor 13 
Figure 4-3: The Caltech Tree: Processor Architecture 15 
Figure 4-4: Mapping Arbitrary Fanouts onto a Binary Tree 18 
Figure 4-6: A Binary Tree with Full- and Half-Ring ConnectlOns 20 
Figure 4-6: Block Diagram of an X-node 22 
Figure 4-1: Top Level OrganizatIon of NON-VON 23 
Figure 4-8: NON-VON Block Diagram of the Processing Element 25 
Figure 4-0: Inorder Embedding of the Linear Array 27 
Figure 4-10: Example of a Double Tree Network 29 
Figure 4-11: Functional Division of the DADO Tree 32 
Figure 4-12: Structure of The Search Tree Machine 35 
Figure 4-13: Components of the Square Node 36 
Figure 6-1: The SystolIc Tree 43 
1 
1. Introduction 
Recent advances in very large scale integrated (VLSI) circuitry allow us not only to 
make current processors faster and smaller in size, but also to em bed a number of 
processIng and memory elements within a single chip in a cost-effective manner. 
This has lead to a surge in research directed toward findmg radically new computer 
organizations that support a high level of concurrency. In conventional von 
Neumann machInes, all communication between the processing site (usually a single 
processor), and the memory site proceeds along a relatively low bandwidth 
communicatlOn path, which is a serious bottleneck in conventional machines [Back 
781. Intermingling data and processing elements in VLSI technology allows many 
local computations to be executed concurrently where the data elements reside. 
Consequently, the bottleneck of conventional machines is eliminated, and the 
amount of data that can be processed in a given period of time is greatly increased. 
CertaIn computational tasks ansing In many applications may be divided into 
smaller cooperatIng tasks that can be performed concurrently. The results computed 
by these smaller tasks may then be combined together, usually in a hierarchical 
manner, to compute the reqUired result. Tree-connected processing elements (PE's) 
prOVide a structure that allows us to exploit the hierarchical nature of this class of 
computations Hierarchical systems also enJoy the properties of local commUnlcatlon 
and regular interconnection patterns which are desirable for VLSI implementations 
A hierarchical communication system IS also necessary for effiCient global 
commUnICatlOn in vLSI technology when the number of elements connected to the 
global communicatlOn system is very large [~1ead 791 
A tree machine can be informally descnbed as a. collectlOn of processIng elements 
. Interconnected to form a complete binary tree. Tree machInes, beSides enJoYIng the 
performance characteristics of hierarchical structures, also capitalize on the 
properties of VLSI technology. They can be laid out optimally on the plane, and 
they do not suffer from pInout limitations, as we will see later in this pa.per. 
In 1970, Llpovski proposed a tree-organized associative machine [Lipo 701 conSisting 
of small PE's connected in the form of a bInary tree, each With a small amount of 
2 
memory and logic. A description of this machine will be gIven in Section Four. 
Berkling [Berk 711 later proposed another architecture for a computing machine that 
implemented in hardware a hierarchical execution of programs that was modeled 
after Turing tree machines. With the recent emergence of VLSI technology, tree 
machines have begun to attract attention again as an interconnection scheme for 
multiprocessor systems. Several projects now underway at different universities are 
directed toward the construction of tree machines. Among these are the X-Tree 
machine project at the University of California at Berkeley, the tree machine 
project at California Institute of Technology, the NON-VON Supercomputer and 
DADO machine at Columbia University, the Stony Brook tree machine at the State 
University of New York at Stony Brook, the tree machine at the University of 
North Carolina at Chapel Hill [Toll 8lcJ, the DDMn Machine at University of Utah, 
and the DAC project at the University of Southern California [Horo 791. 
In SectIon Two we discuss some of the most important performance issues related 
to tree machines. In Section Three some implementation issues will be considered. 
SectIon Four presents some of the architectures that have been proposed (or 
implementIng tree machines, and discusses their differences. Certain special purpose 
tree machmes will also be described in this section. In Section Five, we discuss some 
of the algonthms that run effiCIently on tree machines. 
3 
2. Performance Considerations 
The tree machine architecture is capable of providing a rather general purpose 
computIng environment. A tree machine can be programmed to execute a varied 
collectlOn of algorithms that take advantage of the tree structure explicitly, along 
with many other algorithms that exploit its less obvious advantages in the 
realization of large scale computational concurrency. We will introduce some 
examples of these algorithms in this section. A detaIled discussion of some of the 
algonthms that run on tree machines will be presented in Section Five. In the 
followlDg examples we assume that the tree machine consists of processing elements 
connected together in the form of a binary tree, and that each PE contains some 
memory storage and can perform simple arIthmetic and logical operations on data 
stored in its local memory. Each PE can send data to and receive data from Its 
chIldren and its parent. The PE at the root of the tree can communicate with the 
external world. 
Among the algorithms that run efficiently on tree machines are algebraically 
associative operations such as the computation of the sum, the maximum, or the 
mean of n numbers. Such operations can be performed in O(log n) time, provided 
the data elements are already present In the tree. In these algorithms, data elements 
are stored inItlally in the PE's at the leaves of the tree (it:aJ proct:ssors). The 
answer IS obtained by letting each PE comblDe the values found in its children 
USlDg the algebraically associative operator (adding two numbers ID the sum 
algonthm, for example). This step is repeated log n times (the number of levels In 
the tree), after which the root of the tree contains the final result (the sum of the 
n numbers, ID the sum algonthm) . 
. Search, insert and delete operations on sets can also be performed In log( n) steps 
where n is the number of data elements In the set [Bent iga] In the case of the 
search algorithm, for example, the algorithm st-arts by inspectlDg the data held by 
the root processor. If the target of the search is found, the algorithm is terminated. 
OtherWise, the root processor InItiates the search in both of ItS children. In the 
worst case, this operation is repeated log(n) times. Note that the tree connectlOns 
are used expliCitly in such algonthms to obtalD the required results. 
4 
A tree machine in which all processors simultaneously perform the same instruction 
on their respective local data items is said to be executing in single instruction 
stream, mUltiple data stream (SIMD) mode [Flyn 721. SimIlarly, algorithms 
performed in this mode are called SIMD algorithms. Tree machines are capable of 
the very rapid execution of SIMD algorithms, which require this kind of global 
broadcast. The efficiency of tree machines in executing SIMD algorithms drives 
from the fact that time required for global broadcast in trees is proportional to' the 
number of levels in the tree, and is thus logarithmic in the number of PE's. Few 
other architectures share this property. Two dimensional orthogonal meshes, where 
PE's are connected together in the form of a rectangular grid, have a global 
broadcast time of 0(n1/ 2), where n is the number of PE's in the mesh. For ring 
and linear array architectures, the broadcast time is of O(n). The global broadcast 
time in trees can be reduced further by propagating the data coming from the 
parent to the children without latching it in a clocked manner. This in effect will 
turn the tree bus into a combinational logic circuit during global broadcast. A 
global broadcast of this kind was proposed by Lipovski (or his Associative Tree 
Machine [Lipo 701 and is being implemented in the NON-VON Supercomputer 
[Shaw 821. It is expected that the time needed to broadcast a NON-VON 
instruction to all PE's in a tree of 20 levels (one million PE's) will be about one 
microsecond. 
Highly efficient· SIMD algorithms exist for a wide range of important tasks. 
Examples include content-addressable (or IU30ciatitle) operatlons, relational database 
pnmltlves such as SELECT, PROJECT, and JOIN, and many numerical tasks 
drawn from such diverse areas as signal processing, phYSical Simulation, and low-
level computer vlsion. We wlll now show how a content-addressable search can be 
performed using this kind of global broa.dcast. We assume for simpliclty that each 
PE holds a record of data elements in Its local memory. We globally broadcast the 
sequence of instructions for finding the data element we are lookmg for, and all the 
processors look for this data element concurrently. The search is successful if at 
least one processor finds the data element in its own local memory. We wlll defer 
diSCUSSion of how It may be determined whether a given search has terminated 
successfully. 
Many other algorithms requIre a time complexity proportional to the number of 
data elements in the problem when performed on trees. Sorting, for example, can be 
done in linear time [Song 81], and many matrix operations in O(m) [Brow 791. 
where m is the number of elements in the matrix. Algorithms have also been 
proposed to sim ulate three-dimensional physical systems on trees in O( n 2/9) time 
[Shaw 82] where n is the number of points 10 a discrete approximation of the space 
being simulated. Problems that are solvable by divide-and-conquer techniques may 
also be well suited to tree machine architectures, as are many problems having 
natural recurSIve definitions [Rem 79] 
AlgOrIthms that require extensive input and output operations can suffer from 
delays, if the input and output operations are confined to the root processor. There 
are several SolutIons to this problem, which involve the distribution of input/output 
operatIOns between many PE's, as wlll be shown when tree machine architectures 
are described. Also, algorithms that require a significant amount of communication 
between arbltrary PE's, such as arbitrary permutation of data elements, are not 
amenable to efficlent execution on tree machines. The reason behind this is that 
exchanging data elements between two PE's in the tree reqUIres the data elements 
to travel up the tree until the lowest-common-ancestor of the two PE's is reached. 
In case of extenSIve commumcatlOn between PE's, the highest nodes in the tree may 
become the bottleneck of execution. The worst case occurs when the data elements 
In the rIght and left subtrees, rooted by the root processor, are to be exchanged. 
The root processor becomes the bottleneck In this case, as all the data elements 
have to pass through it. One solution to thIS problem lS to add extra connectlOns 
between PE's to reduce the traffic going through higher level PE's, as wlll be 
deSCrIbed when we discuss the X-tree machine. 
Another issue that relates to the performance of concurrent systems is the time 
needed to access a speclfic element in the system. Access tIme for any element In 
the tree is O(log n), and the maximum communication time between any two 
Indlvidual elements In the tree is of the same order Horo'w'iltz [Horo 811 presented 
an algorithm to route messages in a blnary tree machine. In hls algOrIthm he 
introduced extra llnks to shorten the average path length between the tree nodes, 
and to provlde for fault tolerance 10 case one of the tree processors is not 
functIomng. 
6 
Tree structures also enjoy the property that many assertions about the system can 
be proven by induction over the hierarchy [Rem 791· Tree machines are also easily 
testable if a single processor is testable [Mead 791· First, we test the root PE, and 
If it is working correctly we load the same test program in its chIldren and exercise 
them. This process is repeated at each level of the tree. 
i 
3. Implementation Issues 
V1..SI devices have been decreasing in size and increas10g 10 speed at a rapid rate 
over the last decade. It is estimated that in the late 1980's, we will have chips 
containmg millions of devices. As the dimensions of circuit devices scale down, 
communicatIOn delays in wires that carry control and data to functional blocks 10 
an 10tegrated cirCUit will become a dominant factor. In many cases, much of the 
area on the chip is likely to be occupied by these communication lines. Another 
important factor in current VLSI technology is the relatively small (and only slowly 
increasing) number of pins each package can have. Tree machines are compatible 
with these properties of VLSI by virtue of their local communication, regular 
1Oterconnection pattern, area-efficient layout, and limited number of external ports. 
We will start our discussion of the 'VLSI implementation of tree machines by 
showing how a tree machine may be laid out on chip. Next, we will discuss the 
pinout limitatIOns imposed by current technology, and show how tree machines 
aVOid these limitations. \Ve end the section by presenting a technique proposed by 
Leiserson [Leis 81] for Implementing tree machmes using just one kind of chip. 
3.1 Layout of Tree Machines 
In Figure 31, we have Illustrated a space-economical layout proposed by Mead and 
Rem [\1ead 79] for a complete binary tree. This layout is known as the hyper-H 
embedding of the complete binary tree. The hyper-H embedding IS highly regular, 
and the sllicon area occupied by the tree IS proportional to the number of PE's per 
chip HorOWitz presented an algOrithm for embedding arbitrary (pOSSIbly 
incomplete) binary trees in the plane [Horo 81]. The algOrIthm assumes the wires 
connecting the processors to be straight, and of unit width. The algorithm also 
assumes the processors to be of unit area. In the case of a complete binary tree 
thIS algOrIthm yields the hyper-H embedding of the tree. The area required to 
layout the tree in this case is proportional to n, where n is the number of PE's in 
the tree. The' worst case for the Horowitz algorithm is a binary tree skewed to the 
rIght or the left as shown in Figure 3-2. The algorithm embeds this skewed tree 10 
area O(n2), where n IS the number of PE's in the tree. The above discussIOn 
assumes that It is sufficient to allow the tree to communicate with the external 
Figure 3-1: Hyper-H Embedding of a Binary Tree 






• , 10 
u-·· I I n 12 11 
• 7 4 
The Mapping of a Right Skewed Binary 
Tree 
world through the root only. The hyper-H em bedding is optimal in this case. IT we 
assume that the tree must also have its leaves accessible along the perimeter of the 
chip 1 then the area needed to em bed. the tree IS at least O{ n log n) as shown by 
Brent and Kung [Bren 791. 
8 
9 
By way of comparison, the two-dimensional orthogonal mesh-based architectures, in 
which each PE is assigned to one of the points of a rectangular mesh and 
connected to Its four neighbors, are as area efficient asymptotically as trees, but 
suffer from pinout limitations, as we Will see in the next subsection. Shuffle-
exchange and cube-connected cycles architectures [Ston 71], to cite another example, 
reqUlre an area of at least O{n Ejlog2n) [Thorn 801. 
3.2 Pinout Limitations and Tree Machines 
Over the coming years, the number of PlnS per chip package is not expected to 
lncrease at the same rate as the number of active devices within the chip. The 
limited number of pins per package, places a severe constraint on the bandwidth 
avadable for communication with the external world, and thus represents a physical 
bottleneck. Architectures haVIng a small fixed number of external ports per chip, 
independent of the number of PE's per chip, are for this reason particularly suitable 
for VLSI implementation. Binary trees are quite attractive in this regard. Every 
processor m a bmary tree machme communicates with exactly three other 
processors Its parent, left chIld, and right child. 
If we were to budd the tree machme using a single PE per chip, the number of 
ports per chip would be three. If the tree machine IS to have more than one 
processor per chip, at least two Implementation methods are possible. The flrst uses 
two kmds of chips, as shown in Figure 3-3. Chips that contam the leaf nodes (leaf 
chips) wlil have only one port. The leaf chips are not pm-limited, and as deVices 
on chip scale down, we can embed unlimited number of PE's on each chip ChIPS 
that contam only mternal nodes (internal ChipS), on the other hand, Will have a 
number of ports depending on the number of nodes they contain. These internal 
chips are thus pin-limited, and can not contaIn more than a fixed number of 
processors, even as dimensions scale down. 
Another way ~o implement tree machines, suggested by Leiserson [Leis 81], employs 
only one kind of chip, which is discussed and illustrated in Section 33 In the 
Lelserson scheme, every chip has exactly four ports, independent of the number of 
processors It contains. Tree machines implemented usmg thIS scheme have a fixed 
10 





iob 0 iob 0 
I ______ ------~ ' ______ --_ ....... 
lu' etho 
Figure 3-3: Implementing Tree Machines Using Two Kinds of Chips 
number of pins per chip, and can have as many PE's as the dimensions of VLSI 
devices allow. Linear array machines, which consists of an ordered set of PE's, 
each connected to its lmmediate predecessor and successor, share this property with 
tree machines. Two-dimensional orthogonal mesh-based architectures have a number 
of external ports equal to 4v n where n is the number of PE's. Shuffle exchange 
and cube-connected cycles architectures have even more external ports 
asymptotlcally than meshes. 
3.3 The Leiserson Scheme for Tree Machine. Layout 
The tree architecture gives rise to IC's that have a highly regular interconnection 
structure, local communication, and many repetitions of a single processor design. 
These IC's can then be assem bled in regular patterns at the printed-clrcuit (PC) 
board level to construct machines with thousands to mlllions of processors. A 
scheme for lmplementIng binary trees using a single type of ChIP and regular inter-
chip connection pattern was suggested by Leiserson. In this scheme, a complete 
subtree and a single interior node are embedded on each chip as shown in Figure 
~4. There are four ports per chip labeled T, F, L, and R. The T connection leads 
to the root of the subtree, while F, L, and R connect the single interior node to its 
parent, left child, and right child, respectively. A simple recurSIve procedure allows 
the construction of a complete binary tree as shown in Figures 3-4 and ~5. The 
area requlred for routIng wires within the PC board is proportlonal to the number 





1 ______ ---, 1 _________ 1 
Figure 3-4: Interconnection of Two 
Figure 3-5: 
Leiserson Chips 




4. Some Tree Machines 
In this section we will discuss some of the most Important tree .machmes designs 
that have been proposed, with an emphasis on their differences. Some of these are, 
at present, merely "paper machines"; others are 10 various stages of implementatlon. 































..... of .. hi 
FIgure 4-1: A Linear Array Processor 
The Tree Channel Processor (TCP), proposed by Lipovski [Lipo iOj, IS an 
asSOCiative machine that was designed to emulate a class of machines the author 
called "linear array processors", which are based on a linear array of associative 
memory cells. Figure 4-1 shows a linear array processor. Each associative cell has 
a smgie fixed-SIze word of memory and a comparator. The linear array processor 
13 
was intended primarily for information storage and retrieval. "Rails" that connect 
consecutive cells are used to broadcast data and instructions, and to aggregate the 
results of associative search. The linear array, however, suffers from excessive 
propagation delay, difficulty of segmenting the processors to execute subprograms, 
and a susceptability to faults resulting from raIls stuck at one logical level. 
Figure 4-2: 
to o~ .. ClI" 
A Subtree of the Tree Channel 
Processor 
LlpovskI designed the Tep interconnection scheme of figure 4-2 to solve these 
problems The processor has a single "channel" and two identical rali complexes 
Global broadcast to all cells IS performed simultaneously via the channel in the tree 
branches. Each cell amplifies the channel signal and propagates it to its two 
children The rail communication makes the processor cells appear to the 
programmer as an ordered one-dimensional array able to detect subsets and 
substrings, to count elements, etc. The maximum delay through the channel or the 
ralls is of O(log n). That delay determmes the cycle time for the tree. 
Comm unication with the external world IS performed through input and output 
channels that are connected to selected leaf nodes. The tree can be partitioned 
Into separate modules called instruction domains (ID's). An instruction domain is 
established when a cell is disconnected from the channel by setting one of its mode 
flags. The whole subtree rooted by the separated cell becomes a new ID. The 
14 
different ID's normally act independently, They can Issue commands to delimit the 
channel or reconnect it, U a cell in a gIven ID needs to use an I/O channel 
connected to a lea! cell 10 another ID, the two ID's must be reconnected and 
whatever programs they are executing must be halted until the request is satisfied. 
A switching network at the root of the tree is used to accommodate these 
functIOns, 
The cells at the tree nodes can function in three different modes, which are invoked 
by settmg mode flags within each cell. All cells in a specIfic ID operate in the same 
mode, The three modes are: 
1 Run mode, which involves normal execution of instructions for information 
retneval. 
2. Tran~fer mode, WhICh provides for efficient loading and unloading of the 
tree. 
3, Supervi~or mode, which enables channel-delimiting cells to be set up or 
changed. 
The TCP presented solutIOns to several problems associated with architectu:-es 
havmg a very large num ber of processors. The machine is segmented to allow 
dIfferent problems to run concurrently, It has a small propagation delay, and 
provIdes for fast loading and unloading of data In parallel from different I/O 
cha.nnels 
4.2 The Caltech Tree Machine 
The Ca.ltech tree architecture comprises a collectIon of small, identical processors, 
each WIth some local storage, connected to form a binary tree [Brow 801. The 
machine relies on local comm unication between parents and children only; there are 
no global communlcation paths The machine operates in multiple instruction 
stream, multiple data stream (MIMD) mode [Flyn 721. with PE's independently 
executIng prGgrams stored In their respective local memones. Processors 
communicate by means of message-passing protocol. We will start by describing the 
processor archItecture, and WIll then show how communicatIon between processors IS 
accom plished. 
15 
4.2.1 The Processor Architecture 
Each processor contains a small amount of program memory, a few data registers, 










I I - AC 






Figure 4-3: The Caltech Tree: Processor Architecture 
Figure 4-3 Illustrates the processor architecture. Underlying this architecture is the 
belIef that increased functionality does not justify the requisite increase in chip area, 
and that there IS a tendency on the part of the programmer not to exploit all 
available concurrency If there IS a rich set of instructIons and a powerful processor 
16 
[Brow 801. The processor has 256 bytes of program store and 16 byte registers for 
storing data. Only addition, subtraction, shift operations, and logical operations are 
provided by the ALU. Multiplication, division, and floating point operations are all 
performed in software. The accumulator, AC, is a source, and the only destination, 
for the ALV. The I register holds the instruction bemg executed, while the PC 
register points to the next instruction to be fetched for execution. The mstruction 
set provides only a direct addressing scheme. There are three main categories of 
instructIOns in the machine: control flow instructions, communication instructions, 
and data flow instructions. There are seven control flow instructions: HALT for 
normal termination of the program, ABORT for abnormal termination of the 
program, SKIP for skipping an instruction, and four instructions to implement 
conditIOnal and unconditional branching within the program. Four communication 
instructIOns are provided: two for input and two for output. Data flow instructions 
are used to transfer data between the data registers and the accum ulator, and to 
perform arithmetic and logical operations. There are 14 instructions in this category. 
The tree machine IS programmed in a high level language that resembles Hoare's 
"CommunIcatIng Sequential Processes" .(CSP) notation [Hoar 781. The Tree MachIne 
Programming Language (TrvfPL) , as proposed by Browning [Brow 801 allows the 
programmer to wnte programs for trees With arbitrary fanout. Dunng compIlatIOn 
these programs are transformed into source code for a binary tree by a program 
called MAP [Brow 801. The basiC building block of TtvlPL IS the processor 
definItion. Each definItion describes a self-contained computational UnIt that 
communicates WIth other processors through named external ports [Brow 811 
The tree machine is loaded through the external port of the root processor and the 
loaded code travels down the tree to the leaves. The code stream consists of a 
header and the code Itself. The header speCifies the length of the code, the length 
of the destination processor address, the address of the destination processor, the 
inItial value ~ be placed in PC, and an instruction code (opcode). The PE's are 
uniquely Identified by a bit string that grows in length With the depth of the tree. 
The addresses are assigned in such a way that a child address differs from its 
parent address in only one bit, so that there is no need to store the PE's address 
In Its own local memory. Instead the parent is able to deCide where to direct the 
17 
code message on the basis of the value of a specific bit in the target address. The 
opcode can be any of the following: 
1. ONE, which means the code is to be loaded in a specified processor. 
2. TREE, which means the code will be loaded in the entire subtree, rooted 
by the processor that is addressed .by the opcode. 
3. LEVEL, which directs the code to all PE's at the level specified by the 
processor address in the header. 
4. YOU, which is used by a parent to force the loading of its children. 
Dunng loading each PE looks at the opcode. IT it is YOU, the PE will load the 
code into Its own memory. If it is TREE or ONE, the PE will look at the address, 
and if it has a nonzero length, will remove the leading bit of the address and pass 
the header and the code to one of its two children aepending on the value of this 
bit. If the length is zero, then it wIll load the code stream lOtO its own memory. In 
the case of TREE, the entire subtree beneath the processor must be loaded with 
the sa.me program. This is done by passing the code and the address field equal to 
zero to the children of the processor. If the opcode is LEVEL the PE Will do the 
same thlOg It did In the case of TREE, except that it Will examine the length of 
the a..ldress rather than the address itself. 
4.2.2 Communication in the Caltech Tree Machine 
In eac h PE there are three grou ps of com m unicatlon handlers, one for eac h of the 
PE's three ports. These handlers mana.ge message traffic, load programs, and pass 
code to their descendants. The definition of the processor In TMPL includes a 
named external port to communicate with its parent, and an arbitrary number of 
named internal ports to commUDlcate With Its chtldren. Communication statements 
In TMPL specify the port name through which the message IS to pass instead of 
namIng target processors. 
ways: either by an Imperative 
conditional form appears as 
Inter-process communication can be speCIfied In two 
statement or a conditIonal expression [Brow 81J. The 
a part of a loop or withlO a case statement. On 
executlOg these statements, the processor is blocked untll comm UDlcatlOn IS 
successfully termlOated. The conditional form is executed only if both PE's 
18 
communicating along the specified port are ready to exchange a message. The 
general form for a message statement is the same as that of an expression in CSP: 
[ port. or Uat. ot port.a ] [? or ! ] .... ap (arp.llta> 
where ? indicates input and ! indicates output. Two processors will commUnlcate 
when an output request to a port from one PE matches an input request for the 
same message from the other processor. In order to avoid deadlock, the restriction 
is made that either the output or the input can be done conditionally, but not both 
The type of message, imp for imperative and cond for conditional, must be 
specified so that the compiler can detect Illegal communication operations. For 
examples and more discussion the reader is referred to [Brow 811. 
i 
I 3 $ 
, , 
1 4 6 
-
Figure 4-4: Mapping Arbitrary Fanouts onto a Bmary Tree 
Arbitrary fanouts are mapped Into the binary tree by using several layers of the 
tree to provide the required number of descendants, as shown in Figure 4-4. The 
intermediate PE's are called padding PE's, and are provided with a skeletal program 
that allows them to Simply pass messages between parent and children. 
Some of the algOrIthms that run on the Caltech tree machine wtll be deSCrIbed In 
Section Five. 
4.3 The X-Tree Machine 
The X-Tree machine architecture was formulated at the University of California at 
Berkeley. It is organized as a full binary tree of multiprocessors known as X-nodes. 
The objective of the X-Tree project is to define a modular component from which 
general-purpose computing systems of arbitrary size could be constructed [Sequ i9J. 
19 
A binary tree was chosen as the most advantageous way to interconnect these 
components. The binary tree is enhanced with extra links to form a half-ring or 
full-ring tree as shown in Figure 4-4. These extra links are employed to provide 
fault tolerance, to shorten the average path length between tree nodes, and to 
provIde a more uniform distribution of message traffic throughout the tree. 
Each of the X-Tree PE's was intended to have as much memory as could fit on a 
single chip with the processor itself. Each X-node thus consists of a single chip 
computer with as much memory as technology permits. This memory is used for 
local storage, acting as a cache for the secondary memory, which is connected to 
the leaf nodes only. Having a large memory in each node mmimizes the need for 
swapping pages of memory to and from the secondary memory. 
We Will start by discuss10g communication in the X-Tree, and then discuss the 
architectural features of a single X-node. 
4.3.1 Communication in the X-Tree 
All commUDlcatlon 10 the tree is in the form of messages. To obtain a specific 
block of data from that portion of the secondary memory which is attached to a 
given leaf node a message IS sent to this node, requesting the desired data block. 
A channel IS then established between the two nodes using a special message 
header After comm unication is ended, the channel is disabled by another special 
command. Addresses are assigned to the X-nodes such that the children of node n 
have node addresses en and 2n+l. 
The half- and full-ringed trees give rise to Simple routing algOrithms. DeciSIOns 
about routing messages are made locally at each node depending on the destInatIOn 
address and the current node address. Start10g from the root, a message can travel 
anywhere in the tree by letting the nodes examine the sequence of bits In the 
destinatIon address, compare them to their own local address, and route the 
message depending on the result of this examinatIOn. To go from any arbitrary 
node to another node in a binary tree the message will move up in the tree untll a 
common ancestor of the two nodes is found. This common ancestor shares its node 
address bits with the leading bits of the two nodes. In the case of a full-ring tree it 
20 
Figure "-0: A Binary Tree with 
Full- and Half-Ring Connections 
IS always shorter to travel along the ring connections when the horizontal distance 
between nodes is four or less; otherwise, the message moves up the tree. There is 
a routing controller at each node, which handles in software the messages coming 
into the node. The routing controller in a full-ring tree compares the address of 
the incoming message to its own node address, and then acts in the following 
manner: 
1 If the two addresses agree, then the node is the intended destination and 
the message has reached its destination. The destination node examlOes 
then incoming message to determine what action to perform. 
2. If the destination lies higher in the tree, then the message is routed 
upward. This IS the case If the destination node address is shorter than 
the current node address. 
3 If the destination node lies at the same level or lower, then the 
horizontal distance is computed by subtractIng the current node a.ddress 
from the destInation node address. If this dlsta.nce is less than four, the 
message is routed to a ring connection. If the distance IS greater than 
four, the message is routed upward. When the horizontal distance is zero, 
the message is routed downward. 
In case the message can not be routed in the direction of its primary chOice 
because that link IS broken, the routing controller will try the second and third 
choices that are predefined for it. This fault tolerance scheme was simulated and 
found to yield reasonable results. To prevent the possibility of a message Circulating 
forever, the number of rejections the message encounters is counted and the 
21 
message IS purged once this count reaches a predetermined number. The count is 
kept in a byte that travels with the message. 
Routing in half-ring trees can be done using the same algorithm as for full-ring 
trees, except that when a full-ring link is non-existent, the second choice is used to 
route the message. This algorithm for half-ring trees is slightly less than optimal 
[Sequ 781 An optimal algorithm for half-ring trees is shown in [Sequ 781. 
Whenever the tree is expanded, a storage device or a terminal may have to be 
moved, thus changing its node address A mechanism is thus required to keep the 
current addresses of data items in this case. To achieve this, messages to leaf nodes 
are tagged, and leaf processors are marked differently from non-leaf PE's. New 
nodes are attached the left child positions of, existing leaf nodes. When a message 
directed to a leaf node reaches a destination node that is not a leaf node, the 
destination node directs it to Its left child. This process continues downward along 
the left child chain until the message reaches the leaf node. 
4.3.2 The X-node architecture 
The X-node is a simple microcomputer that communicates with four or five nearest 
neighbors. In addition to the processor, each X-node contains a self-controlled 
SWitching network with its own I/O buffers and controllers. This enables 
computation to occur concurrently with communication. The processor is attached 
to the switchmg network in the same manner as are the other nearest X-node 
neighbors, as shown in Figure 4-6. 
The SWitching network consists of a time-multiplexed bus with five or six attached 
ports. Each port consists of an Input buffer, an output buffer, and the finite state 
. machines necessary to control them. The internal communication bus consists of a 
data bus and an address bus that carries slot and port addresses. The bus IS 
allocated in a fixed round-robin manner to all attached ports. 
The X-node processor architecture is intended to support high level languages 
directly in hardware [Patt 791. This leads to smaller programs, and consequently to 




Flgure 4-6: Block Diagram of an X-node 
support dynamic microprogramming. This helps optimize the execution of code for 
certam applications in particular nodes. The microcode is stored in the on-chip 
memory. 
The X-node processor requires a substantial amount of memory to minImIZe the 
frequency of page faults and subsequent pagmg traffic between the X-node and the 
secondary storage [patt 791. A high density implementation of the node memory 15 
thus required. Programs, data, and microcode are stored in different areas insIde the 
X-node memory. Separate caches are employed to handle these three types. Thus, 
parallel memory references to data, programs, and microcode can be achieved. This 
on-chip memory hierarchy also includes a cache attached to a separate dedicated 
ALU designed for address calculatIons. 
Sequin and Fujimoto proposed adding a separate chip at each node dedicated to 
performing the communication needed between the X-nodes. They called the new 
node the Y-component. For more information on the Y-component the reader is 
referred to [Sequ 82J. 
23 
4.4 The NON-VON Supercomputer 
The NON-VON (non-von Neumann) Supercomputer [Shaw 821 is currently bemg 
Implemented at Columbia UniversIty. Its architecture includes a tree-structured 
Pnmary Processing Subsystem (PPS) based on custom nMOS VLSI circuits, along 
wIth a Secondary Processing Subsystem (SPS) based on a bank of intelligent disk 
dnves. Figure 4-7 shows the top level organization of a simplified initial version of 
the NON-VON machine that is now under construction. 
8 8' OISK HfAOS 
Figure 4-7: Top Level OrganIzation of 
NON-VON 
The PPS is configured as a binary tree of PE's. Each PE comprises a small R.Pu\1 
(64 bytes In the prototype), a modest amount of processing lOgIC, and an 
Input/Output (I/O) switch. The I/O switch can be set for global bus 
commUnICatIon, for communIcation between parents and children (tree neighbor 
24 
comm unication), or to reconfigure the binary tree as a linear array of processors 
(linear neighbor communIcation). All PE's are identIcal except for minor dIfferences 
In the "lea! nodes". At the root of the tree is a special processor called the 
control processor (CP), which is responsible for coordinating different activities 
within the PPS. The CP is capable of broadcasting instructions to be executed in 
SIMD mode in all active PE's. The SIMD execution of globally broadcast 
instructions in NON-VON is compatible with a large number of much smaller' and 
less powerful PE's than has been proposed in other tree machines. This is because 
NON-VON has no need for local program memories or area-expensive processing 
and communication hardware. The area occupied by the processor embodied within 
a single PE is approximately equal to that required for its 64 bytes of local 
memory. This makes it feasible to effect a "nearly one-to-one" association of 
logical records with physical PE's. This aspect of NON-VON IS central to its 
proceSSIng power in large-scale data processing applications. 
The first verSIon of NON-VON, called NON-VON I, will contain chips with only 
one PE for the purpose of testing certain electrical and timing characteristics. This 
ChIP has already been fabrIcated and tested. A modified version of the chip with 
eIght PE's is currently bemg implemented, and may serve as the basis of a later 
prototype called NON-VON3. The NON-VON3 chip uses the same amount of area 
per PE, but IS conSIderably faster and has a more powerful instruction set. 
Currently a new more advanced architecture, called NON-VON 4, IS being designed 
With the goal of significantly expanding the range of applicatIons that can be 
executed by NON-VONl and 3 in a highly parallel fashion. The most Important 
change incorporated in NON-VON 4 is that a number of PE's (pe~haps 256 to lK) 
Large Processing Elements (LPE's), Interconnected with a high-bandwidth 
. interconnection network, WIll be Incorporated WIthin the top portion of the PPS 
tree. Each of these LPE's is capable of serving as a control processor for the 
subtree of which it is the root. This enables NON-VON4 to function In MIMD and 
"multiple SI11D" modes. In multiple SI:MD mode, each LPE broadcasts instructions 
from its own local memory to be executed in SI:MD mode by its own subtree. 
Figure 4-8 shows the design of a Single NON-VON 1 PE. A PE actively executes 
the Instructions broadcast by the CP as long as its enable bit is set. If the enable 
bit IS reset, then the PE is disabied and only an ENABLE instruction wIll activate 
it again. Two internal buses constitute the data path. The first bus is eight bits 
wide, and is used to transfer data between the byte accumulators and other byte 
registers. The memory and the Memory Address RegIster (~1AR) are both connected 
to this byte-wide bus. The other bus is one bit wide and is used to transfer data 




















RAM r;:: SWITCH ---
to rigtC ctIId 
~. 
MAR c:::t ~ 
108 ~ ~ Z8 e .--. V8 ,,!!-
• X8 ~I----t .I 
C8 C1 ~ • ~ --.. 
.. 
B8 -r 11L~ 
I --rr 
_I 




I--- ~ ALU I-
~ 
NON-VON. Block Diagram of the 
Processing Element 
Figure 4-8 shows also the main functional blocks of the PE data path. They are 
the one-bit ArIthmetic/Logical Unit (ALU), the Arithmetic Comparison Unit (ACU), 
the byte accumulators A8 and 88, an array' of byte registers, the one-bit 
accumulators Al and Bl, an array of one-bit registers, the MAR, and the memory 
The ACU IS an eight bit comparator that compares the contents of the two 
accumulators A8 and 88, setting Al whenever A8 = 88, and setting 81 whenever 
26 
A8 > 88. The ACU is used for content-addressable operations. A8 and B8 can 
be rotated, left or right, through Al and 81, respectively All arIthmetic and 
logical operations other than the 8-bit arithmetic relational operations p€rformed by 
the ACU are performed bit-serially using the one-bit ALU. 
The NON-VON 1 architecture incorporates an SPS based on a number of rotating 
storage devices. Associated with each disk head in the SPS is a separate sense 
amplifier and a small amount of logic capable of dynamically examining the data 
pasS10g beneath it. These Intelligent Head Units (nru's) are also assumed to be 
capable of performing general computations (hash coding, for example), and of 
serV10g as control processors. This supports parallel transfer of data between the 
PPS and SPS which is necessary to ayoid I/O becoming a bottleneck, and allows 
NON-VONl to function as an independent collection of SIMD machines (this 
execution mode, also employed by other parallel architecture researchers, has come 
to be referred to as multiple SIMD, or MSIMD). 
4.4.1 Communication in NON-VON 
The NON-VON I/O switch supports three modes of communication: 
1 Global bus communication, supporting both broadcast by the CP to all 
PE's 10 the PPS as required for SIMD execution, and data transfers from 
a s10gle selected PE to the CPo No concurrency is achieved when data IS 
transferred from one PE to another through the CP usmg the global 
com m UnlCatlon instructions. An instruction called RESOLVE can be used 
to disable all but a single PE chosen among a specified set of PE's. Thls 
15 an example of a hardware multiple match re"o/ution scheme, in the 
terminology of the literature of associative processors. (The CP, on 
execut10g a RESOLVE instruction, is able to determine whether the 
operation resulted in any PE being enabled or not). The REPORT 
instruction transfers data from the single chosen PE to the CP US10g the 
global bus communication. 
2. Tree communication, supporting data transfers among PE's that are 
physlcally adjacent within the PPS tree. Instructions support data 
transfers to the Parent (P), Left Child (LC), and Right Child (RC) PE's. 
Full concurrency is achieved in this mode, since all nodes can 
communicate with their physical tree neighbors 10 parallel. 
3. Linear communication, supporting data transfers to the Left Neighbor 
(LN), or Right Neighbor (RN) PE's in a particular logIcal linear 
sequence. This mode is useful for applicatIOns that require a predefined 
total ordering of data. Figure 4-9 shows how the linear logical sequence 
is mapped onto the tree structured physical topology of PPS by inorder 
enumeration [Knut 73J. The path needed to transfer data. between linear 
neighbors in the tree concurrently are shown in Figure 4-9. Two phases 
are needed to complete the linear communication cycle. Note that every 
other element in the inorder sequence l:3 a leaf node. In the first phase, 
data is transferred along the arrows ongmating from the leaf PE's, whIle 
in the second phase, data passes along the black arrows terminatmg at 
the leaf PE's. 
Figure 4-0: Inorder Em bedding of the Lmear Arra.y 
27 
The 108 and 101 registers are used for communication with other PE's. There are 
. two types of instructions for tree and linear communicatIOns. They are of the form 
SE:'ID <PE>, and RECEIVE <PE>, where <PE> can be P, L:"J, RN, LC, or 
RC. There is no SEND P however, smce children must not compete to send 
messages to a' common parent. On executing a SEND instruction, the contents of 
A8 are transferred through the I/O switch into the 108 register of the receiving 
PE. On executing a RECV instruction, the contents of 108 are transferred into the 
108 regIster of the receiving PE. 
28 
NON-VON is designed to support the massively parallel manipulation of data 
records stored in its PE's. A data record stored in a single PE is, in effect, capable 
of manipulating its own contents. Because of this observation, the metaphor of an 
"intelligent record" is suggested by [Shaw 821. In real data processing applications 
however, records may require just a few bytes of RAM, or may be too large to fit 
In the RAM of a single physical PE. In the first case, more than one record can be 
packed in a single physical PE, while in the latter case, the record must be split 
Into pieces and stored in several PE's. This mapping between logical records and 
physical PE's is invisible to the user. Thus a programmer views his records, 
regardless of their size, as stored one per "virtual" PE. Two schemes are suggested 
to allocate storage to records that do not fit within a single physical PE. The first 
scheme, referred to as linear allocation method, splits each record among several 
linearly adjacent (logical) neighbor PE's. The other scheme, referred to as bush 
allocation, stores each record in a distinct "tree-shaped" cluster of physically 
adjacent PE's called a bush. For more detaIls on the allocation of logical records in 
NON-VON, the reader is referred to [Shaw 821 and [Shaw 83J. 
4.5 The Stony Brook Tree Machine 
The Stony Brook tree machine is a tree-structured multicomputer machine that IS 
being Implemented at the State University of New York at Stony Brook. The tree 
machIne has the topology of a "double tree structure". The machine consists of 
two interlockIng trees as shown in Figure 4-10. The first tree Incorporates user 
programmable modules (P-modules). The P-modules are programmed In a supenor-
.subordInate mode. The processors at the nodes of the P-tree (P-modules) have their 
own local memories. They comm unicate using hand-shake techniques, and the 
commands and parameters directly communicated between them are of limited 
length. 
The second tree compnses transactional modules (T-modules) that are not accessible 
to the user '. The T-modules are used to provide the necessary communIcation 
between P-modules and external storage devices. The T-modules wIll be used to 
transmit large data blocks and program segments to the P-modules from the 
external storage devices. These communication links appear as dotted lines in 
29 
Ftgure 4-10: Example of a Double Tree Network 
Figure 4-10. Data transfers are performed under the control of the T-modules, 
which are not accessed by the user Each superior and its immediate subordinates 
are connected to the same T-module. The T-modules are distributed to permit 
expandabIllty and to explOlt the low cost of small computer modules. In effect the 
T-modules act as a sort of cache for files that are stored in the central system file 
storage, labeled G in Figure 4-10. 
The role of G is to monitor the whole system for experimental purposes Programs 
for the whole system are developed on It In the followmg section we Will diSCUSS 
commUnIcatIOns In the machine in more detail. 
4.5.1 Communication in the Stony Brook Tree Machine 
The lines shown in figure 4-10 represent communication pathways of different kInds 
The control links, shown by solid hnes In the figure, permit exchange of short 
messages between the P-processors under program control. The control links provide 
also mechanIsms by which a superior can assert control over a subordinate. Each 
control lInk consists of two independent unidirectional channels, each of which IS 
30 
capable of buffering one word at a time. Each channel has a single word buffer 
that can be loaded under program control by the processor on one side. The 
processor on the other side can read this buffer. A flag for handshake purposes is 
malOtained by each channel. It is set when a buffer is being loaded, and is reset 
when the buffer is read. An interrupt occurs when the incoming channel buffer to a 
processor is full. The interrupted processor disables interrupts and reads the message 
lOtO its local memory by polling the flags. The control link provides two 
mechanisms that allow a superior to assert control over its subordinates. The first is 
a re~et mechanism which is used by a superior to abort the task running in its 
subordinates, and to prepare for a new assignment. The second facility prOVided by 
the control link is a halt mechanism, by which a superior can cause the subordinate 
to enter the halt state. In this state a superior can examine and alter the state of 
its subordinates. This IS useful for interactive debugging, and for 
bootstrapping/restarting in the P·tree [Bhat 79]. 
The data link, shown using dotted lines in Figure 4-10, can support both exchange 
of short messages and efficient transfer of large blocks of data. The da.ta link can 
work 10 either single-word ha.ndshake mode or in a self-sequenced direct memory 
access mode (DMA mode) For a p.processor to access a file found in the T-
processor to which it is connected, it must first engage in prqtocol in which short 
messages are exchanged With the T-processor. At this point, DMA mode is enabled 
and the actual data transfer takes place. The T·processor has no control over the 
P-processor The G-T tree was designed to provide services to requests generated by 
the P-tree processors. For more information about the commumcation and protection 
mechamsms 10 Stony Brook tree machme the reader IS referred to [Bhat i9J 
In the prototype, currently under construction at Stony Brook, DEC LSI-ll IS are 
bemg used as P- and T-processors, while the G-processor is a DEC PDP·11/60. 
The mass storage and all penpheral devices have been concentrated on the 
PDP-ll/60 for reasons of economy. The P-tree machine is well suited to 
applIcations involving problem-solving by decomposition [Kieb i9J. The P·tree also 
can be diVided mto subtrees performing separate comput:ltlonal tasks, using the 
abilIty of supenors to IDltiate and interrupt the executlOn of subordinate tasks. We 
will deSCrIbe some of the algorithms that can run on this tree machine in Section 
Five 
31 
4.6 Special Purpose Tree Machines 
In this section we will describe several special purpose tree machines that implement 
particular algorithms in a highly efficient manner. These machines are non-von 
Neumann computing devices that might typically be used as peripherals to 
conventional computers. 
4.6.1 The DADO Tree Machine 
DADO [Stol 821 is a parallel tree-structured machine designed for the highly 
efficient execution of production systems. Production systems have most often been 
used in artificial intelligence applications to represent a body of knowledge about 
specific tasks in the real world - medical diagnosis, for example. A production 
system consists of a set of rules, or production.!, which form the Production ~{emory 
(PM), together with a database of assertions about the real world called the Working 
Memory (WM). A production rule IS a statement of the form: If the condition holas 
then this action IS appropnate. The condition IS the Left Hand Side (LHS) of the 
rule, while the action is the Right Hand Side (RHS). The WM serves as a "focus of 
attention" for the production rules. The LHS of each rule represents a conditlOn 
that must be present in the \VM before the action of its RHS is fired. The action 
can change the \VM by adding assertIOns to It or deleting existing ones. 
The production system operatIOn consists of a repeated match-select-act cycle. In 
each cycle, rules are matched against the current \V~I assertIOns One of the 
matching rules IS selected according to some predefined criteria. The actIOn In the 
RHS of thiS rule IS then performed. ThiS cycle continues until a goal action IS 
taken or there are no more rules that are matched by the WM. 
PE's in DADO are connected in the form of a complete binary tree; their number 
would be on the order of thousands uSing today's technology Each PE contains Its 
own local memory (2K bytes in the current version), a speCialized I/O SWitch 
allOWing global broadcast in addition to tree neighbor communications, and Its own 
processor DADO IS similar in many aspects to NON-VON. The most Important 
differences are the processor granularity and the mode in which each PE can 
function DADO has a "coarser granularity" than NON-VON; that IS, Its 
32 
"smallest" processing elements are based on more powerful processors and much 
larger memories. Each DADO PE can operate in either of two modes. In the first 
(SIMD mode), it executes instructions broadcast by some ancestor in the tree. In the 
second, called MIMD mode, the PE executes instructions stored in its local memory. 
In this mode, the PE is disconnected from its parent and can broadcast instructions 
to Its descendants, providing they are in SIMD mode. 
In what follows we will briefly explain how a production system is executed on 
DADO. In this discussion we assume productions whose LHS and RHS are 
conjunctions of predicates in which all first order terms are composed of constants 
and existentially quantified variables [Stol 821. 
Figure 4-11: 
-PM '--I: 
~. ~r ..... --=. 
, ,nSlanl'al. 
Functional DiviSion of the DADO Tree 
The tree machine IS dynamically divided by the I/O switches into three 
conceptually distinct components, as shown in figure 4-10: 
1 The PM-level consists of all PE's at a particular level Within the tree 
that are used to store the productions. Each production is stored in a 
Single PE at the PM level. The PM level in the tree is chosen to be the 
lowest level having enough PE's t.o hold all productions. 
" 
2. The upper portion of the tree comprises all PE's above the PM level. 
They are used primarily to synchroDlze the select and act phases of the 
execution cycle. 
3. The lower portion of the tree consists of all PE's found below the PM 
level Each subtree that is rooted by a PE in the PM level will store 
that portlOn of the WM that is relevant to the production stored in this 
PE. These subtrees are referred to as WM-subtrees. 
The execution cycle of the production system consists of three phases: 
1. The matching phase In this phase all PE's at the PM level enter the 
MIMD mode and sim ultaneously match the LHS of their productions 
against the contents of their respective WM subtrees. The matching in 
each subtree is performed associatively. After the matching phase, only 
those PE's containing a ground literal that matches the LHS of the rule 
are left enabled; all other PE's are disabled. Every PE at the PM level. 
on finishing the matchmg operation, sets a flag to indicate that it has 
finished. It also sets a certain word to a value that depends on whether 
there was a match for the production, and on some crIterion that assigns 
a priority for the PE. The upper part of the tree computes the 
conjunction of the flags set by the PM-level PE's, and when this 
conjunction is true, the select phase starts. 
2. The selection phase' The-PM level PE's are switched to SIMD mode at 
the start of this. phase. The PE's in DADO's PM level have a unique 
Identifying tag. The upper part of the DADO tree is then used to 
compute the maximum value of the priorities assigned to PE's on the 
PM level and to identify a PE hav10g this value. This computation 
requires O(log n) steps. The root processor uses the result of this 
computatlOn to enable that PM-level PE having the maximum priOrity 
value This selected PE then sends the RHS of the rule it contalOs to 
the root processor. 
3 The action phase The whole tree is enabled agam, and the action 10 the 
RHS of the wmn10g PE is executed. This typically involves either adding 
Items to or deleting items from the WM. 
33 
This cycle is repeated until a desired state is reached or until no more matchlngs 
a.re found In the WM. 
A DADO prototype incorporating 15 PE's is currently operational at Columbia 
University. A larger prototype, DAD02 which comprises 1023 PE's, is under 
constructlOn, with each incorporating an Intel 8751 mlcrocomputer chip and an Intel 
34 
2168 8K x 8 RAM chip. A high-level language called PPL/M, a variant of Intel's 
PL/M, will be used to implement D.ADO's software. The PE will be able to execute 
the entire Intel 8751 instruction set in MIMD mode, and to broadcast the SIMD 
1Ostructions to its descendants. The SIMD instruction set is a superset of PL/M. 
Two additions have been made to PL/M for programming the SIMD mode of 
operation of D.ADO. The SLICE attribute defines a variable or a function that is 
defined to be resident in each PE for which the declaration applies. The second 
addition is a control construct, called the DO SIMD block, which defines 
1Ostructions to be broadcast to all SIMD PE's. For further detail of the D.ADO 
prototype the reader is referred to [Stol 82bl. 
4.6.2 A Tree Machine for Searching Problems 
ThiS speclal purpose tree machine was proposed by Kung [Kung 1979301 to efficiently 
search and maintain a file of fixed-format records. We must be able to insert a new 
record in the file, delete an existing record, update records, and answer queries 
about the file. One searching problem that is of interest is called member ~earch. In 
this problem we maintain a set of data elements, and the problem is to determine 
If a specific element is a member of the set or not. Usually finding the element is 
followed by other operations in real applications such as reading information 
associated with this element. 
The architecture of Kung's searching machine is shown in figure 4-12. The 
mach10e consists of two back-to-back binary trees that share leaf nodes. The 
machine is based on I6-bit instructions and 32-bit data words. There are three 
kinds of nodes 10 the machine: 
- Circle node~ at the non-leaf nodes of the upper tree, which are used to 
broadcast a stream of data and/or instructions to the square nodes where 
the execution of 1Ostructlons takes place in parallel. The top data paths 
(links from circle nodes) are 16 bits wide. A circle nodes must update an 
10ternal counter in the case of insert and delete operations, and direct 
the commg data and instruction to one of Its two children while sending 
a no-operation code to the other child in the case of insert operations. 
- Square nodes at the common leaf nodes, which are used to hold the data 




Structure of The Search Tree 
Machine 
stored in the tree by placing each record in one of the square nodes. 
Every square node is a very small von Neumann machine that receives 
Its Instructlons from an external 16-bit stream and sends 32 bits of result 
'data to Its output port. Figure 4-13 shows the architecture of the square 
node. The processor contaInS sixteen 32-bit words of memory, two 32-bit 
general purpose registers (RA,RB), a vector of eight single bit flags 
(F(O),F(I), .. F(7)), a byte register for set Identification (SetID), and an 
InstructlOn register. The processor can be disabled by resetting F(O). 
The results of comparing RA and RB are stored In two flags(F(I),F(2)). 
The square node may pass the incoming instructlon to the tnangle node 
that IS connected to It, lf a single bit field in the instruction is reset 
- Triangle node" at the internal nodes of the lower tree, which are used to 
combine the the results produced by the square nodes and produce 
answers to queries. The triangle nodes operate on 80 bit packets 
containing one 16 bits instruction, and two 32 blt data elements commg 
from' the square nodes. 
35 
Data flows in. one direction only, top to bottom, in a synchronous manner, with 
operatlOns performed and data transmitted to the next node dunng each major 
cycle Thus the flow of data and computations can be pipelined. There is also a 
controller at the top of the input node to perform high level functions such as 
36 
Sella 
F'(O .• 7 ] 
M(O] 
M{15] 
Figure "-13: Components of the Square Node 
loading the tree, creating and deleting sets, defining and handling records, etc. The 
controller relieves the CPU of this overhead and lessens the traffic on the system 
bus. 
We now describe how the member search problem is solved on this tree. Assuming 
a set of N elements are stored in the machine, the algorithm starts by inserting the 
va.lue of the element we are looking for at the input node and broadcasting this 
val ue to every square of the tree through the circle nodes. After log n steps the 
data element reaches all the square nodes. At that point each square node 
compares thlS value with its own data element. If they are equal then the square 
node sets a flag blt to one; otherwise, it sets the flag to zero. The bits are then 
combined through the bottom part of the tree (triangle nodes) by letting each 
triangle node compute the disjunction of its two inputs. After another log n steps a 
single bit emerges from the output node telling whether the search was successful or 
not. The total number of steps to solve the member search problem is thus 2 log n. 
If the number of data elements that match the value being broadcast is of interest, 
then the triangle nodes wlll sum their two inputs instead of Just disjunction them. 
In the same number of steps the machine obtains the number of matching elements 
37 
emerglOg from the output node. If the application calls for the list of records that 
contalO the matching data elements, then the triangle nodes compute the Union of 
their two inputs and send the result of this union lO a plpelined fashion to their 
parents. 
In general this tree machine can solve any problem that consists of computing some 
function over every element 10 the set, and then combining the values of these 
results by some associative, commutative binary operator. 
Other speCial machmes that have the same architecture were proposed to perform 
database operations [Song 80]. sorting [Song 8lJ, and for the maintenance priority 
queues [Leis 79]. 
38 
5. Algorithms on Tree Machines 
In this section we will discuss some of the algorithms that run on tree machines 
presented in this paper. These algorithms represent a subset of parallel algorithms 
that have attracted researchers' attention since the early sixties [Kung 7gb]. A 
parallel algorithm can be defined as a collection of independent task modules which 
can be executed in parallel, and which communicate with each other during the 
execution of the algorithm [Kung 79bJ. 
We will adopt Kung's taxonomy of parallel algorithms [Kung 79bJ for the case of 
tree machine algorithms. In his taxonomy Kung classified parallel algorithms based 
on their relation to parallel computer a.rchitectures. Three important a.ttributes of 
parallel algorithms were used for the classification. They are concur1'ency control, 
module granularity, and communication geometry. Concurrency control enforces the 
deSired interaction between task modules to ensure correctness of the algorithIIi's 
execution. Synchronous concurrency control is found in algorithms where all task 
modules execute the same instruction broadcast by a central controller concurrently. 
Dlstn buted synchronous control corresponds to algOrIthms where coordination IS 
achieved by simple control mechaOlsms local to the task modules. Some algorithms 
use asynchronous control via shared data to achieve concurrency control. Module 
granularIty IS the maximal amount of computation performed by a task module 
before It has to commUOlcate with other modules Communication geometry refers 
to the topological layout of the network resulting from the wire connection of task 
modules to perform the algOrIthm. The commUOlcatlon structure can be regular or 
Irregular Trees represent one of the regular commuOlcation structures. Other 
examples are one-dimensional and tw~dimenslonal arrays. 
Based on these attributes, we will divide the tree algorithms we will discuss In this 
sections Into three categorIes: 
- SIMD algOrIthms, where the concurrency control is synchronous and task 
modules' are executing the same instruction Simultaneously. The SIMI) 
notion refers to the corresponding computer architecture which executes 
these algOrIthms efficiently. 
- MI11D algOrIthms, where the task modules interact asynchronously. These 
algOrIthms usually have large module granularIty. 
39 
- Systolic algorithms, where the concurrency control is distributed and 15 
achieved by local simple control mechanisms. 
This section will be divided into three subsections, describing algorithms in the 
three categorIes mentlOned above. It is worth noting here that the tree machInes 
we discussed before are usually oriented towards efficiently executing algorithms In 
one of these categories, because of the kind of concurrency control used. For 
exam pie the Caltech tree is oriented toward performance of MIMD algorithms 
because Its processors mteract through messages. The same is true for the X-tree. 
The NON-VON supercomputer executes SIMD and systolic algorithms efficiently 
while MIMD algorithms must be adapted (where possible) to its central concurrency 
control mechamsm. The Stony Brook machine and the DADO machine can execute 
a mixture of MIMD and SIMD algorithms because a processor in the tree can 
disconnect Itself and assume the the role of the central controller for its 
descendants 
5.1 SL\ID Algorithms 
In SThID algOrIthms all PE's In the tree machine execute the same instructlOn, 
broadcast by a central controller, concurrently on their own local data. Associative 
algOrIthms constitute a subset of SIMD algOrIthms. Many associative algorithms for 
solVing different kinds of problems can be found in the hterature. Examples include 
[Bent 68], [Walk 62], [Fill 63\, ['Nell 69]. [Chen 19]' [Film 11], and [Shaw 801. As 
a.n example, thiS section presents a SIMD algorithm to find the tranSitive closure of 
a directed graph [Shaw 83]. 
5.1.1 The Transitive Closure Algorithm (SIMD Version) 
The transitive closure of a directed graph (digraph) G=(V,E), where V is the set of 
nodes in the graph and E is the set of arcs connecting the nodes, is defined as the 
. . . . 
dlrected graph G =(V,E ) where the~e IS an arc from node u to node v In G Iff 
there IS a directed path from u to v in G. If the number of nodes In G is n, then 
the digraph G can be presented as an n X n matrix known as the adjacency 
matrIx. If A IS the adjacency matnx of the digraph G then a[i,]l=l only If there IS 
a.n arc gOing from node i to node j m G. Otherwise a[i.]l=O The number of edges 
40 
In G is greater than zero and is less or equal to ne. The same is true of the 
number oC edges in G*. There are several algorithms to find the transitive closure. 
The sequential versions oC some oC these algorithms are mentioned in [Brow 79]. 
The widely known Floyd-W arshall algorithm takes O( n 3) steps on a sequential 
machine to find the transitive closure of a digraph presented to the algorithm in 
the form of its adjacency matrix. The best time sequential algorithm known at 
present is 0(n1og 7 (log n)2), where O(nlog 7) is the best time complexity known for 
binary matrix multiplication. 
The SUvID algorithm presented in this section IS a parallelization of Floyd-W arshall 
algorithm. Warshall's algorithm is as follows: 
for t:- 1 to D do 
for 1:- 1 to D do 
for J:- 1 to D do 
a[1.j] := a[1.j] or (a[i.t] aDd a[k.j]): 
In Warshall's algOrIthm, newly created arcs affect the creation of new arcs 10 the 
closure. The creation of a new arc results from comparing two existing arcs 10 the 
closure found so far. The SUvID algOrIthm for computing the transitive closure oC a 
digraph has a time complexity (0(n2). 
In the SI}.-ID algOrIthm, each PE In the tree is used to hold a boolean value of the 
adjacency matrix. Each PE: also holds a pair of lOtegers representing the initial and 
terminal endpoints of the edge held by the PE. Thus n e PE's are used in the tree, 
each containing a boolean value (exlsts[*, a]) In each of n Iterations, for some fixed 
k, all of the boolean values given by eXlsts[* ,k] and eXlsts[k, *] are broadcast into the 
tree by the central controller in O(n) tIme. Note that exists[*, *1 IS lDltlally equal to 
the value of a[*, *] in the adjacency matrix of G. The boolean values exists[*,k] and 
eXlsts[k. *] are reported to the central controller before being broadcast throughout 
the tree. This ensures the use of their most recently updated values. During each 
of the n steps, PE(i,;} listens for exactly two booleans, namely eXlsts[i,kl and 
eXlsts[k,;l. If. these two values are true then exists[i,;l is set to true; otherwise it 
keeps Its previous value. The algOrIthm is formally described as follows [Shaw 83]: 
For k:-1 to n Do 
Begin 
for 1:-1 to n Do 
Begin read exists(1.k); 
broadcast ex1sts(1.k) 
End; 
for j:-l to n Do 
Begin read ex1sts(k.j): 
broadcast ex1sts(k.J) 
End: 
ex1sts(i.J):= exists(i.j) 01 ( exists(1.k) AND exists(k.j) ) 
End 
41 
Note that this parallel version of the Floyd-Warshall algorithm has time complexity 
O(n) for each of the n iterations, and that it is limited by the I/O time. The 
mterested reader is referred to [Shaw 83] for similar SIMI> algorithms to compute 
the product of two matrices, and all pair shortest path 
5.2 ~ Algorithms 
In this subsection we present an algorithm to find the transitive closure of a 
digraph on a MIMI> machine, for purpose of comparison with the SIMI) algorithin 
for transitive closure. Other MIMD algorithms on tree machines can be found 'in 
[Brow 79]' [Brow 801. and [Harr 79]. 
'5.2.1 The Transitive Closure Algorithm (MIMD Version) 
We wIll describe the MIMD algorithm for computing the transitive closure of a 
directed graph as implemented on the CalTech MIMI> tree machine [Brow 79]. As 
mentIOned In the section describing the Caltech tree, It is pOSSible to write 
algOrIthms that assume arbitrary fanout of the tree node processors. A N1AP 
program will translate these algOrIthms to the physical bmary tree machine. The 
tree used to find the transitive closure consists of three levels The first level 
contains the root processor (tht closurtRoot processor) With a fanout of n where n IS 
the number of nodes In the graph The second level has n processors (nodt 
processors), each With a fanout of n Each node In this level represents a node In 
the graph. The third level is the leaf nodes level and contain n 2 node processors 
(tht endNodes processors). The connectIOn between a node and an endNode 
represents a potential arc in the closure Each endNode processor represents an arc 
from Its parent to the node It represents The endNode processor will contain a flag 
that indicates If the arc connecting It to Its parent IS part of the closure or not 
42 
With each arc added to the closure more arcs are broadcast around the tree and 
new arcs are formed. The algorithm terminates when no new arcs are formed in the 
tree Each of the three tree node types executes a different· code and they 
commUOlcate with each other using messages. In the following we will describe the 
code executed by each node type. 
The cl08ureRoot processor begins by initializing the tree. Each node is assigned a 
node number. The closureRoot is connected to an external system bus that provides 
the arcs of the directed graph. The newly formed arcs in the closure are also sent 
out on thiS bus. The closureRoot processor is responsible for rebroadcasting newly 
formed arcs in the closure to all its descendants. The node processor, upon 
receiving the message from the root assigning a node number to it myY, starts the 
arc-processing phase. Upon receiving an arc starting and terminal points (i,,1) from 
the closureRoot processor, the node processor compares the starting and terminal 
pOlOts of this arc to its node number. The node processor creates a new arc ~y 
turOlng on the appropriate descendant if one of two conditions is true. The first 
condition IS that the starting pOlOt IS equal to the node number. This will take care 
of the arcs in the initial graph. The second condition is that there is an existlOg 
arc from the node to the starting point. In this case, the arc myY,j is created. This 
second condition takes care of Warshall's comparison. The endNode processors are 
used to store a boolean value indicating whether there is an arc from their parents 
to them. If the boolean values were to be stored in the node processors, then the 
algonthm would be dependent on the problem's size. Newly created arcs are 
broadcast throughout the tree through the closureRoot processor. There are other 
operations performed by these types of processors to ensure that all arcs are being 
processed. 
The algOrithm as descnbed above requires time proportIOnal to the number of arcs 
10 the closure. Thus, the time complexity of the algOrithm is 0(n2), limited by the 
time needed to read out the arcs of the closure. The number of processors used in 
this algonthm' is 2n2-1. 
43 
5.3 Systolic Algorithms 
In this section, we present a systolic algorithm proposed by Leiserson [Leis 79b] to 
implement a priority queue on a binary tree. Other systolic algorithms on tree 
machInes can be found in [Song 79], [Song 81], and [Kung 79b]. 
5.3.1 The Systolic Priority Queue Algorithm 
A pnonty queue is a data structure that allow us to insert records into a set and 
at any time to retrieve from the set the record having the smallest key according 
to some ordering [Leis 79bJ. The systolic tree that implements the priority queue is 
a binary tree where PE'S rhythmically compute and pass data among themselves. 
The two operations of insertion and extraction can be pipelined. Thus, the systolic 
tree captures the concept of pipelining in addition to parallelism. Each PE in the 
binary tree holds two records as shown in figure SYSTREE. A record with key 
equal to 00, where 00 is larger than any possible value for the key, means an empty 
record. The two records stored In each PE (left record, and right record) are such 
that the key of the left record IS smaller than any key found in the subtree rooted 
WIth thIS PE, and the value of the right record key is greater than any key In the 
same subtree Thus, the record WIth the smallest key is the left record found In the 
root of the systolIc tree. 
Figure 6-1: The Systolic Tree 
44 
In msertlOn, each PE looks at the mcoming key and the values of the keys in lts 
two chlldren to decide what to do. The action taken by the PE can be one of the 
following: 
- If the mput key is smaller than the key of the left record, it will send 
the left record to its left child and replace It by the incoming record. 
- If the input key is larger than the left key and is smaller than the right 
key, we have three subcases: if the incommg key is smaller than the 
nght key of the left child, the input record wlll be sent to the left child. 
If the incommg key is larger then the right key of the right child, the 
right record will be sent to the parent and the incoming record wlll 
replace it. Otherwise, the incoming record will be directed to the right 
chlld. 
- In the case of leaf PE's the incoming key is compared with the two 
keys, and the largest of the three will be sent to the parent. The other 
two are placed in such a way that the smaller will be in the left record. 
Extracting an element from the queue is analogous to reading the left record of the 
the root and inserting an empty record. 
The time complexlty of the insertion algorithm is O(log n), which is the depth of 
the tree. Multiple pnonty queues can be implemented by having every key conslsts 
of two parts, the queue number Q and the key of the record a. In this case, a key 
<Q,a> IS treated as less than <Q',a'> If Q < Q', or Q = Q' and a < a'. 
Figure 5-1 Illustrates this case. 
Acknowledgments 
I wish to acknowledge the excellent criticism of my advisor, Professor D E. Shaw, 
the valuable suggestions of Professors S. Stolfo, T. Bashkow, G. MagUire, and my 













Abo, A. v., Hopcroft, 1. E. and Ullman, 1. D. 
The De~ign and Ana/y~i" of Computer Algorithm~. 
Addison-Wesley, 1974. 
Backus, 1. 
Can Programming be Liberated from von Neuman Style? A 
Functional Style and its Algebra of Programs. 
Communications of ACM 25, August, 1978. 
Baneqee, 1., Hsiao, D. K. and Krishnamurthi. 
DBC- A Database Computer for Very Large Databases. 
IEEE Tran~action on Computer~ 28(6), June, 1979. 
Bentley, c. 
Path Finding with Associative Memory 
IEEE Tran~action~ on Computers 17(7), July, 1968. 
Bently, 1. and Kung, H. T. 
A Tree Machine for Searching Problem". 
Technical Report, Carnegie Mellon University, August, 1979. 
Bentley, 1. L. 
A Parallel Algorithm for Con~tructing Minimum Spanning Trus. 
Technical Report, Computer Science Department, Carnegie Mellon 
University, August, 1979 
Berkling, K. 1. 
A Computing Machine Based on Tree Structures. 
IEEE Tran~action3 on Computer~ 20(4), April, 1971. 
Bhatt, D. and Smith, D. R. 
CommuDlcation in a Hierarchical Multicomputer. 
45 
In The Proceeding~ of IEEE l~t International Conference on Di~tributed 
Computing Sy~tem~. 1979. 
Bhatt, D., et. al. 
An Operating System Kernel for a Hierarchical Multicomputer 
In The Proceeding~ of IEEE Fall Computer Conferencl: , pages 
665:672. 1980. 
Brent, R. P. and Kung, H. T. 
On the Area of Binary Tru Layout. 
Techmcal Report, Computer SCience Department, Carnegie Mellon 













Computations on a Tree of Processors. 
In The Proceeding" of the Fir"t Cil/tech Conference on VLSI. January, 
1979. 
Browning, S. 
The Tree Afllchine: A Highly Concurrent Computing Environment 
PhD thesis, California Institute of Technology, January, 1980. 
Browning, S. and Seitz, C. 
Communications in a Tree Machine. 
In The Proceeding of The Second Ca/tech Conference on VLSf. 
Jan uary, 1981. 
Chen, I-N, Chen, P. Y. and Feng, T. 
Assoclative Processing of Network Flow Problems. 
IEEE Trlln"action" on Computer" 28{3}, March, 1979. 
Despain, A. M. and Patterson, D. A. 
x-Tree. A Tree Structured Multi-Processor Computer Architecture. 
In Proceeding" of the 5th Sympo"ium on Computer Architecture. April, 
1918. 
Estin, G. and Fuller, R. H. 
Some Applications for Content Addressable Memones. 
In The Proceeding" of the Fall Joint Computer Conference, pages 
495.508. 1963. 
Fahlman, S. E. 
The HIl"hnet Interconnection Scheme. 
Techmcal Report, Computer SClence Department, Carnegie Mellon 
Universlty, June, 1980. 
Falkoff, A. D. 
Algorithms for Parallel Search Memories. 
Journal of AClvi 9(10), October, 1962. 
Finnila, C A. and Love, H. Jr. 
The Associative Linear Array Processor. 
IEEE Tran"action" on Computer" 26(2), February, 1977 
Fisher, M. 1. and Paterson, M. S. 
Optimal Tree Layout. 













Flynn, M. 1. 
Some Computer Organizations and Their Effectiveness. 
IEEE Tran.!action.! on Computer~ 21(9), September, 1972. 
Foster, C. C. 
Content Addre~~able Parallel Proce~~or~. 
Van Nostrand Reinhold, 1976. 
Fuller, R. H. 
AssoCiative Parallel Processing. 
In The Proceedings of the Spring Joint Computer Conference, pages 
471:475. 1967. 
Gilmore, P A. 
Numencal Solution of PDE by AssOCiative Processing. 
In Proceedings of AFIPS, Fall Joint Conference, pages 411418. 
1971. 
Goodman, 1. R. and Sequin, C. H. 
Hypertree, a Multiprocessor Interconnection Topology. 
IEEE Transactions on Computer~ 30(12), December, 1981. 
Harris, 1. A. and Smith, D. R. 
Simulation Experiments of a Tree Organized Machine. 
In The Proceeding~ of IEEE Parallel Processing. 1979. 
Hilhs, W. D. 
The Connection Machine. 
Technical Memo, M.1. T. Artificial Intelligence Lab, September, 
1981. 
Hoare, C. A. R. 
Comm unicating Sequential Processes. 
Communication.! of ACM 25, August, 1978. 
Horowitz, E. and Zorat, A. 
A Divide and Conquer Computer 
Technical Report, UniverSity of South California, July, 1979. 
Horowitz, E. and Zorat, A. 
The Bmary Tree as an Interconnection Network:Appilcations to 
Multiprocessor Systems and VLSI. 











Kieburtz, R. B. 
A Hierarchical Multicomputer for Problem Solving by 
Decom position. 
48 
In T1u Proceeding~ of IEEE l~t International Conference on Di~tributed 
Computing Sy~tem". 1979 . 
. Kieburtz, R. B. 
A DistrIbuted Operating System for the Stony Brook MUlticomputer 
In The Proceeding~ of IEEE 2nd International Conference on 
Di~tributed Computing Sy~tem". April, 1981. 
Knuth, D E. 
The .'irt of Computer Programming. 
Addison Wesley, 1973. 
Kuck, D. 1. 
A Survey of Parallel Machine Organization and Programming. 
Computing Survey~ 9(1), March, 1977 
Kung, H. T. 
The Structure of Parallel Algorithm~. 
Technical Report, Computer Science Department, Carnegie Mellon 
University, August, 1979. 
Kung, H. T 
Let's Design Algorithms for '-'LSI Systems. 
In The PRoceeding~ of The Fir~t Caltech Conference on VLSI. 
January, 1979. 
Lea, R. M. 
A~~ociative Proce"3ing of Non-Numeric Information. 
Reidel Publishing Company, Dordrecht, Holland, 1977, 
Lee, C. Y and Paull, M. C. 
A Content Addressable Distributed Logic Memory With Applications 
to Information Retrieval. 
In The Proct!t!ding~ of IEEE, pages 924-932. June, 63. 
Leiserson, C. E. 
Area Efficient Graph Layout for VLSI. 
Technical Report, Computer Science Department, Carnegie Mellon 














Leiserson, C. E. 
Systolic Priority Queues. 
In The Proceeding3 of Caltech Fir3t Conference on VIS!. January, 
1979. 
Leiserson, C. E. 
Area Efficient VLSI Computation. 
PhD thesis, Carnegie Mellon University, October, 198!. 
Lewin, D. 
Introduction to Associative Proce3sor3. 
Reidel Publishing Company, Dordrecht, Holland, 1977, 
Lin, C. S. 
SortIng with Associative Secondary Storage DeVIces. 
In The Proceedings of the National Computer Conference. 1977. 
Lipovski, G. 1. 
The Architecture of a Large Associative Processor. 
In The Proceedings of the Spring Joint Computer Conference. 1970. 
Mead, C and Conway, L. 
Introduction to VLSI System3. 
Addison Wesley, 1979. 
Mead, C A. and Rem, M. 
Cost and Performance of VLSI Computing Structures. 
IEEE Transaction on Solid State Circuits 14(2), Apnl, 1979. 
Pa.tterson, D. A., Fehr, E. S a.nd Sequin, C H. 
Design ConsideratlOns for the VLSI Processor of X-Tree 
In The proceedings of the 6th Annual Symposium on Computer 
Architecture. April, 1979. 
Potter, 1. L. 
Image Processing on the MasSively Parallel Processor 
IEEE Computer Magazine 16(1), January, 1983 
Rem, M. 
Mathematical Aspects of VLSI Design. 
In The Proceedings of Caltech First Conference on VLSI January, 
1979. 
Schwartz, 1. T. 
Ultracomputers. 












Seitz, C. L. 
Self Timed VLSI Systems. 
In The Proceeding~ of the Fir3t Caltech Conference on VLSI. January, 
1979. 
Sequin, C. H., Despain, A. M. and Patterson, D. A. 
CommunicatIon in X-Tree, a Modular Multiprocesser System. 
In The Proceeding3 of The Annual Conference of ACM, Wa3hington 
D.C.. December, 1978. 
Sequin, C. H. 
Single Chip Computers, The New VLSI Building Blocks. 
In The Proceeding3 of the Fir3t Caltech Conference on VLSI. January, 
1979. 
Sequin, C. H, and Fujimoto, R. M. 
X-Tree and Y-Component3. 
Technical Report, University of California at Berkeley, October, 
1982. 
Shaw, D E. 
A Relational Databa3e Machine Architecture. 
Techmcal Report, Computer Science Department, Stanford 
UniverSIty, October, 1979. 
Shaw, D. E. 
Knowledge-Ba~ed Retr,'eval on a Relational Databa3e Machine 
PhD theSIS, Stanford UnIverSIty, Computer Science Department, 
August, 1980. 
Shaw, DE, Stolfo, S. J, IbrahIm, H., Wiederhold, G , Hillyer, 
8. and Andrews, J A. 
The Non-Von Database Machine A bnef OvervIew 
IEEE Computer Society Tuhnical Committee, Quartely Report 4(2), 
December, 1981. 
Shaw, D. E. 
The NON-VON Supercomputer 
Technical Report, Computer Science Department, Columbia 
UniverSIty, August, 1982. 
Shaw, D. E, and Hillyer, B K. 
AI/ocation and Manipulation of Record~ In NON-VON Supercomputer. 
Techmcal Report, Computer Science Department, Columbia 









. [Ston 711 
[Suth 771 
Shin, K. G., Lee, Y. H., and Sasidher, J. 
Design of HM2p - A Hierarchical Multiprocessor for General 
Purpose Applications. 
IEEE Tran"action" on Comput~r" 31(11), November, 1982. 
Snyder, L. 
Programming Proce"80r Interconnection Structur~8. 
Technical Report, Department of Computer Science, Purdue 
University, October, 1981. 
Song, S. W. 
A Highly Concurr~nt Tree Machine lor Databa"e Application.!. 
Technical Report, Computer Science Department, Carnegie Mellon 
UniverSity, August, 1979. 
Song, S. W. 
I/O Complexity and De"ign 01 Special Hardwar~ lor Sorting. 
Technical Report, Computer Science Department, Carnegie Mellon 
University, February, 1981. 
Sternberg, S. R. 
Biomedical Image Processing. 
IEEE Comput~r Magazine 16(1), January, 1983. 
Stillman, N. 1 , Defiore, C. R. and Berra, P. B. 
Associative Processing of Line Drawings. 
In Th~ Proceeding8 01 the Spring Joint Comput~r Con/~r~nce. 1971. 
Stolfo, S. 1 , Shaw, D. E. 
DADO: A Tree-structured Machine Architecture for ProductIOn 
Systems. 
In The Proc~eding" 01 the end National Con/erence on Artificial 
Int~lligenc~. August, 1982 
Stolfo, S. 1, Miranker, 0, and Shaw, D. E. 
Programming The DADO Machine: An Introduction to PPL/M-
Technical Report, Columbia UniverSity, November, 1982 
Stone, H. 
Parallel Processing With the Perfect Shuffle. 
IEEE Tran.!action3 on Computer" 20(2), February, 1971 
Sutherland, I. E., and Mead, C A. 
MicroelectronIcs and Computer SCience. 









Thompson, C. D. 
A Complexity Theory for VLSI . 
PhD thesis, Carnegie Mellon University, August, 1980. 
Thurber, K. 1. and Wald, L. D. 
Associative and Parallel Processors. 
Computing SurveYlJ 7(4), December, 1975. 
Tolle, D. M. and Siddall, W. E. 
On the Complexity of Vector Computation in Binary Tree 
Machines. 
Information Proce~lJing utterlJ 13(3), December, 1981. 
Tolle, 0 M. 
Coordination of Computation in a Binary Tree of ProcelJ~or~. 
PhD thesis, University of North Carolina at Chapel Hill, 1981. 
Tolle, D. M. 
52 
Implanting FFP Trees in Binary Trees: An Architectural Proposal. 
In The ProceedinglJ of the Conference on Functional Programming 
lAnguagelJ and Computer Architecture, pages 115: 122. October, 
1981. 
Wesley, M. A. 
Assocla.tive Parallel Processing for the FFT. 
IEEE TransactionlJ on Audio and Electro AcoulJticlJ 17(2), June, 1969. 
Yau, S. S. and Fung. H. S. 
Assoclatlve Processor Architecture. 
Computing SurveYlJ 9( 1), March, 1977 
