The NON-VON Supercomputer by Shaw, David Elliot
Abstract 
!he NON- VON Supercomputer 1 
David Elliot Shaw 




NON-VON 15 a highly parallel, non-von Neumann nsupeccomputern, pcrtion~ of 
which are n~ being implemented in the Cc:mputer Science Department at ColUllbia 
University. The machine is intended to support the extremely rapid execution 
of large scale data manipulation tasks, including relational database 
operations and many other functions relevant to ccmnercial data processing. 
The NON-VON architecture includes a tree-structured Pr imary Processing 
Subsystem (PPS ) , which we are implementing using custom nMOS VLSI circuits, 
along with a SecQcdao processing Sybsystem (SPS ) based on a bank of 
intelligent disk drives. A high-bandwidth parallel interface provides for 
rapid data transfer between the t',./o subsystems. This paper describes t.'1e 
organization of the NON-VON machine, with particular emphasis on the structure 
and function of the ?PS. Some of t.~e most important NON-VON programming 
te~~iques are then outl ined, and thei r application to typical data processing 
applications illustrated with simple examples. 
'This research was supported i n part by the Defense Advanced Research 
?rojects Agency under contract N00039-82-C-0427 . 
-
Table of Contents 
Introduction 
1 .1 Project History and Current status 
1 .2 Ccmpar1!5on wi th von Nel.lDSM Machines 
2 Organization of the NOH-VON Machine 
2.1 System Organizati.:>n 
2.2 The Primary Proce!5!5ing Subsystem 
2.3 Topological Con!5ideration!5 
2.4 The Proce!5s1ng Element 
3 Prograaming NON-VON 
3. 1 The PE Instruction Set 
3.2 The wIntelligent Recorda Metaphor 
3.3 A!5sociative Operation!5 on the NON-VON ~~chine 
3.4 Packed and Spanned Records 

















List of Figures 
Figure 1: Organization of the NON-VON ~~chine 9 
Figure 2: Interconnection of Two Leiserson Chips 13 
Figure 3: The PPS Printed Circui t Board (Leiserson Layout) 14 
Figure 4: Hyper-H Embedding of the Binary Tree 15 
Figure 5: Inorder Embedding of the Linear Array 21 
Figure 6: Bounded Neighborhood Embedding of the Linear Array 23 
Figure 7: Block Diagram of the Processing Element 26 
Figure 8: Routing of an N-Bit Data Bus through a 90-Degree !urn 29 
Figure 9: Linear Allocation of Spanned Records 49 




The efforts of a nunber of indiv-iduals are reflected in the research reported 
in this paper. In particular, the author wishes to acknowledge the 
contributions of his faculty co-investigators, Professors Salvatore J. Stolfo, 
Zvi M. Kedem, and Michael Lebowitz, and of the eight gifted and zealous Ph.D. 
students who form the central core of the NON-VON Project. Specifically, much 
of the design and VLSI layout of the PPS processing element is due to efforts 
of Hussein Ibrahim and Dan Miranker, aided by the critical insights of Sanjiv 
Sharma. Dan Miranker and Dayton Clark were responsible for a large part of 
the implementation of the NON-VON Simulator, while Bruce Hillyer made 
significant contributions involving fault tolerance and testing, record 
allocation, high-level languages, and parallel algorithms. Steve Taylor has 
played a major role in both hardware design and translator developnent, while 
Dong Cho1 and Yoram Eisenstadter have been recent partiCipants in the 
implementation of database management software for the NON-VON machine. 
The "real-world" expertise of our project engineers, Ted Sabety and Shun Ueda, 
has ~n critical to the success of our integrated circuit and system design 
efforts. Substantial contributions to the theory, architecture, deSign, 
implementation, simulation, and programming of NON-VON have also been made by 
Bob Floyd, Don Knuth, Gio Wiederhold and Terry Winograd, all of the Stanford 
Computer Science Department, and by Dave Bacon, Peter Brajak, Lincoln Hu, 
Kevin Kalaj an , Stuart Kreitman, Ted Markowitz, Reynaldo Newman, Terry Newton, 
Alessandro Piol, Arthur Sun, Danny Sykora, and Michael Weisberg, at Columbia. 
Finally, Jerry Wiener deserves special recognition for his role as 
administrator of the NON-VON project. The contributions of each of these 
individuals are gratefully acknowledged. 
, Iptroduction 
Two observations regarding the evolution of computer systems have, during the 
past decade or so, become so commonplace as to require little discussion. 
F1rst, the cost of digital hardware has dropped to the point where, in many 
applications, processors need no longer be considered a scarce resource. 
Second, the cost of computer soft1olare is increasing, both in absolute tenns, 
and even more dramatically, by canparison with that of the hardware on which 
it executes. 
3 
The design of highly parallel machines is more camnonly associated with the 
first of these observations than the second. Indeed, the availability of 
large nlmbers of inexpensive processing elements quite naturally suggests the 
possibility of constructing highly concurrent systems capable of very rapid 
execution. The NON-VON machine, which incorporates a large m.mber (between 
100,000 and 1,000,000, within the target time frame) of unusually simple 
processors, is one of the most ambitious proposals to date for the realization 
of very large scale parallelism using current integrated circuit technology. 
It should be emphasized, however, that issues related to software are as 
central to the goals of the NON-VON project as is the achievement of 
unprecedented processing power. 
NON-VON was designed to apply computational parallelism on a rather massive 
scale to a large share of the infor:nation processing functions now per:·onned 
by digital computers. In particular, highly efficient support is provided for 
the kinds of operations which seem to characterize much of the workload 
involved in commercial database management and data processing applications. 
This pa~~r describes the architecture of the NON-VON machine and i1l~trates 
the manner in which it achieves such a high degree of parallelism. 
The paper is divided into three sections. The current section brief:y reviews 
the history and current status of the NON-VON Project, and provides an 
informal comparison between the essential elements of a conventional computer 
system and the analogous components of the NON-VON machine. In the second 
section, NON-VON's physical organization 1s described at several levels. The 
final section describes the instruction set of the NON-VON Processing Element 
(PE), and introduces sane of the most important paradigms for the 
implementation of NON-VON software. 
1,1 Project History and CYrrent Status 
4 
The theoretical basis for the NON-VON machine was established in the course of 
a doctoral research project at Stanford University [16], [17]. AsymptotiC 
improvements in the evaluation of a number of relational database operations 
were reported. These results employed a highly general technique kn~n as 
hash partitioning, by which many large-scale data processing operations having 
O(n log n) time complexity on a von Nel.mann machine may be implemented in 
linear time on a different type of machine which has the same hardware 
complexity. The interested reader is referred to these earlier results for a 
rigorous analysis of the complexity of algorithms to which the current paper 
will make only casual reference. 
Detailed design of the NON-VON hardware began in the latter part of 1981, and 
has gained mc:mentun since that time. Major funding for the NON-VON project 
has recently been obtained from the Defense Advanced Research Projects Agency, 
supporting the implementation at Columbia University of certain key elements 
of an initial prototype, which we have come to call NON-VON 1. These elements 
employ custom-designed nMOS VLSI circuits, which are to be fabricated remotely 
using DARPA's "silicon brokerage" system, l-tlSIS. As of August, 1982, a 
preliminary data path for the NON-VON 1 Processing Element (to be described 
shortly) has been laid out in nMaS VLSI, simulated and debugged at the logic 
5 
level, and mechanically checked for design rule violations. Portions of ~~ese 
designs have recently been submitted for fabrication. 
The development of software for the NON-VON machine has proceeded in parallel 
with our hardware implementation efforts. A simulator for the NON-VON 
instruction set was implemented in the fall of 1981, and has since been 
enhanced to provide a user-convenient vehicle for the development of NON-VON 
software. Higher-level linguistic constructs have been implemented as part of 
an evolving LISP-based programming environment, and compilers for two parallel 
languages, modelled after Pascal and APL, are now under development. About 
twenty individuals have thus far written NON-VON programs, and have tested 
their execution using the instruction set s~ulator. While no large-scale 
applications have yet been implemented, our experience with this modest corpus 
of simple NON-VON pro~ has already led to several mnor refinements of the 
architecture and instruction set. 
1..2 Ccmpadson with yon Hennann Machines 
If pressed to identify a single principle underlying the essential 
"philosophy" of the NON-VON arChitecture, we would probably choose to 
highlight the strategy of extensively intermingling processing and storage 
resources. This strategy is employed at several levels within the NON-VON 
machine, and is perhaps best appreciated by contrast with the organization of 
a conventional computer system. 
In an ordinary von Neumann machine, a single (often quite powerful) central 
orocessing unit is cor~ected to a single (often quite large) random access 
memoC', which is used for the s1:orage of both programs and data. The C?U and 
RAM communicate in a serial (or at best, weakly parallel) fashion throu~~ a 







Moreover, the limitations of this organization are becoming ~ore serious as 
technological progress increases both the potential power of processing 
hardware and the realizable size of computer memories . 
6 
In the NON-VON primary processing SYbsystem (PPS), on the other hand, a large 
number of very stmple, highly area-efficient processing elements (PE's) are, 
in effect, distributed throughout the memory. In particular, each integrated 
circuit in the PPS contains a number of PE's (eight, in our planned prototype 
verSion, which is based on typical 1982 nMOS device dimensions and die sizes). 
Each PE is associated with a small anount of locally accessible randan access 
memory (64 bytes, in NON-VON 1). The potential processor/memory bandwidth in 
NON-VON is thW5 many orders of magnitude higher than in conventional machines. 
In pract1ce, many or all of these tiny FE's are often able to operate 
concurrently on data stored in their respective local memories, supporting 
effective execution speeds far exceeding those of today's fastest 
supercomputers. Because of their small size, ha.lever, the PPS is expected to 
be scarcely more expensive than an equivalent amount of ordinary rancian access 
memory. (Specifically, we estimate that a NON-VON PE might occupy as little 
as twice the area that would be required for the amount of RAM it would 
incorporate.) From the vier.wpoint of performance, the PPS may thW5 be regarded 
as an ultra-high-speed parallel processing ensemble; fran a cost perspective, 
though, it is better vier.wed as a (slightly overpriced) randan access memory 
unit. 
A similar comparison be~~een the mass storage facilities of a conventional 
computer system and the analogous subsystem within the NON-VON machine may 
also prove instructive. In the typical large-scale data processing system, a 
large bank of disk drives is charged with the task of responding "mincilessly" 
to a sequence of requests for data ?Osed by the CPU. In practice, ~st of 
tr~s data in fact proves irrelevant to the task at ~~nd. The secondarj 
storage subsystem -- a husky and obedient, but rather dim-witted brute is 
generally incapable of separating wheat from chaff, and must pass both along 
to its more intelligent master. 
As in the case of the von Neunann bottleneck, the patr:w~y between the 
"thinking part" and the "remembering part" of such a system is a relatively 
narrow one, even in the most sophisticated contemporary systems. 'Nhile a 
modest degree of parallelisn is sanetimes achieved in the disk-to-computer 
interface, the process of transferring data between primary and secondary 
processing hardware remains, for the most part, an essentially sequential 
function. 
In the NON-VON Secondary Processing Subsystem (SPS) , on the other hand, a 
sna.l.l anount of processing hardware is associated with each disk head. This 
hardware allc:ws records to be inspected "on the fly" to determine whether a 
given record is relevant to the operation at hand. The NON-VON SPS is thus 
able to be more discriminating in the data it passes along to the primary 
processing hardware. Furthennore, the topology of the PPS supports a 
maSSively parailel interface between primary and secondary storage, allowing 
data transfers between the subsystems to keep pace with the greatly 
accelerated execution possible within the PPS. In short, the SPS is able to 
"filter" data before it is sent to the PPS, and to transfer the "filtrate" in 




2 Organization of the NON-YON Machine 
In this section, we describe the physical structure of the NON-VON machine. 
The top-level organization of the system is outlined in the first subsection. 
Our principal concern in this paper, hcwever, will be with the Primary 
Processing Subsystem, which is described in more detail in the second 
subsection. In the third subsection, we discuss certain topological 
considerations that influenced the design of the PPS. The section concludes 
with a detailed description of the individual processing elements from which 
the NON-VON PPS is constructed. 
2.1 System Organization 
The top-level organization of the Noo-VON machine is illustrated in Figure 1. 
8 . 
The PPS is configured as a binary tree of processing elements. By dynanically 
altering certain switch settings within the PE's, however, the subsystem can 
be reconfigured to provide for linear, tree-structured or global bus 
canml.mication. With the exception of minor differences in the "leaf nodes", 
each PE is laid out identically, and canprises a snail randan access :nemory, a 
modest amount of processing lOgic, and an I/O switch supporting the various 
modes of inter-FE communication. 
At the root of the tree is a von Neumann machine called the Control Processor 
(CP), which is responsible for coordinating various activities within the PPS. 
In a production version of the NON-VON machine, the CP would in fact be 
specialized in several respects to optimize its performance as a controller 
for the PPS. In the context of this paper, however, the CP may be thought of 
as a conventional single instruction stream, single data stream (SISD) 
computer. ~~e certain 5equences of instructions are executed sequentially 
within the CP, it is also capable of broadcasting instructions to be 










simultaneously executed by all enabled PE l s in the tree on a single 
instruction stream, multiple data stream ( SHID ) basis (6]. 
10 
The SPS is based on a number of rotating storage devices, which might in 
practice be realized using either slightly modified multiple-head disk drives 
or unmodified single-head drives. Associated with each disk head in the SPS 
is a separate sense amplifier and a snall amount of logic capable of 
dynamically examining the data passing beneath it. These Intelligent Head 
~ ( IHU t s) are also capable of performing simple ccmputations (hash coding, 
for exanple), and of serling a control function similar to the role played by 
the CPo 
Assuming that the number of intelligent disk heads is equal to 2k, for some 
integer 1< , the k-th level of the ?PS tree (where the root is considered to be 
at level zero) is used to interface the PPS and SPS. Specifically , each of 
the k internal PPS nodes at this level is associated with a different IHU. 
Physically , this connection is made by interposing the IHU between the 
interface-level PE and its parent PE, as illustrated in Figure 1.1. In its 
passive state, the IHU acts as a simple bus, passing information in both 
directions without change . 
In certain algorithms, though, each IHU serves as an active contr ol processor 
for the subtree it roots, allowing independent, asynchronous computation 
within the various interface-rooted subtrees. ( NON-VON is thus not, strictly 
speaking, a SIMD machine ; in practice, however, it often functions as either a 
single SIMD machine or a collection of such machines. ) The most common 
application of this capability is in the concurrent loading of each interface-
rooted subtree from its respective disk drive. Such parallel transfers 
between SPS and PPS account for the unusually high effective IIO bandwidth 
achieved in a wide range of applications. Other algorit~~ make use of the 
1'~Q~ ~art n of the PPS tree _ more precisely, t.'1.e por t ion consis t.ing 0 1- all 
-11 
?E's lying above the interface level. Among other things, this ~ortion of ~~e 
tree can be used for the efficient synchronization of interface-rooted 
subtrees following asynchronous operation. 
A more thorough discussion of the SPS, its interface to the PPS, and the kinds 
of a1gori~ that make explicit use of the upper and lC'toler portion:! of the 
tre1! ls, unfortWlately, beyond the scope of this paper. The reader may I 
however, find the discussion of hash partitioning presented in (18] to be 
useful in gaining some appreciation for the way NON-VON-like architectures 
provide support for at least one 1.:lIportant fanUy of highly parallel 
algorithms involving large amounts of data. 
2 ,2 The Primary Proce=sing Syb.:v~= 
Al though physically structured as a binary tree, the NON-VON PPS can be 
dynamically reconfigured to support ccmmun1cation patterns c.'laracteristic of 
two other topologie3 in a highly efficient manner. In this subsection, we 
describe the physical organization of the NON-VON ??S and discuss the three 
modes of ccamunication it supports. 
The PPS is implemented using a number of identical PES cbip~. Our use of a 
single circuit is made possible by the adoption of a tree-partitioning schene 
first suggested by Leiser:son (12]. This approach embeds both a ccmplete 
suberee (containing 2c_1 conseituene ?~'s, for same c depending on device 
dimensions ) and a single interior node on each chip. Four nine-bit busses 
(eigne bits for data, and one for a control funceion, .... hich ' .. ill noe be 
discussed in this paper ) enter the chip. One, called the T connection, leads 
to the raoe of the chip's subtree, while the other three, called the : , L, and 
R connection3, attach the single interior r.ode to its father, l eft child and 
r :' 6h t cnild, respectivel y, , .. iehin t ."le eree. 
12 
A simple recursive procedure allows the construction of a complete binary tree 
of arbitrary size using only chips of this type. This construction is 
illustrated for the case of tw~ chips in Figure 2. Note that the resulting 
circuit consists of a larger complete binary subtree (in this case rooted by 
the interior node of the chip on the left side of Figure 2), together with a 
single unconnected interior node (the interior node of the chip on the right). 
This circuit has the same four external connections -- T, F, Land R -- as did 
a single chip. 
The interconnection scheme shown in Figure 2 may be easily extended to allow 
the construction of a simple, planar printed circuit board layout (also due to 
Leiserson), which is illustrated in Figure 3. The regularity of this PC board 
layout scheme has greatly simplified the task of designing the NON-VON PPS. 
Furthermore, the area required for routing wires within the PC board is 
strictly proportional to the number of chips, allowing the efficient 
implementation of boards of arbitrary size. 
The PPS is simply a collection of these PC boards, interconnected in precisely 
the same manner as are the constituent PPS chips. This scheme is suitable for 
the construction of a PPS comprising 2b_1 PEls, for arbitrarily large b, and 
leaves only a single interior PE unused. 
The subtree incorporated within each PPS chip is configured geometrically 
according to a ~yper-Hn embedding [3], as illustrated in Figure 4. This 
construction is highly regular, is area-opt~ (in the sense that the amount 
of silicon area occupied by the tree is proportional to the nt.mber of ?E IS) , 
and is easily extended to incorporate larger numbers of PEls as device 
dimensions scale downward. 
The tree structured inter-FE bus structure supports three distinct modes of 
communi ca ti on : 

14 
Figure 3: The PPS Printed Circuit Board (Leiserson Layout) 
15 
-
Figure 4: Hyper-H Embedding of the Binary Tree 
1. Global bys communication, supporting both broadcast by the CP to 
all PE's in the PPS and data transfers from a single selected ?E to 
the CPo 
2. Physically adjacent (tree) commynication to the parent CP), 1&Lt 
~ (LC) and Bight Child (RC) PE within the physical PPS tree. 
3. Linearly adjacent neighbor cgnmypicatlon to the Left Neighbor (LN) 
or Bight Neighbor (RN) PE in a particular logical linear sequence. 
16 
The global broadcast function supports the rapid parallel communication of 
instructions and data from the CP to the individual PE's, as required for SIMD 
execution. As will be seen in Section 3, it is also possible for a selected 
?E to send data to the CP. U sing the CP as an intermediary, any PE can thw5 
send data to any other ?E. No camnunication concurrency is aChieved, however, 
when data is passed from one ?E to another using the global mode primitives. 
The physically and linearly adjacent communication modes, on the other hand, 
support fully parallel ccmnunicatlon. The fonner is used in many tree-based 
algorithms. (Parallel sorting and the logarithmic-time addition of n nunbers 
are two exanples). The linear mode is used in algorittms in which many PET s 
simultaneously exchange data or control information with their Dnmediate 
predecessor or successor PETs in some predefined total ordering. Several 
mappings between the linear logical sequence and the tree-structured physical 
topology of the PPS are possible; these alternatives are discussed in the 
follewing subsection. By way of sumnary, each ?E can ccmmunicate with five 
other PETs, which are referred to within its own local context as P, RC, LC, 
RN and L~. 
2.3 Topological Considerations 
The choice of a tree-structured topology for the PPS was based on 
considerations involving such factors as the efficient use of silicon area, 
favorable pinout properties, and suitability for the rapid broadcasting of 
17 
data. Another important factor was the ability to efficiently emulate a 
linear array (a sequence of PE's, each connected only to its ~diate 
predecessor and successor), which, among other things, plays a central role in 
one of our techniques for manipulating records too large to fit within a 
single ?E. 
First, we observe that each PPS chip has exactly four external connections 
(each nine bits Wide, in NON-VON 1), regardless of the number of PE's 
contained within its subtree. Because of its fixed pinout requirements, 
independent of the size of the embedded subtree, the realizable capacity of 
the PPS chip will increase quadratically with decreases in minimum feature 
width. This will pennit dramatic increases in the canputational pcwer of the 
NON-VON PPS unit as device dimensions are scaled downward with continuing 
advances in VLSI technology. (During the target time frame for a production 
version of a NON-VaN-like machine, a c value of 7 or 8, corresponding to 
several hundred processing element3 per PPS Chip, would seem fea3ible.) 
It i3 worth mentioning that, with the notable exception3 of linear arrays and 
such closely related architecture3 as simple rings, most topologies proposed 
for parallel canputation in VLSI do not share the area and pinout properties 
we have just outlined. A homogeneous implementation of the orthogonal and 
hexagonal mesh-connected topologies proposed for the Dnplementation of 
systolic arrays [10], for example, would require a number of pins proportional 
to the square root of the number of PE's embedded within a chip. 'This is also 
true of such "nearly equivalent" architectures as toroidal meshes [7] and the 
chordal ring [1]. In the absence of a breakthroL:gh in packaging technology 
allo,.;ing a dranatic increase in the nunber of pins per chip, such 
architectures · .... ill thus become progressively more "I/O-bound" as device 
dimensions continue to scale downward. 
A large family of closely interrelated architectures exemplified by the 
... 
18 
shuffle-exchange (11] and cube-connected cycles [13] networks are even more 
limited in this regard. The pinout requirements of this family of 
architectures grow considerably faster than those of the t'tOIO and thr~ 
dimensional meshes. Furthermore, area proportional to n2/log2n is (provably) 
required to embed n PE's within a single chip using such schemes [19]. Thus, 
such architectures are subject to quickly decreasing returns to scale as 
improvements are made in logic densities. 
Another topological consideration in deSigning a machine having as many 
processing elements as is envisioned for NON-VON is the manner in which global 
caanunication is handled. If a "processor density" cc:mparable to that of the 
NON-VON machine is to be achieved, only a very snall amount of local memory 
can be associated with each PE. The extremely fine "granularity" of such a 
massively parallel machine is thus inconsistent in principle with the 
replication of substantial programs wi thin each PEe For this reason, the 
realization of very high processor densities would seem to be inextricably 
tied to the efficient global broadcasting of instructions. 
What are the implications of this requirement for rapid global broadcasting 
capabilities? First, we note that the "bounded valence assl.mption" (the 
restriction that no "node" be connected to more than a fixed maximum number of 
"wires"), which is central to all contemporary models of canputation in VLSI, 
precludes the possibility of broadcasting in time less than logarithmic in the 
m.mber of recipients. While this lower bound is realized by members of the 
tree-structured and shuffle-based families, most other topologies do not share 
this property. The two-dimensional meshes, for example, are incapable of 
broadcasting in time less than proportional to the square root of the number 
of reCipients. In the linear array, broadcast requires linear time. The same 
is true of the ring network, which may be considered "almost equivalent" to 
the linear array in the context of these concerns. 
19 
In the NON-VON ??S, broadcast communication is effected not only in 
asymptotically optimal time, but with extremely small constants as ~ell. 
Specifically, information that is broadcast is not buffered at each level of 
the tr~ according to a sequential discipline, but is instead propagated in an 
unclocked manner, passing through a ven small amount of canbinational logic 
at each level. NON-VON thus provides highly efficient support for the global 
broadcasting of instructions and data to all proce~sing elements. 
8y way of summary, the meshes are as area-efficient as the binary tree, but 
would increaSingly suffer fran pinout limitations and broadcast inefficiencies 
if used in high-density applications of the sort with which we are concerned. 
Such architectUM!~ as the shufne-exchange network and cube-coMected cycle~, 
while matching the optimal" broadcast time of the tree, have area ccmplexi ty 
and pinout characteristics that would be incanpatible with this degree of 
parallelisn. Of the architectures we have considered, only the linear array 
and the tree may be considered indefinitely scalable, in the sense that their 
pinout is fixed, and their area proportional to the number of embedded 
processors. 
There are two reasons for our selection of the tree, and not the linear array, 
as the topology for the NON-VON ??S. First, a strictly linear intercoMectlon 
network requires time proportional to the number of processors for broadcast. 
Second, the NON-VON ?PS tr~ is in fact capable of dynamically reconfiguring 
to emulate the behavior of a linear array with only a minor constant-factor 
degradation in speed, as shewn below. CIt should be clear that the converse 
is not true.) Thus, we are in fact giving up very little by choosing the tree 
over the linear array. 
'There are several ',Jays in which a binary tree can be used to emulate t.."!e 
behavior of a linear array. The most obvious possibility is to ~p the nodes 
of the tree onto a linear sequence according to a standard preorcer, inorder 
" .. 
-or postorder traversal scheme [9]. The nodes of the tree shewn in Figure 5, 
for example, are mapped onto those of a linear array by inorder enumeration. 
20 
Let us new consider what data would have to pass along each tree edge in order 
to simultaneously transfer a single data element along the path fran each tree 
node to its successor in the linear sequence. These paths are indicated in 
Figure 5 by arrows extending fran each node to its linear successor, in 
general passing through intermediate nodes along the way. It should be noted 
that since evert other element in the inorder sequence is a leaf node, half of 
these arrows (which we have colored black) originate in internal nodes and 
tenninate in leaf nodes, while the other half (colored white) extend fran leaf 
nodes to internal nodes. Note further that each tree edge is associated with 
exactly one black and one white arrow. If the communication cycle is divided 
into separate phases for communication to and fran leaf nodes, all nodes in 
the tree can thus communicate with their respective successors within a single 
ccmnunication cycle. 
The inorder embedding scheme, however, has the property that the maximum 
number of physical tree edges between two nodes that are adjacent in the 
linear logical sequence grows logarithmically with the size of the tree. This 
drawback is present in the preorder and postorder enumeration schemes as well, 
since both mappin~ contain paths extending fran root to leaf. Since each 
phase of the communication cycle must be at least as long as the maximum time 
required for ccmnunication between any two linearly adjacent neighbors, it is 
worth investigating whether a linear array can be embedded in ~~e binary tree 
in such a way that the maximum path between linearly adjacent nodes is bounded 
by a constant. 
As it happens, we have found a way to configure NON-VaN's simple IIO switches 
so that the longest path be~~een linear neighbors is exactly three. Based on 
a mathematical result first reportee by Sehanina [14], our scheme requires 
21 
Figure 5: Inorder Emt~dding of the Linear Ar~ay 
that the IIO switch settings at successive levels of the tree alternate 
between those that would be employed in a preorder configuration and those 
that would be used for a postorder mapping. This "bounded neighborhood ll 
embedding is illustrated in Figure 6. 
22 
L~ practice, however, the relative advantage of bounded neighborhood embedding 
over inorder mapping is not so great as it might first appear. The reason has 
to do with the fact that the delay between phySically adjacent PEls is not in 
fact constant throughout the PPS tree. In particular, while most pairs of 
physically adjacent PEls reside on the same chip, many such pairs are located 
on different chips, some on different printed circuit boards, and (in a large-
scale system) a few in different cabinets. In a realistic large-scale system, 
the delays encountered between chi~s, boards and cabinets would typically be 
considerably larger than those experienced within a given chip. Because the 
speed of the ccmnunication cycle is limited by the slewest data transfer 
between linearly adjacent neighbors, each ccmnunication phase must be slew 
enough to allew for the transfer of data between cabinets. 
Rough calculations based on estimates of intra-chip, inter-chip, inter-board 
and inter-cabinet delays suggest that the relative advantage of the bounded 
neighborhood mapping over a simple inorder embedding, while not negligible, is 
not overwhelming for PPS trees of the sizes likely to be encountered in 
practice. In the interest of Simplicity, we have thus decided to adopt the 
inorder embedding for use in the NON-VON 1 prototype. Later versions of NON-
VON will probably be capable of supporting any of the four orderings discussed 
above, and of dynamically switching among these orderings. 
Having argued strongly for the adoption of a tree-structured physical topology 
in systems e~~biting parallelism on the massive scale attempted in NON-VON, 
it must be emphasized that the alternative architectures discussed in this 
subsection may in fact prove well suited to applications amenable to coarser 
23 
Figure 6: Sounded Neighborhood Embedding of the Linear Array 
-24 
granularities, especially in the short term. In particular, the superficially 
compelling asymptotic arguments,advanced above must be considered in the 
context of Lar.ry Snyder's well-phrased reminder that "we don't live in 
Asymptopia". On the other hand, if device dimensions continue to decrease, 
the NON-VON approach to large-scale paralleli.sm may soon have \US "living in 
the suburbs". 
2.4 The Processing Element 
The NON-VON ?E is much simpler, snaller and less powerful than the processing 
elements incorporated in previously proposed tree machines [4], [15]. In 
large part, this difference reflects the SIMD execution of globally broadcast 
instructions, which characterizes NON-VON's typical operation •. By avoiding 
extensive reliance on MIMD (multiple instruction stream, multiple data stream) 
operation, NON-VON obviates the need for large local program memories and 
area-expensive processing and communication hardware, and amortizes the cost 
of most of its control logic over a large nLmber of independent data paths. 
The result is a ?E that occupies a snall fraction of the area required for an 
ordinary microcomputer, supporting a "processor denSity" far greater than that 
of most parallel machines. From an applications viewpoint, the extreme area-
efficiency of the NON-VON ?E makes it economically feasible to divide primary 
storage into roughly "record-sized" units, and to associate a separate 
processing element with each such unit. This aspect of the NON-VON design is 
central to its processing power in large-scale data processing applications, 
as we shall see in the remainder of this paper. 
The NON-VON ?E comprises: 
1. A 64 word X 8 bit random access memory 
2. A set of eight 8-bit byte registers 
25 
3. A set of eight l-bit flag registers 
4. A byte-wide arithmetic Comparison unit (ACU) 
5. A bit-wide arithmetic logical uoit (ALU) 
6. A byte-wide I/O switch 
7. A progranm,able lOgic array (PLA) 
A top-level block diagram of the PE is presented in Figure 7. 
The data path is organized around two data buses - one eight b1t~ Wide, the 
other one bit wide. The local RAM, byte-wide register~, and ACU all 
camnun1cate through the eight-bit bus. One of the eight byte registers serves 
as a menoo address register (MAR), into which addresses are latched in the 
course of accessing the local RAM. (Although the NON-VON 1 PE contains only 
64 bytes of RAM, the architecture is capable of supporting a local memory of 
up to 256 bytes.) 
Two of the other registers, labelled AB and 88, are distinguished as ~ 
accunulator~, and include special hardware for perfonning circular shifts. In 
the course of such shift operations, the bits of AB and 88 may be rotated 
through two distinguished flag registers, Aland 81, which are referred to as 
the bit accunulators. This feature provides a bit-serial link between the 
byte-wide and bit-wide portions of the data path. In addition, the ACU is 
capable of of comparing the contents of AS and 88 and latching the results 
into the bit accunulators. Specifically, Al is set u~ and only if the 
contents of AS and s8 are identical, while 81 is set if and only if the A8 
value is greater than that of 88. Another distinguished byte register, ros, 
series a special role, discussed belcw, iovol ving the latching of data to be 
transmitted be~~een P~'s. The renaining byte registers (labelled ca, XS, Y8 
and Z8 in Figure 7) are available for general use. 
.... !,;... 
~ 26 0 A 
to left child 
• 
to 
.. 1/0 parent :: 
----
--.. RAM r SWITCH .. ..... .. .. ~ 
....... 
~ 





... ... MAR lEN": .. .. ..... r ~ r 
• .. 
... 108 101 l.. .. .... ~ ... r 
... .. Z8 Z1 .... .. .... ... ... ~ 
-
.. V8 Y1 ... .. ..... r' .... .. CI) 
... .. X8 X1 ... .. :l .... r .... " co 
.. 
.... 
... ... C8 C1 I"'" " .-..... ... 







l. ~ 81 ~ 88 .... .. .. ... 
..... r ~ ..... r 
I I 













Figure 7: Block Diagram of the Processing Element 
The one-bit data bus is used to transfer data among the single-bit flag 
registers, and to supply operands to, and obtain results from, the bit-wide 
ALU. As noted above, two of the flag registers, called A1 and 81, serve 
special roles as accumulators. In particular, the bit accumulators serle as 
inputs to the ALU, along with the contents of a third flag register, C1, whi ch 
is U:3ed to store the carr! bit in the course of bit-serial addition and 
subtraction. Upon execution of one of the logical function instructions 
(described belcw), the ALU is capable of canputing one of the sixteen possible 
boolean functions of A1 and 81, and stori.ng the result in A1. In response to 
an ADD1 instruction, the ALU functions as a full adder, canputing sum and 
carry bits for the three inputs A1, S1 and C1. The sum bit is stored in A1 
and the ne'N carr! bit in C,. Analogcus result3 are produced during a SUB1 
instruction. 
Another flag register, EN', is distinguished as an enable Oag. This flag 1s 
used to activate and deactivate individual PE's within the PPS. In general 
tenn.s, only those PE' s whose enable flags are a:s.serte<! will respond to 
instructions broadcast by the CP. If EN' is set to 0 in a particular PE, all 
instructions except one (the ENABLE instruction, discussed below) will be 
ignored. A number of tricky issues arise in considering the behavior of 
enabled and disabled PE's, particularly in the case of inter-FE camnunication 
operations. These issues will be examined as part of our detailed diScu3sion 
of the instruction set. Finally, another flag register, I01, is the boolean 
analogue of IOS, serving as an IIO latch in the transmission of single-bit 
valt..'es bet' .... een PE's. The other flag registers (labelled X1, Y1 and Z1 in 
Figure 7) may be used to store arbitrary boolean values. 
7ne I/O switch is connec:ed to both the eight-bit and one-bit buses, all~ing 
the transfer of byte and flag data to the parent, left child and right child 
PEls (and, depending on ~~e switch settings, to other PE's as well). :inite-
28 
state control for the IIO switch and data path are provided by a cammon PLA. 
Consideration has been given tq the possibility of "factoring out" a portion 
of the PLA associated with each ?E on a given chip into a single ?LA shared by 
all such PE's. This approach might ultimately allcw the "amortization" of 
part of the control logic over a large (and increasing, with reductions in 
device dimensions) nl.lllber of PE' 3. While we have not employed a "PLA 
factorization" strategy in designing NON-VON 1, this approach is likely to be 
incorporated in future versions. 
In order to keep the area of the FE many times smaller than that of a 
conventional microprocessor, many decisions have been made in which execution 
speed is sacrificed for silicon area. While it i" difficult to rigorously 
defend such complex and interacting design deci"ion", an intuitive 
jwtificaUon for this strategy may prove 1lltminating. First, it is worth 
mentiOning that, in our experience, the savings in area made possible by such 
decisions in practice often vary as the square of the associated degradation 
in speed. While such a relationship is observed in many aspects of processor 
design, the routing of an ordinary n-bit data bus through a 90-degree turn 
provides a simple example. Note that the area required to "turn the corner" 
is proportional not to n, but to the square of n, as illustrated in Figure 8. 
More substantive examples abound. 
Because the chip- and board-level layouts employed in the PPS consune area 
proportional to the nl.lllber of PE's, the nunber of PE's realizable in a system 
containing a fixed nl.lllber of chips varies inversely with the area of a Single 
PEe In the critical sections of typical NON-VON programs, all available PE's 
are typically performing useful ccmputational work in parallel. Thus, NON-
VON's maximun achievable execution speed is in same sense inversely 
proportional to PE area. This being ~~e case, we have found it 
















- - - - - - - - - - - -
. 
:igur~ 8: Routing of an N-Bit Data Bus tr~ough a 90-Degree Turn 
30 
quadratic penalty in area. 
The FE instruction set provides another example of the sacrifice of execution 
speed within the individual PE in the interest of minimizing area, thus 
increasing the realizable throughput of the PPS as a whole. As we shall see 
shortly, the NON-VON PE executes a very small, narrow, and rather low-level 
set of instructions by canparison with the current generation of powerful 16-
and 32-bit microprocessors. In particular, all PE instructions are eight bits 
long, including register operan~s and logical function codes. (In one case, 
hcrwever, the instruction is followed by a byte of data). In place of a rich 
set of relatively powerful instructions, we have chosen a few low-level 
operations having extremely simple realizations in hardware. 
A single instruction typical of a contemporary 16-bit microprocessor might be 
implemented in NON-VON using a sequence of between one and four PE 
instructions. At the cost of a modest degradation in local execution speed, 
this strategy dramatically simplifies the canplexity of (and hence, the area 
required for) the data path and PU, and reduces the ntmber of pins required 
to route instructions through the PPS chips. 
31 
3 Programming NON-YON 
In this section, we introduce the NON-VON instruction set and describe the 
manner in ' .... hich it is typically used in the course of programning. ',.jhile a 
detailed discussion of each of the applications we have explored is beyond the 
scope of the current paper, same feeling for the kinds of techniques employed 
in constructing NON-VON prograns 1s necessary to understand the basis for our 
architectural decisions. The remainder of this paper 1s thus devoted to an 
exposit1on of same of the techniques that characterize the NON-VON approach to 
parallel programning. 
One "conceptual metaphor" we have found particularly useful in describing the 
princ1ples underlying most NON-VON algorithms involves the notion of 
"intelligent records". This construct 1s explicated in the subsection 
iIImediately following our description of the instruction set. Next, we 
discuss the associative operations used to access intelligent records. In the 
fourth subsection, we describe and compare alternative techniques for the 
allocation and manipulation of records of various sizes (relative to the local 
storage capacity of a single PE). Finally, we illustrate the typical use of 
~~e techniques introduced in this section by informally describing NON-VON 
algorithns for a few simple symbolic and numerical applications. 
3.1 'Ole P£ Imstruct10D Set 
!he set of instructions exec1lt.ed by the NON-VON ?E may b-e diviced i~to six 
categories. The complete instruction set, grouped by category, is described 
below. Each instruction is followed by a brief specification of its 
semantics. The foilowing symbols are employed: 
<byte reg> One of the eight 8-bit registers 
(AS, 88, ca, XB, IS, Z8, rca, or MAR) 





(Al, 81, Cl, Xl, Yl, Zl, IOl or EN1) 
One of the physi-cally or linearly adjacent PE's (P, LC, Re, LN, or RN) 
An eight-bit address in the local RAM 
A one-bit constant 
An eight-bi t constant 
After the presentation of all instructions in a given group, a ~~rrative 
description the typical use of each instruction is provided. 
1. Register Transfer Group 
OPCODE OPERAND SEMANTICS 
LOADAS <byte reg) A8 (- (byte reg) 
LOAD88 (byte reg) B8 (- (byte reg) 
LOADAl (flag reg> Al <- <flag reg> 
LOADS 1 (flag reg> 81 <- <flag reg) 
STOREAB <byte reg> (byte reg> (- AS 
STOREB8 (byte reg) <byte reg) <- B8 
STOREAl (flag reg> <flag reg> <- A1 
STOREBl (flag reg> (flag reg> (- 81 
32 
The register transfer instructions are used to move data between the four 
accumulators (AB, 88, Al and 81) and any of the other registers of compatible 
length. Note that the MAR may serve as the destination of an eight-bit STORE 
instruction, allowing different addresses to be stored in the MAR's of 
different PE's, and thus permitting simultaneous access to different locations 
in the local memories of different PE's, as described below. Similarly, it is 
wor~~ noting that the value of EN1 may be ~~anged fran one to zero using an 
ordinary STORE instruction, allo..wing selected PE's to be disabled. 
33 
Note that transfers between arbitrary registers mu.st be mediated by one of the 
accunulator re81sters, requiring tioio instructions instead of one. In the 
context of a massively parallel system, ho..wever, the fact that single-operand 
re81ster transfer instructions are conveniently ~plemented in an eight-bit 
instruction word with very little area expended for control 1081c represents a 
significant compensating advantage. 
2. Memory Access Group 
<address> 
<address> 
AS <- RAM(MAR) 
RAM(MAR) <- AS 
In order to transfer data between the local RAM and the AS accl.lIlulator, the 
address of the RAM to be accessed must first be written into the eight-bit MAR 
register using an ordinary STORE instruction. Note that different PE's may 
access different RAM locations simulaneously, since ~~e values in their 
re5pective MAR r s need not be the same. This feature i5 e5sential to such 
applications as the parallel processing of variable-length records. The 
starting addresse5 of three variable-length field!5 might be stored in the 
first, second and third RAM location5 within each PE, for example. In order 
to acce5S the first byte of the second field of each record in parallel, ~~e 
contents of RAM location tioio would be moved (by way of AS) into the MAR, and a 
READRAM in5truction executed. Successive byte5 in this field could ~~en be 
accessed by perfor.ning parallel aritrmetic on the address stored in the ~AR. 







A1 <- A1 xor 81 xor C1 
C1 <- (A1 and 81) or (A1 and C1) 
or (81 and C 1 ) 
A1 <- A1 xor (not 81) xor C1 
C1 <- (A1 and (not 81» or (A1 and C1) 
or «not 81) and C1) 
Rotate AS right by one bit through A1 
Rotate AS left by one bit through A1 
Rotate 88 right by one bit through 81 
Rotate 88 left by one bit through B1 
34 
While we have recently become quite interested 1n the implementation of 
parallel numerical algorithms on NON-VON-like machines, the rapid execution of 
purely numerical problems was not among the primary mot1vations for the NON-
VON machine. Thus, although certain operations critical to NON-VON' s typical 
modes of operat10n (data transfer and arithmetic comparison operations, for 
example) are performed eight bits at a time in NON-VON 1, all arithmetic 
operations other than comparison are perfonned in a b1t-serial fashion. 
Specifically, the ADD1 and SUB1 instruction perform one-bit addition and 
subtraction operations, respectively, as described earlier. Arithmetic on 
operands of arbitrary width are performed by repeated execution of these 
instructions. (Macros for eight-bit addition and subtraction, along with a 
number of other common sequences of PE instructions, are provided as part of 
the NON-VON 1 simulator.) The result is an ALU that, while fully general and 
extremely compact, is rather slow by comparison with conventional 
microprocessors in the performance of standard arithmetic operations. 
In future versions of NON-VON, oriented toward the rapid execution of a wide 
range of numerical problems, we plan to experiment '..lith the implementation of 
somewhat faster, albeit more area-expensive ALUls. It should be noted, 
hcwever I that in many ccmnon data processing applications - peM~onning the 
-35 
same computation on a large number of records, for example, or computing such 
quantities as the mean or variance of selected fields -- the ability to 
penon!] a million or so aritr.metic operations in parallel should push even 
NON-VON 1 I S effective throughput several orders of magnitude beyond those of 
todays fastest supercomputers. 
The four rotate instructions treat the A8 and A1 registers (and similarly, the 
B8 and 81 ~gisters) together as a nine-bit circular shift register. 
Specifically, ROTRA shifts all but the low-order bit of AB into the next 
lowest bit position within A8; the low-order bit of AS 1s moved into A1, and 
the value previously stored in A1 is moved into the high-order bit of AB. 
ROn.! similarly performs a left circular shift of the ccmbined AB and A1 
registers, while ROTRB and RO'l'l.B perform analogou:s shifts on the s8 and 81 
registers. In ccmbination with the one-bit logical function operations 
(discu~sed below), these instructions permit the execution of arbitrary 
operations involving eight-bit operands on a bit-serial basis. 
4. Logical Function Group 
LOGICAL <operation) A1 <- (A1 <operation) 81) 
(where <operation) is a four-bit code specifying one of the 
sixteen possible boolean functions of two single-bit variables) 
CLEAR A1 <- a 
SET ,\1 <- 1 
NEGATE A1 <- not A1 
AND A1 <- A1 and 81 
OR A1 <- A1 or 81 
XOR A1 <- (A 1 and (not B1)) or «not A1) and 31) 
E:CU A1 <- (A 1 and 81) or «not A1) and (not 81» 
NAND 
A1 <- not (A1 and B1) 
38 
since the semantics of this operation would be undefined if both children of 
that parent were enabled. Thus, only LC, RC, LN and RN are legal operands for 
the SENDS instruction. It should be noted, however, that the parent is 
capable of receiving data frexn it's children through the use of REeV8 LC and 
RECV8 RC instructions. The semantics of the SENDS and RECV8 instructions are 
not immediately apparent in the case where the operand PE is currently 
disabled. In such cases, it is the recipient's status, and not that of the 
originator, which determines whether data 1s in fact transferred. 
Specifically, it is always possible to RECV data frexn a PE, regardless of 
whether it 1.~ enabled, but an attempt to SEND data to a disabled PE will not 
result in a transfer of data. 
The SEND1 and RECV1 instructions function in preci5ely the same way as SENDS 
and RECV8, but operat.e on nag operancis instead of byte-wide values. 




EN1 <- 1 in all PEls, including those 
previously disabled 
if AS = sa then A1 <- 1; otherwise A1 <- 0 
if A8 > sa then 81 <- 1; otherwise 81 <- 0 
A1 <- 0 in all PE's except "first" PE 
where A1 = 1 
if no PE has A1 = 1, 
logical register R1 (in CP) <- 0; 
otherwise R1 <- 1 
A PE may be disabled by transferring a 0 into its EN1 register using an 
ordinary STO REA 1 EN1 (or STORES 1 ~~1) instruction. In a typical application, 
the contents of A1 (or 81) will be set to the result of some boolean test 
prior to the execution of such a store instruc~ion, resulting in the select:ve 
disabling of all PE's for which the test fails. This technique supports the 
"conditional" execution of a particular code sequence. Follcwing the 
39 
execu~ion of such a sequence, an ENABLE instruction is issued to "awaken" all 
disabled PE's. In combination with appropriate register transfer and logical 
operations, this approach may be used-to impl~ent more complex conditionals, 
including nested "IF-TEEN-ELSE" constructs. 
The COMPARE instruction sets the A1 nag to 1 if the contents of A8 and B8 are 
the same, and the 81 register to 1 if the contents of A8 exceed that of 88. 
By ccmbining the two bit accunulator values using the appropriate logical 
instructions, it is thus possible to perform any of the six possible 
ari~~tic relational tests <"equal to", "not equal to", "greater trAn", 
"greater than or equal to", "less than It , or "less than or equal to") on the 
values in the byte accunulators. The result may then be used to selectively 
disable certain processors, alla.ling the use of general arithmetic test3 
within a conditional. 
The most common use of the COMPARE instruction, however, is in the execution 
of content-addressable operations. A.s we ~all see shorUy, such operation3 
are realized by broadcasting character strings or nLmeric values throughout 
the PPS, can paring them in parallel with the contents of all enabled PE's, and 
disabling those for which the match criteria are not satisfied. The decision 
to Dnplement the COMPARE in3truction u3ing byte-wide canparator hardware was 
based in large part on the central role played by such content-addressable 
operations in mos~ NON-VON algorithms. 
The RESOLVE instruction is used in practice to disable all but a single PE, 
chosen arbitrarily fran among a specified set of PE's. First, the A1 flag is 
set to one in all PE's to be included in ~~e candidate set. The RESOLVE 
instruction :s then executed, causing all but one of these flags to be changed 
to zero. (Upon executing a RESOLVE instruction, one of the inputs to the C? 
will become high if a~ leas~ one candidate was in fact :ound in the tree, and 
lew if the candidate set was found to be empty. In our si!nulator, this 
40 
condition code is stored in the "logical register" Rl, which may be though.t of 
as existing within the CP.) By issuing a STOREAl ~~1 command, all but the 
single, chosen PE may be disabled, and a sequence of instructions may be 
executed on the chosen PE alone. In particular, data frem the chosen PE may 
be caIIIlunicated to the CP through a sequence of LOAD and REPORT ccmnands. 
If the candidate set is first saved (using another flag register in each PE), 
each of the candidates can be chosen in turn, subjected to individual 
processing, and removed frem the candidate set, allcwlng the sequential 
processing of all candidates. Typically, the individual processing performed 
for each chosen candidate involves the broadcasting of information contained 
in, or derived from, that candidate to other PETs within the PPS. This 
paradigm for sequential emllleration is thus employed as a sort of "outer loop" 
in a number of highly parallel NON-VON algorithms, including the algorithm for 
set intersection described in Subsection 3.5. 
In the NON-VON 1 prototype, the Al flag is preserved in that PE which would be 
assigned the lcwest number in an inorder enumeration of all nodes in the PPS 
tree. The use of inorder enumeration as a criterion for selecting a single PE 
is an artifact of the NON-VON 1 hardware deSign, however, and is not 
guaranteed by the instruction set. The RESOLVE function is implemented using 
special combinational hardware, embedded within the I/O switch, that 
propagates a series of "klll" signals in parallel frem all candidate PEl s to 
all higher-numbered PEls in the tree. As is the case for all of the global 
carmunication functions, the RESOLVE operation is very fast; hundreds of 
thousands of candidates might be "killed" in less than a microsecond in NON-
VON 1, for example. 
41 
3.2 Ihe "Intelligent Record" Metaohor 
A large share of the data processing applications for which computers are now 
used involve operations on files that consist of a relatively large r.umber of 
canparatively small records. In many such applications, the relevant files 
may greatly exceed the capacity of the primary storage device. While the 
design of NON-VaN's SPS, and its interface to the PPS, were in fact based 
largely on the essential characteristics of such large-scale data processing 
task3, our concern in the following discussion will be wi th the case in whicb. 
all records are stored in the PPS. 8riefly stated, the NeN-VON approach to 
parallelizing this sort of record-processing application is based on a ~nearly 
one-to-one" physical association of PE's and records. In such applications 
individual records are often, in effect, capable of manipulating their a.ln 
contents in parallel. This observation suggests the notion of an 
"intelligent" record. 
As we shall see 5hortly, NON-VON is designed to support the massively parallel 
manipulation of records that may be considerably larger or snaller than the 
local storage available within each PE. Furthennore, the high-level languages 
'ole are now developing for use on NON-VON permit the precise mapping between 
records and PE's to be made invisible to the user in most applications. The 
user-transparency of this mapping is in fact a critical aspect of NON-VaN's 
support for the intelligent record concept, since it insulates the programmer 
frem the details of the hardware, allowing each user-defined logical r-ecord to 
be treated as if it had its own private processor. 
As an alternative to the intelligent record metaphor, ~~e reader ~ay wish to 
think in terms of the equivalent notion of ~virtual PE'S~, each consist~r.g of 
a Single processor and an amount of local memory just suf::cient to store a 
single record of arbitrary size. 
3.3 Associative Operation~ on the NON-VON Machine 
Before examining the manner in which NON-VON's hardware supports records of 
arbi trary size, let us consider the fundamental mechanisms employed in 
accessing and manipulating intelligent records. In contrast with a 
conventional coordioate-addressable computer, whose primitive instructions 
access its data by addre~~, NON-VON may be considered a content-addressable 
machine, in which data 1~ acce~sed on an associative basi~. In order to 
illustrate the manner 1n which record:s may be accessed by content, let IJ3 
con~ider an exanple 1n which each PE contain~ a single "employee record" 
containing f1eld:s for the name, department, years of service, and salary of 




Suppose we wish to associatively identify the recorcis of all employees in the 
sales department, and to perform ~cme operation on all such records (either 
concurrently or 1n succession). Let us asst.me that the department name is 
stored in a five-character field beginning in the 17th location within each 
local RAM, and that all PEls containing an employee record are initially 
enabled. We now broadcast the first character in t.he specified department 
r4me, which is an "S", to all PE's. Each FE compares this character wich ~~e 
contents of its 17th RAM locat10n, and disables itself if the two are not 






Send the pattern character 
and save it in 38 
Get the data character 
Do they match'? 
If not, disable ti'1.is PE 
Using a similar set of instruct1ons, the second character is broadcast and 
canpared wi th t.~e 18th location in t.~e local RAM of each enabled PE, After 
the execution of five such code sequences, only those PE's ·..;hose JEPAR~NT 
43 
fleld3 contain the ~tring "SALES" will remain enabled. !t should be noted 
that this proc~ss of associative marking requires time dependent only on the 
length of the pattern ~tring, and independent of the number of employee 
record-'. Furthermore, the values' of any canbinatlon of f1el~ may be u.sed a.s 
criteria for succes.s of the a.ssociative marking operation. 
In the C33e where different PE' s are U3ed for the storage of different types 
of records, operations on a given record type mUoSt be preceded by the 
disabling of all PE's but those containing record:! of that type. To 
facilitate thi.s process, each record is "tagged" internally to indicate it" 
record type. If there are only a f!W di.stinct record type", the record!! can 
be tagged by associating a different one-bit register with ea~~ record type, 
and setting its value to in exactly those FE' s containing record:! of the 
type in question. In order to enable all record.s of a given type, the bit 
contained in the appropriate flag register is simply transferred to EN' u.sing 
two register transfer in.struction.s. For a larger number (up to 256) of record 
types, a distinct "tag byte" is a,,-'OCiated with each record type, and stored 
in the same way as the field" of the record itself. A single BROADCAST and 
COMPARE sequence, followed by a STORE.!' EN' instruction, may be used to 
disable all PE's except those containing records of the desired type. 
Depending on ~~e application, as~ciative marking is typically followed by or~ 
of two operations. The first, and most ccmnon, is to perform a sequence of 
operations in parallel on the record-' contained in each of the associatively 
identified PE's. The second involves sending t.'1e "!narked" records (or 
selected fields thereof) one at a time to the C? in an arbitrary sequence, 
using t.~e RESOLVE and REPORT instructions. The latter operation, when applied 
to associatively iden~ified records, is called associative enumeration. -. 
-" 
should be noted tr..at ':."":e t.ime required for assoclati'le e!"lLIlleracion, ;.;hile 
;:roport.ior.al t.o the nLIllOer of "matching" records, is i!'1dependent. of ':he ectal 
44 
number of records in the file. 30th of the above applications of associative 
marking will be illustrated sh~rtly in the context of particular NON-VON 
algorithms. 
It is of course the case that either a conventional ccmputer or a NON-VON-like 
machine (and indeed, any device with the pcwer of a Turing machine) is capable 
of emulating the behavior of either a content- or coordinate-addressed 
machine. In particular, a conventional system can implement associative 
operations using only coordinate-addressable primitives by employing one of 
several well-understood partial match algorithms. Because they must provide 
for retrieval based on any of the 2k possible ccmbinations of k fields, 
though, such algorithms are associated with significant costs in time, space 
and conceptual ccmplexity. 
Conversely, NON-VON is capable of addressing data on a coordinate basis 
whenever the data under consideration is best understood in terms of an 
"address-like" nunbering scheme. In such applications, coordinate values are 
explicitly stored as part of each intelligent record and associatively probed 
to obtain the record corresponding to a given address. This technique is 
employed in a nunber of parallel matrix algorithms, for example. 
What, then, are the essential differences between NON-VON's addressing 
capabilities and those supported by a conventional von Neunann computer? :rom 
a software perspective, the critical point is that NON-VON uses a nunerical 
addressing scheme only when the problem at hand is most easily described in 
terms of a coordinate system. In the case where records are more naturally 
identified by content, the programmer is relieved of the responsibility of 
translating his or her intentions into an artificial coordinate-based 
descriptive formalism. 
It is our contention that the great majority of the computer applications 
45 
encountered to date are most naturally described in terms of content-
addressable, as oppposed to coordinate-addressable primitives. 'Nhile our 
argunent is perhaps strongest for the kinds of "business-oriented" data 
processing tasks that presently account for most of our SOCiety's ex~~nditures 
for large-scale computing, we believe that a surprising number of "scientific" 
applications might also be more easily specified in content-addressable terms. 
By providing direct, low-level support for associative operations, NON-VON 
effectively shortens the path between the description and implementation of 
many ccmnon computational taslcs, thus simplifying the task of programn1ng. 
The other essential advantage of NON-VON's hardware support for content-
addressability, of course, relates to the time required for associative 
operations. In practice, NON-VON might provide as much as several orders of 
magnitude ~provement over the fastest associative retrieval operations on a 
conventional canputer system, without the n~ for canplex, t1me-consuming, 
and area--expensive indexing or hashing operations. 
3 .4 packed and SeaMed Recor¢5 
Up to this point, we have considered the case in which exactly one record is 
stored in each PE. Let us now consider the manner in whid'l records 
considerably snaller or larger than the capacity of a single local RAM may be 
efficiently stored and manipulated within the NON-VON P?S. the fonner case 
involves the allocation of IDOre than one record per ?E, a scheme we call 
packed record allocation. To illustrate the manner in wr~ch small records ~ay 
be packed, let us consider an application in which it is desirable to pack as 
many fifteen-byte records as possible into the P?S at once. (Althou~~ records 
of this size would be uncanmon in most symbolic applications, they :night ·,..tell 
occur in, say, a sparse matrix manipulation or signal processing problem.) 
46 
Four such records might be stored in each PE, beginning in local RAM locations 
1, 16,31 and 46. We will use the term record slice to refer to a set of 
packed records stored in the s~e position within their respective PE's. (In 
our example, four record slices are defined.) In general terms, each 
operation to be performed on a packed record is carried out by issuing a 
separate set of PE instructions for each record slice. In order to move a 
Single byte fran the fifth to the seventh location of each of our fifteen-byte 
packed records, for example, we would first execute the sequence 
READIW1 5 
WRlTEIW1 7 
followeO by the sequence 
READRAM 20 
WRlTERAM 22 
and then by analogous sequences of instructions corresponding to the last two 
record slices. The high-level languages now under development for use on NON-
VON are intended to relieve the programmer of the responsibility for such 
operations. In our Pascal-based language, for example, the user would simply 
declare the collection of records to be of type PACKED MULTIPLE RECORD j a 
subsequent assignment statement involving two fields of that record would be 
compileO into the four sequences of instructions discussed above. 
Not all operations on packed records, though, are so simply handled. In the 
above example, the AS register is used only for temporary storage of the value 
to be transferred, and need not be preserved ru~ter the transfer is completed 
for a given record slice. In general, however, the contents of certain flag 
and byte registers may have to be saved prior to operations on successive 
record slices. The question of how best to reduce the overhead involved in 
such IIstate-saving" operations is one of the more interesting considerat:ons 
involved in the design of compilers for NON-VON. 
41 
While packed records may be quite useful in sane applications, it should be 
noted that the space saved by packing is at best proportional to the increased 
time required to broadcast each instruction to all slices. An additional 
disincentive is provide~ by the significant compile- and execution-time 
overhead required for the ~upport of operations on packed records. For the~e 
reasons, small ~rc13 are packed only when this option is explicitly chosen 
by the prograamer, based on the relative importance of time and space in the 
context of a given application. 
In the case of records too large to fit within a single PE, each record is 
split among several PEts according to one of two schemes. The flr~t, referred 
to as the linear allocation method, splits each record among several linearly 
adjacent (lOgical) neighbor PE's. The other, which we call bysh allocation, 
store~ each record in a distinct "tree-shaped" cluster of physically proximate 
PEt s called a~. In order to illustrate these schemes, let us consider an 
example involving recorc13 150 bytes in length. Under either allocation 
scheme, each spanned record is split among three phySical PE's. We will refer 
to the first part of each record as segment A, the second as segment B, and 
the third as segment C. 
Using one of the "tagging" techniques introduced above, all PEts containing 
the A segment of a record are marked with one tag, those containing B segments 
with another tag, and those containing C segments with third. In algorithms 
requiring no parallel ccmnunicati:m between different segments of a spanned 
record, ~~e A, B, and C segments are treated as if they were distinct record 
types, only one of which is enabled at any given point in time. As we shall 
see shortly, algorithms in which activation (the state of being e~~bled) and 
data must be transferied in parallel be~~een one segment and another within 
ea~~ record raise a number of more interesting issues. Parallel inter-
segmental transfers are handled differently (and with different average-case 
time complexity) in the case of linear and bush allocation. We begin with a 
discussion of the former technique. 
48 
In a linear allocation of our hypothetical 150-byte records, se~€nt A might 
be assigned to the first PE in the linear sequence used for linearly adjacent 
neighbor communication (as described in Section 2). Segment B of the first 
record would be stored in that PE having linear nunber two, while segment C 
would be stored in the "linear three" PE. Segments A, Band C of the second 
record would then be assigned to the linear four, five and six PE's, 
respectively. The third record would be similarly split among the linear 
seven, eight and nine PE's, and so on. It should be recalled that two PE's 
that are logically adjacent in the linear sequence are not necessarily 
physically adjacent in the PPS tree. Thus, a single record may be split among 
PE's that are not physically contiguous, leading to a physical interleaving of 
records within the PPS. The inorder embedding employed in NON-VON 1, for 
example, would lead to the allocation shewn in Figure 9. (The PE's are 
labelled with the record nunber and segment of the data; segment B of record 
3, for example, is labelled 3B.) 
To see hew linearly allocated spanned records might be manipulated in the 
course of an actual application, let us suppose that our sample records each 
describe one of the employees in our earlier example. Assune also that the 
first two characters of the DEPARTMENT field are stored in segment A and the 
remainder in segment S, and that the salary field is stored entirely within 
segment C. New suppose that we wish to raise the salary of all employees in 
the sales department by 1 ~ in a single parallel operation. Earlier in this 
section, we presented an informal description of an algorithm for 
associatively marking each such employee record in the case of one-to-one 
allocation. After disabling all PE's except those containing A segments, we 
employ this algori thIn to disable all enabled PE's except those havi!1g "SA" as 
49 
Figure 9: Linear Allocation of spanned Records 
50 
the first two characters of their DEPARTMENT field. 
At this point, each PE that remains er~bled transfers activation to its right 
linear neighbor. This step is realized through the use of a code sequence 
that includes a SEND1 RN instruction, which concurrently communicates a 
boolean value from each PE to its linear neighbor. At the end of this 
sequence, which will not be detailed here, the B segments of all records whose 
DEPARTMENT field:3 begin with "SA" are enabled, and all A (and C) segments are 
disabled. The characters "LES" are now matched against the corresponding 
characters in all enabled records, leaving enabled only the B segments of all 
records corresponding to employees in the sales department. Activation is now 
propagated to the C segments of all such PE's, and a sequence of instructions 
issued to increase the salary fields of all such record:3 by 101. 
In contrast with the linear allocation scheme, the technique of bush 
allocation groups all segments of a given record together physically within 
the PPS, as shewn in Figure 10. Each of the "tree-shaped" clusters of PE's 
enclosed within a rectangle in Figure 10 is called a~. Within a given 
bush, successive record segments are assigned to PE's according to the 
bounded-neighborhood mapping introduced in Section 2.3. The precise manner in 
which record segments are allocated within a bush, and bushes within the PPS 
tree, is presented elsewhere [18]. 
Bush-allocated spanned records are manipulated in much the same way as their 
linearly-allocated counterparts, but using the direct physical tree 
cor~ections in place of the indirect linear patr.ways for the parallel 
propagation of data and activation. In our example application, the fi~st ~~o 
characters of the string "SALES" are matched concurrently in all of the "A" 
PE I s shown in Figure 10. Each matching PE then enables its parent (a "8" FE) 
using a RE~11 LC instruction. Upon completion of this matchi~g operation, 
each PE still er~bled executes a code sequence including a S8lD1 RC 
51 
Record 3 
Record 1 Record 2 Record 4 Record 5 
Figure 10: gush Allocation of Spanned Records 
instruction to enable its right child (a "e" PE), which then increases its 
salary field by 1 O~. As in the case of linear allocation, the transfer of 
data and activation between segments is fully parallel. 
52 
There are certain time/space tradeoffs involved in the choice of linear or 
bush allocation for spanned records, however. Let us first compare the space 
required for these two allocation methods. The linear allocation method makes 
progressively more efficient use of the available local RAM as the number of 
PE' s spanned by each record increases. In particular, we would expect to 
waste only half the space of a single local RAM <32 bytes, in NON-VON 1) per 
stored record in the average case. This small amount of waste is due to the 
requirement that the beginning of each record be aligned with the beginning of 
sane PE' s local RAM, at least in the method for parallel memory accesses we 
have outlined. Asymptotically (with increasing record length), the proportion 
of total available RAM wasted due to alignment thus approaches zero. 
By way of comparison, this "waste factor" approaches 25~ in the case of bush 
allocation. To gain an intuitive appreciation for the reason for this 
comparative inefficiency, consider the case of a spanned record just large 
enough to req~:re ~ PE's for storage. The smallest bush capable of storir.g 
such a record would contain 2m+1 - 1 PE's, resulting in a waste of zm - 1 PE's 
worth of RAM (in addition to an alignment penalty), or approximately half of 
the total available RAM, for large records. It is easily seen that ~~e 
average case waste factor must fall micway between this 50~ asymptotiC worst 
case value and the best case value of no waste, which occurs for records 
consuming zn - 1 PE's worth of RAM. Thus, linear allocation is more space-
efficient than bush allocation, particularly in the case of large spanned 
records. 
The space advantage offered by the linear allocation scheme, ha.;ever, cernes at 
the cost of an increase in the time cernplexity of data and act:vatior. 
53 
transfers among record segments. Note that in the worst case, the data :n 
question must be transferred from the first to the last PE in the record (with 
respect to the ordering imposed for -purposes of linear neighbor 
communication). The number of instructions required for such a transfer thus 
varies linearly with record length in the worst (and, in fact, in the average) 
case. In the case of b~ allocation, on the other hand, the worst case 
occurs when data must be passed betw~n two leave!'! of a btJ3h. On the average, 
such trarusfers require time logarithmic in the siZe of the record, a 
significant advantage in the case of large records. In the case of tranfers 
betw~n successive record segments, the baunded-neighborhood ordering reduces 
this time to a constant. 
One other point is worthy of mention 1n connection with the choice of 
allocation method. First, we note that binary tree algon ~ such as those 
described by Browning [4] can only be directly ~plemented on NON-VON when 
one-t~ne allocation is possible (that is, where records are no larger than 
the capacity of a single local RAM, and each is allocated to a different PE). 
~.any of these algori thins, however, can be easily (and in sane cases, 
"mechanically") adapted to apply to m-ary trees. (One ~portant class of such 
algori thms will be described shortly.) 
If bush allocation is chosen, such transformed algorithms can be applied to 
spanned records of arbitrary Size, providing the bushes themselves are 
allocateg. wi thin the ?PS tree in such a way as to preserve an m-arJ tree 
s~ructure for purposes of inter-record communication. This requirement is 
satisfied by a particular kind of bush allocation called landscaped allecation 
(discussed in [18]), in whi~~ the bushes are configured as an m-arJ tree. 
'~ile a thorough diSCUSsion of algorithms for landscaped bushes is ceyond the 
scope of the present ~per, the basic approach involves choosing m to be the 
number of leaves per bush, and treati~g each bush as a single node in an ~-ar! 
54 
tree, where m = 2k for some positive integer k. (The set of bushes depicted ir. 
Figure 10 is in fact landscaped, forming a five-node quaternary tree.) 
In the case of linear allocation, no such transformation is possible, since 
record segments are interleaved throughout the PPS. The ability to execute 
many parallel algori thIns intrinsically tied to a tree-structured topology thus 
constitutes another significant advantage of bush allocation. 
3.5 EXamples of Symbolic and Numerical Algoritbms 
In order to illustrate 3cme of the more important techniques used in the 
course of applications programming, we now consider a few simple NON-VON 
algorithms. First, we describe a highly parallel algorithm for computing the 
intersection of two sets. This algorithm is based on a commonly used NON-VON 
programming technique involving a combination of associative enumeration and 
parallel matching, and i3 closely related to the algorithms for a number of 
other set theoretic and relational database operations. 
Next, we introduce an important technique for the massively parallel execution 
of algebraically associative operations. Using this technique, such 
quantities as the sum, maximum or mean of n numbers may be computed in O(log 
n) time. We then consider NON-VON's application to a rather "un-NON-VON-Uke" 
task: the simulation of large-scale physical systems. We conclude by 
mentiOning a few other examples of symbolic and numerical applications we have 
considered for parallel implementation on the NON-VON machine. 
In general terms, the intersection of two sets of is performed by sequentially 
enumerating the elements of the smaller set, and performing one associative 
probe for each such element to dete~ne if it is also present in the larger 
se~. Suppose, for example, that we wish to intersect two sets of s~rings, 
55 
each stored in its cwn "virtual PE" (which may be realized using either or:e-
to-one, packed or spanned records). As in most NON-VON algorit~ms, these 
strings may be located anywhere within the P?S, since all accesses are made on 
a content-addressable basis. The elements of the two sets are distinguishe<i 
only by tagging, and may in fact be arbitrarily intermingled. 
Fir~t, we enable all elements of the ::maller set by associatively marking 
those having the appropriate tag. An arbitrary one of these elements is then 
sent to the CP using the RESOLVE and REPORT instructions, and marked so that 
it will not be chosen again. This value is then matched against all elements 
of the larger set in parallel, and a RESa..VE instruction executed to see if 
that string is present. If it is, the element is included in the result set. 
This procedure is repeated for all elements in the 3Ilaller set not already 
marked a.s having been proce~sed. The running time of this algorithn i.s linear 
in the cardinality of the .::maller set, and independent of the size of the 
larger one. The union or difference of two sets may be constructed in a 
similar manner. 
It is interesting that same of the best algorithms known for set intersection 
on a von Neunann machine (the hashed intersection algori ttms descri bed by 
Trabb-Pardo [20], for example) may in fact be viewed as software emulations of 
the associative approach employed in our algorithm. W'hile · ... e have chosen set 
intersection to illustrate the "enumeration and probing" paradi~ for 
pedagogical reasons, NON-VON in fact offers more significant advantages in the 
case of certain I~ore difficult" operations, whose implementation on a von 
Neunann machine may in practice be quite expensive. One example having 
particular importance in relational database management appl:cations is :r.e 
equi-ioiO operation (5], of which set intersection may in fact be considered a 
degenerate case. 
The tree-structured topology of the PPS is essential to many aspects of ~JCN-
56 
VON's operation, and thus plays an important implicit role in all of the 
algorithms we have discussed so far. None of these algorit~ms, thou~~, have 
made exclicit use of the tree connections. A simple example of an algorithm 
in which explicit physical tree communication plays an important role is the 
problem of adding a large number of numeric values, each stored in a distinct 
"logical record". We might Wish, for example, to determine the total yearly 
payroll of our hypothetical firm by adding the salary fields of all employees. 
In the interest of simplicity, let us first consider the case in which each PE 
in the PPS contains exactly one employee record. First, we disable all nodes 
except those which are the parent of sane leaf node. (This is easily 
accomplished in constant time using an algorithm that exploits the fact that 
the leaves are the only nodes that can not receive a message from any 
descendant node.) Each of these "penultimate" nodes i~ then (concurrently) 
instructed to obtain the salary of its left child (using a sequence of RECV8 
LC) instructions, and to add this value to its own salary field. 
The process is repeated for all right children, at which point each 
penultimate node holds the sum of its own salary and those of its two 
children. At this point, the oarents of all penultimate nodes are enabled, 
and all other PE's disabled; the entire procedure is then repeated. After 
(log n - 1) such steps, the root node will contain the sum of the salaries of 
all (n - 1) employees. In a full-scale NON-VON prototype containing a million 
PE's, we would expect the effective execution speed for such a problem to be 
on the order of tens of billions of arithmetic operations per second. 
3y substituting other algebraically associative operations in place of 
addition, this algorithm can be adapted to compute many other values of 
practical importance. The mean or maximun salary paid to any employee, for 
example, can be Similarly computed in logarithmic time. Such ope~at:ons can 
of course be combined with the techniques described earlier for the 
a~~oclatlve identification of records satisfying various criteria, allowing, 
say, the parallel ccmputatlon of the average salary paid to employees in 
Department C who have been employeq for between 3 and 5 years. 
57 
Finally, it should be noted that such algorithm:! are easily generalized to 
support packed records. In our exanple, we would first add the salary fields 
of all record slice~, leaving a single canbined salary in each PEt at a cost 
proportional to the packing factor. The algorithm for one-to-one addition 
could then be applied without modification. 
Spanned record!! can al!50 be accanodated, but only when land!!caped allocation 
i~ employed. In order to adapt our algorithm to the case of land~cape­
allocated spanned record.!, we treat each k-level bum a..! a node in 2k-ary 
tree. The descandant~ of such a node are precisely the children of all leaves 
of the bush in que~t1on. In Figure 10, for example, the bush containing the 
root node is considered to be the root of a two-level tree with a fan-out of 
four. Each of the four other bumes in the tree are treated a~ leave~ of this 
quaternary tree. In the modified algorithm, each bush adds the salarie~ of 
~ of its "descendants" into a running sun; after approximately log~ such 
steps, the bush containing the root node contain~ the sun of all salaries. 
In order to convey sane feeling for the diversity of applications for which 
NON-VON may provide substantial performance improvements, we now consider a 
problem which might first appear to be poorly suited for execution on a tree-
structured machine. This application, while of only tOO<iest econanic 
~portance by canparison with conventional business data processing tasks, has 
for same time daninated the attention of most designers and users of 
"conventional" supercomputers. Although NON-VON was in fact designed ~o have 
its primarJ impact within the mainstream of business computing, we will 
succumb to the temptation to discuss its application to this more glamorous 
scient~fic application. Tne task to which we allude is the simulation of 
58 
large-scale three-dimensional physical systems. 
One technique employed in many·such simulation problems uses a large number of 
records (often on the order of a million), each corresponding to a small 
cubical region in the space being simulated. Each record would typically 
contain a small number of scalar or vector variables (temperature or fluid 
veloci ty, for exanple) whose values are kncwn to change over time according to 
certain physical laws involving largely local interactions. The behavior of 
the system is simulated by repeatedly applying the follcwing two-step process: 
1. The ccmmunication step. The values of certain variables at a given 
point are ccmDunicated to adjacent and "nearly adjacent" neighbors. 
2. The ccmputatlon step. A new value is canputed for each point in 
the system, based on the values of variables at neighboring pOints. 
Typically, the same m.merical operations are performed at all points during 
each ccmputation step. The two-step cycle is generally repeated many times to 
Simulate the evolution of a physical system over time. 
Although it was certainly not designed with this sort of task in mind, the 
NON-VON architecture would 1n fact appear to offer significant asymptotic 
advantages over existing supercomputer designs in the solution of such 
problems. Not surprisingly, NON-VON permits the ccmputational ccmponent of 
such problems to be solved in time independent of n, the number of pOints 
being simulated. To do so, each "cube" of the space being simulated is 
associated with a distinct virtual PE, and the sequence of operations is 
broadcast to all such cubes for concurrent execution. 
More interesting is the fact that NON-VON permits an 0(n1/3 ) speedup in the 
ccmnunication ccmponent as well. The algorithn used for ccmnunication depends 
on the a particular scheme for allocating the primitive cubes among the leaves 
of the PPS tree in such a way that the nodes at progressively hi~~er levels 
59 
correspond to progressively larger cubes. While the details of this algori~~ 
are beyond the scope of the current paper, NON-VaN's asymptotic speedup is 
based on the fact that the amount of data passing through each internal node 
is proportional to the surface areas of these recursively constructed cubes, 
and not to their volumes. The time canplexity of a single cCIJIlunication step 
is thus 0(n2l3 ), and not O(n), as in the case of a von Neumann machine. 
'~ile scientific canputing applications have not been central to our design 
goals, we have investigated the potential application of the NON-VON 
architecture to a nunber of nunerical problems. Among the applications we 
have explored are a nunber of signal preceSSing, matrix manipulation, graphics 
and image proceSSing problems. NON-VON's content-addressable primitives 
pennit significant ab:5olute and asymptotic speedups in a nl.mber of array 
processing applications, but provide particularly natural and efficient 
support for problems involving the manipulation of sparse matrices. 
If nunerical applications were expected to constitute a large share of the 
workload of such a machine, the incorporation of a full eight-bit ALU within 
each PE would almost certainly be warranted, even at the expense of a modest 
decrease in proces30r density. Such a change would a1 ter nei ther the basic 
NON-VON architecture nor the essential structure of the algorit~ we have 
developed. 
Space does not permit a detailed discussion of all of the applications :or 
which we have designed algori thn5 (at various levels of detail) for NON-,{ON. 
It is worth mentiOning, though, that the NON-VON PPS supports the execution of 
several linear-time sorting algorithms, and that at least one pranising 
tecb~ique for rapidly sorting very large files is currently under 
investigation. Highly efficient parallel algoritbms for simple transaction 
proceSSing, and for a number of other operations critical to large-scale 
commercial data proceSSing, have also been explored. Although we have thus 
60 
far attacked only a small samplir:g of the problems to ',.,hich "real world" 
computer systems are applied, it has been our experience that ~ such 
largely symbolic applications prove amenable to massive parallelization on the 
NON-VON machine. 
References 
(1] B. Arden, "Analysis of Chordal Ring Networks", in IEEE Transactions on 
Cgnpyter~, vol. C-30, pp. 291-301, April 1981. 
[2] J. Backus, "Can Programming be Liberated From the Von-Neumann Style? A 
Functional Style and its Algebra of Programs n , in Cqrmynications of the ACM, 
vol. 21, no. 8, pp. 613-641, August, 1978. 
[3] S. Browning, "Hierarchically Organized Machines", in C. Mead and 
L. Conway, Introduction to ytSI Systems, Addison-Wesley, 1978. 
61 
[4] Browning, Sally, "The Tree Machine: A Highly Concurrent Computing 
Environment" Ph.D. Thesis and Technical Report 93760, California Institute of 
Technology, January, 1980. 
[5] E. F. Codd, "Relational Completeness of Data Base Sublanguages", in 
R. Rustin (ed.), Courant Cgnputer Science Symposium 6; Data Base Systems, 
Prentice-Hall, Inc., 1972. 
[6] M. Flynn, "Some Computer Organizations and their Effectiveness", in ~ 
Transactions on Computers, vol. C-21 , pp. 948-960, September, 1972. 
[7] C. Hewitt, "Design of the APIARY for Actor Systems", in Conference Record 
of the 1980 LISP Conference, pp. 107-118, August, 1980. 
[8] D. E. Knuth, The Art of Canouter Progranming. vol. 1; Fundamental 
Algorithms, Addison-Wesley, 1969. 
[9] H. T. Kung and C. E. Leiserson, "Systolic Arrays (for VLSI)", in Sparse 
Matrix Proc. 1978, Society for Industrial and Applied MathematiCS, pp. 256-
282, 1979. 
(10] F. Leighton, Layouts for the Shuffle-Exchange Graph and Lower SoUCd 
TechniQyes for VLSI, Ph.D. TheSiS, Massachussetts Institute of Technology, 
August, 1981. 
(11] c. E. Leiserson, area-EffiCient VLSI Computation, Ph.D. TheSiS, Dept. of 
Computer SCience, Carnegie-Mellon UniverSity, October 1981. 
[12] F. P. Preparata and J. Vuillemin, fI-:he Cube-Connected Cycles: A versatile 
Network for ?arallel Computation", in Communications of the ACN, vol. 24, no. 
5, May 1981, pp. 300-309. 
[13] M. Sehanina, flOn an ordering of the set of Vertices of a Connected 
Graph", in PYblications of the Faculty of Science of the Veiversi to, of Srno, 
no. 412, pp. 137-1 L12, 1960. 
[14] C. H. Sequin, A. M. Despain, and D. A. Patterson, "Communication in X-
Tree, a Modular Mul tiprocesser Sy stem", in Proceedings of t.,e Annual 
Conference of the ACM, Washington, D.C., December, 1978. 
[15] D. E. Shaw, "A Hierarchical Associative Architecture for the Parallel 
Evaluation of Relational Algebraic Database Primitives", Stanford Computer 
Science Department Report S!AN-C3-79-778, October, 1979. 
62 
[16] D. E. Shaw, "A Relational Database ~~chine Architecture", in prOCeedings 
of the 1980 Workshop on Computer Architecture for Non-Nymeric ProceSSing, 
Asilomar, California, March, 1980. 
[17] D. E. Shaw, Kn011edge-Based Retrieyal on a Relational Database f:(.achine, 
Ph.D. TheSiS, Department of Computer Science, Stanford University, 1980a. 
[18] D. E. Shaw and B. K. Hillyer, "Allocation and Manipulation of Records in 
the NON-VON Supercomputer", Columbia Computer Science Department Report, 
Augu.st, 1982 (in preparation). 
[19] C. Thcmpson, "A Canplexity theory for VLSI" , Ph.D. Thesi!!, Carnegie-
Mellon University, August, 1980. 
[20] L. Trabe-Pardo, Set Representation and Set Intersection, Ph.D. Thesis 
and Stanford Computer Science Department Report S!AN-CS-78-681, December, 
1978. 
