NON-VON's Performance on Certain Database Benchmarks by Hillyer, Bruce K. et al.
CUCS-78-83 
NON-VON's Performance On Certain Database Benchmarks 1 
Abstract 
Bruce K. Hillyer 
and 
David Elliot Shaw 





November 20, 1983 
In a recent paper by Hawthorn and DeWitt, the projected performance of several 
proposed database machines was examined for three relational database Queries. 
The present paper investigates the performance of the NON-VON supercomputer 
for thd same queries under comparable assumptions. In the case of simple 
queries, a NON-VON machine of comparable size to those considered by Hawthorn 
and DeWitt is found to be somewhat faster than the fastest machines examined 
1n their study; for a more complex database operation, NON-VON is shown to be 
five to ten times faster than the fastest of these machines. 
'This research was supported 1n part by the Defense Advanced Research 
Projects Agency under contract N00039-82-C-0427. 
Tec~nical Report, Department of Computer Science, Colu~bia 
Unive~sity, November 20, 1983 
1. The NON-VON Supercanputer 
NON-VON [1] is a massively parallel non-von Neumann supercomputer, portions of 
which are now under construction at Columbia. While the machine has evolved 
over the past several years, its most important elements remain the same. All 
versions of the machine contain a primary processing subsystem (PPS), 
implemented using custom nMOS VLSI cirCUits, and a secondary orocessing 
sybsystem (SPS), based on a set of "intelligent" disk drives. 
The PPS comprises a large number (perhaps as many as a million) processing 
elements (PE's), and is constructed from custom nMOS VLSI chips, each 
containing a number (eight, at present) PE's. The PPS is organized to form a 
binary tree of PE's. In all but the latest version of the machine (NON-VON 4, 
which will not be discussed further in this paper), a single control processor 
is attached to the root of the PPS tree. The control processor broadcasts 
instructions which are executed simulaneously by all PE's in the PPS. (NON-
VON 4 incorporates a number of processors, each capable of serving as a 
control processor for some subtree in the PPSj these "large processing 
elements" are interconnected by a high-bandwidth interconnection network.) 
Figure 1 provides a description of the PPS. 
Our first prototype, called NON-VON 1, was designed using largely ad hoc 
methods. Our principal goals in constructing the NON-VON 1 prototype were to 
validate the essential architectural principles of the NON-VON design, to 
measure the area and aspect ratios of various silicon structures incorporated 
within the PE's, and to perform certain electrical measurements on the 
completed chips. For this reason, little attention was given to either 
area- or time-optimization in the NON-VON 1 prototype chip. The NON-VON PPS 
chip has now been completed, fabricated, and tested through DARPA's MOSIS 
system, and appears at present to be fully functional. 
2 
... -- ContrOl proceuor 
Internal nOdes 
- 'Lu'" nOdes 
n<at7U 1 NON·VON prtmary pfOeeU1nq .ptem, 
A second prototype, NON-VON 3, i~ now under development. (The name NON-VON 2 
was assigned to an intere~ting architectural exercise that we do not currently 
plan to carry beyond the "paper-and-pencil" stage, although its central ideas 
may well influence future NON-VON designs.) NON-VON 3 will be similar in most 
respect~ to the orig1cal NON-VON , design, but is expected to incorporate a 
n~ber ot 1mprovement~ suggested by the results of our initial experiments in 
chip design and software development. In particular, the NON-VON 3 SPE will 
feature: 
1. An area-efficient eight-bit ALU to replace the one-bit ALU 
incorporated in the prototype NON-VON 1 SPE chip. 
2. ~ewer local registers, based on NON-VON 1 area measurements and 
software ~imulat1on results. 
3. A far better floor plan, formulated using precise measurements 
taken from the prototype chip. 
4. A generalization of certain NON-VON 1 instructions to SUpport the 
more efficient execution of many common instruction sequences. 
5. Less silicon area devoted to control path logic. 
3 
Our plans called for the NON-VON 3 instruction set to be closely based on, and 
with few exceptions, more general than the one employed in NON-VON 1. (Some 
of the additions we plan to incorporate in fact correspond to CODmOnly used 
macros in our existing NON-VON 1 software.) It was also deemed important that 
all existing NON-VON 1 software be simply and mechanically translatable into 
NON-VON 3 instructions, so that none of our work to date would be lost. 
(Translated programs would take advantage of some, but not all of NON-VON 3's 
enhancements.) In the future, of course, NON-VON 3 software will be written 
using NON-VON 3 instructions, allowing the exploitation of all of these 
features. 
Early in the development cycle of NON-VON 3, it was recognized that the 
successful accomplishment of these ambitious area and performance goals would 
be greatly accelerated by the availability of a highly automated system for 
the specification, deSign, layout and testing of the constituent processing 
elements. To be useful, such a system would have to rapidly and reliably 
generate "correct" layouts, alla-ring the user to experiment wi th al ternative 
processing element architectures with the confidence that the resulting layout 
would in fact faithfully realize the more abstractly specified design. Within 
such a semi-automatic development environment, changes in the instruction set 
might be realized in hardware in a fraction of the time that would otherwise 
be required, factii tating extensive experimentation with and "fine tuning" of 
the PE architecture. 
2. Types of Processing Elements 
With minor exceptions, all PE's 1n the PPS tree are physically identical. 
:hose differences that do exist are based on the manner in which communication 
among adjacent PE's 1s accomodated. These differences in communication regimes 
4 
define four classes of processing elements: 
1. Leaf nodes that are left child of some node 
2. Leaf nodes that are right child of some node 
3. Internal nodes that are left child of some node 
4. Internal nodes that are the right child of some node 
By convention, the root is considered an internal node and a left child. Each 
type of PE must be aware of its type so that it can properly participate in 
the communication instructions that operate between the PEls. For example, 
when executing a SEND LEFT CHILD instruction, each left child in the tree must 
latch data into an IIO register as well as send data to its left child; the 
right children only send data and do not latch any incoming data. 
One possible technique used to differentiate between the types of PEls would 
be to encode the PE type on two control lines that enter the ?LA at the inputs 
to the AND-plane. In this scheme, one wire would distinguish between left and 
right children, while the other would distinguish between leaves and internal 
nodes. When the individual chips were wired together to form a complete PPS, 
these inputs would be permanently wired to the appropriate constant logic 
values to "bind" the type of each PEe The disadvantage of this approach, 
however, is that each PE would have to contain a considerable amount of logic 
that would never be used, resulting :n a waste of silicon area. 70 save 
silicon "real estate", PLATO generates a different PLA for each type of P::, 
each containing only that logic 'l'Ihich is relevant to a PE of that type. 
'~le the particular set of processor classes enumerated above, along with 
their associated communication semantiCS, are specific to NON-VON-like tree-
structured machines, the presence of "boundary conditions" distinguishing 
various classes of processing elements is common to most parallel 
architectures. In parallel machines configured as a linear array, for 
example, three types of PEls (leftmost, rightmost, and intermediate) may be 
defined. Macbines based on the ortholonal Mesh, on the other hand, may 
require as .. ny LS nine PE types (central PE's, north, south, east and west 
PE's, and the tour corner PE's). 
3. Design Goals tor PLATO: 
Several goals were fOMlulated when wor~ began on the PLATO program: 
1. The engineer should be able to det1ne the PLA's for all FE types 
with a single high-level def1nition. 
2. The program should produce the smallest possible PLA consistent 
with a given set ot VLSI design rules. 
3. The program should be integrated with all other layout aed 
simulation tools employed tor PE design. 
4. The program should execute with absolutely no intervention by the 
design engineer. 
5 
The last goal is intended to minimize the possibility of errors introduced by 
the human user, insuring that all layouts correctly realize the intended high-
level function, and need not be extracted and simulated before insertion in 
the FE layout. 
4. The PLATO Input F11e 
Among the advantages ot the PLATO program is the fact that only one input f:1e 
need be created to generate all four types of PLA's. The system makes use of 
mnemonic labels wherever possible to aid in the isolation of errors and to 
ma~. it easier to identify PLA inputs and outputs in the finished layout. The 
same labels are used by a register transfer level simulator for ~CN-VCN 
processing elements that is now being designed at Columbia, and which will 
interface directly with PLATO, as will be discussed shortly. 
To use the PLATO system, the engineer compiles a list ot instruction opcodes 
with appropriate state variable inputs, and for each opcode, a list of the 
control lines that must be exci:ed to execute the instruction. The input flle 
contains three types of commands: 
,. Command~ that define the input file format. 
2. Commands that describe the placement of inputs and outputs in the 
layout. 
3. Commands that describe the logical functionality of the array. 
6 
Figure 2 provides a simple example of a typical PLATO input file. The first 
ccmnand line assigns the names INPUT_', INPUT_2, INPUT.3, AND INPUT_4 to the 
first four opcode (and, in general, state) bits that will be encountered in 
left-to-right order. The second command line specifies the order in which 
these opcode and state bits should enter the AND plane of the ?LA (listed from 
the bottom to the top of the PLA, assLming that the bits enter the AND plane 
fran the right). The third command line in the example file specifies the 
order in which the output lines of the array are to appear (listed from left 
to right with the wires leaving the ?LA at the bottom). 
Figure 2 
/1 This is an example of a PLATO input file. All comments are ignored. 1/ 
/1 list of input labels for input file ordering 1/ 
INPUT_', INPUT_2, INPUT.3, INPUT_4; 
/1 list of same labels, but for PLA ordering 1/ 
:NPUT_ 4, INPUT_2, INPUT_', INPUT3; 
/1 Then, a list of output labels are placed in the desired order. II 
OUTPUT_', OUTPUT_2, OUTPUT.3; 
II Mnemonics --- Opcode Bits --- List of Control Lines 1/ 
~V_"-B 0000 OUTPUT_', OUTPUT_2; 
MOV_"-C 0'0' OUT?UT_2, OUTPUT.3; 
ADD_"-B "01 OUTPUT_1, OUTPUT.3; 
SUB_"-B 1111 OUTPUT.3 ; 
The rest of the file specifies the decoding function of the PLA. To specify 
how an instruction is to be decoded, the engineer makes a list consisting of 
every instruction opcode, together with all appropriate state variable inputs 
and a list of the control lines that must be excited to execute the 
7 
instruction properly. The processor's state machine is represented similarly: 
The present state is considered one of the input bits and the next state is 
defined as if each bit in the state machine were a control line. 
The PLATO input fUe is easier to use than a truth table because "don't care" 
conditions are considered acceptable logic values for input bits. This 
facilitates the separation of state machine specification from instruction 
execution specification because the next-state information need not be 
included in each instruction decoding command line. This separation 
contributes greatly to PLATO's ease of use and tends to minimize the number of 
user errors. PLATO converts this input file format into a truth table format 
which is then used by such other design aids as logic minimization programs. 
The sample input file presented in Figure 2 shows the specification of four 
instructions. The MOV_A-B instruction, which causes the contents of the A 
register to be transferred to register S, is executed by asserting two control 
lines: OUTPUT_1 and OUTPUT_2. In this example, the control line OUTPUT_1 
would be the "read A" register control line and the OUTPUT_2 control line 
would be the "write sn register control line. Any number of control lines may 
be specified: in the case of the subtract instruction, only one control line 
is asserted. 
TWo extra input bits are represented in the input file for tte NON-VON 
processing element: the "leaf/not-leaf" and "left-child/not-left-child" lines 
that were discussed earlier. These two bits are analyzed by the PLATO program 
upon scanning of the input file and are used to separate the single input file 
into four truth tables, each representing the function of one of the 
corresponding types of PLA. Figure 3 shows a small piece of the NON-VON 3 
PLATO input file. Typically, this file would have more lines than the number 
of opcodes in the instruction set. The present NON-VON 3 PLATO input file, 
for example, has 147 lines. 
Figure 31 
I· •••••••••••••••••• NON-VON 3 PLA PLATO inputa ••••••••••••••••••••••• , 
I. tormatl (lflnl). mneumon1c, IRQ-II'. SO. Sl. ENl. la, Le. Ctrl-11ne. t/ 
I. 1nput l1n .. 1nto PLA (aource) ·1 
I. a ree; ·1 I1tO. Ill. IR2. I1t3. IR •• IllS. IR6. IR7. 
I. PLA ·1 SO. Sl, 
I. PE ·1 EN1, la, LC; 
I • • ame 1nput l1ne., but ordered tor PLA ·1 
51. SO. IRO. IR1, !R2, !R3, DR •• DRS, IR6, IR7, 
EN1, I.e, Is: 
I. Ctr1 l1ne. dr1ven by PLA (dest) ·1 
I· clpth ·1 
I· RESOLVE 
I· 10 t I 
I· PtA ·1 
RDl..RAM8. WRlRAM8. WRlRAM1. ROlRAMl., RDlMAR, RD1EN1. 5ETEN1. 
WRlEN1. WIllMAR. RO'lll. WRlBS, WR2Sl. WR1Sl. ROT!.. R01Bl. 
RD 1 B8. WR l.A8 • WR2Al. WRlAl, ROlAl. ROl.A8. AOO. SUB. r..cc I C.l.L • 
WIl2Cl. WIl2es. WRles, WRlCl. R01Cl. ROles. WR2Ioe. WR1I08, 
WR2I01, WR1I01, R01I01, R02I01. R01IOS. RD2I08. 
·1 
KCA. KCe. 
ORO. DR!. tB2. 100, 101. 102. 103, 104. lOS, I06. I07, 
SOne~. Sln~; 
I· (0): n1 non-l •• t ·1 
I· (1): lt l •• t *1 






XXXXXXXX XXX1X 106; 
OXXXXXXX 10XlX SOne~; 
llXlXXXX 10XOX SOn~,Slne~; 
10XOOXXX OOXXX 1oo.IOS.I06; 
0000000 1 101XX WRlBS, RDlAS ; 
If 100XX RD1A8; ADO_!OS 01010001 100XX RDI10S ADO; 
COMPARE_lOS OlOlOllfXl 10lXX WR2Cl.WR2C8,RDl108 ADO· If If 100XX RDII08 SUB; , 
~f"ln'l:' a 101XX WR2Al,WR2Bl,RDlIOS sua· 
~~, 11l01XX 10XXX S!TEN1. ' 






10001011 101XX RDlMAR.100.IOS,I07 ORl OR2. 
1001~001 100XX RDIB8.Ioo,I02,I04,DRO;· , 
101XO RDIB8,WR2I08,lOO,I02,I04,ORO; 
101Xl RD1S8,IOO.I02,I04,ORO; 
1110~XXX 10XXX KCA,IOO,IOS.I07.DR1,OR2; 
If 
" 
llXXX KCB.lOO.IOS.I07 DR1 OR2. 
OOXXX ato; '" 
8 . 
5. Automatic Weinberger Array Layout: 
In order to achieve an efficient use of silicon area, ?LATO generates a logic 
array using a variation on the Weinberger Array [3] layout technique. In a 
Weinberger Array, the highly regular structure of a conventional ?LA is 
compressed into a functionally equivalent, but smaller form. The resulting 
layout is less regular, and conceptually more complex, than a traditional PLA. 
By way of background, an ordinary ?LA layout consists of an AND-plane and an 
OR-plane. The AND-plane comprises a set of regularly spaced columns 
incorporating logic gates capable of generating the logical conjunctions of 
its inputs. In the context of the processing element application, those 
inputs are the instruction opcodes. The OR-plane is constructed similarly 
from a set of regularly spaced gating elements that are used to generate the 
logical disjunction of the outputs of the AND-plane gates. The OR-plane is 
rotated 90 degrees from the orientation of the AND plane, allowing the outputs 
of the AND-plane to connect to the inputs of the OR-plane. 
In constructing most processing elements of the kind used in highly parallel 
machines, the population of transistors in the AND-plane far exceeds the that 
in the OR-plane. For this reason, a considerable amount of silicon area is 
typically wasted when a conventional PLA is used to realize the control path 
logic in such a processing element. The Weinberger Array is capable of 
providing significant area savings in such applications. 
This technique uses an conventional array structure for the AND-plane, but 
obviates the need for a full OR-plane. An example of a Weinberger PLA 
generated by PLATO is provided in Figure 4. The AND-plane is shown on the 
bottom and the OR-plane on top. The instruction opcode bits enter the AND-
plane from the right. Control lines exit the entire array from the bottom. 
The columns in the AND-plane feed into the top and make contact with wires 
that run horizontally in poly silicon. These wires extend only as far as is 
required for them to form the gates of all transistors in the OR-plane that 

9' 
require the particular result being carried by the wire. If the layout is 
designed appropriately, several different wires can often share a single track 
in the OR-plane. Canpaction is achieved through the shared use of trades; 
careful placement of AND-plane columns yields a layout with a minimal or near-
minimal number of tracks. 
The authors are not aware of any earlier ?LA generation tools that generate 
Weinberger Array layouts automatically. Typically, the layout engineer must 
manufacture the Weinberger array by hand. PLATO, on the other hand, applies a 
channel-routing algorithm (described in the next section) to automatically 
specify the wiring of the Weinberger array. Unlike the usual channel routing 
problem, one end of the wire in the channel is connected to an AND-plane 
output while the other is the gate of a transistor in the OR-plane. The 
automatic Weinberger array layout algorithm incorporated in PLATO has 
successfully produced a PLA for NON-VON 3 that is approximately 25S smaller 
than the corresponding one produced using conventional ?LA generation 
techniques. 
In the rare instances in which two wires in the array share the same track and 
have transistor gates at the ends that meet, the highly compact AND-plane 
layout primitives used by PLATO can result in design rule violations in the 
Weiberger array. PLATO detects these cases and automatically provides extra 
roan in the array to resolve each conflict as illustrated in Figure 5. 
Empirically, however, such cases have been found to occur quite infrequently. 
Out of a total of 208 AND-plane columns in a NON-VON 3 ?LA, for example, only 
4 incorporate extra space to avoid design rule violations. PLATO's 
I~anagement by exception" approach thus permits effective compaction of the 
control path without the introduction of design rule errors. 
10 
Figure 5 
Its initial net-representation: 4 Tracks 
2 3 4 5 6 7 
I I I I I I 
I I I I I I 
x---------------x---x 
I I I I I 
x---------------x---x 
I I I I I 





i1 i2 i3 i4 01 02 03 
6. Channel Routing Algorithm 
The channel routing algorithm consists of two basic phases: The first phase 
sets up a data structure that represents the placement of columns in the AND-
plane with control lines that leave the OR-plane and are routed through the 
AND-plane. The second phase permutes this data structure, changing the 
relative positions of all of the columns in the AND-plane in such a way as to 
produce an OR-plane with a mintmum number of tracks, and hence the least waste 
of silicon area. 
The initial form of the data structure, called an ordering, is generated from 
the mintmized truth table that represents the function to be realized ty the 
PLA. The ordering is a list of the desired AND-plane results appended to a 
list of the control line outputs, enumerated in the order in which they are to 
appear in the layout. Each AND result is generated by a column in the AND-
plane. The AND columns may be permuted at will, but the control line columns 
must retain the order specified by the user. 
The ordering is augmented by a net-list representation of the OR-plane. Like 
the AND-plane, this net-list is initially generated from the truth table. 
Figure 5 shC'tols an example of this data structure. Along the bottom, the 
11 
label! 11 through i4 represent AND columns and the labels 01 through 03 
repre!ent control line columns. The rest of the figure shows a typical net-
list that represents a connection scheme between the inputs and the outputs of 
the OR-plane. At this stage, '",hether a certain connection between a horizontal 
wire and a vertical column 1s a contact or a transistor gate is immaterial to 
the problem of minimizing the m.mber of tracks in the array. 
Once the initial setup is completed, a depth-first search for a connection 
scheme with the least number of tracks 1s perfonned. Figure 6 shows the 
result of applying this algorithm to the net-list depicted in Figure 5. Note 
that the order of the control line columns is preserved, although their 
positions have changed. The ordering of the AND plane columns has been 
successfully pennuted to allow a connection scheme of only three tracks, the 
best possible result for this example. At this point, PLATO determines 
whether a connection is a contact between a wire and a column or a transistor 
gate. 
Figure 6 
Finished Layout: 3 Tracks (optimum) 
5 3 6 2 7 4 
I I I I I 
I I I I I 
X-------X-----------X : 
: I j I I I I I 
x-x------x X--X 
I I I I I 
I X-X---X 
01 i1 i3 02 i2 03 i4 
PLATO generates the actual layout description, expressed in Caltech 
:ntermediate Form (elf), in ~o stages. First, the AND-plane is generated 
from by placing the layout primitives in those positions described by the ~~D 
portion of the truth table. In the second stage, the OR-plane is constructed 
by generating primitives in positions specified by the net-list. The final 
layout 1s produced after labels are attached to appropriate places in the 
layout. 
7. Cone! us10n 
12 
The PLATO tool employs three techniques to minimize the area required for the 
control paths of processing elements for highly parallel machines: 
1. The generation of control paths through the automatic generation, 
using a channel-routing algorithm, of Weinberger Arrays. 
2. The automatic generation of multiple types of PLA adapted to the 
distinct types of PLA's incorporated in different PE's. 
3. The use of highly compact layout primitives, together with an 
automatic procedure for resolving any resulting design rule 
violations. 
Based on area comparisons between the NON-VON 1 and NON-VON 3 PE layouts, it 
appears that each of these three techniques has proven responsible for an area 
reduction on the order of 25~. The novel techniques embodied in the PLATO 
system have thus been responsible in large part for our ability to embed a 
number of processing elements in one PPS chip, which is the one of the 
essential cornerstones of the NON-VON approach to massively parallel 
can put a ti on. 
13 
References 
1. D.E. Shaw ,"The NON-VON Supercomputer" , Columbia Computer Science Report 
, August, 1982 
2. Carver Mead and Lynn Conway, Introduction to VLSI c. 1980 Addison- Wesley 
Publishing Co. Reading, Mass. 
3. A. Weinberger, "Large-scaled integration of MOS complex logic: A layout 
Method", IEEE J. Solid-State Circuits, vol. SC-2, pp. 182-190,Dec., 1967 
