The DADO Parallel Computer by Stolfo, Salvatore
The DADO Parallel Computer 
Salvatore J. Stolfo 
Department of Computer Science 
Columbia Liniversity 
"ew York City, N. Y. 10027 
CUCS-63-B3 
Abstract 
DADO is a ~,arallf1!. trf'~-structur~d mal:hine designed to provide signifi.;ant performanr:e improVf>m~nt5 in 
the exec lit ion of large production 'terns.:\ full-scale production Hrsion of the DADO machine WQul,j 
,:omprisp. a large (on th~ order of:;. :~undred thQusand) set of processing elements (PE's), each containing 
it:: "wn processor. a small amount (8K bytes, in the current prototype design) of locll random al:I:f>55 
memory, and a specialized I/O switch. The PE's are interconnected to form a complete binary tree. 
This paper describes the org:1nizJ.tion of. and progr3.mming language for two prototypes of the DADO 
system. \Ve also df'tail a g~Ileral procedure for the parallel execution of production systems on the DADO 
rna,~hinp and outl'ine how this proeedure ean be extended to include commutative and mUltiple, 
indl"pendent produ,:tion systems. \\'e then compare this with the RETE matchino;; algori~hrn, and indicate 
how PHOLOG programs may be implemented directly on DADO. 

1 Introduction 
1.1 Production Systems 
1.2 Goal of the Research 
2 The DADO Machine Architecture 
~.1 The Binary Tree Topology 
3 Toe DADO Prototypes 
Table of Contents 
3.1 The Prototy pe Processing Element 
3.2 The PE kernel 
3.~.1 SI~ID Mode of Operation 
3.~.~ ~1ll\1D Mode of Operation 
4 Programming DADO 
4.1 Conventional PL/M 
4.~ Parallel Processing Primitives: PPL/~I 
4.3 ~1l\1D ~loJe Primitives 
4.4 Examples 
~ The Production System Algorithm 
.s.l .-\Ilocation of Productions and \Vorking ~1emory 
.s.:! The ~latching Phase 
.5.3 The Selection Phase 
·;.4 The .. \ction Phase 
.s .. ) Sppcializea Production Systems 
.).6 Dis':lJssion 
.s.n.l Compiling patterns 
.5.B.:2 Data Elements may contain variabll's 
05.6.3 Temporal Redundan,:y 
,;.6.4 \V\I-subtree overflow 
·;.5,·) Duplicate \\,,1 elements 
6 Flitllre Rese~J.r(:h 
7 Condusion 
~l 
~l 
i 
8 
a 
Q 
III 
10 
II 
11 
1:; 
14 
16 
Ii 
18 
19 
19 
:!o 
:20 
21 
:21 
:;1 
.).) 
FllJUre 1. 
Fllure I. 
FIgure 3. 
FIgure 4. 
FIgure 5. 
FIgure 6. 
FIgure 7, 
FIgure 8. 
ii 
LIst ot FIgure. 
An Example Production. 
Interconnection of two Leiserson Chips. 
The Leiserson Printed Circuit Board. 
Hyper-H embedding of a binary tree. 
The DADO Prototype Processing Element. 
Loading DADO sequentially. 
Associative Probing: using DADO as a content-addressable memory. 
Functional Division of the DADO tree. 
4 
5 
5 
9 
15 
lfi 
17 
1 Introduction 
:\s knowledge-based systems grow in size and scope, they will begin to push conventional computing 
systems to their limits of operation. Even for experimental systems, many researchers reportedly 
!'!xp~rience frustration based on the length of time required for their operation. For applications requiring 
real-time response from an expert- system (for example, electronic warfare or autonomous robot control 
systems) conventional implementations may not be practical. 
DADO [Stolfo et. 01. 10S~. Stolfo and Shaw 19S21 is a parallel. tree-structured machine designed to 
provide highly signifkant performance improvements in the exer.ution of very large product£on system8. 
Production systems form the b:lsis for a wide range of approaches to the implementation of knowledge~ 
'based 50ftW;'1rp .. \ number of working systems implemented by researchers in the field of Artificial 
Intf:lIigence (AI) ha\'e demonstrated the considerable utility of rljle~IJ;].sed representation sdlemes applied 
to a nllmber or significant tasks requiring I'!xtensive domain expertise. ~1edical diagnosis [Davis 19761, the 
identifkation or unknovv'n chemical compounds [Buchanan and F~igf:nbaum HJ78J, mineral exploration 
[DuJ, et. al. 197DI and telephone oble maintenance [Vesondor et. al. lDS31 are just a few examples. As 
has been reported by several researchers, rule· based systems appear well-suited to the aquisition of 
knowledge from human experts, and are easily implemented and readily modified and extended. 
1.1 Production Systems 
A production 'Y8tEln [:"ewell, 1073; Da,is and King 1975; Rychener. 19751 is defined by a set of rules. or 
productions, which form the production memury (P~I), together with a database of assertions, called the 
u'orking flIuriory (\\'~f). Each ~'roducti()n consists of a conjunction or pattern f'iementa, called the left-
halld "id, (LHSI of the rule. along with a set of actions ,:oiled the right-hand side (RHS). The RHS 
spe,:ifies inforrllatiol1 which is to be added to (asserted) or removed from WM when t.he LHS successfully 
matchps aga,in:3t th~ I:ontents of \\'\1. 
In o~,erati()n. the PS rf·pea.t~'Jly executes the following cycle or operations: 
1. .\latch: For e1.l:h rule, determine whether the LHS matches the current environment of \VM . 
. ) Select: Choose ~xact!y one of t.he matching rules according to some predefined criterion. 
3. Act: .\dd to or <ielote from \\,~I all assertions specified in the RIIS of' he selected rule. 
For ~,edagogical reasons, we will initially restrict our attention to the ,:ase in which bot.h the LHS and RHS 
a.re t.:onjundions of predicates in which all first order terms 3r~ I'onl~)osed of constants and existentially 
quantified varia hIes. Data elements in \VM will have the form or arbitrary ground literals in the first order 
~'redicate ,:akulus. (\Vhen PROLOG is considered in a later section, we briefly describe how \V~1 element::i 
may contain gener3.1 first·order terms.) A negated pattern in the LlIS ,:a.uses the matching pror:edure to 
fail whenever \\,M contains a matching ground literal, while a neg:lted pattern in the RHS causes all 
matching data elements in the WM to be deleted. 
An example production is presented in figure l. (Variables are prefixed with an equal sign.) 
1.2 Goal or the Research 
In pra,:til:al applications of the sort anticipated by most resear,~hers in the field of AI. the set of 
~)rodlJ<:tions (and hence the set of LHS ratterns against which \\'~( Illust be matched on each cyr:le) are 
expected to typically be quite large. In the case of the Rl/XCON program [McDermott 19S11. for 
example. roughly ~400 specialized productions presently exist to configure a Digital Equipment 
Corporation VAX computing system. To fulfill their promise for the very·large-sca!e embodiment of 
Figure 1: An Example Production. 
(parHategory =part electronic-component) 
(used-in =part =product) 
(Supplied-to =product =customer) 
()lOT Manufactured-by =part =customer) 
--> (Dependent-on =customer =part) 
(:'-lOT Ind.pendent =customer) 
domain-specific expertise, production systems are likely to require at least an order of magnitude more 
rules. making the question of efficiency a potentially (::ritical concern. 
Because the mavhing of each rule against \VM is essentially independent of the others (at least in the 
absence of (~ont~ntion for data. in \VM), it is natural to attempt a decomposition of the matching portion 
of each ,:ydf> into a large number of tasks suitable for physically concurrent execution on parallel 
h:lrd\ .... are. \\'hile this task is in 1':J.,.t, considerably more complicated than it might first appear, we believe 
the imnV'Il::''' potential value of a powerful and highly general productt'on 8ystem machine warrants serious 
a.ttention by paraHd ma.chine architects and VLSI designers, 
Thus, simply stated, the goal of the DADO machine project is the design and implementation of • (cost 
effective) high pf"rformance rule processor capable of rapidly executing a production system cycle for very 
large rule basE'S (irleilily in an amount of time independent of the number of rules). Our goals do not 
includ. the design of a high-speed parallel processor capable of (a fruitless) parallel search through a 
comhinatorial soilltion space, 
:-"1uch of the eXf'erimental resea.rch conducted to date on specialized hardware for AI applications has", 
focussed nn thp rpalization of high-performance, cleverly designed, but for the most part, architecturally 
conventional nla~hines, (~lIT's LISP Machine exemplifies this approach,) Such machines, \,.l,ile quite 
possibly of great pra.r:tical interest to the research community, make no attempt to employ nardware 
parallelism on the massive scale characteristic of our own work. 
Recently. several AI researchers (see [Nilsson 19801. for example) have suggested that .-iignificant increases 
in the perforrnan.'p. of contemporary AI systems might be realized through distributed processing or the 
use of specialized parallel hardware. Some attention has been giv"n to issues of parallelism in system 
organizations for l:ooperating distributed Al subsystems (Lesser and Erman 1979, Lesser and Corkill 19j9!; 
special hardwarE'! for high speed property inheritance and related operations in systems based on semantic 
network-lik, formalisms [Fahlman 19i9, Hillis 19821; and the design of machines supporting the parallel 
execution of cprtain relational algebraic operations having pra.~tical importance in large-scale knowledg~­
based systems [Shaw et. al. 1981; Bonuccelli et. al. 19831. The pot.ntial applications of very large scale 
hardware pa.rallelism to the execution of rule-based systems, however, has remained largely unexplored. 
In this paper, we describe DADO_ a tree-structured, multi-processor based architecture that utilizes the 
emerging technology of VLSI systems in support of the highly efficient parallel execution of larg.-scal. 
production systems. Our research has convinced us that DADO may" support many other AI applications 
induding the very rapid execution of PROLOG programs and a large share of the symbolic processing 
typical of knowledge-based systems. 
A small (C) pror:essor) prototype of the ma.chine, constructed at Columbia University from components 
supplied [,y Inul Corporation, is operational. Based on our experiences with constructing this small 
prototype. we helieye a larger DADO prototype, comprising 1O~3 processors, to be' technica.lly and 
economically I'p.a:::iible for implementation using current technology. We believe that this larger 
experimental device will provide us with the vehide for evaluating the performance, as well as the 
ha.rdware design. of a full-scale version of DADO implemented entirely with custom VLSI ci,cuits. 
3 
2 The DADO Machine Architecture 
DADO is a fine-grain, paralll'!! machine where processing and memory are extensively int.ermingled. A 
full-scale production version of the DADO machine would comprise a very large (on the order of a 
hun;lr~;j thousand) set of proces,~ing eiements (PE's), each containing its own processor, a small amount 
18K hytes, in the current design of the prototype v",sion) of local random access memory (RA~j), and a 
specializpd I/O switch. The PE's are interconnected to form a complete binary tree. 
Within the 0.\00 m"hine, each PE is capable of executing in either of two modes. In the first, which we 
will eall SJ.\ID mode (for single instruction stream, multiple data stream fPlynn 1972j), the PE executes 
instructions broadcast by some ancestor PE within the tree. In the second, which will be referred to as 
AIl.\fD mode (for multiple instruction stream, multiple data stream), each PE executes instructions stored 
in its own local RA~1. independ~ntly of the other PE's. A single conventional coprocessor, adjacent to the 
root of the DADO tree, controls the operation of the entire er ~mble of PE's. 
\Vhen a 0;\00 PE enters ~IL\lD mode, its logical state is changed in such a way as to effectively 
"dis,:onne,:t" it and its descendants from all higher-level PE's in the tree. In particular, a PE in MIMD 
mode does not receive any in:::tructions that might be placed on the tree-structured communication bus by 
one of its ancestors. Such <l PE may, however, broadcast instructions to be executed by its own 
descendants, providing all of these descendants have themselves been switched to SIMD mode. The 
DADO mal'hine can thus be ~onfigured in such a way that an arbitrary internal node in the tree acts as 
the root of a tree-structure,j SI:\ID device in which all PE's execute a single instruction (on different data) 
at a given point in time. This flf!xible architectural design supports mult£ple-SlAfD execution (MSIMD). 
Thus. thp m<J.,·hine may be logically divided into distinct partitions, each executing a distinct task, a.nd is 
the primary source of DADO's speed in executing a large number of primitive pattern matching 0p-erations 
con('urrently. to be ,ictailea shortly. 
The 0.\00 I/O switch. which will be implemented in custom VLSI and incorporated within the 1O~3 
prol:(>s5ing elr;m~nt \"f~rsion of the ma<:hine, has been designed to support con' ":1unication between 
physiealty adjal=~nt tr"~ np.ighbors. In atidition, a specialized combinational circuit incorporated within the 
I/O switch will allow: :r the very rapid selection of a single distinguished PE from a set of candidate PE's 
in the trf'e. Currently, the 1.5 processing element version of DADO performs these operations in firmware 
embodi~d in its ol"f-t he-shelf ,~omponents. 
In the follo\ .... ing sections we outline the reasons for implementing a binary tree organization. \Ve then 
discuss th~ pr(:cise sf>mantic~ of both execution modes of a DADO PE, and the methods employed to 
simulate each in the current DADO prototype design. Subsequently we define PPL/M, a variant of the 
PL/.\I language, providing several primitives for specifying parallel computation on DADO. The bask 
D;\DO algorit hr!IS for production system execution are then described and evaluated. 
2.1 The Binary Tree Topology 
As VLSI technology continues its downward trend in scaling, many PE's may be implemented on a single 
:,ilicon chip. If the minimum feature size is halved, for example, four time~ .s many wmponents I:an be 
placed on a single chip. Thus. future rniaocomputer technology may provide additional speed, funetion 
3nd ::-l.tWlge ':apacity of a single PE on a chip. Alternatively, as is the case with many of the appro.aches 
to fine-grain ~,arall~lisrn, ma.ny simpler processors may be integrated on the same chip. It is aucia!' 
therpfore. to inter,:onnert a !:1rge numb~r of processors in the most area-efficient topology possible. 
Furthpr ,·()n.~·I,!~r<.t.tion must also be given t.o methods which efnciently drive the large number of devij~e 
compunr·nts to be placed on the chip, and which are not restricted by the severe pin-out limitations of 
p:l.d;lging te('hnology. 
In our initial work, several alternative parallel machine architectures were studied to determine a suitable 
organization of a. special-purpose production system machine. High-speed algorithms for the parallel 
execution of production system programs were developed for the perfect shufne [SChwartz 19S01 and 
binary tree machine architectures [Browning 1978[. F'orgy [19801 proposed an interesting use of the mesh-
connected ILL lAC IV machine [Lowrie et. al. 19751 for the parallel execution of production systems. but 
recognized that his approach failed to find all matching rules in certain circumstan-ces. or these 
architectures, thp. binary tree organization was chosen for reasons of efficient impiement:l.Lion in VLSI 
tee hnology. 
First WI'! note that t.he entire binary tree of PE's can be implemented using a number of identical chips. 
This design, first reported by Leiserson [19S11, embeds both a complete subtree of PE's and a single 
interior node on each chip (see figure ~). Four data. ports enter the chip. One, called the T port, connel:t:5 
to the root of the chip's subtree, while the other three ports, called F, Land R. connects the single interior' 
PE node to its father, left "hilJ and right child. respectively. 
A simple r€'cursi\'e procedure allows the construction of an arbitrarily large binary tree using only chips of 
this type. Figure ~ illustrates this construction -for two chips. ~ote that the resulting circuit consists of a 
larger binary tree, together with a single unconnected interior node. This scheme may be extended to 
allow the ~onstruction of a planar printed-circuit board layout (also due to Leiserson). which is illustrated 
in figure 3. ~ote that the area required for routing wires within the PC boa.rd is strictly proportional to 
the number of chi~s, allowing the efficient implementation of boards of arbitrary size. 
Ftgure:l: Interconnection of two Leiserson Chips. 
L 
T-
1---
I 
------1 
I ~ I I 
I I 
I I 
I I 
I 
I 
I 
I J. 
I 
I T I 
I I 
I I 
I I 
I I 
I I 1 _________ 1 
-
~ F 
r-- PI 
1------
I 
I 
I 
I 
I 
I J. 
T 
I 
I 
'1 I I 
I 
I 
---, 
, 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 
I 1 _________ 1 
The :sufotr p " of PE's incorporated within each chip is configured according to the "hyper-H" embedd'ing as 
first rp~,orted by Browning [HI80! (see figure 4). This construction is highly regular, is area-optimal (in 
that th~ ::tr1lount of silicon area is proportional to the number of PE's) and is easily extended to 
incorpor:.lH larger number:s of PE's as device dimensions scale downward. (Note that the number of ports 
5 
F1sure 3. The Leiserson Printed Circuit Board. 
entering the Leiserson chip remains constant as larger numbers of PE's are incorporated.) It should be" 
stressed that this architect.ural design is not only highly efficient from a theoretical perspective, but 
int'xpen.siH to implement (and replicate) in a. working device. 
Althu1lgh {'lnar,'" trees implement.ed in this fashion may sep.rn to scale indefinitely, an interesting 
theoretil:al rpsult reported. in [Patterson et. at. 19811 suggests that- synchronous binary tree systems with 
mass].,,, numbers of PE's may not be practical due to the asymptotically growing lengths of wires in the 
tree. This' 'wire length pror,iem" may introduce severe dock skew as well as other anomalous electrical 
problpnls. Our :)tatistics suggest that for t.he DADO system this is not a severe limitation since the 
numbp.r of PE's in a fullMscale version would be limited to a managl;'uble numLer. (S~p. [Fisher and Kung 
198~1 for :r discussion of this problem in terms of clock skew for 'synchronous systems and possible 
50Iutions.) 
Other factors contributed to our choice of designing DADO as a binary tree ma~hine. The most 
important of these factors is the requirement of broadcasting data to a vpry large number of processors. 
The contemporary models of \-'LSI computation dictate that global wmmunkation requires an amount of 
time no less than the logarithm of the number of recipients. It should be noted that no archit.e~ture based 
on components having a bounded valence (that is, a. fixed maximum number of external connections) ';<1n 
perform this fun,;tion in les5 than logarithmic time. While this lower bound is achieved by treeMstructured 
machine~ and by certain other "areaMexpensive" organizations, like the perfe,:t-5huffle, olh!'r topologies do 
not share this property. For example, linear arrays require an amount of time for broadcast that is 
pro~,ortional to the number of recipients, while meshMconnected devices require time that is proportional to 
the square root of the number of recipients. In such systems consisting of many thousands of PE's, a 
signifkant delay is unfortunately introduced, thus requiring complex pipelined communication schemes 
sugg~sting an asynchronous design. 
6 
FIgure., Hyper·H embedding of a binary tree. 
Finally. we note that binary trees do have certain limitations of practical importance. Although 
broadcasting a small amount of information to a large number of recipients is ernciently handled by 
binary trees. the con'Verse is. in general. unfortunately not true. That is, for certain computationa.l tasks 
(permutation of data. for example) the effective bandwidth of communication is restricted by the top of 
the tree. It is easy to prove that if n items stored at the leaves of a. binary tree are to be randoml,; .. 
permuted, the number of such items which must pass through the root of the tree is proportional to 
n. Fortlln,HPly, a::i we shall see shortly, this "binary tree bottleneck" does not arise in the execution of 
produ(;tion ~Y;:,lems or PROLOG. 
~tan.Y of the '\e<,'isions made in designing DADO were strongly influenced by the organization of the NOi\'-
va" sup.r,'omputee [Shaw 198~[ and the Caltech tree machine [Browning 19i81. Perhaps lhe best way to 
distinguish D.\OO from these two tree machine architectures is by considering the modes of execution of 
each of the (:onstituent PE's, and the implications for the hardware design. 
The proposed Caltech tree machine is a full MIMD device incorporating thousands of PE's in a full·scale 
version. Eaeh PE executes its own independent program and thus requires a substantial amount of local 
memory as is the case in the DADO machine. Communication is supported by a. buffered message passing 
protocol, where the recepient of each message is identified by relatively complex"IIO circuitry at each 
node, whereas other forms of communication (for example, global broadcast),!'-'are implemented by 
sequential logic. 
~ON-VON, by comparison, is a full SIMD, massively. parallel synchronous device incorporating millions of 
simple, highly-area efficient PE's, each associated with only 64 bytes of local RA'\1. In general, each 
NON·VON PE executes an instruction broadcast from a single control processor, located at the root of 
the tr •• , and thus requires a highly·errtcient method of global broadcast. The I/O switch incorporated 
within each node of the NON·VON tre-e contains a. few inverters driving the signals along the broadcast 
bus, and therefore communication is implemented by high·speed combinational logic. 
D.\DO, on lhe other hand, is capable of executing in both SIMD and MIMD modes, and thus contains 
elements of both machine designs. DADO incorporales a. combinational I/O switch similar to tha.t 
7 
,mploy,d in :\O:\-VON. However, each DADO PE may drive the I/O switch, in addition to the single 
t·oprocp.s~or of 0:\00. Thus. DADO also supports very high speed global broadcast. HOWeVp.f, because of 
thp replication of substantial programs within various PE's in the tree, a DADO PE has been designed 
with a more general (8 bit) processor as well a5 an 8K byte RAM. Thus, DADO cannot achieve the same 
~,rocessor density as is possible in NON4 VON. 
The D.\DO d'sign attempts to synergistically merge the advantages of both the NON- VO'" and the 
C'altf'{!h tr~p !ll:ll'hine. It is not dear whether or not the NON-VON approach to single-instruction stream. 
massive f';ir~dl,"'\i:jm will be substantially limited by its inability to execute independent ~)rogram:5 
';o!ll~\jrp'ntly. \:or is it de:lr whether or not the Caltech approach of large-scale parallelism, albeit 
:,ubstantially [OWE'r than th:.Lt of :'\rON·VO~ for certain computational problems, can achieve the sam I' 
throllghr.ut .:n~1~hle of :--.rON·\/O~. It is our hope that experimentation with the DADO prototype may 
!.rovi.JE' :":i()l!jf' of lhE'se answers, a.nd begin to elucidate the predse nnture of thi" tradeoffs involved with 
Ll)th a~l~'rn:-l,·hp:,. 
3 The DADO Prototypes 
.\ IS-elenlent DADOI prototype, constructed from (partially) donated parts supplied by Intel Corporation, 
has been ()~,erntional since April 25, 1983. The two wire-wrap board system, housed in a chassis measuring 
3.5 by IS.·, by 17.5 inches volume (roughly the size of an IBM PC), is clocked at 3.'; magahertz 
producing -1: nlillion instrllctiu .. · per second (~tIPS). (The effective useable MIPS is considerably less due to 
the signifi"nnt overhead in'~urred in interprocessor communication. For each byte quantity communicated 
throygh thr: . .;y:Hem. 12 machine instructions are consumed at each level in the tree while executing an 
asynehronoll . :;. -I-I:ycle handshake protocol.) DADOl contains 1::!4K bytes of user .Jom access storage 
and 601-\ bytl:'5 of read only memory. A much larger version, DAD02, is currently under construction 
whi!'h will in!'oqlorate 1O~3 PE's constructed from two commercia.!ly available Intel chips. DADOI does" 
not provide ~'normous computational resources. Rather, it is viewed as the development system for the 
software llas~ of D.\DO~, and is not expected to demonstrate a significant improvement in the speed of 
exeeution or a [.roduction system application. 
DA.DO~ will l>t implemented on 16 printed circuit boards, manufadured through the DARPA supported 
~10SIS siJi!:on foundry system. and housed in an [BM Series I cabinet (donated by IB~v1 Corporation). The 
system, whi(:h will he integrated within a standard 19 inch rack. provides 8 megabytes of user storage. A 
DEC V.\'\ 11/7,00 (partially donated by DEC Corporation) serves as DAD02's coprocessor (although an 
.\pollo or SU~ workstation may be used as well) and is the only devin~ a user of DAD02 will see. Thus, 
D.\D02 is eonsid,,,d a transparent back-end processor to the VAX II/no. The DAD02 system will have 
roughly the SJme hardware complexity as a DEC VAX 11/750 system. and if amortized over 12 units will 
(;ost in the range of 70 to gO thousand dollars to construct considering 1983 market retail costs. The 
DAD02 custom I/O chip is planned for implementation in gate array technology and will allow D,\D02 
to be clocked at 12 megahertz, the full speed of the Intel chips. The effective machine instruction cycle 
time achievable is 1.8 microseconds, producing a system with a raw computational throughput ~.,i .S70 
million instructions per second. Note that little of this computational resource is wasted in communication 
oyerhead, as in the DADOl machine, since asynchronous communi!;ation is replal:ed with a synchronous 
combinational logic circuit. 
In the following sections we detail the prototype processing element design as well as the software systems 
implemented for the prototypes. 
8 
3.1 The Prototype Processing Element 
Each PE in the DADOl prototype system incorporates an Intel BiSl microcomputer chip, serving as the 
~rocessor. and an 8K X 8 Intel 2185 RAM chip, serving as the local memory. (,"" simple logic gate 
pa,;kaged in a Texas Instrument TI-7408 chip is used to properly integrate the RA:"I and processor,) 
DAD02 will incorpo"rate a slightly modified PE, The Intel 2187, which is fully compatible with but faster 
tllan an Intel 2185, replaces the DADO 1 RAM chip allowing the processor to be clocked at its fastest 
speed, Further, the custom I/O chip will contain extra circuitry to quickly refresh the Intel 2187, and 
thus replaces the TI ,;hip employed in DADO!. The resulting system consists of a 3 chip PE, 54 of which 
may be intpgrated on a single printed circuit board . 
. -\Ithough t.h~ original version of DADO had been designed to inr.orporate a 2K byte RAM within each PE. 
an 8I~ byte R.-\~! was chosen for the prototype PE to allow a modest degree of flexibility. in designing and 
im~df.menti!lg the software base for the full version of the machine. In addition, this extra "breathing 
room" within each PE a!!ows for experimentation with various special operations that may be 
incorporated in the full version of the machine in· combinational circuitry, as well as affording the 
opportunit.y to I~ritica\ly evaluate other proposed (tree·structured) parallel architectures through software 
simulation. 
(It is worth noting though that the proper choice of "grain size" is an interesting open question. That is, 
through ~xpf'rimental evaluation we hope to determine the size of RAM for each PE, chosen against thl!: . 
number of such elements for a fixed hardware complexity, appropriate for the widest range of production 
system ::lppiil:3.tions. Thus. future versions of DADO may consist of a number of PE's each containing an 
amount of RA~I significantly larger or smaller than ;mplemented in the current prototype systems,) 
The Intel Bi·S I is a moderately powerful 8-bit microcomputer incorporating a 4K eraseable programmable 
read only menlOry lEPRa:"!), and a ~55-byte RAM on a single silicon chip, One of the key characteristics 
of the 8;51 proc,ssor is its I/O capability. The 4 parallel, 8-bit ports provided in a 40 pin package has 
contributed ::i1Ib.5tantially to the ease of implementing a binary tree interconnection between processors. 
Indeed, D.-\DOI was implemented within 5 months of delivery of the hardware components, Figure 
.) illustrates the DADOl prototype PE at about twice actual dimensions. 
In 0.-\D01 the communication primitives and execution modes of a DADO PE are implemented by a 
small kErnd system resident within each processor EPROM, The specialized I/O switch envisaged for the 
larger vprsio[l or the machine is simula.ted in the smaller version by a short sequentia.l computation. As 
noted, the 1O~3 element prototype would be capable of executing in excess of 570 MIPS (on 8 bit data), 
assuming inter·rro<;es:5or communication to be implemented with a combinational logic I/O switch. 
Alth( :;;h ~,i~,elined communication is employed in the kernel design, it is expected that only 150 million 
instructions f,er s€(;ond would be achieved using the current design. Thus, the design and implementation 
of a custom I/O chip forms a major part of our current hardware resear('h activities. 
It should b~ noted that, in keeping with our principles of "Jow-cost performance," we have selected a 
processor technology one generation behind existing available microcomputer lechnology. for example. 
DAD02 could have been designed with 1023 Motorola 58000 processors or Intel 80285 chips, Instead, we 
have chosen a relatively slow technology to limit the number of chips for each PE, as well as to 
demonstrate our most important architectural principals in a cost effective manner. 
Furthermore. :;ince the Intel Bi,Sl does not press current VLSI technology to its limits, it is surely within 
the realm 0[' ['eaiibility to implement a DADO~ PE on a single silicon chip, Thus, although DADOZ may 
ap~!f':J.r irl!prf'55i~'e (an inexpensive, compact system with a thousand I;omputers executing roughly 500 
million in,,,:,r1ll:tions per second) its design is very conservative and probably at least an order of magnitude 
less powerrul than a similar device using faster technology. It is our conjecture though that the machine 
will be praetkal and useful and many of its limitations will be ameliorated as VLSI continues its 
downward trend in scaling. (DAD03 may serve to prove this conjecture.) 
9 
Figure.. The DADO Prototype Processing Element. 
P 
~ TI-7408 
4K x 8 
EPROM 
8K x 8 
'--- RAM 
I I § RAM I I 1l 
n LC III II I RC I 
INTEL 8751 INTEL 2186 
3.2 The PE kernel 
As not p ,!, th~ -tK EPROM of the [ntel 875l stores the system kernel of a PE. which includes code 
~)erforming the most basic communication and synchronization functions as well as the simulation of 
S[~tD and \[[\ID modes of execution. Presently. the kernel software occupies less than lK bytes of 
EPRO\f. Thus, many frequently used procedures (pattern matching, for example) are planned for 
impl,m,ntation in DAD02's EPROM. 
The kernel system is designed in such a. way as to logically divide the 8K RA~f space of the Intel 2186 
chip into two portions for each of the execution modes. The size of these two portions is specified by the 
software declarations. By convention the initial portion of RA:"l, ref~rred to as SIAID RA.Jf, is OJ. r~served 
data space for "'ariables a.nd constants operated upon by a PE while in SIMD mode. The rpmaining 
portion of R:\.\tf is used for storage of code. as well as the local variables used during the \U:"10 mode nl' 
operation. (The 8 megabyte memory provided in DAD02 is fully distributed in 8[( quantities. A '1.' 
program written for DADO is, thus, limited to an 8K address space.) A set of reserved memory locations 
within the SIMD RAM have special significance to the kernel system. In the following sections we define 
each of these rp.serveJ locations when appropriate and briefly describe their use. 
3.2.1 SIMD Mode or Operation 
A prol~es:sor in S[~tD mode (henceforth, a SIfo.-ID PEl can be instn. ~d to enter one of two states, as 
determined by the contents of a special single bit variable. called EN!, resident within the SIMD RAM of 
the PE. If E:--:I is set high (logical I) within. PE, the processor will be in the SIMD enabled st.te, 
otherwise it is in the SIMD d,'sabled state. The kernel simulates the SIMD mode of operation in the 
following way. 
to 
SJl>fD ENABLED state 
A DADO PE in SIMD enabled state will repeat the following steps: 
1. :\c('Ppt an instruction from the broadcast bus (received from its parent). 
~. Pass the instruction on to its descendants, provided the PE is not a lea! processor and its 
immediate tree neighbors (children) are logically connected (see below). 
3. Exe!!ute th~ instruction. 
SI.\fD DISABLED .,tate 
A DADO PE in SI:-'ID disabled state will repeat the following steps: 
1. .\ccept an instruction from the broadcast bus. 
') As in the .. nabled case, it will pass the instruction on to its descendants if they are logically 
connecte-d. however _ .. 
3. The instruo:tion is ignored unless it is one 01 he following special functions, to be detailed 
shortly: 
- RESOLVE 
- E:\.\flLE 
. n ,:ommunications instruction ISEND, REC\!, BROADCAST or REPORT) 
3.2.2 MIMD Mode of Operation 
Likewise. the kernel system simulates the MIMD mode of operation (henceforth, a MIMD PEl in the 
following way: 
1. A :-.m.1D PE is logically disconnected from its parent. (Thus, instructions from the broadcast 
bus will not-be accepted.) 
2. The PE executes code from its local memory (Unless otherwise instructed, the tree below the 
pror.eswr rp.mains logically connected and thus, can be utilized as a SIMD processor.) 
3. Upon terminating its MIMD operation, it enters SIMD disabled state, after broadcasting an 
instruction to disable its descendants. 
4 Programming DADO 
PL/~I lIntel 198~1 fS a high-level language designed by Intel Corporation as the host programming 
environment for applications using the full range of Intel microcomputer and microcontroller chips. A 
superset of PL/M. ,I'hich we call PPL/M, has been implemented as the syatem-level language for the 
D.-\'DO prototypes. PPL/M prov"ides a set of facilities to specify operations to be performed by 
independent PE', in parallel. In this section we discuss the additions to PL/M of new data types, built-in 
functions and syntadic conventions for the parallel execution of DADO programs. 
11 
4.1 Conventional PL/M 
Before Jenning the primitives for parallel computation on DADO, we begin with a brief overview of 
PL/M. Intel's PL/M language is based on: 
_ a statement-oriented syntactic structure based largely on PLjI, 
_ a. full range of supported statements t.ypical of a high-level Ia.nguage including assignment, 
np~tp\J if. ·:asp., and several forms of iteration (while, and auto-int.:rement), 
_ blol:k struct.ure, employing several forms of the PL/I DO statement, 
- a full range of data t.ype facilities including arrays, structures and pointer-ba-sed dynarm"c 
rariablu. as well as subroutine and function definition statements, 
- and bstly. all data is either of type BIT, BYTE or WORD (~ bytes) . 
.-\ PL/~t prngr:lm is constructed from blocks of associated statements, delimited by either a DO or 
Pf{(JCEDU!E statement, and a terminating END statement. As is typical of a block oriented language, 
nesting is permitted following the usual conventions for variable scoping. Explicit. dat.a definition and 
typing is ,:,p~l:ifi~d primarily wit.h the DECLARE statement.. 
4.2 Parallel Processing Primitive.: PPL/M 
Th~ f(lllo\ ... ·in~ two syntact.ic conventions have been added to PL/M for programming the SIMD mode of 
operation of D.\DO. The design of these constructs was influenced by the methods employed in specifying. 
parall"1 "ornf,uu,tion in the GLYPNIR language [Lowrie, et. al. [9751 designed for the ILLIAC IV parallel' 
rrol:p.:,;,.()r. Th~ SLICE attribute defines variables and procedures that are resident. within ea.ch PE. The 
se,'ond addition is a syntadic construct, the DO SI.\1D block, which delimits PPL/~l instruction., 
broad':ast to d"'~,:endent ';I:\lD PE's. (In the following definitions, optional syntactic constru,;ts arp 
represented ,vithin meta bra...:kets.) 
The SLICE attribute: 
DL::CLARE variable[(single-array-dimensionl[ type SLIt 
name: PROCEDURE[(parameter-list)[ [typel SLICE; 
Ea\'h declaration of a SLICEd variable will cause an allocation of spa,~e for the variable to O';l'ur within 
the SI~ID RAM of each PE. SLICEd procedures are automatically loaded within the \!J\ID portion ot' 
RAM (by an operating system executive resident in DADO's copro,:es50rl. 
\Vithin a PPL/M program, an assignment of a value to a SLICEd variat·ie will cause the transfer to occur 
\yithin ~a,:h I'!nabl~d SIMD PE concurrently. A constant appearing in the right hand side will be 
automatil'ally broadca~t to all enabled PE's. Thus. the statement 
whpr~ \: is of type BYTE SLICE, wil! assign the value .5 to each occurren,;e of X in each enabled SIMD 
PE. (Thus. at times it is convenient to think of SLICEd variables a.s v~ct.or:s which may be operated upon, 
in wholp or in p<lrt., in paralleL) However, statements which operate upon SLICEd variables can only be 
spel:iried within the bounds of a DO SIMD block. 
DO SOlD block: 
DO SI:-'!D: 
r-5t:1 ten1 f'nto; 
r-st;\ tF, me n tn: 
E:\D; 
The f-.;;talPmf:'nt is restricted to be either 
12 
- an n5:~ignmf'nt statement incorporating only SLICEd variablEs and con8tants, or 
- a call to a subroutine that has been declared to be of type SLICE (user defined SLICEd 
procf'l.lurr::5 may not execute any ~lIMD mode primitives), or 
- a ,:all to 0 loeal user defined procedure, by way of the MIMD function (to be detailed shortly). 
A non-SLICEd yariable may appear within an r-statement only as an argument to the BROADCAST 
fun,:tion. to b, ,jefin,d shortly. The parameters of a SLICEd subroutine are assumed to be of type SLICE 
by dpf::lult. Examr,ies of the use of these features are provided in a later se~tion. 
4.3 MIMD Mode Primitives 
In addition to the full range of instructions available in PPL/~1. a DADO PE in ~HMD mode will have 
available to it the following list of built-in functions. (It should be noted that DADO's coprocessor may 
execute the fill! range of PPL/M instructions as well.) These functions have been mode!led after the 
machine in:Hrul.'tions employed in the NON~VON supercomputer as reported in IShaw Hl8~1. For 
con:;ist~nl''y. th~ SO,V-~'OlV registers are used in precisely the same manner as that defined in the ~ON­
V00J instr': 'lion set. 
Call RESOLVE: -- the SLICEd variable AI. resident in all PE·s, 
i5 set to zero except in the "first" PEA 
The register CPRR in the MI)"IJ) PE is set high. 
If no descendent PE has AI=\' CPRR is set low. 
Call REPORT; -- the "ontents of A8 in the one enabled descendent PE 
i, written to the register CPIO in the MIMD PE. If 
more than one descendent PE is enabled. the result 
is undefined. 
Call DHu.\J)C.\ST( <byte>); 
_. thp value of the single byte argument is stored 
[3 
in the . .1.8 variable of every d.scenden~ SIMD PE. 
Call SE:"D( <neighbor-PE»; 
-- the contents of register 108 of <neighbor-PE> is 
5~t to the value stored in ..1.8. <neighbor-PE> may be 
on. of: LC I,ft tree child 
RC right tree child 
Call REC\'( <neighbor-PE»; 
_. th~ wntents of r~gister .-\8 is set to the value 
stored in 108 of <neighbor-PE>. <neighbor-PE> 
lnoy be on. of: LC, RC, and P (parent) 
Call \II\ID( <o,ldre;;> I: 
Call EXIT: 
-- any E:"ABLED SIMD PE will enter MI~1D mode of 
operation and execute code stored locally in RAM 
starting at address <addrE'ss> 
-- the \!I\!D PE will terminate its \!I\1D operation. 
The PE will issue an instruction to SI!,,1D 
desN'ndants.-to disable themselves (set ENI low) 
and wil! rel:0nnect itself to its parent in SIMD 
,iisar,led state. 
Call E:" . .l.ULE: .- the E:"[ variable of all descendent PE's are set 
high. thus enabling the entire subtree. 
Call DISM3LE: -- the E:'-l[ variable of all descendent PE's are 50t 
luw, thus disabling the entire subtree. 
The BRO . .l.DC"..l.ST function is used to communicate a specified BYTE value from a \!I\!D PE or DADO's 
l;opro(eS:30r to all (~nabled) PE's in the subtree it roots. The REPORT instruction, on th~ other hand, 
provide:5 thp means for the ,~ontents of a variable of a single enahled PE to h~ (~ommuni('ated to the 
\1I~!D PE .. .1.5 a side effect. the DADO~ I/O chip provides the means for the byte quantity to be 
simultaneously BROADCAST to aU (connected) processors within the tree. The REPORT instruction is 
intended for use only when it is known that at most one PE is currently enabled, for example. after use of 
a RESOLVE instruction detailed below. 
The SE:\D and RECV instr!ldions are used for communication alTlong phY:5ically adja,;ent PE·s. Two 
special SLICE,! v~riat,IF':i of type BYTE. ealled :\.8 and r08, take flart in the data tran:;fer operation. 
Unlih thp REt'\' in:strwtions, :1 PE can not SE;'\;D data to its parent, since this oppration would bp 
undefinp.] if I,oth dlildrf:'n uf th3.t parent were enabled. A parent i:5 f:apable of re<:eiying data frolJl it~ 
children thrOlq.;h th .. II:'!'.' of BEC'V instrul:tions. It should be noted that it is always possi·ble to RECV d3ta 
from a PE. regardless of whether it is enabled, but an attempt to SE:.JD data to a disa~led PE will not 
result in :) transfer of da.ta. In the case of a discqnnected or nonexistent PE all I/O operations return a 
val ue of O. 
:\ PE ma.y be disabled by transferring a 0 into its ENl variable using an ordinary assignment statement in 
PPL/:vt, or by use of the DISABLE function. In a typical application, the contents of ENI will be set to 
the result of some boolean test prior to the execut.ion of such a store instrur.tion. resulting in the selective 
disabling of all PE's for which the test fails. This tel~hnique supports the "conditional" execution of a 
particular code sequence. Following the execution of such a sequence. an ENABLE instrul:tion is issued to 
"awaken" all disabled PE's. 
The RESOL \'E instruction is used in practice to disable all but a single PE. chosen arbitrarily from among 
a specified set of PE's. First, the Al flag, also a SLICEd variable, is set to one in all PE's to be included 
in the candidate set. The RESOLVE instruction is then executed, causing all but one of these flags to be 
changed to zero. (Upon executing a RESOLVE instruction, one of the inputs to the MIMD PE will 
become high if at least one candidate was found in the tree, and low if the candidate set was found to b~ 
empty. This condition code is stored in the SLICEd variable CPRR, which exists within the ~lIMD PE.) 
By issuing an assignment to EN!, all but the single, chosen PE may be disabled, and a sequence of 
instructions may be executed on the chosen PE alone. In parti':1l1ar. data from the chosen PE lIlay be 
communicated to the ~n~tD PE through a sequence of REPORT commands. 
If the candidate set is first saved (USing another flag in each PE), each of the candidates can be chosen in 
turn. subje,:ted to individual processing, and removed from the candidate set, allowing the sequential 
processing of a!l canuidates. Typically, the individual processing performed for each chosen candidate 
involves the broadcasting of information contained in, or derived from, that candidate to other PE's 
~ .. :it.hin the 0,\00 tree. This paradigm for sequential enumeration is employed as a sort of "outer loop" in 
a number of parallel DAOO algorithms. 
In O.\DOl, thp RESOLV~ function is implemented using special sequential code, embedded within the 
EPRO\1, that propagates a series of "kill" signals in parallel from all candidate PE's to all (higher-
numb".d) PE's in the tre •. In DADO~, the RESOLVE operation has b •• n generalized to op.rate on S-bit 
dat:), prodwing the maximum value stored in some candidate PE. Repeated lise of this max-RESOLVE 
funl:tion allow;) for the very rapid selection of multiple byte data. This circuit has proven very useful for a 
numb~r oJ O.\DO algorithms which made use of the SEND and RECV instructions primarily for ordering 
data within th~ tr .. ~. The use of the high-speed max-RESOLVE often obviates the need for such 
(:oml!111ni"<ltion instructions. Consequently. the view of DA.DO as a binary tree architecture has become. 
fortuitously, nearly transparent. The I bit RESOLVE implem.nted in DADOl is exhibited in the 
exanlf,[es whil'h follo\'~'. 
Finally, the \1I\tD function causes an enabled SIMD PE to begin executing in ~1I:-.tD mode. The 
argument address is first broadcast as the base address of the IOI:al user defined prol:edure to be exel:utpri. 
(PPL/:\I ~'rovid .. s a very simple and direct means for specifying the ~ddress of an object within a program, 
induding thE' hase address of a subroutine.) Return to S[.\ID mode is performed by the EXIT fun,.'tion 
when the :\lI:\fD PE terminates its computation. (Synchronization t;:'1[l be ~lerforrned with sequential logi,' 
to explkitly test whether or not data may be transfered to the ~1I~1D PE. Thus, \ ..... hpn su,:h 11 t ... ..;t 
indicat!>s thnt data may be transfered, the .\Ui\.ID PE has terminatP,1 its oper~tion and reconnel:ted itsl':'lf 
to the tree above in SIMD mode by wa'f'at t,b,.e EXIT function. SevAral algorithms for the synchronization 
of MII.ID PE's within DADO have been reported elsewhere [Stolfo lQ8l\. ) 
. 4.t Example. 
Code for two fundamental operations are presented in this section: the first loads the DADO tree 
sequentially: the second is used to associativelY mark all PE's that match a given search string. 
1.5 
Figure &: Loading DADO sequentially. 
, • •• will a •• u •• that thi. progr .. i. exeeuted .ithin 
DADO', CP. Thl 'y,t,. function READ i. u.,d to load 
.tring data into .. bufter fro. ,0.' txt,rDal .ourci. $/ 
DO; 
DECLARE rntllligant-rtcordCS4) ByrE SLICE; ,. An instance ., 
DECLARE Not-doni BIT SLICE; '* of tach of tht •• SLICE */ 
DECLARE Indlx ByrE SLICE; '* Tariablt. apprlar. in .ach PE *' 
DECLARE Bufflr(64) BYTE: 
DECLARE i BYTE; 
DO SUlO; 
Call EHABfE; '* All PE', &1"1 tnabled. _, 
radII: = 0; 
END; 
Call REAO(Butttr); ,. Oat .. proTidtd bJ .0.' 
txt.rnal .ourel. -, 
DO IHILE Ilnsth(Buff1r) ) 0; '* AND CPRR *' 
DO SUlO; 
Call ENABLE; 
Ai = Not-doni: '* haTt A1 .et high. 
Call RESOLVE; '* Only ant At i8 no. 'It. 
., 
., 
ENt = Al; '* StllctiTII, di.abl, all but on. PE. ., 
Not-done = 0; 
END; 
IF NOT CPRR THEN quit; ,. No PE's an&bl.d. thus oT.rile •.• ' 
DO i = 0 to length(Buff.r) - 1 ; 
DO SUD; 
Call BROADCAST(Buff.r(i»; '* Th •• ingl. enabled PE ., 
Intelligant-recora(Indes) = AS; ,* .ill execut. thi •• , 
END; 
END; 
Call REAO(Buffer); 
,* code alOIU. 
~~O; '* R.peat for other PE', in the OAOO tree. *, 
ElID: 
., 
!5 
The second E'xample impieml'!nts the most basic operation for associative matching on DADO. This 
procedure was the first PPL/~I program executed on the DADO! system. 
Figure 1: Associative Probing: using DADO as a content-addressable memory. 
ASSOCIATIVE-PROBE: PROCEDURE (S.~reh); 
DECLARE Inttlligent-Rteord(64) BYTE SLICE; ,. An in.tanet *' 
DECLARE Indtx BYTE SLICE; "of tach appear. in 1.lr1 PE.-' 
,_ •• ~ssum •• ach of tht in.tanee. of Intelligent-Rlcord 
haY. belD pr •• iou.ly loaded within the DADa tr ••.• ' 
DECLARE i BYTE; /. i is 10el.1 to thi. rout in,. -, 
DECLARE Search(64) BYTE; I'Tht stareh string i. proyidtd by 
so •• external sourct. " 
DO snlO; 
Call ENABLE; '-All d"clndlnt PE', tnttr SIlO tnabltd .tate.-' 
A1 = Q; ,* All A1 flagl .ithin tht trt. belo. art cltar,d .• , 
EJlD; 
00 i = 1 to l,ngtheS'arch) : ,.R.p.at the tOllo.ing tor 
each charact.r ot the •• arch string ., 
00 sum: 
Call BROADCAST(i); ,.Th. Talu. ot the ind.s Tariabl. i i. 
broadcast and load.d in .ach SLICEd 
AS Tariabl •• ithin tht tre.. ., 
Index = AS: '-and trao.tered to a local SLICEd 
nriable .• , 
Call BROADCAS!(Starch{i»; '.The ith characttr ot the 
starch string i. thto broadcalt and 
loaded in .ach AS regi.t,r .• ' 
ENI = AS = Intelligent-Record!I~des); 
EJlD; 
EJlD; 
DO SHlO; 
'.After comparing the .earch 
character. currently .tored in AS • 
• ith the locally stored data. 
di,able tho,e PE', whieh do not 
match (by a"ign.eot ot logical 0 
to the toable n,.table EJlt). ., 
Ai = 1; ,- Ooly those PE', that r •• aio .nabled. that is 
on11 tho ••• hich .ateh tht •• arch .tring. will 
s.t th.ir Al Tariabl •• high .• , 
Call RESOLVE; '.La.tly,.' t •• t tor .hether or not 
aoy Ai tla,. io the tr •• art high. 
CPRR i ••• t accordiogly .• , 
EJlD; 
IF CPRR THEJ , • •• haT. r •• pond.r.1 ., ; 
END ASSOCIATIVE-PROBE; 
5 The Production System Algorithm 
The general production systp.m algorithm implemented on DADO is presented in this section. Following 
this we COm~fi1rp. the algorithm with RETE-based matching systems and outline several ways the 
algorithm may bl"! modified to adequately treat various anoma.lous situations. 
17 
5.1 Allocation ot Productions and Working Memory 
In ord~r to execute the production system cycle on the DADO machine. the I/O switches ar~ configured in 
:5ur.h a way as to divide the DADO machine into three conceptually distinct components. One of these 
components consists of all PE's at a. particular level within the tree, ('alleri the P;\[·let!el. which is .:hosen 
in a. manner to be detailed shortly. The other two components are the upper portion of the tree, which 
comprises all PE's located above the PM-level, and the lower portiofl of the tree. which ~onsi:::its of all PE's 
found below the PM-level. This functional division is illustrated in figure 8. 
Figure 8. Functional Division of the DADO tree. 
-PM Level, 
maJen. dele""'"e releyance 
, instatlllate 
WM Subtr ... , 
CCI'I!eI'It - addrusabh 
memories. 
Each PE at thi'> P~1-1i'>~'el is used to store a single production (although t.his restriction i.:3 ~asily relaxed at 
the ~xr,~nsp a modest cost in time fOf. matching). The P~I-I~vel nl'lst thus be ch05en such that the 
numher of nodes at that level is at least as large as the number of productions in P\L The subtr~e rootf>·j 
by 3. given PE at the PM-level will store that portion of vV~·1 that is relevant to the produetion storpd in 
t:.:.tt PE. A ground literal in WM is defined to be relevant to a givi>1l produ';lion if its predi,:ate sym~)ol 
:lgrei>S with the predicate symbol in one of the pattern liter:.tls in the LHS of the proJuetion, and .::t!! 
..:onstants in the pattern literal are equal to the corresponding ,~on:it3.nts in the ground literal. Intuitively, 
the set of ground literals relevant to a. given production consists of exactly those literals that might match 
that production. given appropriate variable bindings. 
The constituent subtrees that make up the lower portion of the tree will be referred to as the \V~t­
subt.rees. For simplicity, we will assume in this paper that each PE in a W~1-subtree rooted by somp 
production contains exactly one ground literal relevant to that production. (In a f:>'lion similar to t.hat for 
production allocation, "packing" techniques may be employed at the expense of a .lodest increase in timp 
for mat':hing). It should be noted that; since a single ground literal may be relevant to more than on"! 
produdiun, portions of WM may in general be replicated in different WM-subtrees. 
During the match phase, the 'r\'~1-subtrees are used as content-addres.j(/.ble memo,,'es. allowing parallel 
matching of a single pattern element in time independent of the size of WM. The upper portion of the 
18 
tro;:f' is Il . .,pd to ulect one of the matching productions to be executed in O(log P) time, where P is the 
[lUIli1'Pf of rlroductions, and to broadcast the action resulting from this ex~cution. Df!taiis of these 
fundion:'l follow. 
5.2 The Matching Phase 
.-\t th~ j,pginning of the matching phase, all PE's at the PM-level are instructed to enter ~1I\1D mode. and 
to :::inllllt;joP(l!J5!Y (and independently) match their LHS against the contents of their respective \\'~I-
5ubtre('~. The :.lbility to concurrently match the LHS of all productions accounts for some, but not all, of 
thl'! paral!0jisrn aehieved in DADO's matching phase. In addition, the matching of a ~~1'1lgle LHS is 
~"~rrUrrllf:,j in 3. parallel manner, using the corresponding \VM-5ubtre~ as an associative procf.'JsiTlg devi(:p.. 
Thp simrdest ('ase involves the matching or a single LHS pa.ttern predkate containing at m05t on~ in:5uln(~(> 
of an.Y yuriable. (The reader may wish to peruse the Associative·Probe procedure detailed in figure 
'j before f1roaeding.) In order to match the predicate, 
(Part-,:ategory =part electronic-component), 
for examl,[p, the P~1-level PE corre5ponding to the production in question woulJ first broadcast 3. 
sequen(:e or instructions to all PE's in the WM-subtree that would (:ause each one to simultaneously -
compare the field beginning in, say, its fifth RAM cell (the location of some SLICEd variable, for instance) 
with the ;;tring (or 5(',ite syntactic token representing) "Part-category". All non-matching PE's would 
then he di,~ab(fd, causing all subseqiJent instructions to be ignored for the duration or the match. Next, 
the ::::tring "p\e!?tronic-component" would be broadcast, along with the instructions necessary to match this 
string o.gainst. :3ay, the f1eld beginning in the thirty-firth RAM location of all currently enabled PE's . 
. -\t'tpr ag;lln ,.!i:,aLling all . ')n-matching PE's. the only PE's still enabled would be those containing a.-
ground literal that. matches the predicate in question. If this were the only predicate in the LHS, . 
mat,:hing would terminate at this point, A.s noted, the time requireJ for this matching operation depends 
only on th~' "ofllrdexity of the pattern predicate, and not on the number of ground literals stored in the 
W.\\-;uLt"', 
We should mention though that the depth of the DADO tree defines DADO's maehine cyde time. Thus 
<ilthough " .... e :ito..te that access to WM is achievable in time independent of the numoer or \VM elements, it 
is actually dependent on a logarithmic function of the number of PE's in the tree. Since \V\l is fuBy 
distribut~d among the majority of ava.Hable PE's, it is also logarithmic in the number of \YM elements. 
Howe,,',,·, this delay is bounded by log(n) combinational gate delays, which is proportional to the latency 
period for 3.':!'eS5 to a conventional RAM of comparable size and hardware complexity, and thus may he 
ignop>d in analysis or the time complexity. 
The gf:'nernl lTI.'ltching algorithm, which accommodates a LHS c:onsisting of a number of conjninpd 
prf'di!~:1.te:=;. possibly including common pattern variables, is considerably more complex. In this ('ase, after 
associatively probing ror the first pattern predicate, each value conr.ained in a. matchi!\g \V~,t ~lel1l~lIl. 
stored in the same relative location as a pattern variable, is sequential!y enumerated and used for rurther 
associative probing for subsequent patterns. In the worst ,:ase, this operation may require enumeration or 
each of the elements in \VM. However, the high-speed content addressable memory operations rpdu,,;e thp. 
"look-up" time to a constant- factor, and ob .... iates the ,need for ex~)pnsive overhead incurred by index-jng 
scheme'S rp,"!uired or sequentia,l impleme'ntations. The result of this general matching operation is a set of 
.... ariable binding..:; !~ol"respondjng to all possible instantiations or the production in question that. are 
t:on:~i5tf'nt with the contents of \VM. 
19 
5.3 The Selection Phase 
Since ea.ch production is asynchronously matched against the data stored in its \VM-suLtree, the 
product.ion matching phase will in general terminate at difrerenl times within each P~1-leHI PE. At the 
end of the matching phase, the PM-level PE's must thus be synchronized before initi,ation of the selection 
phas •. In support of this synchronization op.ration, each PM-I.vel PE sets a local flag upon completion of 
its own matching task. Th. DADO RESOLVE circuit permits the DADO tree to compute a logical 
conjunction of th.se flags in time equal to O(log nJ gate delays. (In this case, the max-RESOLVE function 
of DAD I (lr'(~r:J.tf's on flags set to a upon termination, and 1 otherwise. A low bit result signaLs 
synchronization.) DADO's tree-structured topology, along with the combinat.ional, as opposed to 
sequential. computation of this n-ary "logical AND", lead to a synchronization time which is dominated 
by that requir~d for matching, and which may, as in the case of \VM access. be ignored in analysis of the 
time complexity of the production system cycle. 
The selection of a. single production to "fire" from among the set of all matching productions also requires 
time proportiona! to depth of the tree. Unlike the synchronization operation, however, the primitive 
opera.tions requip:·d for selection are computed using sequential logic in DADOl. We assume that each 
PM-Jevf'J PF. !Jf'rforms some local computation prior to the synchronization operation that yields a single. 
numerical i.'riority ratz"ng. PE's containing matching productions are assigned positive values, while other 
PM-level PE's are assigned a priority of zero. We also assume that each PM-level PE has a distinct PE 
tag, stored in SLICEd variable within its local memory, which may be used to uniquely identify that PE. 
After synchronization, a!l P!\1-level PE's are instructed to enter SIMD mode. Each such PE is then 
instructed to :="'nd its priority rating to its parent. Each parent compares the priority ratings of its two 
children. rHa:; .. ng the larger of the two, along with the unique tag of the "winner". The process is 
repeated at SIJ':I:f'ssively high~r levels within the tree until a single tag arrives at the root. This tag is then 
broadcast to 011 P~!-Ievel PE's for matching, disabling all except the one having the highest priority-
rating. whi,~h remains enabled for the action phase. ~ote that in D'-\002. this operation is replaced by a 
few sequential st~r5 employing several applications of the max-RESOLVE circuit. 
5.4 The Action Phase 
At this point. thp "winning" FE is instructed to instantiate its RHS, whieh is then reported to the root.. 
~ext.. a.ll P\I-le\'~1 PE's are enabled, and the RHS of the winning jnst(lll"~ is broadcast to all. The d~tails 
of t.he action phase are made more complex by the importance of avoiJing unnecessary r~plication of \VM 
literals within the lower portion of the tree, and of reclaiming local m~mory space freed by the deletion of 
such literals. These functions are based on associative operations similar to t.hose employed in thp 
matching opf'ration. 
The PE', at the P:"'!-I.vel are instructed to enter \!IMD mode anJ to "oncurrently updote their In!-
subtrees as spe·~ifir.d by the RHS of the winning instance. 
First, the PM-level PE's perform an associative probe for each literal to b~ deleted from W\t, enabling 
only those PE's in the WM-subtrees whose local memories are to be reclaimed. The enabled PE's are then 
instructed by the PM-level PE to overwrite their stored ground literal with a special free-tag identifying 
empty PE's. This tag is the target of the subsequent associative probe executed for each of the ground 
literals to be add.d to WM. 
When processing an asserted literal, the PM-level PE first determines whether or not the literal is relevant 
to its stored p]'uduction. 0JexL the associative operation identifies those relevant literals which are not 
present in the \\":\i-5ubtree. and thus are to be stored in some empty PE. 
After probing for the free-tag, all PE's are disabled except the empty PE's. To avoid duplication of 
asserted literals, all but one of these PE's is disabl.d by the RESOLVE circuit. Th. asserted literal is then 
broadcast. to the one enabled PE. 
~o 
As in the mat,:hing phase, the action phase in general will terminate at different times in each PM·level 
PE. After synchronization, another cycle of production system execution begins with the production 
matching phase. 
5.5 Speclalbed Production Systems 
The general 5l'heme for production system execution on DADO can be extended to support commutatz't'e 
production sy..,tems. as well as "cooperating expert systems" based on multiple, independently executing 
produl;tion systems. 
A conlmutatiYe production system allows each of the matching rules on every cycle of operation to be 
selr::ded for execution. The same combinatorial hardware used in the action phase .to select a single 
arbitrary "free" PE supports this operation by enumerating each of the matching productions in an 
arbitra..ry sequential order. Each of the RHS's so reported to the root are then processed by ~he ar:~ion 
phas~. 
In Ollr exposition of ~he general production sys~em algorithm, i~ wa.s assumed ~hat the upper tree was 
rooted at th' (physical) root of DADO (see figure 8). Since each PE in the DADO tree can execute its 
own indeppndf>nt program, the upper tree can be rooted at an arbitrary internal node of DADO. Thus, 
n1ulti~,If'. indf'ppndp.nt produ1:tion systems are executed on the 0.\00 ma(;hihe by rooting a forest of upper' 
trees ;It vrtrious levels of the DADO tree. lThe "buddy system" of memory allocation provides a simple 
means to allol'nte multiple production systems on DADO.) Communication among these independent 
production s.v~tems is imrlemented in the same fashion as communication among the PM-level PETs 
during the action phase of the (commutative) production system cycle. 
5.8 Discussion 
By Wrty of summary, the ba.sic DADO algorithm for PS execution operates in the following way: 
1. By 3ssigning a single rule to a unique PE at a fixed level within the tree (referred t.o as t·he 
P~I-le,,,pl). exeeuting in ~lIMD mode, each rule in the system is mat(:hed (~oncurrently. Thus, 
the time to calculate t.he set of matching rules on each cycle is independent of the number of 
productions in t.he system. 
~. By assigning a data item in Working Memory (WM) to a single PE below the PM·level 
executing i.n SIMD mode, WM is implemented as a true hardware content-addressable memory. 
Thus, t.he time required to match a single pattern element in the LHS of a rule is independent 
of the number of facts in WM. 
3. Lastly, the selection of a single rule for execution from the eonflkt 5P.t is also:'pp.rformed in 
para.lIel. Thus, the logarithmic time lower bound of comparing a.nd selecting a single item from 
a collection of items is achievable on DADO as well. 
This algorithm offers a number of advantages over the RETE algorithm reported by Forgy. while 
maintaining much of RETE's emcient characteristics. We quote from [Forgy 198~[: 
... Certainly the [RETE[ algorithm should not be used for all match problems; its use is 
indicated only ·if the following three conditions are satisfied . 
. The patterns must be compilable [to more primitive match tests[ ... 
. the objects must be constant. They cannot contain variables or other non·constants as 
patterns can. 
~l 
. The set of objeds must change relatively slowly. Since the algorithm maintains state 
between cycles, it is inefficient in situations where most of the data changes on each 
cycle. 
5.8.1 Compiling patterns 
In its ,;urrent form. the DADO algorithm does not provide a means to compile patterns into primitive 
mateh tests. although it does not directly exclude this possibility. However. the ability of a DADO PE to 
exe,:ute {;ude independently of other PE's permits pattern matching tests common to several rules to be 
perform~d in parallel, as we!! as a more powerful pattern match operation, ull£jication, discussed below. 
5.8.2 Data Elements may contain variables 
Data items within DADO's WM may contain variables or other non-constants. In this case, the 
Associative-Probe procedure is replaced by a SLICEd unil1cation procedure local to eaell PE in a W,!-
subtree. Thus, an entire partially instantiated pattern element is first broadcast to all PE's, locally unilied, 
and variable bindings are subsequently reported .from those which successfully matched the pattern 
element in qU~5tion. 
This ,:apability forms the basis of the implementation of PROLOG on DADO. ITaylor et. al. 19831 . 
describes this procedure modified to permit the entire set of PROLOG clauses to be fully distributed 
throughout the DADO tree. The sequential semantics of PROLOG is maintained in the reported design 
through the use of the max-RESOLVE circuit applied to integers associated with each clause. Each of 
thf>S€ integers represent the "position" of the clause in the PROLOG data base, and thus determines the 
order in whkh dau:;es areJeported to the coprocessor and subsequently applied. (This parallel associative 
PROLOG implementation is the focus of a doctoral investigation undertaken by Stephen Taylor working. 
in collaboration with Gerald \1aguire.) 
It .should be noted that the introduction of general first order terms within elements stored in WM has 
substantially complicated the design· of the general matching procedure. This difl1culty is a direct 
consequence of the unifica.tion process which generates objects that may grow exponentially. For example, 
in ordp.r to unify the literals: 
r·(f('l· 'll. f('~. x~I .... ,f(xn_l,xn_l)) 
r·(x~. '3 ..... 'n) 
xn IS forced to bl'! bound to a term consisting of 2" symbols. Thus, the implementatinn requires a 
representation S~'heme based on pointer structures (which can represent the unified literai .1 question in 
linear space), and dooms any attempt to represent such objects in character form to failure. (See the linear 
time unifie>tion algorithm reported in IPaterson and Wegman 19781.1 The 8K RAM space of a DADO PE 
is more than sufficient. to adequat.ely handle such objects. (A report presently in preparation by Taylor 
describes the use of t.he Paterson and Wegman algorithm in a distributed environment.) 
5.8.3 Temporal Redundancy 
The DADO algorithm does not restrict the amount or scope of \VM modifications, but rather permits 
large global changes to be made to WM very efficiently (by broadcasting such changes from the root PEl. 
Ho\\'ever, the DADO algorithm as outlined above does not save state between cycles. Rather, in 
sit.uations in which few \\i~1 changes are made on each cycle, the DADO algorithm recomputes much of its 
matt:h rf>suits l~a1culated on the previous cycle. However, the basic DADO algorithm ca.n be easily 
extended to directly implement this temporal redundancy by executing the match for only those literals 
recent IX assf'rtE'd in \VM while saving previous rule instantiations directly within the WM~subtrees. An 
implfinlentation including this feature is presently under development. 
Lastl,Y. we note that the basic DADO algorithm can be modified to accomodate certain anomalous 
sitU:ltion:'i which rna.y arise in practice. \Ve briefly des~ribe these in turn. 
5.8.4 WM·8ubtree overfiow 
In the event that the number of literals to be stored within a WM-subtree is too large, two productions 
may be stored within a single PE one level higher than the PM-level. The resulting configuration produ,,,s 
a 'A·M~subtree twice as large, at the expense of slowing the match phase to accornodate two sequentia.l 
prouud.ion matchings. Other allocation schemes are possible. For example, the entire upper portion of the 
DADO troe may be used to store productions for matching, at the expense of a 10g(P) time matching 
operation. where P is the depth of the upper tree. This last configuration is particularly useful when 
considering the following problem. 
5.8.5 Duplicate WM element! 
As noted, an instance of a \VM element relevant to several rules will exist within several distinct \\·~f· 
subtrees. In order to achieve the maximum for parallel matching of rules, in the worst case an exact I;Opy 
of W~f may exist for each rule. (Contrast this with "shared memory" models of computation in which a 
single gloLal fw:'mory is accessible to some large number of asynchrous processors.) This duplication 
problem i.'5 imposed on us by t.he fully dist.ribut.ed model of st.orage and computation in the DADO 
machine. In order to reduce the number of duplicate. elements, a subsumptio-rt principle may be used to 
effecti"ly p"'tition the productions distributed within the upper tree while maintaining a 10g(P) time for 
rule matl'hing. 
If the LHS of rule PI is a generalization of the LHS of rule P2 (that is, rule PI matches a superset of the 
literals lIlat"he,j by P 2) then P2 is placed in the subtree rooted by PI' Rule P2 thus shares a subset of 
WM-suotrees a,,':essible to PI . If the LHS of PI is disjoint from that of P2 then both rules may be placed 
in silding :-iubtrpes without sharing a common WM·subtree. Finally, in the case where the LHS of P 1 
overlaps with that of P", but it is not a generalization of P." then they may either be located within the 
same PE. or thr>ir relati;e positions may be determined by tlieir respective relationships with other rules in 
the prodwtion system. 
There ar~ nl::lny degreE'S of freedom and tradeoffs involved with this allocation scheme (\'·!"!ich forms a 
major part of it dOI:toral investigation being conducted by Daniel ~nranker). 
8 Future Research 
Thus far. 12 I,eople have written PPLjM programs for DADO. The applications that have been written, 
at varinu:; :;;l:lges of completion, include system·level diagnostics and Al applications. 
The di:lgno~tk ~)rograrns, which are currently being integrated within the kernel system, exercise thi> 
prot.'f'S50r and K-\,.\1 chips whenever a PE is in a non·busy state. The coprocessor has been designpd to 
p~riodj(·;:tily ::lSSPS5 the status of the entire system by performing·a high-speed logical disjunction of th~ 
error flags of a.JI PE's to identify any that may have failed. Furth~rrnore, we have included in the design 
of DADO:: a simple sequential circuit passing through each (:onnedor in the system which is used to 
detect any faulty connections. Other than this simple "hardware ha.ck", we have paid little attention to 
the issue of fa.ult tolerancy thus far. Nonetheless, the statistics on the error rate of the Intel chips we have 
employed indicates that soft errors will apl'ear in DAD02 every 1800 hours of operation . 
. The bulk of our effort has concentrated on th.e development of the interpreter for the parallel execution of 
~)rodudion systp'm programs. A restricted model of production systems, Winston's animal progrnlli 
[\Vin:'lton 19771. has been implemented in PPL/M and is currently being tested. Our plans include the 
complf:'lion of an interpreter for a more general version of produdion systems in the· coming months 
including' direct implementation of the RETE matching algorithm. A modified algorithm for the rapid 
evaluation of hi<rarchical production 8y8tem8, typified by MYCIN-like systems, is being investigated for 
implementation on DADO as well. Indeed, the envisaged PROLOG implementation may subsume this 
effort., 
23 
Fahlman [1979[ has proposed a special-purpose parallel architecture for high-speed property inheritanee in 
5y5t~m5 based on semantic network-like formalisms. Although it is too early in our investigations to make 
any precise claims, we believe that DADO may in fact provide significant improvement in the exet:ution of 
semantic n"twork based systems over von Neumann machines. Currently, we have implemented the 
essential elements of a frame-matching operation, but have not yet explored the possibilities of appi),ing 
DADO's hardware parallelism to property inheritance operations. 
Lastly, we note the relationship of LISP to DADO. Part of our work has (~oncentrated on providing LISP 
with additional parallel processing primitives akin to those employed in PPL/M. Thus, we have hl'en 
actively pursuing the opportunity of providing SLICEd list structures within a. '~(lnventional LISP 
environment. 
~1or~ inlJ,ortan1.ly, though, we have begun to formulate the essential aspe!:ts of LISP execution which may 
be rpg3.r,\f'd as r.ourply associatively-based, and thus suitable for direl:t exe!:ution on DADO. Examples of 
such of.'eration.~ indude: 
- fiuding ... ·ariable bindings on an assodation list, 
- pro~,ert.y list operations, including the access and instantiation of fllnction definitions, 
- finding and allocating, as weI! as freeing, a cons cell from a. largl' 5]'il,.'e of free memory cells. 
The lillie rl'(lllirl'd for each of these operations on a sef}uential ma!:hine i:;, in general, lineat in the size of 
the li~t ~tru"turf'S in question. For certain of these operations spa':f'-~xpensive hashing may reduce the 
time to a ,;onstant. \Vith~n DADO, on the other hand, these oper::ltions may be executed in constant time 
withuut a signifiL::1ot overhead in storage management (see [Bonar and Le\'itan 19811). 
By way of 5urnnnry. it is our belief that 9.\00 can in fact support thf> high-speed execution of a very 
!arg~ da~:3 of ,-\1 ap~dications. Coupled with an ~ffkient implelllf>lltation in VLSI technology, the larg~­
scale parn!lf>li::lfl a,'hievable on DADO will indeed provide signifkant r,erformance improvements over von 
:--.reumann m:),~hinp~. \Ve are presently preparing detailed experiments to empirically evaluate the 
performance ot' D.\DO~. If pressed to :;ive some indication of its I';q,abilities, we ha.ve estima.ted that-
DADO:! llIay eXf>l'ute Rl/XC'ON. for example, at an aVf>rage rnt.e in f'x'~es.s of 150 production systF'!!ll 
cycles !,f'r st!'ond. Presently, R1/XCON runs on a VAX 11/780 at a. rat,· from 2 to 600 cycles per minute. 
The envi~agtd inl~.[ementation of Rl/XCON, whi,;h provides this rOllgh (':-itimate. consist.s of a P~1-le'iel of 
3~ PE·s. ~'f'rforming the match, 31 PE's within the upper tree, performing selection, and 30 PE's in e:l' ~ 
of the 3:! W\l-sllbtrees. 
7 Conclusion 
A large part of our work continues to involve the analytical inYE'stig:ltinn of new parallel algorithms and 
languages for AI applications. Several researchers are actively inv~.,tigating methods for the rapid 
execution of frame-based systems, as well as methods for improving the p~rrormancp. of "!:onventional" AI 
languages including a. parallel implementation of PROLOG. New nlf'thods for the parallel execution of 
hitrarchical production 8Y8tems are being investigated as well. 
:\ large sha_re of our efforts, though, are- rlevoted to the hardware .j('."ign and imp'lementation of a larger 
expf>rirnental device. Although adequate for further development of the software base for DADO, the Uj 
element DADOl system is too limited in storage r:apacity and I,rol'e::;sing power to demonstrate a. 
signifi,~n.nt. performance improvement in the execution of AI systems. (A large share of the ma,chinf>'s 
processing power is utilized in system control and interprocessor corntllunication,) However, as the 
number of prOCf>ssors in the system increases, the proportion of this ';wasterl" processing power will 
decrease dramatir:aBy. Concomitant with increasing the number of pro(;essing elements in the system, a 
24 
set of new technical problems dealing with interprocessor communication and fault-tolerance must be 
solv.d to achieve the predicted speed-up. 
However, these problems can only be investigated if a large-scale, experimental device is implemented. 
Thus. using" slightly modified hardware design of the DADO! system. we are currently implementing a 
much larger version of DADO, comprising 1023 processing elen1f~nts. This version, DAD02. will 
in,~orporate a custom \'LSI chip. currently being designed at Columbia l:niversity, to perform the most 
hasil: cornrnunkntion functions in combinational logic. This custom Ie is expected to produee a signifieant 
improvement in operating speed, and would be a required component of a full version of DAOO 
implemented entirely in VLSI. The existing DADO software will require only minor modification to run 
on the newer dF!sign. 
Our futurr plans include the demonstratmn of the DADO:] prototype using several existing large-sealc 
e:\~ert syst~ms whi(~h use the production system paradigm. Digital Er'lllipment Corporation has exprp.ssp.<.! 
an intere~t in sllpplying a copy of Rl/XCON for implementation on DADO. Bell Lahoratories has nl:-in 
exprt:ssed a willingness to supply a copy of ACE, an expert system that has been deveiop'~d to per!'orrn 
teiephonf> (~ablp maintenance. Other systems are being actively sought from other sources in the Artificia.l 
Inte Ilig~ nee ('om rfl unity. 
.4cknowledgements 
It is a great ~t\ea5ure to thank the many people and organizations who ha.ve contributed to the DADO 
projed. Daniel Mir.nker, a Ph.D. student at Columbia, is responsible for m.ny of the detailed hardware 
and software designs of th_e machine and is the driving force in the effort to implement the OPS family of 
ianguagp.:-i on the system. Stephen Taylor, also a Ph.D. student, working closely with Chris Maio, Andy 
Lowry and f3.f;u!ty co-investigator Professor Gerald Maguire have made tremendous progress in specifying' 
the exe.:ution of PROLOG on DADO. Their design of the PROLOG system is as elegant as the hardware 
sulutions proviJpri by our project engineer, Shunsaku Ueda, who deserves special mention. Professor 
David Shaw has had a tremendous impact on our work, and we are very grateful for his involvement. 
~fany resf'ar<:hers have contributed in substantial ways, too numerous to specify in great detail. We thus 
would like to a •. ·knowledge the efforts of Janvid Cheng, Eugene Dong, Wai Man Wong, Jody Weiss, Mike 
Weisberg. Jim Gilpatrick, Monique Fei, Daphne Tzoar, Doug Degroot. ~Iark Lerner, Alex Pasik and Ted 
Sabety. We would also like to thank Lanny Forgy. Allen Newell anJ ~Iike Rychener for very interest-ing 
and thought-provoking conversations a.bout DADO. 
The Defense Advanced Resea.rch Projects Agency is our primary source of support through contract 
"00039-8~-C-042i. Intel Corporation has contributed most of the components used in the construction of 
DADOl. and continues to support our development efrort or 0:\002. Digital Equipment Corporation hns 
provided computational resources for our software development. Valid Logk Systems has donated the 
prototype boards use.d in the construction of DADOl and has continued to aid our research. Finally, we 
acknowledge the assistance of IBM Corporation for providing components for DAD02 as well as taking a 
more active role in supporting our research. Presently, IBM researchers are helping to prepare experiments 
for the statistical analysis of our PROLOG work. 
