Performance Analysis of Two Competing DADO PE Designs by Miranker, Daniel P.
Performance Analysis of Two Competing DADO PE Designs 
Daniel P. Miranker 
Department or Computer Science 
Columbia University 
New York City, N. Y. 10027 
November 15, 1983 
Abstract 
eueS-B3-B3 
In parallel processing, useful computation is performed by having a number or processors computing 
values and communicating these results to neighboring processors. It is a crucial design issue in any 
parallel processing architecture to determine the optimal balance or resources for competing requir~ments 
of typical problems to be solved by the device, i.e. computation versus communication. 
For example, in a highly parallel machine consisting of many individua.l processing elements (PE's), there 
is a trade ofr between the complexity of the constituent PE's and the number of such elements which may 
be embedded in a fixed silicon area, Part or the PE's circuitry must be dedicated to communication 
processing. Increasing the ability and the speed at which a PE can perform communication can be done, 
but only at the expense or the number of processors on a chip. 
DADO is a large scale VLSI computer designed for the rapid execution of AI production systems. This 
paper analyses the nature of the instruction stream expected to be executed on DADO for production 
system applications, with emphasis placed on the number or processors in the machine and the size of the 
problem. We describe four difrerent proposed methods of handling I/O and their queueing network 
models. The models were carefully simulated to determine which I/O scheme and it's reSUlting circuit 
complexity is best suited (most efficient) with respect to the DADO instruction stream. 
Table or Contents 
1 Introduction 
2 The Nature or the DADO SIMD Instruction Stream 
3 Queuing Models or Proposed DADO Hardware 




8 Appendix I: Simulation Results 
9 Appendix II: Code to Determine Ring Burrer Overhead 












Lbt Or Figures 
Ftgure 11 An Exa.mple Production Rule. 
FIgure 2. Example PPL/M Code (or Sequentially Loading DADO 
FIgure 3. Statistics Characterizing the DADO Instruction Steam. 
Flgure.fl Queuing Model o( DADOb (no buffering). 
FIgure 5. Queuing Model o( DADOlb (burrering). 
FIgure th Queuing Model of DAD023, (no buffering). 
Figure 11 Queuing Model or DAD02b (bufrering). 
FIgure 81 Passive Queue Construct. 
Ftgure QI DADO SIMD Instruction Stream Genera.tor Model. 
FIgure 10. RESQ Queueing model o( a DADO 1 PEe 
F1gure 111 RESQ Queueing model or a DADO 2 PEe 
FIgure 12. Throughput Results For The Four DADO Models. 
Ftgure 131 Throughput vs. Length o( Da.ta Dependent. Operations. 

















DADO is a highly parallel computer comprising a large number or processing elements (PE's) 
interconnected in a complete binary tree. Communication may occur between processors along the tree 
edge~. Under certain eircumstance~ a processor may broadcast data to all or its descendants in the tree, 
or a processor may be instructed to report a byte value to all or it ancestor~. 
Each PE or DADO is a rully capable computer compo~ed or an 8 bit microprocessor, a ROM resident 
operating system, 16K bytes or RAM, and an I/O section. Under the control or software, a PE may 
operate in one of two modes: ma"ter or "lave. In master mode the PE run~ a computer program stored in 
its local memory. However, embedded within the master's program are blocks of instructions that are 
broadcast to connected de~cendant PE's operating in slave mode. The slave PE's execute the broadcast 
instructions in parallel, in a manner similar to an array processor, or the ILLlAC IV. This type or 
parallelism is known as single instruction stream mUltiple data stream (SL\ID) execution [II. Further, the 
machine can be arbitrarily partitioned into a number of independent subtrees. The root or such a subtree 
logically disconnects itselr from its parent, and becomes the master of the PE's logically connected below. 
This type of machine has become known as a mUltiple SIMD (MSIMD) architecture 151. 
SIMD parallelism is typically controlled by a conventional processor. The control processor issues a. 
stream of machine level instructions that are executed synchronously in lock step by all the slave 
processors in the array. DADO is dirrerent. Since each PE or DADO is a fully capable computer, and 
communication between PE's is generally expensive, we wish to make an instruction as "meaningful" as 
possible. What is communicated as an instruction in DADO is actually a pointer to a procedure, stored 
locally in each slave PE. Primitive SIMD DADO in~tructions are in fact parallel procedure calls and may 
be viewed as macro instructions. 
For example, a common instruction that will be executed by a DADO PE is "MATCH(pattern)", where 
~fA TCH is a generalized pattern match routine local to each processor. 
Transmitting entire procedure calls makes efrective use or communications links but introduces a difficult 
problem. A procedure may behave difrerently depending on the local data; the same macro instruction 
may require different amounts or processing time in each PE. In such a device either the PE's must 
synchronize on every instruction and therefore potentially lay idle while the slowest PE finishes, or the 
PE's must be able to queue the instruction streaIn to possibly achieve better performance: 
Two design questions need to be answered for DADO: how much circuit complexity should be used in a 
PE's I/O section and how much of a PE's processing power should be diverted to handle I/O. For a given 
size (physical) machine, the more silicon devoted to communications the less area there will be for the 
processors. Thus, faster I/O potentially reduces the number of available processors. The proper balance 
of the PE's capabilities with respect to the I/O section must therefore be established. 
The independence of instruction execution times among PE's suggests that the performance analysis of 
DADO is analogous in many ways to the performance analysis of open queueing networks. The remainder 
of this paper describes careful analysis of the DADO instruction stream with respect to the size of the 
machine, and the size of the problem while considering four difrerent configurations of DADO in terms of 
open queueing network models. 
The next section will describe the nature of the DADO instruction stream. Section 3 will describe the four 
different proposed configurations of DADO and their respective queueing models. Section 4 will describe 
the performance simulation method. The results of the simulations is presented in section 5. 
% The Nature at the DADO SlMD Iostruetlao Stream 
We are prim&rily concerned with the behavior or the SIMD subtrees. Resident in the root of a SIMD 
subtree is a production or set of productions. A production system 141 is derined by a set or rules. (or 
productions), and a collection of dynamically chnging racts, called the working memory (WM). A rule in 
a production system consists of & lert hand side (LHS) and a right hand side (RHS). The LHS is a 
collection or pattern elements to be matched against the contents ot the working memory while the RHS 
contains actions errecting changes in the working memory. A production system repeatedly executes the 
following cyde or operations: 
1. Match: For each rule, comp&re the LHS ag&inst the current WM. Determine if the WM satisfies the 
LHS. 
2. Select: Choose 3. satisfied rule according to some predefined criteria. 
3. Act: Add to or delete elements rrom the WM a" specified by the RHS or the selected rule. 
An example rule using the OPS5 production system language syntax 121 is shown in figure. This rule says 
ir there is a WM element in the system representing a meS3age about a new job, and the job's size 
matches the cla.ss definition ror medium size jobs, create 3. new WM element tagging the job with the daas 
name medium. 
(p categorize-job-sizes 
(message 'job <x> 'size <y> 'status new) 
(c1a.ss-derinition 'size <y> . class-name medium) 
--> 
(make job 'job-name <x> 'da.s.s medium) 
) 
; rule name 
; pattern element. 
; <x>, <y> 
; are pattern variables 
; pattern element 
FIgure 11 An Example Production Rule. 
A copy of the relevant subset or working memory is stored in a subtree (see [61). Within a subtree the 
working memory is rully distributed amongst the PE's. An efrort is made to keep the number of working 
memory elements in each or the PE's or a subtree balanced. 
During the match operation, the root PE broadcasts to all it's descendants the first pattern element that 
is to be matched. The slave processor.! then attempt to match the pattern element against their local 
WM. Under the direction of the root PE. those slave PE's that have successfully matched report any 
values they matched against pattern variables occurring in the pattern element. If there is a variable 
common to the rirst pattern element and to subsequent pattern elements the root PE then substitutes 
each previously reported value in place or the pattern variables in the subsequent pattern elements. The 
process is repeated recursively with each or the remaining pattern elements until pattern eleme~ts have 
been successfully matched. or until the match in that subtree fails. 
A particular rule and variable bindings is selected. The details of this selection are unimportant here. 
The interested reader may see 121. The RHS of the rule is then evaluated. Two types of actions are 
possible. adding a new working memory element, or deleting an old one. If the action is a delete from 
WM, the slave PE's must determine it they have a copy of the WM element to be deleted. The W\1 
element in question is broadcast to all the slave PE's who compare the broadcast element to there own 
local store. If the match is successful the slave PE frees the storage. If the a.ction is a.n add, one PE is 
selected in a. subtree and the WM element is stored in that PE. 
3 
It. should be apparent. from the above description t.hat 8. considerable amount of time is spent in 
communicating small amounts of data and in the storage management of the messages and WM elements. 
Data dependent operations are in fact infrequent, but lengthy in comparison to the others. 
The DADO SL\ID instruct.ion stream can be broken into t.hree t.ypes of instructions. The first and most 
common t.ype (I) are those that handle primitive communication steps and storage management. These 
tend to have short. service times 8.nd arrive in bu~ts. They also have t.he property that they have 
identical service times in all PE's. 
The second type (E for exponenti&!) of DADO SIMD instructions are the data dependent operations where 
intensive processing occu~, operations such u Match, and Delete. These tend to be rather lengthy and 
infrequent compared to the I type. The primary component of t.hese operations is an associative probe 
where a PE must match a pattern element against it's loc&! WM. The match must succeed t.hrough a 
n umber of terms within a pattern element. The likelihood of a successful match on a particular term is 
independent of the success of the match on the previous terms. Therefore an exponential service time 
distribution is a fair assumption for the E t.ype o( instructions. 
The t.hird type (S for Synchronize) of DADO SIMD instructions are those that cause the control processor 
to block and wait ror all the PE's to finish all previously issued instructions. Examples or t.his type of 
instruction are "Report", where a byte va.lue is communicat.ed from a PE to it's ancesto~. Also the 
"Resolve" operation, which is an instruction integra.! to t.he communication sect.ion of the PE's that will 
select a single PE from mUltiple responders 171. 
To determine t.he arrival rates and service times of the t.hree types of SIMD instruct.ions t.he current 
implementation of the OPS5 monitor was examined. The OPS5 monitor is written in Parallel PL/M 
(PPL/M) 171 which is an extension of the PL/M language that. Intel provides for all of its microprocessors. 
The extensions to PL/~ are rather simple. The primary one being a new block delimiter, "DO SI?-.1D". 
All code within a SI~1D block is executed in parallel by the slave processors. Ten pages of PPL/~i code 
were compiled and the resulting assembly language output was analyzed. An example of this code is 
illustrated in figure 1. 
The original source code was examined to determine which instructions corresponded with ea.ch of the 
three types defined above. The types were then mapped onto the a.ssembly language output. The lengths 
of each of the instructions blocks were tallied as well as the interarrival time of the blocks. The results of 
the tallies were. used to generate buic statistics or the instruction stream. The units of time are 
normalized to the time to execute a. typical machine level instruction in a PE, which is two microseconds 
in the current implementation. The statistics are summarized in figure 3. 
Close examination of the code exposed a.n important property. The size of WM afrected only the length 
of the E type instructions but not any of the other characteristics in the instruction stream. If twice as 
many WM elements are packed into a PE then the E type instructions would on average take twice as 
long. 
3 Queuing Models or Proposed DADO Hardware 
Within the currently available technology ror building DADO are two orthogonal implementation issues. 
One is whether the PE's processor chip should directly ha.ndle low level I/O steps, or whether a distinct 
semicustom chip be incorporated in the PE design to ha.ndle the low level I/O. 
The selected processor for DADO is the Intel 8751 single chip computer. The DADO 1 design calls for the 
I/O ports of each computer chip to be directly tied to the I/O port.5 of the computer chips of its three tree 
neighbors. Data is transferred along the tree edges by using a standard four cycle handshake protocol. 
This simple design has greatly expedited our prototyping. However, to move one data byte from one PE 
4 
Figure 21 Example PPL/M Code for Sequentially Loading DADO 
We will a.l"ume that thi" progr,am iJ ezecuted within 
DADV'" CP, The "lI."tem Junction READSTR /oarU "tring 
data mto a bUller Jrom "ome ezternal "ouree. 
S!lD.OAD: noc!l)UIZ: 
DECLAIE Iat.111c •• t-r.eor4CS4) IYTZ SLICE EXT!IIAL: 
DlCLAAI: lot 40 .. BYTE SLICE; 
DECLAU CIadis.LI~k) BYTE SLICE; 
Dr.ct.AAZ I BYTE: 
Dr..a..uJ: blhrCS4) ITT!; 
DO SlID: 
~~ ~a:~i: AffLPflJcWI&imfMD 
IIDtJ a 0: 
En; 
~~ pe to load the nezt record. into 
DO SlID; 
CAU. EaaU.: 
Al a !OOLlAJClot Dq •• ): 
CAU. Illlo1n: Only one Al i" now ut 
Dl a A1; Selectively diJable all but one pe 
lot Do .. a 0; 
DD: -
IF CprrwO THD [I tree j" lull 
DO: 
Call 1r1t.ltr( •• ,.11); 
RIn1U: 
DD' 
CAll.lleadltr (. !lIfhr, '.Lu(tk): Data provided by ezternal "ouree 
IF a.,f.rCO) aC') THD 1l£TUlI: 
DO Ia ° TO L!JCTH-l; 
CALL Broadcalt(Bufl.rCI»: 
00 SlID: 
IlItl1l1cnt_RlCor4CIn4u) " AS: 








to the next requires twelve machine instructions. A byte broadcast to a subtree takes twelve mac nine 
instructions for each level of the tree. The simplicity of the design has allowed us to rapidly implement a 
15 node DADO, but it is expected a considerable amount of the available processing power will be 
consumed in I/O computation. 
The alternative is to build an I/O coprocessor for the DADO PE's, The coprocessor will be specifically 
designed for the I/O task and as a result will be very efficient. Current designs indicate that a byte ma.y 
be broadcast. t.o 1.11 PE's in a tree in le5S than one 8751 instruction cycle. The efficiency does not come 
free. The I/O coprocessor will expand the number of chips per PE from two to three. If we are 
constra.ined to build a. DADO at a fixed complexity then a DADO incorporating the I/O coprocessor may 
contain only half a.s many PE's. A DADO with the I/O coprocessor and the same number of P~'s will 
require twice a.s many boards and lifty percent more chips. 
A key question is will the increase in communication speed overcome the 1055 of the half of the processors? 
The answer revealed in section 5 is yes. 
The second major issue facing the architects of DADO, is whether or not to buffer incoming instructions. 
The 8751 computer has a sophisticated interrupt mechanism. All incoming communication control lines 
are in fact connected to the interrupt mechanism. Since all the necessary hardware support is available 



































;grows linearly with the 
;number of WM elements 
10 
10 
FIgure 31 Statistics Characterizing the DADO Instruction Steam. 
DADO configurations from polled, busy wait I/O, to asynchronous interrupt driven buffered I/O. But this 
apparent improvement does not come for rree. To buCrer the instructions requires an additional 25 
instructions when both performing the I/O and when processing the instructions. (see appendix II). 
There are four DADO models to be analyzed. DADO 1 refers to the configuration without the I/O 
coprocessor, while DADO 2 refers to that configuration with the I/O coprocessor. Figure 3 illustrates the 
queuing model of DADO 1 without buffering (DADO ta). The basic execution loop for a PE is to accept 
an instruction, pass it to its children and then execute the instruction. At each stage of the tree there is a 
communication delay, after which the job is given to the PE's two children to continue with the I/O 
operation and to subsequently execute. The queue length ror the processor is zero. Ir a PE is busy it's 
communication path is blocked. rr the previous stage wishes to pass a new instruction to a blocked PE it 
must also block. 
It has been stated that 90 percent of the time spent in interpreting a production system is spent in the 
match phase. On DADO this corresponds to the computations performed by a working memory subtree. 
For all rour models it is that instructions enter from above the root oC the subtree. To simplify the 
simulation it was assumed and that the three instruction types arrive with independent Poisson arrival 
rates. However, the instruction stream generator must stop while an S type of instruction is outstanding. 
The S type of instructions have constant service time in all PE's. The E type instructions were simulated 
assuming exponential service time whose mean was based on the size of WM. In reality difrerent I type of 
instructions may have different service times, though an individual I instruction has the same service time 
in all PE's. To capture this aspect of the I type instructions required assuming that all I type instructions 
had constant service time equal to there means. . 
Figure 5 illustrates the queueing model of DADO 1 with buffering (DADOlb). The model is the sa.me, 
except there is now an infinite queue feeding instructions to the processors. We can assume the queue is 
infinite because the synchronizing instructions (S type) from the control processor will prevent the queues 
from ever getting very large and the PE's will never block. However we must now add 25" units to the 
I/O dela.y time and also 25 units to the service time or the instructions. 
Figure 6 illustrates the queueing model of DADO 2 without buffering (DADO 230). The I/O propagation 


















"A. .... e 
i':I'I;V' 
Ftgure 4: Queuing Model of DADOla (no buffering). 
may be ignored. An instruction is passed to all PE's in the tree at once. Not only does the additiona.l 
hardware increase the speed or an I/O operation but it also has the affect of nattening the queuing model 
of the tree into a linear array. However since there is no bufrering, the I/O section must wait for all PE's 
to finish executing the instruction before communicating the next instruction. Figure 7 illustrates the 
queueing model of DADO 2 with burTering (DADO 2b) and is a.nalogous to DADO lb. Instructions may 
queue up at the individual proceSSOI"3, but 25 units of time must be added to the I/O propagation time 
and 25 time units to the service times. 
4 Analysis Method, RESQ: queuing network simulation package 
The four models above were simulated using the IBM Research Queuing Network Simulation package 131. 
The package has a number of very powerful simulation primitives, including generation of job streams 
according to a variety of distributions, and a.ctive queues with a variety of queueing service disciplines. In 
addition RESQ2 provides a new construct called a passive queue. 
The passive queue provides the means to introduce now control and finite burTer lengths into a queueing 
model. A passive queue consists of a pool of tokens and nodes that may allocate, deallocate, create or 
destroy tokens. A job may only pass an allocate node ir tllere are tokens in the pool that may be assigned 
to the job. Subsequently the job may release the tokens back in to the pool by passing through a release 
node. Tokens may also be created and destroyed when a job passes the corresponding node type. Figure 
8 illustrates the use of a passive queue to model an active server with queue length of zero. By initializing 
------------------- ----
.. .,~ ....... c· ...... , 











r ........ '. 
40(.0 ......, ... 
Figure 5: Queuing Model or DADOlb (burrering). 
• 
the pool with a single token, a single job causes the passive queue's allocate node to block, and jobs must 
naturally queue up behind the allocate node. 
Another important reature of the RESQ2 package is the ability to associate variables with eac~ individual 
job, and assign values to them. Subsequently these variables and the states of the passive queues may be 
tested to dynamically route the jobs. Extensive use of this feature was made to model the "nADO 
instruction stream (figure 9). A generator was used to simulate Poisson arrivals of each of the three 
instruction types. A job variable was associated with the job identifying the instruction type. However, 
when a Sync instruction is issued, the control processor must block and continue only when all PE's have 
finished. A global variable was created (s_ cnq to represent the number of processors remaining to 
execute the sync job. Whenever a sync job is created s _ cnt is initialized to the number of PE's. Ir 
s _ cnt is zero, instructions may proceed to the PE models. If s cnt is nonzero then all three instr.uction 
sources are directed into sinks, 
The RESQ2 sub model representing a DADO 1 PE is illustrated in figure 10. The key to now control in 
DADO 1 is to introduce a reedback loop where a PE creates a dummy job and routes it back to its parent. 
A passive queue only allows a job into the PE submodel when it contains three ~okens. representing the 
case that each child has fed back a token indicating it ha.s received the next instruction and the third 
token representing that the PE has completed the previous instruction. The first node in the sub model 
sends the dummy reedback jobs to a create node, adding tokens in the passive queue. Otherwise the jobs 
must wait to pick up the three tokens berore passing through the I/O _ in _ delay. After the 
8 





o /'~ I "'t ) ~I 
FIgure 11 Queuing Model of DAD02b (burrering). 
I/O_in_delay, the job is split into two copies. One copy is sent back to the PE's parent to indicate the 
end of the I/O while the second copy feeds the I/O_out _ delay. From there three copies are made. 
These are directed to the two children, and the PE's proce~ing unit. After the job is serviced in the PE, a 
now token is created a.nd added to its own pa.ssive queue. [f it is a sync job, it'decrements the s _ cnt as 
well. 
[t should be noted that in the case of DADO la there is a na.tural pipelining errect. The upper PE's see 3.n 
9 
~'.IV. :lIwe\,;. 
~ .,., . ." 
) 
.' ~ ...... w~" !'Q.'~ ..1."., 'U" 
P •••• ""e 0....\01. 
,~ ... ,. 
ole ~ ..... 0....\,;. "'" 
,-----7) J-iC)-\7-~) 
a. 1'0 •• '" I"IClit ~""I'" ". JODI ~st ... t 
Figure 8: Passive Queue Construct. 
.......... :: ~_: ... r i) :vc ~ )2L 
Qrr-, J&" .,/ ~ 1 1:1 S. ~ 
'" 1!:Ar-~'''&: > tt ~_CNr ,. e :\J( 1122 ~ ~ I ::. ,. , 
~ / "- tt ~_C"<T' ,. e J'V(l)=l 3_00"T': l5 
IS&Jrf*'I .... I~ / ;:. ,. 
~ 
FIgure gl DADO SIMD Instruction Stream Generator Model. 
instruction earlier than the lower ones, and are able to start the next instruction berore the lower ones 
have finished. The pipelining effect improves the performance of the DADO 1 design but is a great 
liability whenever data must be communicated up the tree. Before data can be communicated up the tree 
the entire pipe of instructions down the tree must first be terminated and emptied. Only after the 
reported byte has reached the top of the tree will the next instruction be allowed down. Since the root of 
the tree is waiting to receive the reported byte, those instructions that communicate information back up 
that are the sync instructions. To properly model the delay required to reverse the pipe, the last sync job 
10 
1rn-ri.1.. \I 
FIgure 10. RESQ Queueing model or a DADO 1 PE. 
is routed through an additional queue called sync_delay before the instruction stream 15 allowed to 
proceed. 
In DADO 2, since the tree has been 11attened to an array. the RESQ model representing a DADO 2 PE is 
considerably simpler (figure 11). It is only the server portion or the DADO 1 PE. A passive queue is' 
introduced outside the PE sub model to permit only a single job into the system at a time. 
D ••• ,..... ~ 
----~ -------\ 
_ .. \ 
,..J-------------"illl 
~--------------------------------------~ I I 
/< 
"~! ~ g·t 




I j: I : ;;r : \ : 1'.-;. ',,--- _'_' 
j~~: ~I ~<::'>;t~~s_o>-----:,~:'~ 'I ~~: /' / ,-
!"'" 
t > i 1....,./ "'-~,..( ------------" e / 
\
.-J \ I ~-. 5 __ < ( II , i 
~ I <::::> III 
S'7 r- ill 1 
I;'" 
..J 
FIgure 11. RESQ Queueing model of a DADO 2 PE. 
11 
Changing either of these models from the non buffered case to the buITered case requires the passive queues 
be initialized to a large number of tokens. The extra tokens will allow the simulator to queue up jobs 
within the PE's. The instruction stream generator has been designed to block after a sync instruction. 
The queues will not grow very large. The change further requires that 50 instructions for the ring buITer 
overhead be added to service times for the individual instructions. 
5 Results 
Figure 11 contains the most basic numbers derived from the simulation of a four level DADO tree. We 
can see that in the four level tree the I/O chip results in a 55 and 68 percent increase in speed in the cases 
with and without buffering, respectively. Results of adding a ring buITer are negative. In DADO 1 and 
DADO 2 the addition of the ring buITer reduced the the throughput by 20 and 27 percent, respectivety. 
The extra overhead of the ring buffer reduces from the overall speed of the machine. A closer look at the 
input data makes these results understandable. The majority of the instructions seen by the PE's are the 
I type instructions whose average service time is 14 instruction cycles. The 50 instruction cycle overhead 
for the ring buffer causes these I type instructions to require about 4 times as many CPU cycles. A 
decrease in performance of only 20 to 27 percent instead of 400 percent indicates that buITering does 





~I"':U'l"'Cut In DAOO S:MD Ir",tl"',,·:tl·:ln, 
QX.C·.J ~laO PQ" ~7'31 I rostruct Ion C'"C I. 
28 
Figure U: Throughput Results For The Four DADO Models. 
Figure 13 shows the effects of changing the average size of the data dependent operations. Six plots are 
shown. The plots labeled 2A4 and 2AS are throughput for a DADO 2 four and five levels deep {IS and 31 
PE's}. respectively. Similarly. the plots labeled lA4 and lAS correspond to a DADO 1 four and five levels 
deep. Both configurations show a moderate decrease in performance when more PE's are added. 
The important plot in figure 13 is labeled 2A4(x/2}. This plot corresponds to placing twice as much 
working memory in the PE's of a four deep DADO 3.5 is in plot 2A4, or the same amount of working 
memory as is found in DADO lAS. Assuming we a.re to build a DADO of fixed complexity. since a DADO 
2 would incorporate an I/O chip, it would necessarily contain half as many PE's as a DADO 1. The 
comparison of the plots 2A4{x/2} and lA5 indicates the relative throughput of the two machines on the 
same size problems. The simula.tions indicate that if the data dependent instructions are in fact an 
average of 500 cycles in length, then DADO 2 still has a 16 percent improvement in speed over DADO 1. 
However, the 2A4(x/2) is extrapolated in the graph and we can see that we are not far from the crossover 






(arbitrary units) , 
l~ 
Length of Oata Dependent Operations (instruction cycles) 
FIgure 13. Throughput vs. Length of Data Dependent Operations. 
The plot of lA4{x/2) is included for comparison of the performance degradation. 
:\ possible deficiency in the above analysis is that we are only simulating relatively small DADO machines. 
For this reason DADO's of size 6 deep (63 PE's) were also simulated on the expected instruction stream. 
80th DADO 1 and DADO 2 degrade gracefully when expanded (see figure 14). 
~ ConclusIons 
It appears that the custom I/O coprocessor will give us substantial performance increases. The I/O 
technique in DADO 1 results in a slow percolation of instructions thorough the DADO tree. One might 
expect the nattening effect of the I/O processor to create an appreciable performance increase. However, 
the increase is only 65 percent. The indication is that the PE's are spending an appreciable amount of 







I) AOO 1. 
6 
Depth (tree levels) 
FIgure 14: Performance of Deeper DADO trees. 
percolation of instructions creates a natural pipelining of the DADO instruction stream producing an 
appreciable performance improvement. When we simulated the loading of a DADO 2 with a.s much 
working memory a.s would be conveniently held by a DADO 1 of twice the size we still have a 
performance improvement of 16 percent. 
The effects of buffering the instruction stream a.re negative. The data. dependent operations simply 
appear too infrequently with respect to the storage ma.nagement type. The storage management 
instructions are burdened with the overhead of the ring bufrer. 
1 Acknowledgments 
I would like to thank Jim Kurose whose private tutorials saved me from hours of additional toil and 
aggravation. Also Doug DeGroot, Ed Macnair and the IBM Corporation for the extensive use of both 
hardware and software facilities, and the financial support while doing this work at the Watson Research 
Laboratory. 
14 
8 Appendlt II SImulation Results 
\fodeJ Length or data dependent inlltructions(E) 
125 250 500 750 1000 
lA 6.93 5.81 4.46 3.61 
3.00 
IS 3.37 
2A 17.3 12.3 7.Sg 5.45 
•. 45 
28 8.56 7.03 S.Sg 4.24 
3.43 
lAS 5.g2 5.06 3.83 3.16 
2.60 
1A6 3.38 




1* This 1_ a _ample ot code tor hand11ng a r1ng butter *1 
1* Th1s 1_ the input routine .. called .a an 1nten-upt *1 
Procedure Interrupt: 
call d1sable: 
1t 1nptr = h19h-11!!l1t then 
el.e 1nptr = 1nptr + 1: 
1* d1.able 1nterrupts * 
inptr= lovl1m1t: 1* wrap around? *1 
1t 1nptr = outptT + 1 then call block: 
atinptr = 1nport; 
call enable: 
end; 
1* Th1s 1s the read butter routine *1 
Cet..byte: procedure byte; 
1* butter tull? *1 
1* ~t byte *1 
1* ~able interrupts *1 
1t outptr = inptr - 1 the call block; 1* butter ~ty? *1 _ 
it outptr = h1ghll.mit then outptr = lovl1mit; 1* wrap around? *1 
temp:: atoutptr: 






[lj Flynn, M. J. 
Some Computer Organizations and Their Errectiveness. 
IEEE Tran"action" on Computer" :" 1972. 
[2] Forgy,Charles L. 
OPS5 U",:r '" Manual. 
Technical Report. Carnegie-Mellon University-CS-81-135, Department of Computer. Science, 
Carnegie Mellon University, July, 1981. 
[3/ Sauer, Charles H., MacNair, Edward A., Kurose, James F. 
The Re8erach Queueing Package, C MS Uur" Guide. 
Technical Report RA 139 *41127, IBM Research Division, 12, 1982. 
[41 Chase,E (editor). 
Vi"ual In/ormation Proce.!6ing. 
Academic Press, 1973. 
[51 Siegel. 
PASM: A Partitionable SIMD jMI!vID System for Image Processing and Pattern Recognition. 
IEEE TranlJaction.s ~m Computer.s C-30(12):934-941, December., 1981. 
[61 Stolfo, S. J. and Shaw, D. E. 
DADO: A Tree-Structured Machine Architecture -for Production Systems. 
[n Proceeding.s 0/ the National Conference on Artificial Intelligence. American Association for 
Artificial Intelligence, August, 1982. 
iii Stolfo S. J., D. Miranker, and M. Lerner. 
PPLj.\f: The Sy.stem.s Level Language lor Programming the DADO Machine. 
Technical Report, Department of Computer Science, Columbia University, 1984. 
(submitted to AC~1 TOPLAS). 
